Version 0.2

This is an edited transcript of the data analyses performed during the lectures of the Empirical Methods in Software Engineering (EMSE) course.

This file has been generated from an R Markdown source file: AnalysisExample.Rmd.

The analysis is performed on the data collected through a questionnaire1 filled in by the students at the beginning of the first lecture.

library(knitr)

data = read.csv("EmpiricalMethodQuest.csv",stringsAsFactors=F)

data$Q3 = factor(data$Q3,levels=c("Never heard of","Basic","Good","Expert"))
data$Q4 = factor(data$Q4,levels=c("Never heard of","Basic","Good","Expert"))

twenty.thirty = data$Q5 == "20/30"
data$Q5[twenty.thirty] = 25
data$Q5 = as.numeric(as.character(data$Q5))

data$Q6 = factor(data$Q6,levels=
                   c("It is the way to go","It is interesting","It is complex","It is useless"))

Data description

Question 1

The question was: Did you know what the scientific method was?

The distribution of answers is

table(as.data.frame(data$Q1))
## 
##  No Yes 
##   3  12

A better representation in RMarkdown is

Response Freq
No 3
Yes 12

It can be represented also using a bar chart

barplot(table(data$Q1))
text(c(.7,1.9),Q1.t/2,Q1.t)

Question 5 overview

boxplot(data$Q5)

Hypothesis testing

One sample t-test

Let’s start with a simple (one sample) null hypothesis:

\(H_{0}\): \(\mu_{Q5} = 50\)

and the relative alternative hypothesis

\(H_{a}\): \(\mu_{Q5} \neq 50\)

The test works in this way:

  • we assume an \(\alpha=5%\) (=probability of type I error), that means a confidence level \(1-\alpha=95%\).

    alpha <- .05
  • we compute the \(t\) statistic as:

\[ t = \frac{\bar{Q5}-\mu}{s{Q5}/\sqrt{n}}= \frac{49.6153846)-50}{21.6469338/\sqrt{13}} = -0.0640622\]

  • we assume the null hypothesis is true (\(\mu = 50\)), then

  • \(t\) will be distributed according to a Student’s t distribution with $df=n-1=12

  • as a comparison, the values having a probability smaller of equal to \(\alpha\) correspond to a a \(t_{critical}\) computed from the t distribution:

      t.crit <- qt(1-alpha/2,df=n-1)
      t.crit
    ## [1] 2.178813

  • the probability of having such an (absolute) extreme value or larger is:

      p.value <- (1-pt(abs(t.Q5),df=n-1))*2
      p.value
    ## [1] 0.9499755

  • the decision about rejecting or not the null hypothesis can be taken

    • on the basis of the critical value: reject if \(|t_{Q5}| > t_{critical} = 0.0640622 > 2.1788128\)

    • on the basis of the confidence level v: reject if \(|p.value| < \alpha = 0.9499755 < 5%\)

The procedure above is performed by the function t.test:

t.test(data$Q5,mu=50)
## 
##  One Sample t-test
## 
## data:  data$Q5
## t = -0.064062, df = 12, p-value = 0.95
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  36.53427 62.69650
## sample estimates:
## mean of x 
##  49.61538

Directional hypotheses

If \(H_a\) simply says the two means will be different, but doesn’t predict a direction to the difference, then you would use the default form of t-test (two tailed).

If \(H_a\) predicts a difference in a particular direction (the mean being larger than a reference value), then you would use a one-tailed t-test.

For instance we could perform the test for the following hypothesis:

\(H_{0}\): \(\mu_{Q5} = 50\)

t.test(data$Q5,mu=30,alternative="greater")
## 
##  One Sample t-test
## 
## data:  data$Q5
## t = 3.2672, df = 12, p-value = 0.003369
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  38.91492      Inf
## sample estimates:
## mean of x 
##  49.61538

The p-value of a one tailed test is typically twice that of the equivalent one tailed one.

t.test(data$Q5,mu=30,alternative="greater")
## 
##  One Sample t-test
## 
## data:  data$Q5
## t = 3.2672, df = 12, p-value = 0.003369
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  38.91492      Inf
## sample estimates:
## mean of x 
##  49.61538

Two samples t-test

We want to compare the responses to Q5 based on the response given to question Q2.

The two samples of responses can be visualized using a box plot:

boxplot( Q5 ~ Q2, data=data)

The two sample t-test is performed similarly to the one sample version: