Version 0.2

This is an edited transcript of the data analyses performed during the lectures of the Empirical Methods in Software Engineering (EMSE) course.

This file has been generated from an R Markdown source file: AnalysisExample.Rmd.

The analysis is performed on the data collected through a questionnaire1 filled in by the students at the beginning of the first lecture.

library(knitr)

data = read.csv("EmpiricalMethodQuest.csv",stringsAsFactors=F)

data$Q3 = factor(data$Q3,levels=c("Never heard of","Basic","Good","Expert"))
data$Q4 = factor(data$Q4,levels=c("Never heard of","Basic","Good","Expert"))

twenty.thirty = data$Q5 == "20/30"
data$Q5[twenty.thirty] = 25
data$Q5 = as.numeric(as.character(data$Q5))

data$Q6 = factor(data$Q6,levels=
                   c("It is the way to go","It is interesting","It is complex","It is useless"))

Data description

Question 1

The question was: Did you know what the scientific method was?

The distribution of answers is

table(as.data.frame(data$Q1))
## 
##  No Yes 
##   3  12

A better representation in RMarkdown is

Response Freq
No 3
Yes 12

It can be represented also using a bar chart

barplot(table(data$Q1))
text(c(.7,1.9),Q1.t/2,Q1.t)

Question 5 overview

boxplot(data$Q5)

Hypothesis testing

One sample t-test

Let’s start with a simple (one sample) null hypothesis:

\(H_{0}\): \(\mu_{Q5} = 50\)

and the relative alternative hypothesis

\(H_{a}\): \(\mu_{Q5} \neq 50\)

The test works in this way:

  • we assume an \(\alpha=5%\) (=probability of type I error), that means a confidence level \(1-\alpha=95%\).

    alpha <- .05
  • we compute the \(t\) statistic as:

\[ t = \frac{\bar{Q5}-\mu}{s{Q5}/\sqrt{n}}= \frac{49.6153846)-50}{21.6469338/\sqrt{13}} = -0.0640622\]

  • we assume the null hypothesis is true (\(\mu = 50\)), then

  • \(t\) will be distributed according to a Student’s t distribution with $df=n-1=12

  • as a comparison, the values having a probability smaller of equal to \(\alpha\) correspond to a a \(t_{critical}\) computed from the t distribution:

      t.crit <- qt(1-alpha/2,df=n-1)
      t.crit
    ## [1] 2.178813

  • the probability of having such an (absolute) extreme value or larger is:

      p.value <- (1-pt(abs(t.Q5),df=n-1))*2
      p.value
    ## [1] 0.9499755

  • the decision about rejecting or not the null hypothesis can be taken

    • on the basis of the critical value: reject if \(|t_{Q5}| > t_{critical} = 0.0640622 > 2.1788128\)

    • on the basis of the confidence level v: reject if \(|p.value| < \alpha = 0.9499755 < 5%\)

The procedure above is performed by the function t.test:

t.test(data$Q5,mu=50)
## 
##  One Sample t-test
## 
## data:  data$Q5
## t = -0.064062, df = 12, p-value = 0.95
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  36.53427 62.69650
## sample estimates:
## mean of x 
##  49.61538

Directional hypotheses

If \(H_a\) simply says the two means will be different, but doesn’t predict a direction to the difference, then you would use the default form of t-test (two tailed).

If \(H_a\) predicts a difference in a particular direction (the mean being larger than a reference value), then you would use a one-tailed t-test.

For instance we could perform the test for the following hypothesis:

\(H_{0}\): \(\mu_{Q5} = 50\)

t.test(data$Q5,mu=30,alternative="greater")
## 
##  One Sample t-test
## 
## data:  data$Q5
## t = 3.2672, df = 12, p-value = 0.003369
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  38.91492      Inf
## sample estimates:
## mean of x 
##  49.61538

The p-value of a one tailed test is typically twice that of the equivalent one tailed one.

t.test(data$Q5,mu=30,alternative="greater")
## 
##  One Sample t-test
## 
## data:  data$Q5
## t = 3.2672, df = 12, p-value = 0.003369
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  38.91492      Inf
## sample estimates:
## mean of x 
##  49.61538

Two samples t-test

We want to compare the responses to Q5 based on the response given to question Q2.

The two samples of responses can be visualized using a box plot:

boxplot( Q5 ~ Q2, data=data)

The two sample t-test is performed similarly to the one sample version:

t.test( Q5 ~ Q2, data=data)
## 
##  Welch Two Sample t-test
## 
## data:  Q5 by Q2
## t = 1.2118, df = 8.3143, p-value = 0.2589
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.14504  42.66885
## sample estimates:
##  mean in group No mean in group Yes 
##          56.42857          41.66667

Non-parametric tests

Ordinal (or better) variables

The Wilcoxon signed rank test is the non-parametric equivalent to the t-test. It can be performed using the function wilcox.test:

wilcox.test( data$Q5 ,mu=50)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$Q5
## V = 24.5, p-value = 0.7942
## alternative hypothesis: true location is not equal to 50

The two sample extension is the Mann-Whiteny U test, that can be executed using the same function:

wilcox.test( Q5 ~ Q2, data=data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Q5 by Q2
## W = 30, p-value = 0.2169
## alternative hypothesis: true location shift is not equal to 0

Categorical variables

It is possible to test the independence between two categorical variables using the \(\chi^2\) test.

The test operates on a contingency table that reports the combined frequencies2:

Never heard of Basic Good
Never heard of 0 2 0
Basic 1 7 0
Good 0 3 2

The \(\chi^2\) test compares the frequencies in the contingency table of the observed frequencies \(O\) to a table with expected frequencies, which can be computed on the basis of the marginals using the following formula:

\[E_{i,j} = \frac{O_{i,*} \cdot O_{*,j}}{N}\]

where:

  • \(O_{i,*}\) are the row marginals (sum of all elements in row \(i\))
  • \(O_{*,j}\) are the column marginals (sum of all elements in column \(j\))
  • \(N\) is the sum of all frequencies (i.e. size of the sample)
E = (margin.table(O,1) %*% t(margin.table(O,2)))/sum(O)
Never heard of Basic Good
Never heard of 0.1333333 1.6 0.2666667
Basic 0.5333333 6.4 1.0666667
Good 0.3333333 4.0 0.6666667

The \(\chi^2\) statistic is computed as:

\[ \chi^2 = \sum_{i,j} \frac{(E_{i,j} - O_{i,j})^2}{E_{i,j}} = 5.28125\]

The statistic is distributed according to the \(\chi^2\) distribution with \(df = n - p = (r-1)\cdot(c-1)\) degrees of freedom, where

  • \(n\) is the number of elements in the contingency table (9), and
  • \(p = r + c - 1 = 4\).

The observed \(\chi^2\) statistic value (5.28125) has to be compared to the critical value for the predefined \(\alpha\) levels (5%) \(\chi^2_{critical}=11.1432868\)

Alternatively we can directly compare the p-value that is 0.2596373 to the reference \(\alpha\) (5%).

In presence of 2 x 2 contingency tables it is possible to use the Fisher exact test. The test is based on the total number of possible permutation (keeping the observed marginals) that are more extrem (in terms of odds ratio) w.r.t the observed table.

We start with a 2x2 table, e.g. the one reporting the frequency of the observed combinations of Q1 and Q2:

No Yes
No 3 0
Yes 4 8

The Fisher exact test checks the null hypothesis that the Odds Ratio is equal to 1.

The test can be performed using the fisher.test function:

fisher.test(table(data$Q1,data$Q2))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(data$Q1, data$Q2)
## p-value = 0.07692
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.5349074       Inf
## sample estimates:
## odds ratio 
##        Inf

  1. The questionnaire consisted of the following items and the relative possible responses.

    Considering you knowledge before the previous lecture on the experimental method:

    Q1. Did you know what the scientific method was?

    • Yes
    • No

    Q2. Did you know the key role of falsification in the scientific method?

    • Yes
    • No

    Q3. What was your knowledge of the logic argumentation?

    • Never heard of
    • Basic
    • Good
    • Expert

    Q4 What was your knowledge of statistical hypothesis testing?

    • Never heard of
    • Basic
    • Good
    • Expert

    In general, thinking about the experimental method:

    Q5. In the articles you will write for your PhD work, how often do you plan to use hypothesis testing?

    • ______ %

    Q6. What is you opinion about the empirical method?

    • It is the way to go
    • It is complex
    • It is useless
    • It is interesting
  2. In this case we excluded from the contingency table the level Expert because it never occurred