--- title: "Data Analysis Examples" author: "Marco Torchiano" date: "20,24 November 2015" version: 0.2 output: html_document --- _Version 0.2_ _This is an edited transcript of the data analyses performed during the lectures of the [Empirical Methods in Software Engineering (EMSE)](http://softeng.polito.it/EMSE/) course._ This file has been generated from an _R Markdown_ source file: [`AnalysisExample.Rmd`](AnalysisExample.Rmd). The analysis is performed on the data collected through a questionnaire[^questionnaire] filled in by the students at the beginning of the first lecture. ```{r load data,warning=FALSE} library(knitr) data = read.csv("EmpiricalMethodQuest.csv",stringsAsFactors=F) data$Q3 = factor(data$Q3,levels=c("Never heard of","Basic","Good","Expert")) data$Q4 = factor(data$Q4,levels=c("Never heard of","Basic","Good","Expert")) twenty.thirty = data$Q5 == "20/30" data$Q5[twenty.thirty] = 25 data$Q5 = as.numeric(as.character(data$Q5)) data$Q6 = factor(data$Q6,levels= c("It is the way to go","It is interesting","It is complex","It is useless")) ``` Data description ================ Question 1 ---------- The question was: _Did you know what the scientific method was?_ The distribution of answers is ```{r Q1 distrib} table(as.data.frame(data$Q1)) ``` A better representation in RMarkdown is ```{r Q1 distrib table, echo=FALSE} Q1.t = table(data$Q1) kable(as.data.frame(Q1.t),col.names = c("Response","Freq")) ``` It can be represented also using a bar chart ```{r Q1 distrib bar} barplot(table(data$Q1)) text(c(.7,1.9),Q1.t/2,Q1.t) ``` Question 5 overview ------------------- ```{r } boxplot(data$Q5) ``` Hypothesis testing ================== One sample t-test ----------------- Let's start with a simple (one sample) null hypothesis: $H_{0}$: $\mu_{Q5} = 50$ and the relative alternative hypothesis $H_{a}$: $\mu_{Q5} \neq 50$ The test works in this way: * we assume an $\alpha=5%$ (=probability of type I error), that means a confidence level $1-\alpha=95%$. ```{r} alpha <- .05 ``` * we compute the $t$ statistic as: $$ t = \frac{\bar{Q5}-\mu}{s{Q5}/\sqrt{n}}= \frac{`r (m.Q5<-mean(data$Q5,na.rm=T))`)-50}{`r (s.Q5<-sd(data$Q5,na.rm=T))`/\sqrt{`r (n<-sum(!is.na(data$Q5)))`}} = `r (t.Q5 <- (m.Q5 - 50)/(s.Q5/sqrt(n)))`$$ * we assume the null hypothesis is true ($\mu = 50$), then * $t$ will be distributed according to a Student's t distribution with $df=n-1=`r n-1` ```{r,echo=FALSE,fig.width=5,fig.height=5} ts = seq(-5,5,.1) plot(ts,dt(ts,df=n-1),t="l") ``` * as a comparison, the values having a probability smaller of equal to $\alpha$ correspond to a a $t_{critical}$ computed from the t distribution: ```{r} t.crit <- qt(1-alpha/2,df=n-1) t.crit ``` ```{r,echo=FALSE,fig.width=5,fig.height=5} ts = seq(-5,5,.1) plot(ts,dt(ts,df=n-1),t="l") ts = seq(t.crit,5,.1) polygon(c(t.crit,ts,5),c(0,dt(ts,df=n-1),0),col="red",border=NA) ts = seq(-5,-t.crit,.1) polygon(c(-5,ts,-t.crit),c(0,dt(ts,df=n-1),0),col="red",border=NA) segments(t.crit,0,t.crit,4,col="red",lty=2) segments(-t.crit,0,-t.crit,4,col="red",lty=2) ``` * the probability of having such an (absolute) extreme value or larger is: ```{r} p.value <- (1-pt(abs(t.Q5),df=n-1))*2 p.value ``` ```{r,echo=FALSE,fig.width=5,fig.height=5} t.Q5 = abs(t.Q5) ts = seq(-5,5,.1) plot(ts,dt(ts,df=n-1),t="l") ts = seq(t.Q5,5,.1) polygon(c(t.Q5,ts,5),c(0,dt(ts,df=n-1),0),col="green",border=NA) ts = seq(-5,-t.Q5,.1) polygon(c(-5,ts,-t.Q5),c(0,dt(ts,df=n-1),0),col="green",border=NA) segments(t.Q5,0,t.Q5,4,col="green",lty=2) segments(-t.Q5,0,-t.Q5,4,col="green",lty=2) ``` * the decision about rejecting or not the null hypothesis can be taken * on the basis of the critical value: reject if $|t_{Q5}| > t_{critical} = `r abs(t.Q5)` > `r t.crit`$ * on the basis of the confidence level v: reject if $|p.value| < \alpha = `r p.value` < 5%$ The procedure above is performed by the function `t.test`: ```{r} t.test(data$Q5,mu=50) ``` Directional hypotheses ---------------------- If $H_a$ simply says the two means will be different, but doesn't predict a direction to the difference, then you would use the default form of t-test (two tailed). If $H_a$ predicts a difference in a particular direction (the mean being larger than a reference value), then you would use a _one-tailed_ t-test. For instance we could perform the test for the following hypothesis: $H_{0}$: $\mu_{Q5} = 50$ ```{r} t.test(data$Q5,mu=30,alternative="greater") ``` The _p-value_ of a _one tailed_ test is typically twice that of the equivalent _one tailed_ one. ```{r} t.test(data$Q5,mu=30,alternative="greater") ``` Two samples t-test ----------------- We want to compare the responses to Q5 based on the response given to question Q2. The two samples of responses can be visualized using a box plot: ```{r} boxplot( Q5 ~ Q2, data=data) ``` The two sample t-test is performed similarly to the one sample version: ```{r} t.test( Q5 ~ Q2, data=data) ``` Non-parametric tests -------------------- ### Ordinal (or better) variables The **Wilcoxon signed rank test** is the non-parametric equivalent to the t-test. It can be performed using the function `wilcox.test`: ```{r, warning=FALSE} wilcox.test( data$Q5 ,mu=50) ``` The two sample extension is the **Mann-Whiteny U test**, that can be executed using the same function: ```{r, warning=FALSE} wilcox.test( Q5 ~ Q2, data=data) ``` ### Categorical variables It is possible to test the independence between two categorical variables using the $\chi^2$ test. The test operates on a contingency table that reports the combined frequencies^[In this case we excluded from the contingency table the level `Expert` because it never occurred]: `r kable(O <- table(data$Q3,data$Q4,exclude="Expert"))` The $\chi^2$ test compares the frequencies in the contingency table of the observed frequencies $O$ to a table with _expected_ frequencies, which can be computed on the basis of the marginals using the following formula: $$E_{i,j} = \frac{O_{i,*} \cdot O_{*,j}}{N}$$ where: - $O_{i,*}$ are the row marginals (sum of all elements in row $i$) - $O_{*,j}$ are the column marginals (sum of all elements in column $j$) - $N$ is the sum of all frequencies (i.e. size of the sample) ```{r} E = (margin.table(O,1) %*% t(margin.table(O,2)))/sum(O) ``` `r kable(E)` The $\chi^2$ statistic is computed as: $$ \chi^2 = \sum_{i,j} \frac{(E_{i,j} - O_{i,j})^2}{E_{i,j}} = `r (chi.o <- sum((E-O)^2/E))`$$ The statistic is distributed according to the $\chi^2$ distribution with $df = n - p = (r-1)\cdot(c-1)$ degrees of freedom, where - $n$ is the number of elements in the contingency table (`r length(E)`), and - $p = r + c - 1 = `r (df <- prod(dim(E)-1))`$. The observed $\chi^2$ statistic value (`r chi.o`) has to be compared to the critical value for the predefined $\alpha$ levels (5%) $\chi^2_{critical}=`r qchisq(c(.975),df)`$ ```{r,echo=FALSE,fig.width=5,fig.height=5} chis = seq(0,20,.2) plot(chis,dchisq(chis,df),t="l") segments(chi.o,0,chi.o,2,lty=2,col="green") segments(qchisq(c(.975),df),0,qchisq(c(.975),df),2,lty=2,col="red") ``` Alternatively we can directly compare the _p-value_ that is `r 1-pchisq(chi.o,df)` to the reference $\alpha$ (5%). In presence of 2 x 2 contingency tables it is possible to use the **Fisher exact test**. The test is based on the total number of possible permutation (keeping the observed marginals) that are more extrem (in terms of odds ratio) w.r.t the observed table. We start with a 2x2 table, e.g. the one reporting the frequency of the observed combinations of Q1 and Q2: `r kable(table(data$Q1,data$Q2))` The Fisher exact test checks the null hypothesis that the Odds Ratio is equal to 1. The test can be performed using the `fisher.test` function: ```{r} fisher.test(table(data$Q1,data$Q2)) ``` [^questionnaire]: The questionnaire consisted of the following items and the relative possible responses. Considering you knowledge before the previous lecture on the experimental method: Q1. Did you know what the scientific method was? - `Yes` - `No` Q2. Did you know the key role of falsification in the scientific method? - `Yes` - `No` Q3. What was your knowledge of the logic argumentation? - Never heard of - Basic - Good - Expert Q4 What was your knowledge of statistical hypothesis testing? - Never heard of - Basic - Good - Expert In general, thinking about the experimental method: Q5. In the articles you will write for your PhD work, how often do you plan to use hypothesis testing? - ______ % Q6. What is you opinion about the empirical method? - It is the way to go - It is complex - It is useless - It is interesting