-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-chap7.Rmd
254 lines (201 loc) · 13.3 KB
/
07-chap7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
# LinearModels {#chap7}
Consider R. A. Fisher's classic Iris data set, consisting of several morphological measurements on the flowers of three species of _Iris_. This data set is famous in statistics and is often used as test data for a wide variety of statistical methods. Fisher was probably the most important statistician __and__ evolutionary biologist of the 20th century. He layed the foundations for our modern practice of statistics and he was a great evolutionary geneticist, partly responsible for inventing population genetics and establishing that natural selection could indeed be a powerful force in evolution. He was also a terrible grump and had lots of fights with other scientists. Although I like to find out about my idols, Fisher is someone that I am glad not to have met! Back to the data:
```{r, fig.cap="An Iris"}
knitr::include_graphics("sepalsPetals.jpg")
```
```{r, fig.cap="R. A. Fisher"}
knitr::include_graphics("fisher-smiling-50.jpg")
```
```{r}
data(iris)
head(iris)
tail(iris)
```
Here we have loaded the data frame called ``iris`` It is built into R, which is convenient for us. The ``head()`` and ``tail()`` functions examine the first 6 and the last 6 lines of the data frame, allowing us to see the variable names and check that the data frame is usable. Although
```{r, fig.cap="Iris setosa"}
knitr::include_graphics("Kosaciec_szczecinkowaty_Iris_setosa.jpg")
```
```{r, fig.cap="Iris virginica"}
knitr::include_graphics("220px-Iris_virginica.jpg")
```
```{r, fig.cap="Iris versicolor"}
knitr::include_graphics("220px-Iris_versicolor_3.jpg")
```
```{r}
summary(iris)
```
Here we have a six-figure summary of each of the variables: the mean, minimum and maximum values, plus the median, and first and 3rd quartiles. Together these numbers tell us a little about the distribution of the data (but not much).
## Graphing the Iris data
```{r, fig.height=2.5, fig.width=3.75}
library(ggplot2)
ggplot(aes(x = Species, y = Petal.Length), data = iris) +
geom_point() + ylim(0, 8)
```
Here we have a plot of the Petal Length for each species. The dots overlap a lot and it is hard to see what is going on, although you can see the maxima and minima quite clearly for each species.
```{r, fig.height=2.5, fig.width=3.75}
ggplot(aes(x = Species, y = Petal.Length), data = iris) +
geom_jitter(width = 0.4, height = 0) + ylim(0, 8)
```
Here are the same data but we have "jittered" the data along the x-axis, allowing us to see each data point separately from the others. This is a good way to display small amounts of data. Note that you should only jitter the data in the x direction in this case. Jittering the y data is wrong! (why?)
```{r, fig.height=2.5, fig.width=3.75}
ggplot(aes(x = Species, y = Petal.Length), data = iris) +
geom_boxplot() + ylim(0, 8)
```
If we have a larger amount of data, a box and whisker plot can be useful. It plots the median and interquartile range, and the outer ``whiskers`` extend to 1.5 x the size of the interquartile range. Extreme outliers are also indicated.
## Good Resources for Graphing
We won't have much time to go into advanced graphing techniques in this course, although I will try to highlight graphing methods as we go along. Fortunately, there are some good books to consult. They are online through the library:
- _R Graphics Cookbook_ by Chang
- _ggplot2 Elegant Graphics for Data Analysis_ by Wickham
## Analysing the iris data
How would you analyse these data? Since we have 3 distinct groups (species) to compare and the data are continuous and (probably) normally-distributed, the Analysis of Variance (ANOVA) might be your first choice. Here is an analysis for petal length in R:
```{r}
fit <- aov(Petal.Length ~ Species, data = iris)
summary(fit)
```
We see that there is evidence for significant variation in petal length among species. We reject the null hypothesis that there are no differences in petal length among the three species. Note that this is a very vague conclusion. We don't know which species differ in their petal length from the others, and we don't know how big the differences really are. A thorough analysis would establish this information.
## Model Diagnostics
When building a statistical model, checking the model using diagnostic plots is a crucial step. All models have assumptions and if these assumptions are violated then the model will potentially give wrong results. For the ANOVA model, there are several important assumptions:
- The data are independent
- The residuals follow a Normal distribution
- The variance of the residuals is constant over the range of the data (homoscedastic errors).
While all three assumptions are important, the first and third are crucial. It turns out that the Normality of residuals is not an overly strong assumption for the ANOVA model, but you should still check it, as large departures from Normality can cause problems. I usually rely on graphical approaches to testing assumptions:
```{r, fig.height=4, fig.width=8}
par(mfrow=c(1,2))
plot(fit, which = 1:2)
```
The first diagnostic plot above, plots the residuals versus their fitted values. If the assumption of homoscedasticity holds (try saying that 5 times fast!), there will be no pattern to the residuals. If you see a curve, either upwards or downwards, that is suggestive of curvature in your data. If there is a "trumpet" like effect either increasing to the right or left, or you have a "bow tie" shape to the residual plot ie narrower in the middle, then the data are not homoscedastic.
Subset and Recode the Data}
```{r}
iris2 <- subset(iris, Species != "setosa")
```
Hypothesis: There exists a difference between the species' means. How would you analyse these data?
- t-test?
- ANOVA?
- Regression?
- All three!!!
## Analysis using t test
```{r}
fit.ttest <- t.test(Petal.Length ~ Species,
data = iris2, var.equal = TRUE)
print(fit.ttest)
```
## Analysis using ANOVA
```{r}
fit.anova <- aov(Petal.Length ~ Species, data = iris2)
summary(fit.anova)
```
## Analysis using Regression
```{r}
fit.regression <- lm(Petal.Length ~ Species, data=iris2)
summary(fit.regression)
```
## Points to Note
- For an F test with __one__ degree of freedom in the numerator, $F = t^2$ with the denominator degrees of freedom for the F statistic equal to the degrees of freedom for the t statistic.
\item We can get the regression output from the ANOVA fit using ``summary.lm``
```{r}
summary.lm(fit.anova)
```
Regression Plot
```{r, fig.height=3.2, fig.width=5}
iris2$coded <- ifelse(iris2$Species == "versicolor", 0, 1)
options(warn = -1)
mns <- tapply(iris2$Petal.Length, iris2$coded, mean)
df <- data.frame(Petal.Length = mns, coded = 0:1, lower=c(NA, mns[1]), upper = c(NA, mns[2]))
ggplot(data=iris2, aes(coded, Petal.Length)) +
geom_point() +
geom_smooth(method = lm , color = "red", se = FALSE, formula=y~x) +
geom_errorbar(data = df, mapping = aes(x = coded, ymin = lower,
ymax = upper), width = 0.05, col = "blue") + ylim(2.5, 7.5) + xlim(0, 1) +
geom_text(label = paste(expression(beta[1])), x = 0.5, y = 4.5, parse = TRUE) +
geom_text(label = paste(expression(beta[1])), x = 0.95, y = 5, parse = TRUE) +
geom_text(label = paste(expression(beta[0])), x = 0.05, y = 4, parse = TRUE) +
geom_text(label = paste(expression(height == beta[0] + beta[1] %*% Species)),
x = 0.5, y = 7, parse = TRUE)
```
## Summary
- If we code our categorical variables as 0, 1, then the slope $\beta_1$ is the same as the difference between the means for each category.
- The intercept $\beta_0$ is the value for the baseline category.
- With 3 or more categories, we construct the 0, 1 ``dummy variables``:
```{r}
cat.var <- LETTERS[1:3]
model.matrix(~cat.var)
```
## Linear Models
- Comparing groups or doing regression are \textbf{both} examples of the General Linear Model.
- Regression: $y_i = \beta_0 + \beta_1x_i + \epsilon_i, \epsilon \sim NID(0, \sigma^2)$
- ANOVA: $y_{ij} = \mu + \tau_i + \epsilon_{ij}, \epsilon \sim NID(0, \sigma^2)$
- Why ``Linear?'' Definition: A linear transformation (equation) is defined by these 2 properties:
-- $f(x + y) = f(x) + f(y)$: Additivity
-- $f(Ax) = Af(x)$ : Homogeneity
-- Idea: The whole is equal to the sum of its parts
Linear models are the workhorse of all statistical modelling. They crop up everywhere. Especially, the computational ease with which we can analyse linear models governs the procedures for the design of experiments. The reason for the use of the guidelines for the design of experiments in Chapter 6 is precisely so that we can apply a linear modelling framework to the experimental system. This is why we need to spend so much time on them. An appreciation of the mathematics is very useful for understanding their uses, similarities and differences from each other.
## Matrix Notation for Linear Models
The mathematical formulation of a linear regression model (perhaps the simplest linear model) is below:
$Y_i= \beta_0 + \beta_1 X_i + \epsilon_i, \epsilon_i \sim NID(0, \sigma^2)$
The $Y_i$'s are the values of the response variable. The $\beta$'s are the intercept and slope parameters. It is these that we would like to estimate.. $X_i$ are the data for the explanatory variable. $\sigma^2$ is the error variance: the variance of the residuals around the line. The residual is simply the distance between the data point and the fitted line, on the y axis. We can write down an equation for each data point:
$$
Y_1 = \beta_0 + \beta_1 X_1 + \epsilon_1 \\
Y_2 = \beta_0 + \beta_1 X_2 + \epsilon_2 \\
Y_3 = \beta_0 + \beta_1 X_3 + \epsilon_3 \\
\vdots \qquad \vdots \qquad \vdots \\
Y_n = \beta_0 + \beta_1 X_n + \epsilon_n
$$
We can form column vectors for the elements of each equation:
$$
\begin{bmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_n \end{bmatrix} =
\begin{bmatrix} \beta_0 + \beta_1X_1 \\
\beta_0 + \beta_1X_2 \\
\beta_0 + \beta_1X_3 \\
\vdots \\
\beta_0 + \beta_1X_n
\end{bmatrix} +
\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n \end{bmatrix}
$$
## Matrix Representation
We can then factor out the $\beta$ terms from the middle vector.
$$
\begin{bmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_n \end{bmatrix} =
\begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ 1 & X_3 \\ \vdots & \vdots \\ 1 & X_n \end{bmatrix} \times
\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} +
\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n \end{bmatrix}
$$
We can then write down the matrix equation for the linear model:
$$
\mathbf{Y}= \mathbf{X}\mathbf{\beta}+\mathbf{\epsilon}
$$
$$
\mathbf{X}= \begin{bmatrix}
1 & X_1 \\
1 & X_2 \\
1 & X_3 \\
\vdots & \vdots \\
1 & X_n \end{bmatrix} \mathbf{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \mathbf{\epsilon} = \begin{bmatrix}
\epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n \end{bmatrix} \mathbf{Y} = \begin{bmatrix}
Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_n \end{bmatrix}
$$
Here we have $\mathbf{X}$ which is called the __design matrix__. $\beta$ is the __vector of parameters__ (to be estimated). $\mathbf{\epsilon}$ is the __vector of errors__. $\mathbf{Y}$ is the __vector of responses__. In this, the General Linear Model, $\mathbf{X}$ can have more than 2 columns. The first column is usually a vector of 1's (corresponding to the intercept). The other columns contain the explanatory variables, which may be continuous (leading to __multiple regression__) or discrete (usually 0, 1), leading to the __Analysis of Variance__ (ANOVA). Mixtures of continuous and discrete explanatory variables lead to the __Analysis of Covariance__ (ANCOVA) and other more complicated models.
## Distributional Assumptions
Finally, we have to specify the assumptions regarding the statistical distribution of the residuals. This will usually be one number, specifying that the variance of residuals, given the symbol $\sigma^2$. If we want to specify the variance for each data value, we can write:
$$
\sigma^2_\epsilon = \sigma^2\mathbf{I} = \begin{bmatrix}
\sigma^2 & 0 & \cdots & 0 \\
0 & \sigma^2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \sigma^2
\end{bmatrix}
$$
where $\mathbf{I}$ is the identity matrix. Thus, each data point has a variance of $\sigma^2$ and the off-diagonals are all zero, signifying that there is __no correlation__ between the data points. We can write this as:
$$
\epsilon \sim N(0, \sigma^2\mathbf{I})
$$
We say that the epsilons ($\mathbf{\epsilon}$) are __distributed as__ following a Normal distribution, with zero mean and variance $\sigma^2\mathbf{I}$. __N__ is the usual symbol for the Normal distribution and the "~" means "is distributed as."
This leads to the Full model:
$\Huge\textbf{Y} = \mathbf{X\beta} + \mathbf{\epsilon}, \quad \mathbf{\epsilon} \sim N(0, \sigma^2\mathbf{I})$
__This equation is worth remembering.__
## Estimation: Ordinary Least Squares
We wish to estimate $\beta$ and $\sigma^2$. This is usually done by the method of Ordinary Least Squares (OLS). The derivation of these estimators is beyond the scope of the course. You may have seen the formula for $\hat{\beta}$ before: it was the last exercise in the matrix problem sheet.
$$
\hat{\beta} = (\mathbf{X^TX})^{-1}\mathbf{X^T} \mathbf{Y} \\
\hat{\sigma^2} = \frac{1}{n - p}(\mathbf{Y} - \mathbf{X\hat{\beta}})^T (\mathbf{Y} - \mathbf{X\hat{\beta}})
$$
where $\hat{\sigma}^2$ is the unbiased estimator of $\sigma^2$, and $p=2$ in the situation where we are estimating the slope and the intercept. $n$ is the sample size.