-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsingle_linear_regression.qmd
333 lines (260 loc) · 11.4 KB
/
single_linear_regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# Single linear regression
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
## Overview
Single linear regression is an appropriate way to analyse data when:
- You have two continuous variables
- One of the variables is explanatory and the other is a response.
That is, one variable, the $x$, "causes" the $y$.
- The explanatory variable has been chosen, set or manipulated and the
other variable is the measured response. This is sometimes described
as the $x$ being “sampled without error”
- The response variable, $y$, is randomly sampled for each $x$ with a
normal distribution and those normal distributions have the same
variance.
- The relationship between the variables is linear
Applying a single linear regression to data means putting a line of best
fit through it. The intercept and the slope of the true population
relationship is estimated from the sample you have. We test whether
those two parameters differ significantly from zero.
### Reporting
Reporting [the significance of effect, direction of effect, magnitude of
effect](https://3mmarand.github.io/comp4biosci/what_statistical_model.html#reporting)
for a single linear regression means making the following clear to the
reader:
1. the significance of effect - whether the slope is significantly
different from zero
2. the direction of effect - whether the slope is positive or negative
3. the magnitude of effect - the slope itself
Figures should reflect what you have said in the statements. Ideally
they should show both the raw data and the statistical model:
We will explore all of these ideas with an example.
## 🎬 Your turn!
If you want to code along you will need to start a new [RStudio
project](workflow_rstudio.html#rstudio-projects), add a `data-raw`
folder and open a new script. You will also need to load the
**`tidyverse`** package [@tidyverse].
## Single linear regression
Three replicates water baths were set up at each of five temperatures
(10, 11C, 12C, 13C, 14C). Ten Brine Shrimp (*Artemia salina*) were
placed in each and their average respiration rate per water bath was
measured (in arbitrary units). The data are in
[shrimp.txt](data-raw/shrimp.txt).
### Import and explore
Import the data:
```{r}
shrimp <- read_table("data-raw/shrimp.txt")
```
```{r}
#| echo: false
knitr::kable(shrimp) |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "200px")
```
These data are in tidy format [@Wickham2014-nl] - all the respiration
values are in one column with another column indicating the water bath
temperature. There is only one water bath per row. This means they are
well formatted for analysis and plotting.
In the first instance, it is sensible to create a rough plot of our data
(See @fig-shrimp-rough). Plotting data early helps us in multiple ways:
- it helps identify whether there missing or extreme values
- it allows us to see if the relationship is roughly linear
- it tells us whether any relationship positive or negative
Scatter plots (`geom_point()`) are a good choice for exploratory
plotting with data like these.
```{r}
#| label: fig-shrimp-rough
#| fig-cap: "A default scatter plot of the relationship between temperature and respiration rate in brine shrimp is enough for us to see that respiration rate seems to increase with temperature approximately linearly."
ggplot(data = shrimp,
aes(x = temperature, y = respiration)) +
geom_point()
```
The figure suggests that respiration rate increases with temperature and
there are no particularly extreme values. We can also see that any
relationship is roughly linear.
### Do a regression with `lm()`
We can create a single linear regression model like this:
```{r}
mod <- lm(data = shrimp, respiration ~ temperature)
```
And examine the model with:
```{r}
summary(mod)
```
```{r}
#| echo: false
b0 <- mod$coefficients[1] |> round(2)
b1 <- mod$coefficients[2] |> round(2)
```
What do all these results mean?
The `Estimate` in the `Coefficients` table give:
- the `(Intercept)` known as $\beta_0$, which is the value of the *y*
(the response) when the value of *x* (the explanatory) is zero.
- the slope labelled `temperature` known as $\beta_1$, which is the
amount of *y* you add for each unit of *x*. `temperature` is
positive so respiration rate increases with temperature
@fig-regression-lm-model shows the model and its parameters.
The *p*-values on each line are tests of whether that coefficient is
different from zero. Thus it is:
`temperature 0.91850 0.09182 10.003 1.79e-07 ***`
that tells us the slope is significantly different from zero and thus
there *is* a significant relationship between temperature and
respiration rate.
The *F* value and *p*-value in the last line are a test of whether the
model as a whole explains a significant amount of variation in the
response variable. For a regression, this is exactly equivalent to the
test of the slope against zero and the two *p*-values will be the same.
```{r}
#| echo: false
#| label: fig-regression-lm-model
#| fig-cap: "In a linear model, the first estimate is the intercept and the second estimate is the 'slope'."
ggplot(data = shrimp,
aes(x = temperature, y = respiration)) +
geom_point(colour = "black") +
geom_abline(intercept = b0,
slope = b1,
pal3[1], linetype = 2) +
geom_smooth(method = "lm",
se = FALSE,
colour = pal3[1]) +
scale_x_continuous(expand = c(0,0),
limits = c(0, 15),
name = "Temperature (C)") +
scale_y_continuous(expand = c(0,0),
limits = c(-0, 20),
name = "Respiration (units)") +
geom_segment(aes(x = 2,
xend = 0,
y = b0 + 6,
yend = b0),
colour = pal3[2]) +
annotate("text",
x = 3,
y = b0 + 6.5,
label = glue::glue("Intercept (β0) is { b0 }"),
colour = pal3[2],
size = 4) +
geom_segment(aes(x = 9,
xend = 9,
y = b0 + b1 * 8,
yend = b0 + b1 * 9),
colour = pal3[3]) +
geom_segment(aes(x = 8,
xend = 9,
y = b0 + b1 * 8,
yend = b0 + b1 * 8),
colour = pal3[3]) +
geom_segment(aes(x = 9,
xend = 12,
y = b0 + b1 * 8.5,
yend = b0 + 1.3),
colour = pal3[2]) +
annotate("text",
x = 12,
y = b0 + 1,
label = glue::glue("slope (β1) is { b1 }"),
colour = pal3[2],
size = 4) +
theme_classic()
```
### Check assumptions
Check the assumptions: All general linear models assume the "residuals"
are normally distributed and have "homogeneity" of variance.
Our first check of these assumptions is to use common sense: respiration
is a continuous variable and we would expect it to be normally
distributed thus we would expect the residuals to be normally
distributed
We then proceed by plotting residuals. The `plot()` function can be used
to plot the residuals against the fitted values (See
@fig-regression-plot1). This is a good way to check for homogeneity of
variance.
```{r}
#| label: fig-regression-plot1
#| fig-cap: "A plot of the residuals against the fitted values shows no obvious pattern as the points are roughly evenly distributed around the line. This is a good sign for the assumption of homogeneity of variance."
plot(mod, which = 1)
```
We can also use a histogram to check for normality (See
@fig-regression-plot2).
```{r}
#| label: fig-regression-plot2
#| fig-cap: "A histogram of residuals is symetrical and seems consistent with a normal distribution. This is a good sign for the assumption of normally distributed residuals."
ggplot(mapping = aes(x = mod$residuals)) +
geom_histogram(bins = 5)
```
Finally, we can use the Shapiro-Wilk test to test for normality.
```{r}
shapiro.test(mod$residuals)
```
The p-value is greater than 0.05 so this test of the normality
assumption is not significant. Note that "not significant" means not
significantly different from a normal distribution. It does not mean
definitely normally distributed.
Taken together, these results suggest that the assumptions of normality
and homogeneity of variance are not violated.
### Report
The temperature explained a significant amount of the variation in
respiration rate (ANOVA: *F* = 67; *d.f*. = 1, 13; *p* \< 0.001). The
regression line is: Respiration rate = `r b0` + `r b1` \* temperature.
See @fig-shrimp.
::: {#fig-shrimp}
```{r}
#| code-fold: true
ggplot(data = shrimp,
aes(x = temperature, y = respiration)) +
geom_point(size = 2) +
geom_smooth(method = "lm",
se = FALSE,
colour = "black") +
scale_x_continuous(expand = c(0,0),
limits = c(0, 15.5),
name = "Temperature (C)") +
scale_y_continuous(expand = c(0,0),
limits = c(0, 20),
name = "Respiration (units)") +
theme_classic()
```
**Respiration rate of *Artemia salina* increases with temperature**.
Three replicate water baths were set up at each of five temperatures
(10, 11C, 12C, 13C, 14C). Ten Brine Shrimp (*Artemia salina*) were
placed in each and their average respiration rate per water bath was
measured. There was a significant effect of altering water bath
temperature on the average respiration rate (ANOVA: *F* = 67; *d.f*. =
1, 13; *p* \< 0.001). The regression line is: Respiration rate =
`r b0` + `r b1` \* temperature. Data analysis was conducted in R
[@R-core] with tidyverse packages [@tidyverse].
:::
## Summary
1. Single linear regression is an appropriate when you have one
continuous explanatory variable and one continuous response and the
relationship between the two is linear.
2. Applying a single linear regression to data means putting a line of
best fit through it. We estimate the **coefficients** (also called
the **parameters**) of the model. These are the intercept, $\beta_0$,
and the slope, $\beta_1$. We test whether the parameters differ
significantly from zero
3. We can use `lm()` to a linear regression.
4. In the output of `lm()` the coefficients are listed in a table in the
Estimates column. The *p*-value for each coefficient is in the test
of whether it differs from zero. At the bottom of the output there
is a test of the model *overall*. In a single linear regression this
is exactly the same as the test of the $\beta_1$ and the p-values are
identical. The R-squared value is the proportion of the variance
in the response variable that is explained by the model.
5. The assumptions of the general linear model are that the residuals
are normally distributed and have homogeneity of variance. A
residual is the difference between the predicted value and the
observed value.
6. We examine a histogram of the residuals and use the Shapiro-Wilk
normality test to check the normality assumption. We check the
variance of the residuals is the same for all fitted values with a
residuals vs fitted plot.
7. If the assumptions are not met, we might need to transform the data
or use a different type of model.
8. When reporting the results of a regression we give the significance,
direction and size of the effect. Often we give the equation of the
best fitting line. A Figure should show the data and the line of
best fit.