-
Notifications
You must be signed in to change notification settings - Fork 15
/
Logistic.qmd
executable file
·297 lines (191 loc) · 12.8 KB
/
Logistic.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
title: "Logistic regression"
---
*Last modified on `r Sys.Date()`*
```{r, echo = FALSE, message = FALSE}
source("libs/Common.R")
```
```{r Load fonts, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE}
# library(extrafont)
# font_import(prompt=FALSE, pattern="[A/a]rchitects")
# loadfonts(device="postscript")
```
-----
Packages used in this tutorial:
```{r packages, message=FALSE, results='hide'}
library(ggplot2) # Used for plotting data
library(dplyr) # Used for data manipulation
library(rms) # Used to extract p-value from logistic model
```
Another package used in this tutorial is `gdata`, but its function will be called directly from the package (e.g. `gdata::mapLevels`) in section 2.
# Introduction
We'll be making use of median per-capita income data aggregated at the county level for the state of Maine. We will focus on the relationship between income and whether or not the county is on the coast.
```{r}
# Load dataset
dat <- read.csv("https://raw.githubusercontent.com/mgimond/Stats-in-R/gh-pages/Data/Income_and_education.csv", stringsAsFactors = TRUE)
# Limit the dataset to the two columns of interest
df <- select(dat, Coast, Income = Per.capita.income )
df
```
One approach to exploring this dataset is to see *how* per capita income varies as a function of the county's coastal status (i.e. whether or not the county borders the ocean or not). A t-test statistic could be used to assess if incomes differ between coastal and non-coastal communities.
```{r, echo=FALSE, fig.height=1.8, fig.width=2.5}
ggplot(df, aes(x=Coast, y=Income)) + geom_boxplot()
```
Or, if one wanted to model that relationship, a [categorical regression analysis](regression.html#4_including_categorical_predictor_variables) could be implemented.
But what if we are interested in flipping the relationship? In other words, what if we wanted to see how the coastal status of a county related to per capita income? More specifically, what if we wanted to see if county level income could predict whether a county is on the coast or not. Visually, this relationship would look like:
```{r, fig.height=1.8, fig.width=2.5, echo=FALSE}
ggplot(df, aes(x=Income, y=Coast)) + geom_point(alpha=.5) + xlim(15000,35000)
```
This does not look like a *typical* scatter plot one sees in a regression analysis, but the relationship we are exploring is similar in concept--i.e. we are seeking a model of the form `Y = a + bX`. We could, of course, fit a `linear` model to the data as follows:
```{r, echo=FALSE, fig.height=1.8, fig.width=2.5}
ggplot(df, aes(x=Income, y=as.integer(Coast)-1)) + geom_point(alpha=.5) +
stat_smooth(method="lm", se=FALSE) + ylab("Coast") + scale_y_continuous(breaks=c(0,1), labels=c("no", "yes")) +
xlim(15000,35000)
M <- lm(as.integer(Coast) -1 ~ Income, df)
M.a <- sprintf("%0.2g",coef(M)[1])
M.b <- sprintf("%0.2g",coef(M)[2])
```
The model to the above fit is of the form *Coast = `r M.a` + `r M.b` Income*. Now, you may see a couple of issues with this model. For starters, the model implies that there are `coast` values other than `yes` and `no` (e.g. what does the model return for an income value of $24,000?). In fact, the model is treating `coast` as a numeric value where `no` is coded as `0` (no probability) and `yes` is coded as `1` (maximum probability). This makes sense when you re-frame the question along the lines of *what is the probability that the county is on the coast given the county's median per capita income?*
Another problem with the above model is that the straight line does a very poor job in *fitting* the data and, if we are treating the `coast` axis as a probability limited to the range of 0 and 1, the model implies that we can have a probability greater than `1` (e.g. and income value of $32,000 suggests a probability of about 1.17). A workaround is to fit a different model--one that is bounded by the minimum and maximum probabilities. Such a shape is called a **logistic curve**.
```{r, echo=FALSE, fig.height=1.8, fig.width=2.5}
ggplot(df, aes(x=Income, y=as.integer(Coast)-1)) + geom_point(alpha=.5) +
stat_smooth(method="glm", se=FALSE, fullrange=TRUE,
method.args = list(family=binomial)) +
ylab("Coast") + scale_y_continuous(breaks=c(0,1), labels=c("no", "yes")) + xlim(15000,35000)
```
# The logistic regression model
The logistic regression model can be presented in one of two ways:
$$
log(\frac{p}{1-p}) = b_0 + b_1 x
$$
or, solving for `p` (and noting that the `log` in the above equation is the *natural* log) we get,
$$
p = \frac{1}{1+e^{-(b_0 + b_1 x)}}
$$
where `p` is the probability of `y` occurring given a value `x`. In our example this translates to the probability of a county being on the coast given its median per capita income value. In the first equation, fraction $\frac{p}{1-p}$ is referred to as the **odds ratio** which gives us the odds in favor of a `yes` (or `1` when represented using binomial values). The log of the *odds ratio*, $log(\frac{p}{1-p})$, is referred to as the **logit**. Note that the probability can be computed from the odds ratio as $\frac{odds}{1 + odds}$. Note too that there is not error term as is the case with a linear regression model.
Whereas the linear regression parameters are estimated using the least-squares method, the logistic regression model parameters are estimated using the **maximum-likelihood** method. For our dataset, these parameters can be estimated in R using the `glm()` function as follows:
```{r}
M1 <- glm(Coast ~ Income, df, family = binomial)
M1
```
Thus, our model looks like:
$$
P_{coast} = \frac{1}{1+e^{-(-12.2 + 0.0005 Income)}}
$$
where $P_{coast}$ is the probability of a county being on the coast. To see what the relationship looks like for a range of income values, we can use the `predict()` function as follows:
```{r, fig.height=1.8, fig.width=2.5}
# Create a range of income values (we'll cover a wider range then the dataset)
# The range of values must be saved in a data frame and must have the same column
# name as that given in the original dataset
M.df <- data.frame(Income = seq(10000, 40000, 1000))
#Predict the Coast values (as a probability) using the above data
M.df$Coast <- predict(M1, newdata=M.df, type="response")
# Plot the modeled probability values
ggplot(M.df, aes(x=Income, y=Coast)) + geom_line()
```
Note how the logistic regression model converted the categorical variable `Coast` into a numeric one by assigning `0` to `no` and `1` to `yes`.
A simpler way to plot the model is to make use of `ggplot`'s `stat_smooth` function. However, this will require that we convert the `Coast` factor to numeric values manually since `ggplot` will not do this for us automatically like `glm`. One quick way to do this is to wrap the `Coast` factor with `as.numeric`:
```{r}
as.numeric(df$Coast)
```
Instead of seeing `yes`'s and `no`'s, we now have numbers (`1` and `2`). But which number is mapped to which factor? One easy way to map the levels is to use the `mapLevels` function from the package `gdata`.
```{r}
gdata::mapLevels(df$Coast)
```
The label `no` is mapped to `1` and the label `yes` is mapped to `2`.
However, since we are modeling the probability as a fraction that ranges from `0` to `1` we will need to subtract `1` from the converted values as follows:
```{r}
as.numeric(df$Coast) - 1
```
So the label `no` is now mapped to `0` and the label `yes` is now mapped to `1`.
Next, we'll plot the values while making sure to map the numeric representation of `Coast` to the y-axis (and not the x-axis).
```{r, fig.height=1.8, fig.width=2.5}
ggplot(df, aes(x=Income, y=as.numeric(df$Coast) - 1)) +
geom_point(alpha=.5) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) +
ylab("Coast")
```
The logistic curve does not follow the complete sigmoid shape when limited to the original `Income` range. To see the full shape, we can increase the x-axis range using `xlim`, but this will also require that we instruct `stat_smooth` to extend the logistic curve to the new x-axis range by setting `fullrange` to `TRUE`.
```{r, fig.height=1.8, fig.width=2.5}
ggplot(df, aes(x=Income, y=as.numeric(df$Coast) - 1)) +
geom_point(alpha=.5) +
stat_smooth(method="glm", se=FALSE, fullrange=TRUE,
method.args = list(family=binomial)) +
ylab("Coast") + xlim(10000, 40000)
```
# Assessing the fit with a pseudo R^2^
> Note that even though many statistical software will compute a pseudo-R^2^ for logistic regression models, this measure of fit is not directly comparable to the R^2^ computed for linear regression models. In fact, some statisticians recommend avoiding publishing R^2^ since it can be misinterpreted in a logistic model context.
To assess how well a logistic model fits the data, we make use of the **log-likelihood** method (this is similar to the Pearson's correlation coefficient used with linear regression models). A *large* log-likelihood statistic indicates a poor fit (similar in idea to a large residual sum of squares statistic for a linear model). What we seek, therefore, is a *small* log-likelihood statistic. What constitutes a small or large statistic is determined by the log likelihood statistic of a base model (aka *null* model) where *none* of the predictive terms are added to he equation, i.e.:
$$
p_{null} = \frac{1}{1+e^{-(b_0)}}
$$
In our working example, the log-likelihood statistic (often labeled as **-2LL** in some statistical packages) for the null model is,
```{r}
M1$null.deviance
```
What we want is -2LL for the full model (i.e. the model with the `Income` predictor variable) to be smaller than that of the null model. To extract -2LL from the model, type:
```{r}
M1$deviance
```
This value is smaller than that of the null model--a good thing!
The difference between both log-likelihood values is referred to as the **model Chi-square**.
```{r}
modelChi <- M1$null.deviance - M1$deviance
```
Dividing the model Chi-square by the null log-likelihood value gives us one measure of the **pseudo R-square** (note that there is no exact way to compute the R-square value with a logistic regression model).
```{r}
pseudo.R2 <- modelChi / M1$null.deviance
pseudo.R2
```
In this working example, the model can account for `r round(pseudo.R2 * 100, 1)`% of the variability in the `Coast` variable. This pseudo R-square calculation is referred to as the **Hosmer and Lemeshow** R-square.
## Alternative pseudo R^2^
Here, we'll make use of the `rms` package's `lrm` function to compute another form of the pseudo R^2^ called the **Nagelkerke R^2^**.
```{r}
lrm(Coast ~ Income, df)
```
Note how this value of `r round(lrm(Coast ~ Income, df)$stats["R2"],2)` differs from that of the *Hosmer and Lemeshow* R^2^ whose value is `r round(pseudo.R2,2)`.
# Assessing the significance
## Model significance
A p-value for the logistic model can be approximated (note that it is difficult to associate an exact p-value with a logistic regression model).
First, pull the the difference in degrees of freedom between the null and full model:
```{r}
Chidf <- M1$df.null - M1$df.residual
```
Then, compute the p-value using the chi-square statistic. This pseudo p-value is also called the **likelihood ratio p-value**.
```{r}
chisq.prob <- 1 - pchisq(modelChi, Chidf)
chisq.prob
```
If the p-value is small then we can reject the null hypothesis that the current model does not improve on the base model. Here, the p-value is `r round(chisq.prob, 2)` suggesting that the model is a significant improvement over the base model.
## Parameter significance
If we want to assess the significance of a parameter as it compares to the base model simply wrap the model object with the `summary` function.
```{r}
summary(M1)
```
The `Income` coefficient p-value is 0.036.
# Multi-variable model
So far, we've worked with a single variable model. We can augment the model by adding more variables. For example, we will add the fraction of the population that has attained a bachelor's degree to the model (we'll ignore the possibility of co-dependence between variables for pedagogical sake).
The entire workflow follows:
```{r}
# Grab variables of interest
df2 <- select(dat, Coast, Income = Per.capita.income, Edu = Fraction.with.Bachelor.s.or.greater)
# Run regression model
M2 <- glm(Coast ~ Income + Edu, df2, family = binomial)
# Compute pseudo R-square
modelChi <- M2$null.deviance - M2$deviance
pseudo.R2 <- modelChi / M2$null.deviance
pseudo.R2
# Compute the pseudo p-value
Chidf <- M2$df.null - M2$df.residual
modelChi <- M2$null.deviance - M2$deviance
1 - pchisq(modelChi, Chidf)
# Assess each parameter's significance
summary(M2)
```
Note the change in the `Income` coefficient p-value when adding another variable that may well be explaining the same variability in `Coast` (i.e. `Income` and `Edu` are very likely correlated).
-----
**Session Info**:
```{r,results='asis', echo=FALSE}
#detach("package:extrafont")
pander::pander(sessionInfo(), locale = FALSE)
```