forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
chapter3.Rmd
170 lines (103 loc) · 3.27 KB
/
chapter3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# 3.Logistic regression
##by Md Karim Uddin
**Part1**
```{r}
date()
```
```{r}
alc<-read.table( "alc.csv", header = TRUE,sep = ",", stringsAsFactors = FALSE)
```
Dimension of the data
``` {r}
dim(alc)
```
It can bee seen that , the dataset has 382 observations and 35 variables.
``` {r}
str(alc)
```
initialize a plot of high_use and G3
``` {r}
library(ggplot2)
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")
```
``` {r}
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
```
#logistic regression model
``` {r}
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
```
``` {r}
library(knitr)
coef(m)
```
``` {r}
library(magrittr)
library(knitr)
```
``` {r}
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
```
OR 2.5 % 97.5 %
(Intercept) 0.1491257 0.09395441 0.228611
failures 1.5695984 1.08339644 2.294737
absences 1.0977032 1.05169654 1.150848
sexM 2.5629682 1.60381392 4.149405
Predictive power of the model-1
``` {r}
library(dplyr)
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
```
Predictive power of the model-2
``` {r}
#initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
```
Accuracy and loss functions
``` {r}
#the logistic regression model m and dataset alc with predictions are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
```
#Cross Validation
``` {r}
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
```
# 10-fold cross-validation
``` {r}
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
```