forked from eamonnbolger/DAS2022-Group-07
-
Notifications
You must be signed in to change notification settings - Fork 0
/
GLM.Rmd
176 lines (134 loc) · 6.36 KB
/
GLM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "GLM"
author: "DAS Group 07"
date: "19/03/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r libraries, include=FALSE,echo=TRUE}
library(tidyverse)
library(kableExtra)
library(knitr)
library(skimr)
library(ggcorrplot)
library(gridExtra)
library(countrycode)
library(car)
library(moderndive)
library(jtools)
```
```{r data, echo= FALSE, eval = TRUE}
# import and process data
coffee <- read.csv('dataset7.csv')
```
```{r process, echo= FALSE, warning=FALSE}
coffee <- as_tibble(coffee)
coffee <- coffee %>%
rename(altitude = altitude_mean_meters) %>%
rename(defects = category_two_defects) %>%
rename(country = country_of_origin)
# removing the two data errors for altitude
coffee$altitude <- ifelse(coffee$altitude > 8000, NA, coffee$altitude)
#removing na obs
coffee <- coffee %>% na.omit()
coffee <- as.data.frame(coffee)
```
```{r add variable, echo= FALSE, warning=FALSE}
coffee$continent = countrycode(sourcevar = coffee[,"country"],
origin="country.name",
destination = "continent")
coffee[c(374,408,764,787),"continent"] <- 'Americas'
coffee <- as_tibble(coffee)
```
# Formal Data Analysis
In formal data analysis, we use three link functions to fit generalized linear models, then use step-wise regression to select reasonable explanatory variables based on AIC values, and finally use chi-square values to judge the fitness of the model.
```{r model selection, echo= FALSE, warning=FALSE}
coffee1 <- as.data.frame(coffee)
coffee1[which(coffee1$Qualityclass=="Poor"),]$Qualityclass=0
coffee1[which(coffee1$Qualityclass=="Good"),]$Qualityclass=1
coffee1$Qualityclass = as.factor(coffee1$Qualityclass)
```
```{r logit, echo= FALSE, warning=FALSE}
logit_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "logit"))
both_logit = step(logit_model,direction="both")
summ(both_logit)
both_logit$anova %>%
select(Step, AIC) %>%
kable() %>%
kable_styling(font_size = 10, latex_options = "hold_position")
## formula = Qualityclass ~ aroma + flavor + acidity + altitude + harvested
## AIC: 533.47, BIC = 562.23
## Fitness of model
summary(both_logit)$null.deviance - summary(both_logit)$deviance > qchisq(0.95,891-886)
## TRUE. we can reject the null hypothesis, and the terms are all significant
```
```{r probit, echo= FALSE, warning=FALSE}
# Probit link
probit_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "probit"))
both_probit = step(probit_model,direction = "both")
summ(both_probit)
both_probit$anova %>%
select(Step, AIC) %>%
kable() %>%
kable_styling(font_size = 10, latex_options = "hold_position")
## formula = Qualityclass ~ aroma + flavor + acidity + altitude + harvested
## AIC = 554.03, BIC = 582.79
## Fittness of model
summary(both_probit)$null.deviance - summary(both_probit)$deviance > qchisq(0.95,891-886)
## TRUE. we can reject the null hypothesis, and the terms are all significant
```
```{r cloglog, echo= FALSE, warning=FALSE}
clog_model = glm(Qualityclass~aroma + flavor + acidity + defects + altitude + harvested + continent, data = coffee1, family = binomial(link = "cloglog"))
both_clog = step(clog_model,direction = "both")
summ(both_clog)
both_clog$anova %>%
select(Step, AIC) %>%
kable() %>%
kable_styling(font_size = 10, latex_options = "hold_position")
## formula = Qualityclass ~ aroma + flavor + acidity + harvested
## AIC = 636.96, BIC = 660.93
## Fittness of model
summary(both_clog)$null.deviance - summary(both_clog)$deviance > qchisq(0.95,891-887)
## TRUE. we can reject the null hypothesis, and the terms are all significant
```
According to the stepwise regression results, the GLM explanatory variables of complementary log-log link are *aroma*, *flavor*, *acidity* and *harvested*, and the GLM explanatory variables of logit link and probit link are *aroma*, *flavor*, *acidity*, *altitude* and *harvested*.
The Pearson chi-squared statistics of three models are all greater than the 95th percentile of the $\chi^{2}(4)$ distribution. Therefore the models fit the data well and We need to choose the appropriate link function by comparing the information criteria.
# GLM Model Selection
The AIC and BIC values corresponding to each link function are as follows:
```{r aicbic, echo= FALSE, warning=FALSE}
summlogit = summ(both_logit)
summprobit = summ(both_probit)
summclog = summ(both_clog)
aic1 <- round(summlogit$model$aic, 3)
aic2 <- round(summprobit$model$aic, 3)
aic3 <- round(summclog$model$aic, 3)
```
Link | Link Function | AIC | BIC
:-------------------------|:--------------- |:--------:|:-----:
Logit link | $g\left(p_{i}\right)=\log \left(\frac{p_{i}}{1-p_{i}}\right)$ | 533.47 | 562.23
Probit link | $g\left(p_{i}\right)=\Phi^{-1}\left(p_{i}\right)=\beta_{0}+\beta_{1} x_{i}$ | 554.03 | 582.79
Complementary log-log link| $g\left(p_{i}\right)=\log \left[-\log \left(1-p_{i}\right)\right]=\beta_{0}+\beta_{1} x_{i}$ | 636.96 | 660.93
Based on the AIC and BIC values in the table above, the model using logit link fits best in three. So we finally choose the logit link function in GLM.
The GLM regression model of logit link is as follows:
$$
Y \sim B(m_i, p{(\text {Qualityclass = Good})}_i),
$$
$$
g\left(p{(\text {Qualityclass = Good})}_{i}\right)=\log \left(\frac{p{(\text {Qualityclass = Good})}_{i}}{1-p{(\text {Qualityclass = Good})}_{i}}\right),
$$
```{r logitformula, echo= FALSE, warning=FALSE}
logitsele = summary(both_logit)
Coefs <- round(coef(logitsele), 4)
```
$$
\log \left(\frac{p{(\text {Qualityclass = Good})}}{1-p{(\text {Qualityclass = Good})}}\right) = `r Coefs[1]` + `r Coefs[2]` \cdot aroma + `r Coefs[3]` \cdot flavor + `r Coefs[4]` \cdot acidity + `r Coefs[5]` \cdot altitude + `r Coefs[6]` \cdot harvested.
$$
Considering that the correlation coefficient of aroma, flavor and acidity in EDA is relatively large, we calculated the VIF value of the variables in the regression.
```{r VIF, echo= FALSE, warning=FALSE}
vif(both_logit) %>%
kable(caption = '\\label{tab:VIF} VIF of Variables', digits = 2)%>%
kable_styling(latex_options = "hold_position")
```
The results show that the VIF values are all small, excluding multicollinearity.