chapter3.Rmd

# 3.Logistic regression
##by Md Karim Uddin

**Part1**


```{r}
date()

```

```{r}
alc<-read.table( "alc.csv", header = TRUE,sep = ",", stringsAsFactors = FALSE)

```

Dimension of the data
``` {r}
dim(alc) 
```
It can bee seen that , the dataset has 382 observations and  35 variables.
``` {r}
str(alc) 
```

initialize a plot of high_use and G3

``` {r}
library(ggplot2)
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))

g1 + geom_boxplot() + ylab("grade")
```


``` {r}
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
```

#logistic regression model


``` {r}
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
```


``` {r}
library(knitr)

coef(m)
```


``` {r}

library(magrittr)
library(knitr)

```


``` {r}
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")


```


                   OR      2.5 %   97.5 %
(Intercept) 0.1491257 0.09395441 0.228611
failures    1.5695984 1.08339644 2.294737
absences    1.0977032 1.05169654 1.150848
sexM        2.5629682 1.60381392 4.149405


Predictive power of the model-1

``` {r}
library(dplyr)
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
```

Predictive power of the model-2


``` {r}
#initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins

```

Accuracy and loss functions


``` {r}
#the logistic regression model m and dataset alc with predictions are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
```

#Cross Validation


``` {r}
# the logistic regression model m and dataset alc (with predictions) are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
```
# 10-fold cross-validation

``` {r}

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
```