B3-NeuralNetworks.qmd

---
output: html_document
editor_options:
  chunk_output_type: console
---

# Artificial Neural Networks

Artificial neural networks are biologically inspired, the idea is that inputs are processed by weights, the neurons, the signals then accumulate at hidden nodes (axioms), and only if the sum of activations of several neurons exceed a certain threshold, the signal will be passed on.

```{r}
library(cito)
```

cito allows us to fit fully-connected neural networks within one line of code. When we come to other tasks such as image recognition we have to use frameworks with higher flexibility such as keras or torch.

Neural networks are harder to optimize (hey are optimized via backpropagation and gradient descent) and a few hyperparameters that control the optimization should be familiar:

+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+
| Hyperparameter | Meaning                                                                                                                                                                                                       | Range                     |
+================+===============================================================================================================================================================================================================+===========================+
| learning rate  | the step size of the parameter updating in the iterative optimization routine, if too high, the optimizer will step over good local optima, if too small, the optimizer will be stuck in a bad local optima   | \[0.00001, 0.5\]          |
+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+
| batch size     | NNs are optimized via stochastic gradient descent, i.e. only a batch of the data is used to update the parameters at a time                                                                                   | Depends on the data:      |
|                |                                                                                                                                                                                                               |                           |
|                |                                                                                                                                                                                                               | 10-250                    |
+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+
| epoch          | the data is fed into the optimization in batches, once the entire data set has been used in the optimization, the epoch is complete (so e.g. n = 100, batch size = 20, it takes 5 steps to complete an epoch) | 100+ (use early stopping) |
+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+

Example:

```{r}
data = airquality[complete.cases(airquality),]
data = scale(data)

model = dnn(Ozone~., 
            hidden = c(10L, 10L), # Architecture, number of hidden layers and nodes in each layer
            activation = c("selu", "selu"), # activation functions for the specific hidden layer
            loss = "mse", lr = 0.01, data = data, epochs = 150L, verbose = FALSE)
plot(model)
summary(model)
```

The architecture of the NN can be specified by the `hidden` argument, it is a vector where the length corresponds to the number of hidden layers and value of entry to the number of hidden neurons in each layer (and the same applies for the `activation` argument that specifies the activation functions in the hidden layers). It is hard to make recommendations about the architecture, a kind of general rule is that the width of the hidden layers is more important than the depth of the NN.

The loss function has to be adjusted to the response type:

+---------------------------+---------------------------------------+--------------------------------------------------+
| Loss                      | Type                                  | Example                                          |
+===========================+=======================================+==================================================+
| mse (mean squared error)  | Regression                            | Numeric values                                   |
+---------------------------+---------------------------------------+--------------------------------------------------+
| mae (mean absolute error) | Regression                            | Numeric values, often used for skewed data       |
+---------------------------+---------------------------------------+--------------------------------------------------+
| softmax                   | Classification, multi-label           | Species                                          |
+---------------------------+---------------------------------------+--------------------------------------------------+
| cross-entropy             | Classification, binary or multi-class | Survived/non-survived, Multi-species/communities |
+---------------------------+---------------------------------------+--------------------------------------------------+
| binomial                  | Classification, binary or multi-class | Binomial likelihood                              |
+---------------------------+---------------------------------------+--------------------------------------------------+
| poisson                   | Regression                            | Count data                                       |
+---------------------------+---------------------------------------+--------------------------------------------------+

::: callout-caution
## Importance of the learning rate

cito visualizes the training (see graphic). The reason for this is that the training can easily fail if the learning rate (lr) is poorly chosen. If the lr is too high, the optimizer "jumps" over good local optima, while it gets stuck in local optima if the lr is too small:

```{r}
model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", lr = 0.4, data = data, epochs = 150L, verbose = FALSE)
```

If too high, the training will either directly fail (because the loss jumps to infinity) or the loss will be very wiggly and doesn't decrease over the number of epochs.

```{r}
model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", lr = 0.0001, data = data, epochs = 150L, verbose = FALSE)
```

If too low, the loss will be very wiggly but doesn't decrease.
:::

::: callout-note
## Learning rate scheduler

Adjusting / reducing the learning rate during training is a common approach in neural networks. The idea is to start with a larger learning rate and then steadily decrease it during training (either systematically or based on specific properties):

```{r}
model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", 
            lr = 0.1,
            lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
            # reduce learning all 30 epochs (new lr = 0.1* old lr)
            data = data, epochs = 150L, verbose = FALSE)
```
:::


## Regularization

We can use $\lambda$ and $\alpha$ to set L1 and L2 regularization on the weights in our NN:

```{r}
model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", 
            lr = 0.05,
            lambda = 0.1,
            alpha = 0.5,
            lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
            # reduce learning all 30 epochs (new lr = 0.1* old lr)
            data = data, epochs = 150L, verbose = FALSE)
summary(model)
```

Be careful that you don't accidentally set all weights to 0 because of a too high regularization. We check the weights of the first layer:

```{r}
fields::image.plot(coef(model)[[1]][[1]]) # weights of the first layer
```

## Exercise


::: {.callout-caution icon="false"}
#### Question: Hyperparameter tuning - Titanic dataset

Tune architecture

-   Play around with the architecture and try to improve the AUC on the submission server

Bonus:

-   Tune the architecture! (depth and width of the NN via the hidden argument)


```{r}
library(EcoData)
library(dplyr)
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response


model = dnn(survived~., 
          hidden = c(10L, 10L), # change
          activation = c("selu", "selu"), # change
          loss = "binomial", 
          lr = 0.05, #change
          validation = 0.2,
          lambda = 0.001, # change
          alpha = 0.1, # change
          lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
          data = data_obs, epochs = 40L, verbose = TRUE, plot= TRUE)

# Predictions:

predictions = predict(model, newdata = data_new)

write.csv(data.frame(y = predictions[,1]), file = "Max_titanic_ensemble.csv")
```


<!-- ```{r, eval=FALSE} -->
<!-- library(cito) -->
<!-- set.seed(42) -->
<!-- data_obs = data_sub[!is.na(data_sub$survived),]  -->
<!-- cv = 3 -->

<!-- outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv)) -->

<!-- results = data.frame( -->
<!--   set = rep(NA, cv), -->
<!--   lambda = rep(NA, cv), -->
<!--   AUC = rep(NA, cv) -->
<!-- ) -->

<!-- for(i in 1:cv) { -->
<!--   train_outer = data_obs[outer_split != i, ] -->
<!--   test_outer = data_obs[outer_split == i, ] -->

<!--   tuning_results =  -->
<!--       sapply(1:length(hyper_lambda), function(k) { -->
<!--         model = dnn(survived~.,  -->
<!--             hidden = c(10L, 10L),  -->
<!--             activation = c("selu", "selu"),  -->
<!--             loss = "binomial",  -->
<!--             lr = 0.05, -->
<!--             lambda = ..., # change this line -->
<!--             alpha = 0.1, -->
<!--             lr_scheduler = config_lr_scheduler("step", step_size = 10, gamma = 0.1), -->
<!--             data = train_outer, epochs = 40L, verbose = FALSE, plot= FALSE) -->
<!--         return(Metrics::auc(test_outer$survived, predict(model, test_outer )[,2])) -->
<!--       }) -->
<!--   best_lambda = hyper_lambda[which.max(tuning_results)] -->

<!--   results[i, 1] = i -->
<!--   results[i, 2] = best_lambda -->
<!--   results[i, 3] = max(tuning_results) -->
<!-- } -->

<!-- print(results) -->
<!-- ``` -->


<!-- ```{r} -->
<!-- library(EcoData) -->
<!-- library(dplyr) -->
<!-- library(missRanger) -->
<!-- data(titanic_ml) -->
<!-- data = titanic_ml -->
<!-- data =  -->
<!--   data %>% select(survived, sex, age, fare, pclass) -->
<!-- data[,-1] = missRanger(data[,-1], verbose = 0) -->

<!-- data_sub = -->
<!--   data %>% -->
<!--     mutate(age = scales::rescale(age, c(0, 1)), -->
<!--            fare = scales::rescale(fare, c(0, 1))) %>% -->
<!--     mutate(sex = as.integer(sex) - 1L, -->
<!--            pclass = as.integer(pclass - 1L)) -->
<!-- data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end -->
<!-- data_obs = data_sub[!is.na(data_sub$survived),] # data with known response -->

<!-- ``` -->

<!-- `r hide("Click here to see the solution")` -->

<!-- ```{r} -->
<!-- library(cito) -->
<!-- set.seed(42) -->
<!-- cv = 3 -->
<!-- hyper_lambda = runif(5,0.0001, 0.02) -->

<!-- outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv)) -->

<!-- results = data.frame( -->
<!--   set = rep(NA, cv), -->
<!--   lambda = rep(NA, cv), -->
<!--   AUC = rep(NA, cv) -->
<!-- ) -->

<!-- for(i in 1:cv) { -->
<!--   train_outer = data_obs[outer_split != i, ] -->
<!--   test_outer = data_obs[outer_split == i, ] -->

<!--   tuning_results =  -->
<!--       sapply(1:length(hyper_lambda), function(k) { -->
<!--         model = dnn(survived~.,  -->
<!--             hidden = c(10L, 10L),  -->
<!--             activation = c("selu", "selu"),  -->
<!--             loss = "binomial",  -->
<!--             lr = 0.05, -->
<!--             lambda = hyper_lambda[k], -->
<!--             alpha = 0.1, -->
<!--             lr_scheduler = config_lr_scheduler("step", step_size = 10, gamma = 0.1), -->
<!--             data = train_outer, epochs = 40L, verbose = FALSE, plot= FALSE) -->
<!--         return(Metrics::auc(test_outer$survived, predict(model, test_outer )[,1])) -->
<!--       }) -->
<!--   best_lambda = hyper_lambda[which.max(tuning_results)] -->

<!--   results[i, 1] = i -->
<!--   results[i, 2] = best_lambda -->
<!--   results[i, 3] = max(tuning_results) -->
<!-- } -->

<!-- print(results) -->
<!-- ``` -->

<!-- Make predictions: -->

<!-- ```{r, results='hide', warning=FALSE, message=FALSE} -->
<!-- prediction_ensemble =  -->
<!--   sapply(1:nrow(results), function(i) { -->
<!--   model = dnn(survived~.,  -->
<!--               hidden = c(10L, 10L),  -->
<!--               activation = c("selu", "selu"),  -->
<!--               loss = "binomial",  -->
<!--               lr = 0.05, -->
<!--               lambda = results$lambda[i], -->
<!--               alpha = 0.1, -->
<!--               lr_scheduler = config_lr_scheduler("step", step_size = 10, gamma = 0.1), -->
<!--               data = data_sub[is.na(data_sub$survived),] , epochs = 40L, verbose = FALSE, plot= FALSE) -->
<!--     return(predict(model, data_obs)[,1]) -->
<!--   }) -->

<!-- # Single predictions from the model with the highest AUC: -->
<!-- write.csv(data.frame(y = prediction_ensemble[,which.max(results$AUC)]), file = "Max_titanic_best_model.csv") -->

<!-- # Single predictions from the ensemble model: -->
<!-- write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_titanic_ensemble.csv") -->
<!-- ``` -->

<!-- `r unhide()` -->

<!-- `r hide("Click here to see the solution for architecture tuning")` -->

<!-- ```{r} -->
<!-- library(cito) -->
<!-- set.seed(42) -->
<!-- data_obs = data_sub[!is.na(data_sub$survived),]  -->
<!-- cv = 3 -->
<!-- hyper_lambda = runif(10,0.0001, 0.02) -->
<!-- hyper_alpha = runif(10,0, 1.0) -->
<!-- hyper_hidden = sample.int(10, 10) -->
<!-- hyper_nodes = sample(seq(5, 100), size = 10) -->

<!-- outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv)) -->

<!-- results = data.frame( -->
<!--   set = rep(NA, cv), -->
<!--   lambda = rep(NA, cv), -->
<!--   alpha = rep(NA, cv), -->
<!--   hidden = rep(NA, cv), -->
<!--   nodes = rep(NA, cv), -->
<!--   AUC = rep(NA, cv) -->
<!-- ) -->

<!-- for(i in 1:cv) { -->
<!--   train_outer = data_obs[outer_split != i, ] -->
<!--   test_outer = data_obs[outer_split == i, ] -->

<!--   tuning_results =  -->
<!--       sapply(1:length(hyper_lambda), function(k) { -->
<!--         model = dnn(survived~.,  -->
<!--             hidden = rep(hyper_nodes[k], hyper_hidden[k]),  -->
<!--             activation = rep("selu", hyper_hidden[k]),  -->
<!--             loss = "binomial",  -->
<!--             lr = 0.05, -->
<!--             lambda = hyper_lambda[k], -->
<!--             alpha = hyper_alpha[k], -->
<!--             lr_scheduler = config_lr_scheduler("step", step_size = 10, gamma = 0.1), -->
<!--             data = train_outer, epochs = 40L, verbose = FALSE, plot= FALSE) -->
<!--         return(Metrics::auc(test_outer$survived, predict(model, test_outer )[,1])) -->
<!--       }) -->
<!--   results[i, 1] = i -->
<!--   results[i, 2] =  hyper_lambda[which.max(tuning_results)] -->
<!--   results[i, 3] =  hyper_alpha[which.max(tuning_results)] -->
<!--   results[i, 4] =  hyper_hidden[which.max(tuning_results)] -->
<!--   results[i, 5] =  hyper_nodes[which.max(tuning_results)]   -->
<!--   results[i, 6] = max(tuning_results) -->
<!-- } -->

<!-- print(results) -->
<!-- ``` -->

<!-- Make predictions: -->

<!-- ```{r, results='hide', warning=FALSE, message=FALSE} -->
<!-- prediction_ensemble =  -->
<!--   sapply(1:nrow(results), function(i) { -->
<!--   model = dnn(survived~.,  -->
<!--               hidden = rep(results$nodes[i], results$hidden[i]),  -->
<!--               activation = rep("selu", results$hidden[i]),  -->
<!--               loss = "binomial",  -->
<!--               lr = 0.05, -->
<!--               lambda = results$lambda[i], -->
<!--               alpha = results$alpha[i], -->
<!--               lr_scheduler = config_lr_scheduler("step", step_size = 10, gamma = 0.1), -->
<!--               data = data_sub[is.na(data_sub$survived),] , epochs = 40L, verbose = FALSE, plot= FALSE) -->
<!--     return(predict(model, data_obs)[,1]) -->
<!--   }) -->

<!-- # Single predictions from the model with the highest AUC: -->
<!-- write.csv(data.frame(y = prediction_ensemble[,which.max(results$AUC)]), file = "Max_titanic_best_model.csv") -->

<!-- # Single predictions from the ensemble model: -->
<!-- write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_titanic_ensemble.csv") -->
<!-- ``` -->

<!-- `r unhide()` -->

:::


<!-- ::: {.callout-caution icon="false"} -->
<!-- #### Question: Regularization -->

<!-- Change the following code to a pure L1 regularization and try different $\lambda$ values, what happens to the weights of the first layer? -->

<!-- ```{r} -->
<!-- data = airquality -->
<!-- model = dnn(Ozone~.,  -->
<!--             hidden = c(40L, 40L),  -->
<!--             activation = c("selu", "selu"),  -->
<!--             loss = "mse",  -->
<!--             lr = 0.05, -->
<!--             lambda = 0.0, -->
<!--             alpha = 0.5, -->
<!--             lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1), -->
<!--             # reduce learning all 30 epochs (new lr = 0.1* old lr) -->
<!--             data = data, epochs = 150L, verbose = FALSE) -->
<!-- fields::image.plot(coef(model)[[1]][[1]]) -->
<!-- ``` -->

<!-- `r hide("Click here to see the solution")` -->

<!-- $\alpha = 0.0$ means that only L1 is used: Weak regularization -->

<!-- ```{r} -->
<!-- model = dnn(Ozone~.,  -->
<!--             hidden = c(40L, 40L),  -->
<!--             activation = c("selu", "selu"),  -->
<!--             loss = "mse",  -->
<!--             lr = 0.05, -->
<!--             lambda = 0.01, -->
<!--             alpha = 0.0, -->
<!--             lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1), -->
<!--             # reduce learning all 30 epochs (new lr = 0.1* old lr) -->
<!--             data = data, epochs = 150L, verbose = FALSE, plot = FALSE) -->
<!-- fields::image.plot(coef(model)[[1]][[1]]) -->
<!-- ``` -->

<!-- Strong regularization -->

<!-- ```{r} -->
<!-- model = dnn(Ozone~.,  -->
<!--             hidden = c(40L, 40L),  -->
<!--             activation = c("selu", "selu"),  -->
<!--             loss = "mse",  -->
<!--             lr = 0.05, -->
<!--             lambda = 0.04, -->
<!--             alpha = 0.1, -->
<!--             lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1), -->
<!--             # reduce learning all 30 epochs (new lr = 0.1* old lr) -->
<!--             data = data, epochs = 150L, verbose = FALSE, plot= FALSE) -->
<!-- fields::image.plot(coef(model)[[1]][[1]]) -->
<!-- ``` -->

<!-- The weights get sparse, i.e. many of them are zero. -->

<!-- `r unhide()` -->
<!-- ::: -->


::: {.callout-caution icon="false"}
#### Question: Hyperparameter tuning - Plant-pollinator dataset

see @sec-plantpoll for more information about the dataset.

Prepare the data:

```{r}
library(EcoData)
library(missRanger)
library(dplyr)
data(plantPollinator_df)
plant_poll = plantPollinator_df

plant_poll_imputed = plant_poll %>% select(diameter,
                                           corolla,
                                           tongue,
                                           body,
                                           interaction,
                                           colour, 
                                           nectar,
                                           feeding,
                                           season)
# Remove response variable interaction
plant_poll_imputed = missRanger::missRanger(data = plant_poll_imputed %>%
                                              select(-interaction), verbose = 0)

# scale numeric variables
plant_poll_imputed[,sapply(plant_poll_imputed, is.numeric)] = scale(plant_poll_imputed[,sapply(plant_poll_imputed, is.numeric)])

# Add response back to the dataset after the imputatiob
plant_poll_imputed$interaction = plant_poll$interaction
plant_poll_imputed$colour = as.factor(plant_poll_imputed$colour)
plant_poll_imputed$nectar = as.factor(plant_poll_imputed$nectar)
plant_poll_imputed$feeding = as.factor(plant_poll_imputed$feeding)
plant_poll_imputed$season = as.factor(plant_poll_imputed$season)


data_new = plant_poll_imputed[is.na(plant_poll_imputed$interaction), ] # for which we want to make predictions at the end
data_obs = plant_poll_imputed[!is.na(plant_poll_imputed$interaction), ]# data with known response
dim(data_obs)
```

The dataset is large! More than 10,000 observations. For now, let's switch to a simple holdout strategy for validating our model (e.g. use 80% of the data to train the model and 20% of the data to validate your model.

Moreover:
```{r}
table(data_obs$interaction)
```
The data is strongly imbalanced, i.e. many 0s but only a few 1. There are different strategies how to deal with that, for example oversampling the 1s or undersampling the 0s. 

Undersampling the 0s:
```{r}
data_obs = data_obs[c(sample(which(data_obs$interaction == 0), 2000), which(data_obs$interaction == 1)),]
table(data_obs$interaction)
data_obs$interaction = as.integer(data_obs$interaction)
```


`r hide("Click here to see the solution")`

Minimal example:

```{r}
library(cito)
set.seed(42)
tuning_steps = 2
hyper_lambda = runif(tuning_steps,0.0001, 0.02)
hyper_alpha = runif(tuning_steps,0, 1.0)
hyper_hidden = sample.int(10, tuning_steps)
hyper_nodes = sample(seq(5, 100), size = tuning_steps)

outer_split = sample(nrow(data_obs), 0.2*nrow(data_obs))

results = data.frame(
  set = 1,
  lambda = rep(NA, 1),
  alpha = rep(NA, 1),
  hidden = rep(NA, 1),
  nodes = rep(NA, 1),
  AUC = rep(NA, 1)
)

train_outer = data_obs[-outer_split, ]
test_outer = data_obs[outer_split, ]

tuning_results = 
    sapply(1:length(hyper_lambda), function(k) {
      model = dnn(interaction~., 
          hidden = rep(hyper_nodes[k], hyper_hidden[k]), 
          activation = rep("selu", hyper_hidden[k]), 
          loss = "binomial", 
          lr = 0.05,
          lambda = hyper_lambda[k],
          alpha = hyper_alpha[k],
          batchsize = 100L, # increasing the batch size will reduce the runtime
          lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
          data = train_outer, epochs = 50L, verbose = FALSE, plot= FALSE)
      return(Metrics::auc(test_outer$interaction, predict(model, test_outer )[,1]))
    })
results[1, 1] = 1
results[1, 2] =  hyper_lambda[which.max(tuning_results)]
results[1, 3] =  hyper_alpha[which.max(tuning_results)]
results[1, 4] =  hyper_hidden[which.max(tuning_results)]
results[1, 5] =  hyper_nodes[which.max(tuning_results)]  
results[1, 6] = max(tuning_results)

print(results)
```

Make predictions:

```{r, results='hide', warning=FALSE, message=FALSE}
k = 1
model = dnn(interaction~., 
    hidden = rep(results$nodes[k], results$hidden[k]), 
    activation = rep("selu", hyper_hidden[k]), 
    loss = "binomial", 
    lr = 0.05,
    lambda = results$lambda[k],
    alpha = results$alpha[k],
    batchsize = 100L, # increasing the batch size will reduce the runtime
    lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
    data = train_outer, epochs = 50L, verbose = FALSE, plot= FALSE)

predictions = predict(model, newdata = data_new)[,1]

write.csv(data.frame(y = predictions), file = "Max_plant_poll_ensemble.csv")
```


`r unhide()`

:::