A4-MLpipeline.qmd

---
output: html_document
editor_options:
  chunk_output_type: console
---

# Machine learning pipeline {#workflow}

The Standard Machine Learning Pipeline using the Titanic Data set

Before we specialize on any tuning, it is important to understand that machine learning always consists of a pipeline of actions.

The typical machine learning workflow consist of:

-   Data cleaning and exploration (EDA = explorative data analysis) for example with tidyverse.
-   Preprocessing and feature selection.
-   Splitting data set into training and test set for evaluation.
-   Model fitting.
-   Model evaluation.
-   New predictions

![Machine Learning pipeline](images/pipeline.png)

<!-- Here is an (optional) video that explains the entire pipeline from a slightly different perspective: -->

<!-- ```{r chunk_chapter4_39, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F} -->

<!-- cat( -->

<!--   '<iframe width="560" height="315"  -->

<!--   src="https://www.youtube.com/embed/nKW8Ndu7Mjw" -->

<!--   frameborder="0" allow="accelerometer; autoplay; encrypted-media; -->

<!--   gyroscope; picture-in-picture" allowfullscreen> -->

<!--   </iframe>' -->

<!-- ) -->

<!-- ``` -->

In the following example, we use tidyverse, a collection of R packages for data science / data manipulation mainly developed by Hadley Wickham.

::: {.callout-note appearance="default" collapse="true"}
## dplyr and tidyverse

The `dplyr` package is part of a framework called tidyverse. Unique features of the tidyverse are the pipe `%>%` operator and `tibble` objects.

-   The `%>%` operator:

    Applying several functions in sequence on an object often results in uncountable/confusing number of round brackets:

    ```{r}
    library(tidyverse)
    max(mean(range(c(5, 3, 2, 1))))
    ```

    The pipe operator simplifies that by saying "apply the next function on the result of the current function":

    ```{r}
    c(5, 3, 2, 1) %>% range %>% mean %>% max
    ```

    Which is easier to write, read, and to understand!

-   `tibble` objects are just an extension of data.frames. In the course we will use mostly data.frames, so it is better to transform the tibbles back to data.frames:

    ```{r}
    air_grouped = airquality %>% group_by(Month)

    class(air_grouped)
    air_grouped = as.data.frame(air_grouped)
    class(air_grouped)
    ```
:::

<!-- ```{r chunk_chapter4_40, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F} -->

<!-- cat( -->

<!--   '<iframe width="560" height="315"  -->

<!--   src="https://www.youtube.com/embed/nRtp7wSEtJA" -->

<!--   frameborder="0" allow="accelerometer; autoplay; encrypted-media; -->

<!--   gyroscope; picture-in-picture" allowfullscreen> -->

<!--   </iframe>' -->

<!-- ) -->

<!-- ``` -->

::: column-margin
Another good reference is "**R for data science**" by Hadley Wickham: <a href="https://r4ds.had.co.nz/" target="_blank" rel="noopener"></a>.
:::

For this lecture you need the Titanic data set provided by us (via the `EcoData` package).

::: column-margin
You can find it in GRIPS (datasets.RData in the data set and submission section) or at <a href="http://rhsbio7.uni-regensburg.de:8500" target="_blank" rel="noopener">http://rhsbio7.uni-regensburg.de:8500</a> (VPN for University of Regensburg is required!).
:::

::: callout-important
### Motivation - We need a model that can predict the survival probability of new passengers.

We have split the data set into training and an outer test/prediction data sets (the test/prediction split has one column less than the train split, as the response for the test/outer split is unknown).

**The goal is to build a predictive model that can accurately predict the chances of survival for Titanic passengers!**

The dataset:

```{r}
library(tidyverse)
library(EcoData)
data(titanic_ml)
data = titanic_ml
```

The response variable:

```{r}
unique(data$survived)
```

0 = passenger died

1 = passenger survived

NA = we don't have information about the passenger, at the end, we will make predictions for these passengers!

**Important**: Preprocessing of the data must be done for the training and testing data together!!
:::

## Data preparation

Load necessary libraries:

```{r chunk_chapter4_41, message=FALSE}
library(tidyverse)
```

Load data set:

```{r chunk_chapter4_42}
library(EcoData)
data(titanic_ml)
data = titanic_ml
```

Standard summaries:

```{r chunk_chapter4_43}
str(data)
summary(data)
```

The name variable consists of 1309 unique factors (there are 1309 observations...) and could be now transformed. If you are interested in how to do that, take a look at the following box.

::: {.callout-tip collapse="true"}
## Feature engineering of the name variable

```{r chunk_chapter4_44}
length(unique(data$name))
```

However, there is a title in each name. Let's extract the titles:

1.  We will extract all names and split each name after each comma ",".
2.  We will split the second split of the name after a point "." and extract the titles.

```{r chunk_chapter4_45}
first_split = sapply(data$name,
                     function(x) stringr::str_split(x, pattern = ",")[[1]][2])
titles = sapply(first_split,
                function(x) strsplit(x, ".",fixed = TRUE)[[1]][1])
```

We get 18 unique titles:

```{r chunk_chapter4_46}
table(titles)
```

A few titles have a very low occurrence rate:

```{r chunk_chapter4_47}
titles = stringr::str_trim((titles))
titles %>%
 fct_count()
```

We will combine titles with low occurrences into one title, which we can easily do with the `forcats` package.

```{r chunk_chapter4_48}
titles2 =
  forcats::fct_collapse(titles,
                        officer = c("Capt", "Col", "Major", "Dr", "Rev"),
                        royal = c("Jonkheer", "Don", "Sir",
                                  "the Countess", "Dona", "Lady"),
                        miss = c("Miss", "Mlle"),
                        mrs = c("Mrs", "Mme", "Ms")
                        )
```

We can count titles again to see the new number of titles:

```{r chunk_chapter4_49}
titles2 %>%  
   fct_count()
```

Add new title variable to data set:

```{r chunk_chapter4_50}
data =
  data %>%
    mutate(title = titles2)
```
:::

### Imputation

NAs are a common problem in ML and most ML algorithms cannot handle NAs. For example, the age variable has 20% NAs:

```{r chunk_chapter4_51}
summary(data$age)
sum(is.na(data$age)) / nrow(data)
```

There are few options how to handle NAs:

-   Drop observations with NAs, however, we may lose many observations (not what we want!)

-   Imputation, fill the missing values

We impute (fill) the missing values, for example with the median age. However, age itself might depend on other variables such as sex, class and title. Thus, instead of filling the NAs with the overall median of the passengers, we want to fill the NAs with the median age of these groups so that the associations with the other groups are preserved (or in other words, that the new values are hopefully closer to the unknown true values).

In `tidyverse` we can "group" the data, i.e. we will nest the observations within categorical variables for which we assume that there may be an association with age (here: `group_by` after sex, pclass and title). After grouping, all operations (such as our `median(age....)`) will be done within the specified groups (to get better estimates of these missing NAs).

```{r chunk_chapter4_52}
data =
  data %>%
    select(survived, sex, age, fare, pclass) %>% 
    group_by(sex, pclass) %>%
    mutate(age2 = ifelse(is.na(age), median(age, na.rm = TRUE), age)) %>%
    mutate(fare2 = ifelse(is.na(fare), median(fare, na.rm = TRUE), fare)) %>%
    ungroup()
```

### Preprocessing and Feature Selection

Later (tomorrow), we want to use Keras in our example, but it cannot handle factors and requires the data to be scaled.

Normally, one would do this for all predictors, but as we only show the pipeline here, we have sub-selected a bunch of predictors and do this only for them. We first scale the numeric predictors and change the factors with only two groups/levels into integers (this can be handled by Keras).

```{r chunk_chapter4_53}
data_sub =
  data %>%
    select(survived, sex, age2, fare2, pclass) %>%
    mutate(age2 = scales::rescale(age2, c(0, 1)),
           fare2 = scales::rescale(fare2, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
```

::: {.callout-tip collapse="true"}
## Transforming factors with more than two levels

Factors with more than two levels should be **one hot encoded** (Make columns for every different factor level and write 1 in the respective column for every taken feature value and 0 else. For example: $\{red, green, green, blue, red\} \rightarrow \{(0,0,1), (0,1,0), (0,1,0), (1,0,0), (0,0,1)\}$):

```{r chunk_chapter4_54, eval = FALSE}
one_title = model.matrix(~0+as.factor(title), data = data)
colnames(one_title) = levels(data$title)

one_sex = model.matrix(~0+as.factor(sex), data = data)
colnames(one_sex) = levels(data$sex)

one_pclass = model.matrix(~0+as.factor(pclass), data = data)
colnames(one_pclass) = paste0("pclass", 1:length(unique(data$pclass)))
```

And we have to add the dummy encoded variables to the data set:

```{r chunk_chapter4_55, eval = FALSE}
data = cbind(data.frame(survived= data$survived),
                 one_title, one_sex, age = data$age2,
                 fare = data$fare2, one_pclass)
head(data)
```
:::

## Modelling

### Split data for final predictions

To tune our hyperparameters and evaluate our models, we split the data into the training and testing data. The testing data are the observations where the response is NA:

```{r}
summary(data_sub$survived)
```

655 observations have NAs in our response variable, these are the observations for which we want to make predictions at the end of our pipeline.

```{r chunk_chapter4_56}
data_new = data_sub[is.na(data_sub$survived),]
data_obs = data_sub[!is.na(data_sub$survived),]
```

### Hyperparameter optimization

We want to tune our hyperparameters ($\lambda$ and $\alpha$). Normally, we should do a nested CV on our training data (data_obs), however, we assume that the test data on the submission server is our outer split, so we can tune our hyperparameters using a n-fold Cross-Validation which serves as our inner CV.

::: column-margin
Again, why is it important to tune hyperparameters? Hyperparameters (configuration parameters of our ML algorithms that (mostly) control their complexity) are usually tuned (optimized) in an automatic / systematic way. A common procedure, called random search, is to sample random configuration combinations from the set of hyperparameters and test for each combination the prediction error.
:::

We implement manually a CV to tune the learning rate. We start with a 3xCV and 10x different learning rates:

```{r}
library(cito)
set.seed(42)
model = dnn(survived~.,
            data = data_obs, 
            loss = "binomial",
            lr = tune(0.001, 0.1),
            tuning = config_tuning(CV = 3, steps = 10)
            )
model$tuning
```

<!-- ```{r} -->

<!-- library(glmnet) -->

<!-- library(glmnetUtils) -->

<!-- set.seed(42) -->

<!-- cv = 5 -->

<!-- hyper_lambda = runif(20,0, 0.2) -->

<!-- tuning_results =  -->

<!--     sapply(1:length(hyper_lambda), function(k) { -->

<!--         auc_inner = NULL # save results from CV -->

<!--         for(j in 1:cv) { -->

<!--           inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv)) -->

<!--           train_inner = data_obs[inner_split != j, ] -->

<!--           test_inner = data_obs[inner_split == j, ] -->

<!--           model = glmnet(survived~.,data = train_inner, family = "binomial", lambda = hyper_lambda[k]) -->

<!--           auc_inner[j]= Metrics::auc(test_inner$survived, predict(model, test_inner, type = "response")) -->

<!--         } -->

<!--       return(mean(auc_inner)) -->

<!--     }) -->

<!-- results = data.frame(lambda = hyper_lambda, AUC = tuning_results) -->

<!-- print(results) -->

<!-- ``` -->

<!-- The best (highest AUC) $\lambda$ is then: -->

<!-- ```{r} -->

<!-- results[which.max(results$AUC),] -->

<!-- ``` -->

## Predictions and Submission

When we are satisfied with the performance of our model, we will create predictions for the new observations on the submission server. cito directly returns the best model so we do not have to fit the final model.

We submit our predictions to the submission server at <a href="http://rhsbio7.uni-regensburg.de:8500" target="_blank" rel="noopener">http://rhsbio7.uni-regensburg.de:8500</a>.

For the submission it is critical to change the predictions into a data.frame, select the second column (the probability to survive), and save it with the write.csv function:

```{r, results='hide', warning=FALSE, message=FALSE}

data_new = data_sub[is.na(data_sub$survived),]
predictions = predict(model, data_new, type = "response")[,1] 
write.csv(data.frame(y = predictions), file = "Max_1.csv")
```


## Exercises

::: callout-warning
#### Question: Hyperparameter tuning dnn - Titanic dataset

Tune architecture

-   Tune training parameters (learning rate, batch size) and regularization

**Hints**

cito has a feature to automatically tune hyperparameters under Cross Validation!

-   passing `tune(...)` to a hyperparameter will tell cito to tune this specific hyperparameter
-   the `tuning = config_tuning(...)` let you specify the cross-validation strategy and the number of hyperparameters that should be tested (steps = number of hyperparameter combinations that should be tried)
-   after tuning, cito will fit automatically a model with the best hyperparameters on the full data and will return this model

Minimal example with the iris dataset:

```{r, eval=FALSE}
library(cito)
df = iris
df[,1:4] = scale(df[,1:4])

model_tuned = dnn(Species~., 
                  loss = "softmax",
                  data = iris,
                  lambda = tune(lower = 0.0, upper = 0.2), # you can pass the "tune" function to a hyerparameter
                  tuning = config_tuning(CV = 3, steps = 20L)
                  )

# tuning results
model_tuned$tuning


# model_tuned is now already the best model!
```

```{r}
#| message: false
library(EcoData)
library(dplyr)
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response


model = dnn(survived~., 
          hidden = c(10L, 10L), # change
          activation = c("selu", "selu"), # change
          loss = "binomial", 
          lr = 0.05, #change
          validation = 0.2,
          lambda = 0.001, # change
          alpha = 0.1, # change
          lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
          data = data_obs, epochs = 40L, verbose = FALSE, plot= TRUE)

# Predictions:

predictions = predict(model, newdata = data_new, type = "response") # change prediction type to response so that cito predicts probabilities

write.csv(data.frame(y = predictions[,1]), file = "Max_titanic_dnn.csv")
```
:::

<!-- ::: {.callout-warning} -->

<!-- #### Task: Tuning $\alpha$ and $\lambda$ -->

<!-- 1.  Extend the code from above and tune $\alpha$ and $\lambda$ (via 10xCV) -->

<!-- 2.  Train the model with best set of hyperparameters and submit your predictions -->

<!-- Submit your predictions (<http://rhsbio7.uni-regensburg.de:8500/>), which model has a higher AUC?. -->

<!-- **Important**: -->

<!-- Submissions only work if your preditions are probabilities and a data.frame (which you can create via `write.csv(data.frame(y = predictions ), file = "Max_1.csv")` -->

<!--  ) -->

<!-- ```{r} -->

<!-- library(EcoData) -->

<!-- library(dplyr) -->

<!-- library(missRanger) -->

<!-- library(glmnet) -->

<!-- library(glmnetUtils) -->

<!-- data(titanic_ml) -->

<!-- data = titanic_ml -->

<!-- data =  -->

<!--   data %>% select(survived, sex, age, fare, pclass) -->

<!-- # missRanger uses a random forest to impute NAs (RF is trained on the data to predict values for the NAs) -->

<!-- data[,-1] = missRanger(data[,-1], verbose = 0) -->

<!-- data_sub = -->

<!--   data %>% -->

<!--     mutate(age = scales::rescale(age, c(0, 1)), -->

<!--            fare = scales::rescale(fare, c(0, 1))) %>% -->

<!--     mutate(sex = as.integer(sex) - 1L, -->

<!--            pclass = as.integer(pclass - 1L)) -->

<!-- data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end -->

<!-- data_obs = data_sub[!is.na(data_sub$survived),] # data with known response -->

<!-- ``` -->

<!-- Bonus: -->

<!-- -   Try different features -->

<!-- -   Try cito -->

<!-- -   Try different datasets (see @sec-datasets) -->

<!-- Code template for a simple CV (for tuning $\lambda$): -->

<!-- - Extend the following code so that $\alpha$ is also tuned! -->

<!-- ```{r} -->

<!-- set.seed(42) -->

<!-- cv = 5 -->

<!-- hyper_lambda = runif(20,0, 0.2) -->

<!-- tuning_results = -->

<!--     sapply(1:length(hyper_lambda), function(k) { -->

<!--         auc_inner = NULL -->

<!--         for(j in 1:cv) { -->

<!--           inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv)) -->

<!--           train_inner = data_obs[inner_split != j, ] -->

<!--           test_inner = data_obs[inner_split == j, ] -->

<!--           model = glmnet(survived~.,data = train_inner, lambda = hyper_lambda[k], family = "binomial") -->

<!--           auc_inner[j]= Metrics::auc(test_inner$survived, predict(model, test_inner, type = "response")) -->

<!--         } -->

<!--       return(mean(auc_inner)) -->

<!--     }) -->

<!-- results = data.frame(lambda = hyper_lambda, AUC = tuning_results) -->

<!-- print(results) -->

<!-- ``` -->

<!-- ::: -->

<!-- `r hide("Click here to see the solution")` -->

<!-- ```{r} -->

<!-- set.seed(42) -->

<!-- cv = 5 -->

<!-- hyper_lambda = runif(20,0, 0.2) -->

<!-- hyper_alpha = runif(20, 0, 1) -->

<!-- tuning_results =  -->

<!--     sapply(1:length(hyper_alpha), function(k) { -->

<!--         auc_inner = NULL -->

<!--         for(j in 1:cv) { -->

<!--           inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv)) -->

<!--           train_inner = data_obs[inner_split != j, ] -->

<!--           test_inner = data_obs[inner_split == j, ] -->

<!--           model = glmnet(survived~.,data = train_inner, family = "binomial", alpha = hyper_alpha[k], lambda = hyper_lambda[k]) -->

<!--           auc_inner[j]= Metrics::auc(test_inner$survived, predict(model, test_inner, type = "response")) -->

<!--         } -->

<!--       return(mean(auc_inner)) -->

<!--     }) -->

<!-- results = data.frame(lambda = hyper_lambda, alpha = hyper_alpha,  AUC = tuning_results) -->

<!-- print(results) -->

<!-- print(results[which.max(results$AUC),]) -->

<!-- ``` -->

<!-- Predictions: -->

<!-- ```{r, results='hide', message=FALSE, warning=FALSE} -->

<!-- model = glmnet(survived~.,data = data_obs, family = "binomial",alpha = results[which.max(results$AUC),2]) -->

<!-- predictions = predict(model, data_new, alpha = results$alpha[i], s = results[which.max(results$AUC),1], type = "response")[,1] -->

<!-- write.csv(data.frame(y = predictions, file = "Max_titanic_best_model.csv")) -->

<!-- ``` -->

<!-- `r unhide()` -->

::: {.callout-warning}
#### Question: Hyperparameter tuning rf

| Hyperparameter    | Explanation                                                                                                                              |
|--------------|----------------------------------------------------------|
| mtry              | Subset of features randomly selected in each node (from which the algorithm can select the feature that will be used to split the data). |
| minimum node size | Minimal number of observations allowed in a node (before the branching is canceled)                                                      |
| max depth         | Maximum number of tree depth                                                                                                             |

Combing back to the titanic dataset from the morning, we want to optimize min node size in our RF using a simple CV.

Prepare the data:

```{r}
library(EcoData)
library(dplyr)
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response
data_sub$survived = as.factor(data_sub$survived)
data_obs$survived = as.factor(data_obs$survived)
```

**Hints:**

-   adjust the '`type`' argument in the `predict(…)` method (the default is to predict classes)
-   when predicting probabilities, the randomForest will return a matrix, a column for each class, we are interested in the probability of surviving (so the second column)

**Bonus:**

-   tune min node size (and mtry)
-   use more features

::: {.callout-tip collapse="true" appearance="minimal"}
## Code template

```{r, eval=FALSE}
library(ranger)
data_obs = data_sub[!is.na(data_sub$survived),] 
set.seed(42)

cv = 3
hyper_minnodesize = ...

tuning_results =
    sapply(1:length(hyper_minnodesize), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv))
          train_inner = data_obs[inner_split != j, ]
          test_inner = data_obs[inner_split == j, ]
          
          model = ranger(survived~.,data = train_inner, min.node.size = hyper_minnodesize[k], probability = TRUE)
          predictions = predict(model, test_inner)$predictions[,2]
          
          auc_inner[j]= Metrics::auc(test_inner$survived, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(minnodesize = hyper_minnodesize, AUC = tuning_results)

print(results)
```
:::


:::

`r hide("Click here to see the solution")`


```{r}
library(ranger)
data_obs = data_sub[!is.na(data_sub$survived),] 
set.seed(42)

cv = 3
hyper_minnodesize = sample(300, 20)

tuning_results =
    sapply(1:length(hyper_minnodesize), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv))
          train_inner = data_obs[inner_split != j, ]
          test_inner = data_obs[inner_split == j, ]
          model = ranger(survived~.,data = train_inner, min.node.size = hyper_minnodesize[k], probability = TRUE)
          predictions = predict(model, test_inner)$predictions[,2]
          
          auc_inner[j]= Metrics::auc(test_inner$survived, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(minnodesize = hyper_minnodesize, AUC = tuning_results)

print(results)
```


Make predictions:

```{r, results='hide', warning=FALSE, message=FALSE}
model = ranger(survived~.,data = data_obs, min.node.size = results[which.max(results$AUC),1], probability = TRUE)

write.csv(data.frame(y = predict(model, data_new)$predictions[,1]), file = "Max_titanic_rf.csv")
```

`r unhide()`


::: {.callout-warning}
#### Question: Hyperparameter tuning BRT

Important hyperparameters:

| Hyperparameter | Explanation                                                                  |
|----------------|--------------------------------------------------------|
| eta            | learning rate (weighting of the sequential trees)                            |
| max depth      | maximal depth in the trees (small = low complexity, large = high complexity) |
| subsample      | subsample ratio of the data (bootstrap ratio)                                |
| lambda         | regularization strength of the individual trees                              |
| max tree       | maximal number of trees in the ensemble                                      |


Combing back to the titanic dataset from the morning, we want to optimize max depth and the eta parameter in xgboost.

Prepare the data:

```{r}
library(EcoData)
library(dplyr)
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response
```

::: {.callout-tip collapse="true" appearance="minimal"}
## Code template

```{r, eval=FALSE}
library(xgboost)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 3

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

# sample minnodesize values (must be integers)
hyper_depth = ...
hyper_eta = ...

tuning_results =
    sapply(1:length(hyper_eta), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv))
          train_inner = data_obs[inner_split != j, ]
          test_inner = data_obs[inner_split == j, ]
          
          data_xg = xgb.DMatrix(data = as.matrix(train_inner[,-1]), label = train_inner$survived)
          
          model = xgboost(data_xg, nrounds = 16L, eta = hyper_eta[k], max_depth = hyper_depth[k], objective = "reg:logistic")
          predictions = predict(model, newdata = as.matrix(test_inner)[,-1])
          
          auc_inner[j]= Metrics::auc(test_inner$survived, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(depth = hyper_depth, eta = hyper_eta, AUC = tuning_results)

print(results)
```
:::

:::


`r hide("Click here to see the solution")`

```{r}
library(xgboost)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 3

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

# sample minnodesize values (must be integers)
hyper_depth = sample(200, 20)
hyper_eta = runif(20, 0, 1)


tuning_results =
    sapply(1:length(hyper_eta), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv))
          train_inner = data_obs[inner_split != j, ]
          test_inner = data_obs[inner_split == j, ]
          
          data_xg = xgb.DMatrix(data = as.matrix(train_inner[,-1]), label = train_inner$survived)
          
          model = xgboost(data_xg, nrounds = 16L, eta = hyper_eta[k], max_depth = hyper_depth[k], objective = "reg:logistic")
          predictions = predict(model, newdata = as.matrix(test_inner)[,-1])
          
          auc_inner[j]= Metrics::auc(test_inner$survived, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(depth = hyper_depth, eta = hyper_eta, AUC = tuning_results)

print(results)

```

Make predictions:

```{r, results='hide', warning=FALSE, message=FALSE}
data_xg = xgb.DMatrix(data = as.matrix(data_obs[,-1]), label = data_obs$survived)

model = xgboost(data_xg, nrounds = 16L, eta = results[which.max(results$AUC), 2], max_depth = results[which.max(results$AUC), 1], objective = "reg:logistic")

predictions = predict(model, newdata = as.matrix(data_new)[,-1])

# Single predictions from the ensemble model:
write.csv(data.frame(y = predictions), file = "Max_titanic_xgboost.csv")
```

`r unhide()`