forked from TheoreticalEcology/machinelearning
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAppendix-Datasets.qmd
342 lines (238 loc) · 9.87 KB
/
Appendix-Datasets.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
output: html_document
editor_options:
chunk_output_type: console
---
# Datasets {#sec-datasets}
```{r}
#| echo: false
#| include: false
#| message: false
#| warning: false
reticulate::use_condaenv("r-reticulate")
library(tensorflow)
library(keras)
tf
```
You can download the data sets we use in the course <a href="http://rhsbio7.uni-regensburg.de:8500" target="_blank" rel="noopener">here</a> (ignore browser warnings) or by installing the EcoData package:
```{r chunk_chapter8_0, eval=FALSE}
devtools::install_github(repo = "florianhartig/EcoData", subdir = "EcoData",
dependencies = TRUE, build_vignettes = FALSE)
```
## Titanic
The data set is a collection of Titanic passengers with information about their age, class, sex, and their survival status. The competition is simple here: Train a machine learning model and predict the survival probability.
The Titanic data set is very well explored and serves as a stepping stone in many machine learning careers. For inspiration and data exploration notebooks, check out this <a href="https://www.kaggle.com/c/titanic/data" target="_blank" rel="noopener">kaggle competition</a>.
**Response variable:** "survived"
A minimal working example:
1. Load data set:
```{r chunk_chapter8_1}
library(EcoData)
data(titanic_ml)
titanic = titanic_ml
summary(titanic)
```
2. Impute missing values (not our response variable!):
```{r chunk_chapter8_2, message=FALSE, warning=FALSE}
library(missRanger)
library(dplyr)
set.seed(123)
titanic_imputed = titanic %>% select(-name, -ticket, -cabin, -boat, -home.dest)
titanic_imputed = missRanger::missRanger(data = titanic_imputed %>%
select(-survived), verbose = 0)
titanic_imputed$survived = titanic$survived
```
3. Split into training and test set:
```{r chunk_chapter8_3}
train = titanic_imputed[!is.na(titanic$survived), ]
test = titanic_imputed[is.na(titanic$survived), ]
```
4. Train model:
```{r chunk_chapter8_4}
model = glm(survived~., data = train, family = binomial())
```
5. Predictions:
```{r chunk_chapter8_5}
preds = predict(model, data = test, type = "response")
head(preds)
```
6. Create submission csv:
```{r chunk_chapter8_6, eval=FALSE}
write.csv(data.frame(y = preds), file = "glm.csv")
```
And submit the csv on <a href="http://rhsbio7.uni-regensburg.de:8500" target="_blank" rel="noopener">http://rhsbio7.uni-regensburg.de:8500</a>.
## Plant-pollinator Database {#sec-plantpoll}
The plant-pollinator database is a collection of plant-pollinator interactions with traits for plants and pollinators. The idea is pollinators interact with plants when their traits fit (e.g. the tongue of a bee needs to match the shape of a flower). We explored the advantage of machine learning algorithms over traditional statistical models in predicting species interactions in our paper. If you are interested you can have a look <a href="https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13329" target="_blank" rel="noopener">here</a>.
```{r chunk_chapter8_7, echo=FALSE}
knitr::include_graphics("./images/TM.png")
```
**Response variable:** "interaction"
A minimal working example:
1. Load data set:
```{r chunk_chapter8_8}
library(EcoData)
data(plantPollinator_df)
plant_poll = plantPollinator_df
summary(plant_poll)
```
2. Impute missing values (not our response variable!) We will select only a few predictors here (you can work with all predictors of course).
```{r chunk_chapter8_9, message=FALSE, warning=FALSE}
library(missRanger)
library(dplyr)
set.seed(123)
plant_poll_imputed = plant_poll %>% select(diameter,
corolla,
tongue,
body,
interaction)
plant_poll_imputed = missRanger::missRanger(data = plant_poll_imputed %>%
select(-interaction), verbose = 0)
plant_poll_imputed$interaction = plant_poll$interaction
```
3. Split into training and test set:
```{r chunk_chapter8_10}
train = plant_poll_imputed[!is.na(plant_poll_imputed$interaction), ]
test = plant_poll_imputed[is.na(plant_poll_imputed$interaction), ]
```
4. Train model:
```{r chunk_chapter8_11}
model = glm(interaction~., data = train, family = binomial())
```
5. Predictions:
```{r chunk_chapter8_12}
preds = predict(model, newdata = test, type = "response")
head(preds)
```
6. Create submission csv:
```{r chunk_chapter8_13, eval=FALSE}
write.csv(data.frame(y = preds), file = "glm.csv")
```
## Wine
The data set is a collection of wines of different quality. The aim is to predict the quality of the wine based on physiochemical predictors.
For inspiration and data exploration notebooks, check out this <a href="https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009" target="_blank" rel="noopener">kaggle competition</a>. For instance, check out this very nice <a href="https://www.kaggle.com/aditimulye/red-wine-quality-assesment-starter-pack" target="_blank" rel="noopener">notebook</a> which removes a few problems from the data.
**Response variable:** "quality"
We could theoretically use a regression model for this task but we will stick with a classification model.
A minimal working example:
1. Load data set:
```{r chunk_chapter8_14}
library(EcoData)
data(wine)
summary(wine)
```
2. Impute missing values (not our response variable!).
```{r chunk_chapter8_15, message=FALSE, warning=FALSE}
library(missRanger)
library(dplyr)
set.seed(123)
wine_imputed = missRanger::missRanger(data = wine %>% select(-quality), verbose = 0)
wine_imputed$quality = wine$quality
```
3. Split into training and test set:
```{r chunk_chapter8_16}
train = wine_imputed[!is.na(wine$quality), ]
test = wine_imputed[is.na(wine$quality), ]
```
4. Train model:
```{r chunk_chapter8_17, message=FALSE, warning=FALSE}
library(ranger)
set.seed(123)
rf = ranger(quality~., data = train, classification = TRUE)
```
5. Predictions:
```{r chunk_chapter8_18}
preds = predict(rf, data = test)$predictions
head(preds)
```
6. Create submission csv:
```{r chunk_chapter8_19, eval=FALSE}
write.csv(data.frame(y = preds), file = "rf.csv")
```
## Nasa
A collection about asteroids and their characteristics from kaggle. The aim is to predict whether the asteroids are hazardous or not. For inspiration and data exploration notebooks, check out this <a href="https://www.kaggle.com/shrutimehta/nasa-asteroids-classification" target="_blank" rel="noopener">kaggle competition</a>.
**Response variable:** "Hazardous"
1. Load data set:
```{r chunk_chapter8_20}
library(EcoData)
data(nasa)
summary(nasa)
```
2. Impute missing values (not our response variable!):
```{r chunk_chapter8_21, message=FALSE, warning=FALSE}
library(missRanger)
library(dplyr)
set.seed(123)
nasa_imputed = missRanger::missRanger(data = nasa %>% select(-Hazardous),
maxiter = 1, num.trees = 5L, verbose = 0)
nasa_imputed$Hazardous = nasa$Hazardous
```
3. Split into training and test set:
```{r chunk_chapter8_22}
train = nasa_imputed[!is.na(nasa$Hazardous), ]
test = nasa_imputed[is.na(nasa$Hazardous), ]
```
4. Train model:
```{r chunk_chapter8_23, message=FALSE, warning=FALSE}
library(ranger)
set.seed(123)
rf = ranger(Hazardous~., data = train, classification = TRUE,
probability = TRUE)
```
5. Predictions:
```{r chunk_chapter8_24}
preds = predict(rf, data = test)$predictions[,2]
head(preds)
```
6. Create submission csv:
```{r chunk_chapter8_25, eval=FALSE}
write.csv(data.frame(y = preds), file = "rf.csv")
```
## Flower
A collection of over 4000 flower images of 5 plant species. The data set is from <a href="https://www.kaggle.com/alxmamaev/flowers-recognition" target="_blank" rel="noopener">kaggle</a> but we downsampled the images from $320*240$ to $80*80$ pixels. You can a) download the data set <a href="http://rhsbio7.uni-regensburg.de:8500" target="_blank" rel="noopener">here</a> or b) get it via the EcoData package.
**Notes:**
- Check out convolutional neural network notebooks on kaggle (they are often written in Python but you can still copy the architectures), e.g. <a href="https://www.kaggle.com/alirazaaliqadri/flower-recognition-tensorflow-keras-sequential" target="_blank" rel="noopener">this one</a>.
- Last year's winners have used a transfer learning approach (they achieved around 70% accuracy), check out this <a href="https://www.kaggle.com/stpeteishii/flower-name-classify-densenet201" target="_blank" rel="noopener">notebook</a>, see also the section about transfer learning \@ref(transfer).
**Response variable:** "Plant species"
1. Load data set:
```{r chunk_chapter8_26, message=FALSE, warning=FALSE}
library(tensorflow)
library(keras)
train = EcoData::dataset_flower()$train/255
test = EcoData::dataset_flower()$test/255
labels = EcoData::dataset_flower()$labels
```
Let's visualize a flower:
```{r chunk_chapter8_27}
train[100,,,] %>%
image_to_array() %>%
as.raster() %>%
plot()
```
2. Build and train model:
```{r chunk_chapter8_28, eval=FALSE, warning=FALSE}
model = keras_model_sequential()
model %>%
layer_conv_2d(filters = 4L, kernel_size = 2L,
input_shape = list(80L, 80L, 3L)) %>%
layer_max_pooling_2d() %>%
layer_flatten() %>%
layer_dense(units = 5L, activation = "softmax")
### Model fitting ###
model %>%
compile(loss = loss_categorical_crossentropy,
optimizer = optimizer_adamax(learning_rate = 0.01))
model %>%
fit(x = train, y = keras::k_one_hot(labels, 5L))
```
3. Predictions:
```{r chunk_chapter8_29, eval=FALSE}
# Prediction on training data:
pred = apply(model %>% predict(train), 1, which.max)
Metrics::accuracy(pred - 1L, labels)
table(pred)
# Prediction for the submission server:
pred = model %>% predict(test) %>% apply(1, which.max) - 1L
table(pred)
```
4. Create submission csv:
```{r chunk_chapter8_30, eval=FALSE}
write.csv(data.frame(y = pred), file = "cnn.csv")
```