-
Notifications
You must be signed in to change notification settings - Fork 24
/
README.Rmd
280 lines (203 loc) · 16.3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: "kuenm: An R package for detailed development of Maxent Ecological Niche Models"
author: "Marlon E. Cobos, A. Townsend Peterson, Luis Osorio-Olvera, and Narayani Barve"
output:
github_document:
toc: yes
toc_depth: 4
csl: ecography.csl
bibliography: My Library.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
<br>
## Introduction
**kuenm** is an R package designed to make the process of model calibration and final model creation easier and more reproducible, and at the same time more robust. The aim of this package is to design suites of candidate models to create diverse calibrations of Maxent models and enable selection of optimal parameterizations for each study. Other objectives of this program are to make the task of creating final models and their transfers easier, as well to permit assessing extrapolation risks when model transfers are needed.
This document is a brief tutorial for using the functions of the **kuenm** R package. The example of a disease vector (a tick) is used in this tutorial to make it more clear and understandable. Functions help can be checked while performing the processes.
<br>
## Getting started
### Directory structure and necessary data
Since this package was designed to perform complex analyses while avoiding excessive demands on the computer (especially related to RAM memory used for R), it needs certain data that are organized carefully in the working directory. Following this structure (Figure 1) will allow working with one or more species in a project, and avoid potential problems during the analyses.
Before starting the analyses, the user must make sure that the working directory (the folder with information for an individual species) has the following components:
* A folder containing the distinct sets of environmental variables (i.e., M_variables in Figure 1) to be used (more than one is highly recommended, but not mandatory). These variables must represent environmental variation across the area over which models are calibrated.
* A csv file containing training and testing occurrence data together (preferably after cleaning and thinning original data to avoid problems like wrong records and spatial auto-correlation). This data set consists of three fields: species name, longitude, and latitude. See Sp_joint.csv, in Figure 1.
* A csv file containing occurrence data for training models. This file and the next file generally represent exclusive subsets of the full set of records. Occurrences can be subsetted in multiple ways [@muscarella_enmeval:_2014], but some degree of independence of training and testing data is desired. See Sp_train.csv in Figure 1.
* A csv file containing species occurrence data for testing models as part of the calibration process (i.e., Sp_test.csv in Figure 1).
* If available, a csv file containing a completely independent subset of occurrence data—external to training and testing data—for a final, formal model evaluation. This dataset (i.e., for final model evaluation) is given as Sp_ind.csv in Figure 1.
<br>
```{r Fig.1, echo=FALSE, message=FALSE, warning=FALSE, fig.height=4, fig.width=6, fig.cap="Figure 1. Directory structure and data for starting processing, as well as directory structure when the processes finish using the kuenm R package. Background colors represent data necessary before starting the analyses (blue) and data generated after the following steps: running the start function (yellow), creating candidate models (lighter green), evaluating candidate models (purple), preparing projection layers (light orange), generating final models and its projections (light gray), evaluating final models with independent data (brown), and analyzing extrapolation risks in projection areas or scenarios (darker green)."}
knitr::include_graphics("Structure.png")
```
<br>
### Installing the package
The **kuenm** R package is in a GitHub repository and can be installed and/or loaded using the following code (make sure to have Internet connection). To warranty the package functionality, a crucial requirement is to have the maxent.jar application in any user-defined directory (we encourage you to maintain it in a fixed directory). This software is available in the <a href="https://biodiversityinformatics.amnh.org/open_source/maxent/" target="_blank">Maxent repository</a>. Another important requirement for using Maxent and therefore the kuenm package is to have the Java Development Kit installed. The Java Development Kit is available in <a href="http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html" target="_blank">this repository</a>. Finally, for Windows users, Rtools needs to be installed in the computer; it is important that this software is added to the PATH. For instructions on how to download and install it see <a href="http://jtleek.com/modules/01_DataScientistToolbox/02_10_rtools/#1" target="_blank">this guide</a>. Users of other operative systems may need to install other compiling tools.
```{r, eval=FALSE, include=TRUE}
# Installing and loading packages
if(!require(devtools)){
install.packages("devtools")
}
if(!require(kuenm)){
devtools::install_github("marlonecobos/kuenm")
}
library(kuenm)
```
<br>
### Downloading the example data
Data used as an example for testing this package correspond to the turkey tick *Amblyomma americanum*, a vector of various diseases, including human monocytotropic ehrlichiosis, canine and human granulocytic ehrlichiosis, tularemia, and southern tick-associated rash illness. This species is distributed broadly in North America and a complete analysis of the risk of its invasion in other regions is being developed by Raghavan et al. (in review).
These data are already structured as needed for doing analysis with this package, and can be downloaded (from <a href="http://doi.org/10.17161/1808.26376" target="_blank">kuenm example data</a>) and extracted using the code below.
```{r, eval=FALSE, include=TRUE}
# Change "YOUR/DIRECTORY" by your actual directory.
download.file(url = "https://kuscholarworks.ku.edu/bitstream/handle/1808/26376/ku.enm_example_data.zip?sequence=3&isAllowed=y",
destfile = "YOUR/DIRECTORY/ku.enm_example_data.zip", mode = "wb",
quiet = FALSE) # donwload the zipped example folder in documents
unzip(zipfile = "YOUR/DIRECTORY/ku.enm_example_data.zip",
exdir = "YOUR/DIRECTORY") # unzip the example folder in documents
unlink("YOUR/DIRECTORY/ku.enm_example_data.zip") # erase zip file
setwd("YOUR/DIRECTORY/ku.enm_example_data/A_americanum") # set the working directory
dir() # check what is in your working directory
# If you have your own data and they are organized as in the first part of Figure 1, change
# your directory and follow the instructions below.
```
Your working directory will be structured similarly to that presented for Species_1 in the left part of Figure 1.
<br>
## Doing analyses for a single species project
### Workflow recording
Once the working directory and data are ready, the function *kuenm_start* will allow generating an R Markdown (.Rmd) file as a guide to performing all the analyses that this package includes (Figure 1, yellow area). By recording all the code chunks used in the process, this file also helps to make analyses more reproducible. This file will be written in the working directory. The usage of this function is optional, but it is recommended if recording individual workflows per each species is desired.
```{r, eval=FALSE, include=TRUE}
help(kuenm_start)
```
```{r}
# Preparing variables to be used in arguments
file_name <- "aame_enm_process"
```
```{r, eval=FALSE, include=TRUE}
kuenm_start(file.name = file_name)
```
<br>
### Calibration of models
Note that, from this point, the following procedures will be performed in the R Markdown file previously created, but only if the *kuenm_start* function was used.
<br>
#### Creation of candidate models
The function *kuenm_cal* creates and executes a batch file for generating Maxent candidate models that will be written in subdirectories, named as the parameterizations selected, inside the output directory (Figure 1, light green area). Calibration models will be created with multiple combinations of regularization multipliers, feature classes, and sets of environmental predictors. For each combination, this function creates one Maxent model with the complete set of occurrences and another with training occurrences only. On some computers, the user will be asked if ruining the batch file is allowed before the modeling process starts in Maxent.
Maxent will run in command-line interface (do not close the application) and its graphic interface will not show up, to avoid interfering with activities other than the modeling process.
```{r, eval=FALSE, include=TRUE}
help(kuenm_cal)
```
```{r, eval=FALSE, include=TRUE}
# Variables with information to be used as arguments. Change "YOUR/DIRECTORY" by your actual directory.
occ_joint <- "aame_joint.csv"
occ_tra <- "aame_train.csv"
M_var_dir <- "M_variables"
batch_cal <- "Candidate_models"
out_dir <- "Candidate_Models"
reg_mult <- c(seq(0.1, 1, 0.1), seq(2, 6, 1), 8, 10)
f_clas <- "all"
args <- NULL # e.g., "maximumbackground=20000" for increasing the number of pixels in the bacground or
# note that some arguments are fixed in the function and should not be changed
maxent_path <- "YOUR/DIRECTORY/ku.enm_example_data/A_americanum"
wait <- FALSE
run <- TRUE
```
```{r, eval=FALSE, include=TRUE}
kuenm_cal(occ.joint = occ_joint, occ.tra = occ_tra, M.var.dir = M_var_dir, batch = batch_cal,
out.dir = out_dir, reg.mult = reg_mult, f.clas = f_clas, args = args,
maxent.path = maxent_path, wait = wait, run = run)
```
<br>
#### Evaluation and selection of best models
The function *kuenm_ceval* evaluates model performance based on statistical significance (partial ROC), omission rate (*E* = a user-selected proportion of occurrence data that may present meaningful errors; see @peterson_rethinking_2008), and model complexity (AICc), and selects best models based on distinct, user-set criteria (see selection in function help). Partial ROC and omission rates are evaluated based on models created with training occurrences, whereas AICc values are calculated for models created with the full set of occurrences [@warren_ecological_2011]. Outputs are stored in a folder that will contain a .csv file with the statistics of models meeting each of the evaluation criteria, another with only the models selected based on the user-specified criteria, a third with performance metrics for all candidate models, a plot of model performance, and an HTML file reporting all the results of the model evaluation and selection process designed to guide further interpretations (Figure 1, purple area).
```{r, eval=FALSE, include=TRUE}
help(kuenm_ceval)
```
```{r, eval=FALSE, include=TRUE}
occ_test <- "aame_test.csv"
out_eval <- "Calibration_results"
threshold <- 5
rand_percent <- 50
iterations <- 100
kept <- TRUE
selection <- "OR_AICc"
paral_proc <- FALSE # make this true to perform pROC calculations in parallel, recommended
# only if a powerfull computer is used (see function's help)
# Note, some of the variables used here as arguments were already created for previous function
```
```{r, eval=FALSE, include=TRUE}
cal_eval <- kuenm_ceval(path = out_dir, occ.joint = occ_joint, occ.tra = occ_tra, occ.test = occ_test, batch = batch_cal,
out.eval = out_eval, threshold = threshold, rand.percent = rand_percent, iterations = iterations,
kept = kept, selection = selection, parallel.proc = paral_proc)
```
<br>
### Final model creation
After selecting parameterizations producing the best models, the next step is that of creating final models and, if needed, transferring them to other areas or scenarios. The *kuenm_mod* function takes the .csv file with the best models from the model selection process, and writes and executes a batch file for creating final models with the selected parameterizations. Models and projections are stored in subdirectories inside an output folder; these subdirectories will be named as with the candidate models. By allowing projections (i.e., project = TRUE) and defining the folder holding the data for transfers (i.e., folder name in G.var.dir argument), this function automatically performs those transfers.
Maxent will run in command-line interface, as it did when creating the calibration models (again, do not close the application). However, the process of creating final models may take considerably more time, especially when transferring to other regions or scenarios.
```{r, eval=FALSE, include=TRUE}
help(kuenm_mod)
```
```{r, eval=FALSE, include=TRUE}
batch_fin <- "Final_models"
mod_dir <- "Final_Models"
rep_n <- 10
rep_type <- "Bootstrap"
jackknife <- FALSE
out_format <- "logistic"
project <- TRUE
G_var_dir <- "G_variables"
ext_type <- "all"
write_mess <- FALSE
write_clamp <- FALSE
wait1 <- FALSE
run1 <- TRUE
args <- NULL # e.g., "maximumbackground=20000" for increasing the number of pixels in the bacground or
# "outputgrids=false" which avoids writing grids of replicated models and only writes the
# summary of them (e.g., average, median, etc.) when rep.n > 1
# note that some arguments are fixed in the function and should not be changed
# Again, some of the variables used here as arguments were already created for previous functions
```
```{r, eval=FALSE, include=TRUE}
kuenm_mod(occ.joint = occ_joint, M.var.dir = M_var_dir, out.eval = out_eval, batch = batch_fin, rep.n = rep_n,
rep.type = rep_type, jackknife = jackknife, out.dir = mod_dir, out.format = out_format, project = project,
G.var.dir = G_var_dir, ext.type = ext_type, write.mess = write_mess, write.clamp = write_clamp,
maxent.path = maxent_path, args = args, wait = wait1, run = run1)
```
<br>
### Final model evaluation
Final models should be evaluated using independent occurrence data (i.e., data that have not been used in the calibration process that usually come from different sources). The *kuenm_feval* function evaluates final models based on statistical significance (partial ROC) and omission rate (*E*). This function will return a folder containing a .csv file with the results of the evaluation (see Figure 1, brown color).
```{r, eval=FALSE, include=TRUE}
help(kuenm_feval)
```
```{r, eval=FALSE, include=TRUE}
occ_ind <- "aame_ind.csv"
replicates <- TRUE
out_feval <- "Final_Models_evaluation"
# Most of the variables used here as arguments were already created for previous functions
```
```{r, eval=FALSE, include=TRUE}
fin_eval <- kuenm_feval(path = mod_dir, occ.joint = occ_joint, occ.ind = occ_ind, replicates = replicates,
out.eval = out_feval, threshold = threshold, rand.percent = rand_percent,
iterations = iterations, parallel.proc = paral_proc)
```
<br>
### Extrapolation risk analysis
If transfers were performed when creating final models, risks of extrapolation can be assessed using the *kuenm_mmop* function. This function calculates mobility-oriented parity (MOP) layers [@owens_constraints_2013] by comparing environmental values between the calibration area and one or multiple regions or scenarios to which ecological niche models were transferred. The layers produced with this function help to visualize were strict extrapolation risks exist, and different similarity levels between the projection regions or scenarios and the calibration area. Results from this analysis will be written for each set of variables inside an specific data (Figure 1, dark green areas).
```{r, eval=FALSE, include=TRUE}
help(kuenm_mmop)
```
```{r, eval=FALSE, include=TRUE}
sets_var <- "Set3" # a vector of various sets can be used
out_mop <- "MOP_results"
percent <- 10
paral <- FALSE # make this true to perform MOP calculations in parallel, recommended
# only if a powerfull computer is used (see function's help)
# Two of the variables used here as arguments were already created for previous functions
```
```{r, eval=FALSE, include=TRUE}
kuenm_mmop(G.var.dir = G_var_dir, M.var.dir = M_var_dir, sets.var = sets_var, out.mop = out_mop,
percent = percent, parallel = paral)
```
<br>
## Other functionalities of kuenm
Other analyses **kuenm** allows are:
* <a href="https://github.com/marlonecobos/kuenm/tree/master/extra_vignettes/post-modeling.md" target="_blank">Post-modeling analyses</a>
<br>
## References