README.Rmd

---
title: "kuenm: An R package for detailed development of Maxent Ecological Niche Models"
author: "Marlon E. Cobos, A. Townsend Peterson, Luis Osorio-Olvera, and Narayani Barve"
output:
  github_document:
    toc: yes
    toc_depth: 4
csl: ecography.csl
bibliography: My Library.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<br>

## Introduction

**kuenm** is an R package designed to make the process of model calibration and final model creation easier and more reproducible, and at the same time more robust. The aim of this package is to design suites of candidate models to create diverse calibrations of Maxent models and enable selection of optimal parameterizations for each study. Other objectives of this program are to make the task of creating final models and their transfers easier, as well to permit assessing extrapolation risks when model transfers are needed.

This document is a brief tutorial for using the functions of the **kuenm** R package. The example of a disease vector (a tick) is used in this tutorial to make it more clear and understandable. Functions help can be checked while performing the processes. 

<br>

## Getting started
### Directory structure and necessary data

Since this package was designed to perform complex analyses while avoiding excessive demands on the computer (especially related to RAM memory used for R), it needs certain data that are organized carefully in the working directory. Following this structure (Figure 1) will allow working with one or more species in a project, and avoid potential problems during the analyses.

Before starting the analyses, the user must make sure that the working directory (the folder with information for an individual species) has the following components: 

* A folder containing the distinct sets of environmental variables (i.e., M_variables in Figure 1) to be used (more than one is highly recommended, but not mandatory). These variables must represent environmental variation across the area over which models are calibrated.
* A csv file containing training and testing occurrence data together (preferably after cleaning and thinning original data to avoid problems like wrong records and spatial auto-correlation). This data set consists of three fields: species name, longitude, and latitude. See Sp_joint.csv, in Figure 1.
* A csv file containing occurrence data for training models. This file and the next file generally represent exclusive subsets of the full set of records. Occurrences can be subsetted in multiple ways [@muscarella_enmeval:_2014], but some degree of independence of training and testing data is desired. See Sp_train.csv in Figure 1.
* A csv file containing species occurrence data for testing models as part of the calibration process (i.e., Sp_test.csv in Figure 1).
* If available, a csv file containing a completely independent subset of occurrence data&mdash;external to training and testing data&mdash;for a final, formal model evaluation. This dataset (i.e., for final model evaluation) is given as Sp_ind.csv in Figure 1. 

<br>

```{r Fig.1, echo=FALSE, message=FALSE, warning=FALSE, fig.height=4, fig.width=6, fig.cap="Figure 1. Directory structure and data for starting processing, as well as directory structure when the processes finish using the kuenm R package. Background colors represent data necessary before starting the analyses (blue) and data generated after the following steps: running the start function (yellow), creating candidate models (lighter green), evaluating candidate models (purple), preparing projection layers (light orange), generating final models and its projections (light gray), evaluating final models with independent data (brown), and analyzing extrapolation risks in projection areas or scenarios (darker green)."}

knitr::include_graphics("Structure.png")
```


<br>

### Installing the package

The **kuenm** R package is in a GitHub repository and can be installed and/or loaded using the following code (make sure to have Internet connection). To warranty the package functionality, a crucial requirement is to have the maxent.jar application in any user-defined directory (we encourage you to maintain it in a fixed directory). This software is available in the <a href="https://biodiversityinformatics.amnh.org/open_source/maxent/" target="_blank">Maxent repository</a>. Another important requirement for using Maxent and therefore the kuenm package is to have the Java Development Kit installed. The Java Development Kit is available in <a href="http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html" target="_blank">this repository</a>. Finally, for Windows users, Rtools needs to be installed in the computer; it is important that this software is added to the PATH. For instructions on how to download and install it see <a href="http://jtleek.com/modules/01_DataScientistToolbox/02_10_rtools/#1" target="_blank">this guide</a>. Users of other operative systems may need to install other compiling tools.

```{r, eval=FALSE, include=TRUE}
# Installing and loading packages
if(!require(devtools)){
    install.packages("devtools")
}

if(!require(kuenm)){
    devtools::install_github("marlonecobos/kuenm")
}

library(kuenm)
```

<br>

### Downloading the example data

Data used as an example for testing this package correspond to the turkey tick *Amblyomma americanum*, a vector of various  diseases, including human monocytotropic ehrlichiosis, canine and human granulocytic ehrlichiosis, tularemia, and southern tick-associated rash illness. This species is distributed broadly in North America and a complete analysis of the risk of its invasion in other regions is being developed by Raghavan et al. (in review).

These data are already structured as needed for doing analysis with this package, and can be downloaded (from <a href="http://doi.org/10.17161/1808.26376" target="_blank">kuenm example data</a>) and extracted using the code below.

```{r, eval=FALSE, include=TRUE}
# Change "YOUR/DIRECTORY" by your actual directory.
download.file(url = "https://kuscholarworks.ku.edu/bitstream/handle/1808/26376/ku.enm_example_data.zip?sequence=3&isAllowed=y", 
              destfile = "YOUR/DIRECTORY/ku.enm_example_data.zip", mode = "wb",
              quiet = FALSE) # donwload the zipped example folder in documents

unzip(zipfile = "YOUR/DIRECTORY/ku.enm_example_data.zip",
      exdir = "YOUR/DIRECTORY") # unzip the example folder in documents

unlink("YOUR/DIRECTORY/ku.enm_example_data.zip") # erase zip file

setwd("YOUR/DIRECTORY/ku.enm_example_data/A_americanum") # set the working directory

dir() # check what is in your working directory

# If you have your own data and they are organized as in the first part of Figure 1, change 
# your directory and follow the instructions below.
```

Your working directory will be structured similarly to that presented for Species_1 in the left part of Figure 1.

<br>

## Doing analyses for a single species project
### Workflow recording

Once the working directory and data are ready, the function *kuenm_start* will allow generating an R Markdown (.Rmd) file as a guide to performing all the analyses that this package includes (Figure 1, yellow area). By recording all the code chunks used in the process, this file also helps to make analyses more reproducible. This file will be written in the working directory. The usage of this function is optional, but it is recommended if recording individual workflows per each species is desired. 

```{r, eval=FALSE, include=TRUE}
help(kuenm_start)
```

```{r}
# Preparing variables to be used in arguments
file_name <- "aame_enm_process"
```

```{r, eval=FALSE, include=TRUE}
kuenm_start(file.name = file_name)
```

<br>

### Calibration of models

Note that, from this point, the following procedures will be performed in the R Markdown file previously created, but only if the *kuenm_start* function was used.

<br>

#### Creation of candidate models

The function *kuenm_cal* creates and executes a batch file for generating Maxent candidate models that will be written in subdirectories, named as the parameterizations selected, inside the output directory (Figure 1, light green area). Calibration models will be created with multiple combinations of regularization multipliers, feature classes, and sets of environmental predictors. For each combination, this function creates one Maxent model with the complete set of occurrences and another with training occurrences only. On some computers, the user will be asked if ruining the batch file is allowed before the modeling process starts in Maxent.

Maxent will run in command-line interface (do not close the application) and its graphic interface will not show up, to avoid interfering with activities other than the modeling process.

```{r, eval=FALSE, include=TRUE}
help(kuenm_cal)
```

```{r, eval=FALSE, include=TRUE}
# Variables with information to be used as arguments. Change "YOUR/DIRECTORY" by your actual directory.
occ_joint <- "aame_joint.csv"
occ_tra <- "aame_train.csv"
M_var_dir <- "M_variables"
batch_cal <- "Candidate_models"
out_dir <- "Candidate_Models"
reg_mult <- c(seq(0.1, 1, 0.1), seq(2, 6, 1), 8, 10)
f_clas <- "all"
args <- NULL # e.g., "maximumbackground=20000" for increasing the number of pixels in the bacground or
             # note that some arguments are fixed in the function and should not be changed
maxent_path <- "YOUR/DIRECTORY/ku.enm_example_data/A_americanum"
wait <- FALSE
run <- TRUE
```

```{r, eval=FALSE, include=TRUE}
kuenm_cal(occ.joint = occ_joint, occ.tra = occ_tra, M.var.dir = M_var_dir, batch = batch_cal,
          out.dir = out_dir, reg.mult = reg_mult, f.clas = f_clas, args = args,
          maxent.path = maxent_path, wait = wait, run = run)
```

<br>

#### Evaluation and selection of best models

The function *kuenm_ceval* evaluates model performance based on statistical significance (partial ROC), omission rate (*E* = a user-selected proportion of occurrence data that may present meaningful errors; see @peterson_rethinking_2008), and model complexity (AICc), and selects best models based on distinct, user-set criteria (see selection in function help). Partial ROC and omission rates are evaluated based on models created with training occurrences, whereas AICc values are calculated for models created with the full set of occurrences [@warren_ecological_2011]. Outputs are stored in a folder that will contain a .csv file with the statistics of models meeting each of the evaluation criteria, another with only the models selected based on the user-specified criteria, a third with performance metrics for all candidate models, a plot of model performance, and an HTML file reporting all the results of the model evaluation and selection process designed to guide further interpretations (Figure 1, purple area).

```{r, eval=FALSE, include=TRUE}
help(kuenm_ceval)
```

```{r, eval=FALSE, include=TRUE}
occ_test <- "aame_test.csv"
out_eval <- "Calibration_results"
threshold <- 5
rand_percent <- 50
iterations <- 100
kept <- TRUE
selection <- "OR_AICc"
paral_proc <- FALSE # make this true to perform pROC calculations in parallel, recommended
                    # only if a powerfull computer is used (see function's help)
# Note, some of the variables used here as arguments were already created for previous function
```

```{r, eval=FALSE, include=TRUE}
cal_eval <- kuenm_ceval(path = out_dir, occ.joint = occ_joint, occ.tra = occ_tra, occ.test = occ_test, batch = batch_cal,
                        out.eval = out_eval, threshold = threshold, rand.percent = rand_percent, iterations = iterations,
                        kept = kept, selection = selection, parallel.proc = paral_proc)
```
<br>

### Final model creation

After selecting parameterizations producing the best models, the next step is that of creating final models and, if needed, transferring them to other areas or scenarios. The *kuenm_mod* function takes the .csv file with the best models from the model selection process, and writes and executes a batch file for creating final models with the selected parameterizations. Models and projections are stored in subdirectories inside an output folder; these subdirectories will be named as with the candidate models. By allowing projections (i.e., project = TRUE) and defining the folder holding the data for transfers (i.e., folder name in G.var.dir argument), this function automatically performs those transfers.

Maxent will run in command-line interface, as it did when creating the calibration models (again, do not close the application). However, the process of creating final models may take considerably more time, especially when transferring to other regions or scenarios.

```{r, eval=FALSE, include=TRUE}
help(kuenm_mod)
```

```{r, eval=FALSE, include=TRUE}
batch_fin <- "Final_models"
mod_dir <- "Final_Models"
rep_n <- 10
rep_type <- "Bootstrap"
jackknife <- FALSE
out_format <- "logistic"
project <- TRUE
G_var_dir <- "G_variables"
ext_type <- "all"
write_mess <- FALSE
write_clamp <- FALSE
wait1 <- FALSE
run1 <- TRUE
args <- NULL # e.g., "maximumbackground=20000" for increasing the number of pixels in the bacground or
             # "outputgrids=false" which avoids writing grids of replicated models and only writes the 
             # summary of them (e.g., average, median, etc.) when rep.n > 1
             # note that some arguments are fixed in the function and should not be changed
# Again, some of the variables used here as arguments were already created for previous functions
```

```{r, eval=FALSE, include=TRUE}
kuenm_mod(occ.joint = occ_joint, M.var.dir = M_var_dir, out.eval = out_eval, batch = batch_fin, rep.n = rep_n,
          rep.type = rep_type, jackknife = jackknife, out.dir = mod_dir, out.format = out_format, project = project,
          G.var.dir = G_var_dir, ext.type = ext_type, write.mess = write_mess, write.clamp = write_clamp, 
          maxent.path = maxent_path, args = args, wait = wait1, run = run1)
```

<br>

### Final model evaluation

Final models should be evaluated using independent occurrence data (i.e., data that have not been used in the calibration process that usually come from different sources). The *kuenm_feval* function evaluates final models based on statistical significance (partial ROC) and omission rate (*E*). This function will return a folder containing a .csv file with the results of the evaluation (see Figure 1, brown color). 

```{r, eval=FALSE, include=TRUE}
help(kuenm_feval)
```

```{r, eval=FALSE, include=TRUE}
occ_ind <- "aame_ind.csv"
replicates <- TRUE
out_feval <- "Final_Models_evaluation"
# Most of the variables used here as arguments were already created for previous functions
```

```{r, eval=FALSE, include=TRUE}
fin_eval <- kuenm_feval(path = mod_dir, occ.joint = occ_joint, occ.ind = occ_ind, replicates = replicates,
                        out.eval = out_feval, threshold = threshold, rand.percent = rand_percent,
                        iterations = iterations, parallel.proc = paral_proc)
```

<br>

### Extrapolation risk analysis

If transfers were performed when creating final models, risks of extrapolation can be assessed using the *kuenm_mmop* function. This function calculates mobility-oriented parity (MOP) layers [@owens_constraints_2013] by comparing environmental values between the calibration area and one or multiple regions or scenarios to which ecological niche models were transferred. The layers produced with this function help to visualize were strict extrapolation risks exist, and different similarity levels between the projection regions or scenarios and the calibration area. Results from this analysis will be written for each set of variables inside an specific data (Figure 1, dark green areas).

```{r, eval=FALSE, include=TRUE}
help(kuenm_mmop)
```

```{r, eval=FALSE, include=TRUE}
sets_var <- "Set3" # a vector of various sets can be used
out_mop <- "MOP_results"
percent <- 10
paral <- FALSE # make this true to perform MOP calculations in parallel, recommended
               # only if a powerfull computer is used (see function's help)
# Two of the variables used here as arguments were already created for previous functions
```

```{r, eval=FALSE, include=TRUE}
kuenm_mmop(G.var.dir = G_var_dir, M.var.dir = M_var_dir, sets.var = sets_var, out.mop = out_mop,
           percent = percent, parallel = paral)
```

<br>

## Other functionalities of kuenm

Other analyses **kuenm** allows are:

* <a href="https://github.com/marlonecobos/kuenm/tree/master/extra_vignettes/post-modeling.md" target="_blank">Post-modeling analyses</a>

<br>

## References