-
Notifications
You must be signed in to change notification settings - Fork 204
Random Forest in R with Large Sample Sizes
Author: Jacob Nearing
First Created: 17 April 2019
Last Edited: 17 April 2019
- Introduction
- Background
- General Workflow
- Our Pipeline
- Load Packages and Read in Data
- Pre-processing
- Assessing Model Fit
- Identifying Important Features
This tutorial is aimed at individuals with a basic background in the R programming language that want to test out how well they can use microbiome sequencing data to either classify samples between different categories or predict a continuous variables outcome. In this tutorial we will go through how to set up your data to be used in training a RF model, as well as the basically principles that surround model training. By the end you will have learned how to create random forest models in R, assess how well they perform and identify the features of importance. Note that this tutorial is generally aimed at larger studies (greater than 100 samples). If you would like to see a similar tutorial on using random Forest with lower sample sizes please see this tutorial.
To Run through this tutorial you will need to have the following packages installed
- Tutorial [ASV Table](link here)
- R (v3.3.2)
- RStudio - recommended, but not necessary (v1.0.136)
- randomForest R package (v4.6-12)
- caret R package (v6.0-73)
- pROC R package
- doMC R package
- DMwR R package
- RandomForest Utility Scripts in this Repo
If you would like to install and load all of the listed R packages manually you can run the following command within your R session:
deps = c("randomForest", "pROC", "caret", "DMwR", "doMC")
for (dep in deps){
if (dep %in% installed.packages()[,"Package"] == FALSE){
install.packages(as.character(dep), repos = "http://cran.us.r-project.org")
}
library(dep, character.only = TRUE)
}
However, the RandomForest Utility Script that is contained within this Repo will automatically install any missing packages and load them into your R session.
Random Forest modelling is one of the many different algorithms developed within the machine learning field. For a brief description of other models that exist check out this [link](link here). Due to this random forest may not always be the optimal machine learning algorithm for your data, however, random forest has been [shown](link here) to perform fairly well on microbial DNA sequencing data, is fairly easy to interrupt and is a great place to start your machine learning adventure.
Random forest models are based off of decision trees a method for classifying categorical and regressing continuous variables within a data set.Decision trees work by helping you to choose which fork in the road is best to go down to get to the optimal result. Imagine your driving down a complex road network and there is two final destinations that you can end up in depending on the turns that you choose to make. Luckily you have lots of information on similar complex road networks others have driven down so you can use the information on what turns they made and where they ended up in the end to determine which turns you should make (this can be thought of as the training dataset more on this later). Well if you have this information you can then use that to make educational guesses to determine which split in the road to take at each turn giving you the best chance at ending up at your desired destination. This is similar to how a decision tree works.
Random forest takes this algorithm further by creating multiple decision trees from different subsets of the training data that you present the algorithm. Each tree that is made then gets to have a vote on which class an object belongs to. For instance if we were to head into a complex road network and give our random forest algorithm all of the various turns we planned to make, each decision tree in the model can then vote on which destination it thinks we would end up in. Generally for classification the final result is whatever class the majority of trees vote on (more on this later).
Random forest models provide multiple advantages compared to single decision trees. They tend to be less over fit to the dataset that they are trained on (although over fitting can still be an issue). They also allow us to evaluate the model by taking subsets of data that trees within the model have never seen before and testing how well those trees perform. This is how we determine the the out-of-bag error rate a useful metric to evaluate how well your model works on predicting the information that it was trained on.
Random forest models also allow users to determine which features within their dataset (in this case microbes) are the most important to make accurate predictions. The importance of each feature within the dataset is determined by scrambling each feature one at a time and re-running the model to determine the change in accuracy. This is useful when trying to determine which microbes may be a major contributing signal between two different classes (such as those with and without disease).
The next section will go over the general work flow of our random forest pipeline and will present what each function within the RF_utlity.R script is doing.accuract
The general workflow falls into three different steps outlined below
Step 1: Splitting Data into training and test datasets This step is a critical step to evaluate model performance and to avoid reporting the performance of models that are grossly over fit. The idea here is quite simple you take your dataset and split them up into training data and testing data. This split then allows you to determine how well your model would perform on data that it has never seen before which is important to evaluate whether or not your model is over fit. Generally this split is done so that the majority of data is kept for training and a small subset is used for testing purposes. In this tutorial we will use 20% of the data for testing and 80% for training.
Step 2: Model Training Once you have your training dataset the next step is begin training your model. There are two parameters that can be changed in this step for random forest modelling; the number of trees and the number of splits at each node in the trees (mtry).
In general as the number of trees increase the performance of the model will increase until a plateau is reached. This plateau can vary between data sets and sometimes its worth trying multiple numbers to find the optimal number of trees. You can always set a very large number for this value and not worry about this issue, however, this will sever## Our Pipeline We will begin this section by going over the functions of the random forest classification pipeline. This will help you understand how each function works and how to interrupt the data that results from them. ely impact the run time.
Determine the best mtry is a bit more difficult because it varies from dataset to dataset. A good starting point is usually the square root of the number of features.
mTry is known as a hyper-parameter and tuning hyper-parameters in appropriately can lead to model over fitting, which leads us to cross-validation. Cross-validation is one of many different methods that can be used to combat over fitting when tuning hyper parameters. In this case we will focus of k-fold repeated cross validation. K-fold cross validation works by taking the training dataset supplied to the model algorithm and further splitting them up into k even numbered data sets. A model is then trained on each dataset and validated on the other data sets. This helps further protect against picking hyper-parameters that just happen to work really well on the split you made between training and test. Generally this process is then repeated n times to determine which mtry value works best overall allowing you to pick a final "best" model.
Step 3 Model Validation This final step is where the test dataset comes into play. During this step you will take the final model from training and use it to make classifications (for categorical variables) or regressions (for continuous variables) on the hold out test set. This will allow you to determine how well you can expect your model to work on data that it has never come across. This is important when reporting things such as how well your model can predict disease outcome.
This section will go through how to run the random forest pipeline on a dataset that can be found [here]. This dataset was published in [2015 by Singh et al.,](link to paper) who were interested in looking a differences between healthy individuals and those with enteric infections.
The first step will be to read in the dataset and cleaning up the data
Set your working directory
setwd("/Path/to/my/RF_tutorial/")
Read in the count table of genera and the metadata associated to each sample
Genus_values <- read.table("edd_singh_genera.tsv",
sep="\t", header=T, row.names=1)
metadata <- read.table("edd_singh.metadata.clean.tsv",
sep="\t", header=T, row.names=1)
Lets inspect these tables a bit
dim(Genus_values)
[1] 153 283
We can see that the genus table has 283 different columns and 158 rows. Each column represents a different sample and each row represents a different genus. We will next want to clean up these two tables so that they only include samples that had greater than 1000 reads
indexs_to_keep <- which(metadata$total_reads >= 1000)
length(indexs_to_keep)
[1] 272
clean_Genus_values <- Genus_values[,indexs_to_keep]
clean_metadata <- metadata[indexs_to_keep,]
dim(clean_Genus_values)
[1] 153 272
dim(clean_metadata)
[1] 272 10
rownames(clean_metadata) <- gsub("-","\\.",rownames(clean_metadata))
all.equal(colnames(clean_Genus_values), rownames(clean_metadata))
[1] TRUE
Great the dimensions still match up and so do the row names. Now that this is complete the next step we will want to take is to remove genera that are "rare" meaning those that are only found in a small amount of the total samples. To do this we will need to run the follow lines of code to load in our remove rare feature function:
remove_rare <- function( table , cutoff_pro ) {
row2keep <- c()
cutoff <- ceiling( cutoff_pro * ncol(table) )
for ( i in 1:nrow(table) ) {
row_nonzero <- length( which( table[ i , ] > 0 ) )
if ( row_nonzero > cutoff ) {
row2keep <- c( row2keep , i)
}
}
return( table [ row2keep , , drop=F ])
}
This function will remove any rows (genera) that are non-zero in less than some cutoff proporation of samples. For this dataset we set the cutoff proportion to 10% (so we will discard Genera that are 0 in 90% or more of the samples).
non_rare_genera <- remove_rare(clean_Genus_values, 0.1)
dim(non_rare_genera)
[1] 73 272
Removal of rare features using these parameters left us with 73 of the original 153 genera. The next step in our data pre-processing will be to convert our count data into a different representation. Common pre-processing procedures will convert data into Relative abundance, center-log ratio or even PCA variables (given enough features). In this case we have chosen to convert our count table into center-log-ratio values with a pseudo-count addition of 1.
input_features <- data.frame(apply(non_rare_genera +1, 2, function(x) {log(x) - mean(log(x))}))
colSums(input_features)
You will notice that all of the colSums add up to very small close to zero 0 numbers a good indication that our above function worked correctly.
Finally the last step is to set-up the classes of each sample that we want to predict.
classes <- factor(clean_metadata$DiseaseState, levels=c("EDD","H"))
In this case EDD will represent case samples of enteric infections and H will represent healthy individuals.
We now have all the features ready to input into our model as well as the classes those features represent. To run our model we can use the run_rf_classifcation_pipeline() function found in the RF_utilities.R script in this repository.
We will start by examining the input parameters of the R function
rf_pipeline <- function(feature_table, classes, metric, sampling, SEED)
There are a total of 5 different parameters that are included in the random forest classifcation pipeline:
-
feature_table is the table of features used for prediction in our case this will be the abundance of each microbial sequence
-
classes is the classes that each sample belongs to in the feature table. Note that classes expects that the first level is Case, and the second is Control... (this become important for using metrics other than ROC).
-
metric currently there are two metrics that this pipeline will expect:
- PR which evaluates our model based on precision recall curves (this is best suited for datasets with high class imbalances or data sets where you are not interested in true negative results)
- ROC which evaluates our model based on ROC curves. This is the most basic way to evaluate your model and takes into account the trade off between sensitivity and specificity. We did not include accuracy as an evaluation metric in our pipeline as pure accuracy can give [misimpressions of model performance.](link to why accuracy is bad)
-
sampling allows you to determine how model training data is sampled. This can help improve your model in situations where class imbalance is a problem. There are currently four different sampling methods supported:
- NULL : tells the pipeline to not change its sampling strategy. This will mean that the model will be trained on data that is in the same class ratio as originally presented to it.
- up : tells the pipeline to up sample the minority class by randomly picking additional samples with replacement. Note that if you are using this strategy validation through cross-validation and testing on final data is REQUIRED as over fitting can be an issue.
- down tells the pipeline to under sample the majority class so that their is equal balance in the training data presented to the model. Note that if you are using this strategy validation through cross-validation and testing on final data is REQUIRED as over fitting can be an issue.
- smote tells the pipeline to using the [SMOTE] sampling strategy. Briefly this tells the model to up sample the minority class by making synthetic samples that represent features found in the minority class. Note that if you are using this strategy validation through cross-validation and testing on final data is REQUIRED as over fitting can be an issue.
-
SEED allows you to input a seed number that makes the results from the pipeline reproducible.
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].