Random Forest in R with Large Sample Sizes

Author: Jacob Nearing

First Created: 17 April 2019

Last Edited: 17 April 2019

Introduction
- Requirements
Background
- [The Random Forest Pipeline]
Load Packages and Read in Data
Pre-processing
- Removing Rare Features
- Transforming Your Data
Assessing Model Fit
Identifying Important Features

Introduction

This tutorial is aimed at individuals with a basic background in the R programming language that want to test out how well they can use microbiome sequencing data to either classify samples between different categories or predict a continuous variables outcome. In this tutorial we will go through how to set up your data to be used in training a RF model, as well as the basically principles that surround model training. By the end you will have learned how to create random forest models in R, assess how well they perform and identify the features of importance. Note that this tutorial is generally aimed at larger studies (greater than 100 samples). If you would like to see a similar tutorial on using random Forest with lower sample sizes please see this tutorial.

Requirements

To Run through this tutorial you will need to have the following packages installed

Tutorial [ASV Table](link here)
R (v3.3.2)
RStudio - recommended, but not necessary (v1.0.136)
randomForest R package (v4.6-12)
caret R package (v6.0-73)
pROC R package
doMC R package
DMwR R package
RandomForest Utility Scripts in this Repo

If you would like to install and load all of the listed R packages manually you can run the following command within your R session:

deps = c("randomForest", "pROC", "caret", "DMwR", "doMC")
for (dep in deps){
  if (dep %in% installed.packages()[,"Package"] == FALSE){
    install.packages(as.character(dep), repos = "http://cran.us.r-project.org")
  }
  library(dep, character.only = TRUE)
}

However, the RandomForest Utility Script that is contained within this Repo will automatically install any missing packages and load them into your R session.

Background

Random Forest modelling is one of the many different algorithms developed within the machine learning field. For a brief description of other models that exist check out this [link](link here). Due to this random forest may not always be the optimal machine learning algorithm for your data, however, random forest has been [shown](link here) to perform fairly well on microbial DNA sequencing data, is fairly easy to interrupt and is a great place to start your machine learning adventure.

Random forest models are based off of decision trees a method for classifying categorical and regressing continuous variables within a data set.Decision trees work by helping you to choose which fork in the road is best to go down to get to the optimal result. Imagine your driving down a complex road network and there is two final destinations that you can end up in depending on the turns that you choose to make. Luckily you have lots of information on similar complex road networks others have driven down so you can use the information on what turns they made and where they ended up in the end to determine which turns you should make (this can be thought of as the training dataset more on this later). Well if you have this information you can then use that to make educational guesses to determine which split in the road to take at each turn giving you the best chance at ending up at your desired destination. This is similar to how a decision tree works.

Random forest takes this algorithm further by creating multiple decision trees from different subsets of the training data that you present the algorithm. Each tree that is made then gets to have a vote on which class an object belongs to. For instance if we were to head into a complex road network and give our random forest algorithm all of the various turns we planned to make, each decision tree in the model can then vote on which destination it thinks we would end up in. Generally for classification the final result is whatever class the majority of trees vote on (more on this later).

Random forest models provide multiple advantages compared to single decision trees. They tend to be less over fit to the dataset that they are trained on (although over fitting can still be an issue). They also allow us to evaluate the model by taking subsets of data that trees within the model have never seen before and testing how well those trees perform. This is how we determine the the out-of-bag error rate.

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to [email protected].