Skip to content

Random Forest in R with Large Sample Sizes

nearinj edited this page Apr 17, 2019 · 29 revisions

Author: Jacob Nearing

First Created: 17 April 2019

Last Edited: 17 April 2019

Introduction

This tutorial is aimed at individuals with a basic background in the R programming language that want to test out how well they can use microbiome sequencing data to either classify samples between different categories or predict a continuous variables outcome. In this tutorial we will go through how to set up your data to be used in training a RF model, as well as the basically principles that surround model training. By the end you will have learned how to create random forest models in R, assess how well they perform and identify the features of importance. Note that this tutorial is generally aimed at larger studies (greater than 100 samples). If you would like to see a similar tutorial on using random Forest with lower sample sizes please see this tutorial.

Requirements

To Run through this tutorial you will need to have the following packages installed

If you would like to install and load all of the listed R packages run the following command within your R session:

deps = c("randomForest", "pROC", "caret", "DMwR", "doMC")
for (dep in deps){
  if (dep %in% installed.packages()[,"Package"] == FALSE){
    install.packages(as.character(dep), repos = "http://cran.us.r-project.org")
  }
  library(dep, character.only = TRUE)
}
Clone this wiki locally