Random Forest in R with Large Sample Sizes

Author: Jacob Nearing

First Created: 17 April 2019

Last Edited: 17 April 2019

Introduction
- Requirements
Background
Load Packages and Read in Data
Pre-processing
- Removing Rare Features
- Transforming Your Data
Assessing Model Fit
Identifying Important Features

Introduction

This tutorial is aimed at individuals with a basic background in the R programming language that want to test out how well they can use microbiome sequencing data to either classify samples between different categories or predict a continuous variables outcome. In this tutorial we will go through how to set up your data to be used in training a RF model, as well as the basically principles that surround model training. By the end you will have learned how to create random forest models in R, assess how well they perform and identify the features of importance. Note that this tutorial is generally aimed at larger studies (greater than 100 samples). If you would like to see a similar tutorial on using random Forest with lower sample sizes please see this tutorial.

Requirements

To Run through this tutorial you will need to have the following packages installed

Tutorial [ASV Table](link here)
R (v3.3.2)
RStudio - recommended, but not necessary (v1.0.136)
randomForest R package (v4.6-12)
caret R package (v6.0-73)
pROC R package
doMC R package
DMwR R package

If you would like to install and load all of the listed R packages run the following command within your R session:

deps = c("randomForest", "pROC", "caret", "DMwR", "doMC")
for (dep in deps){
  if (dep %in% installed.packages()[,"Package"] == FALSE){
    install.packages(as.character(dep), repos = "http://cran.us.r-project.org")
  }
  library(dep, character.only = TRUE)
}

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to [email protected].

Useful Links

Main SOPs

Amplicon SOP v2 (qiime2-amplicon-2024.5)

PacBio Amplicon SOP v2 (qiime2-2022.2)

Metagenomics SOP v3

Wet-Lab SOPs on Protocols.io

Old SOPs

Tutorials

Microbiome for beginners

Metagenomics Resources

mSystems paper data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Forest in R with Large Sample Sizes

Introduction

Requirements

Contact

Clone this wiki locally