R Parallel

Purpose

Here, we briefly explain different ways to use R in parallel on the Harvard University FASRC Cannon cluster.

Parallel computing may be necessary to speed up a code or to deal with large datasets. It can divide the workload into chunks and each worker (i.e. core) will take one chunk. The goal of using parallel computing is to reduce the total computational time by having each worker process its workload in parallel with other workers.

Cannon cluster basics

Cannon has 1800+ compute nodes with 80,000+ CPU cores. Each compute node is equivalent to a computer and typically made up of CPU cores, memory, local storage, and sometimes a GPU card.

Sequential vs. multi-core vs. multi-node

A sequential (or serial) code uses one single CPU core and each instruction is processed in sequence (indicated by the triangle).

A multi-core (and single-node) code uses one compute node (i.e. one "computer") and it can use any number of cores that comprises a node (indicated by the stars). In Cannon, depending on the partition, the number of cores varies from 32 to 64. In addition, multi-core codes can take advantage of the shared memory between the cores.

A multi-node code uses cores across multiple nodes (indicated by the Xs), which means that we need a special communication between nodes. This communication generally happens using message passing interface (MPI), which has a specific standard for exchanging messages between many computers working in parallel. In R, we show below a few packages that wrap MPI.

In addition to parallel codes, we may also need different strategies to deal with large datasets.

Below we provide a summary of R parallel packages that can be used in Cannon. You can find a complete list of available packages at CRAN. You can also find more examples under Resources

Processing large datasets

Working with large data that does not fit into memory
Processing Single instruction multiple data problem on shared and distributed memory systems

Single-node, multi-core (shared memory)

Package parallel
- FAS RC embarrassingly parallel documentation
- FAS RC embarrassingly parallel Cannon example (using parLapply)
- FAS RC Embarrassingly parallel VDI example (using parLapply)
- parallel documentation
Package future
- Install future on Cannon
- Example of multisession (not shared memory) and multicore (shared memory) and its submit script
- future documentation

Multi-node, distributed memory

Package Rmpi
- Install Rmpi on Cannon
- Example and its submit script
- Rmpi documentation
Package pbdMPI (programming big data MPI)
- Install pbdMPI on Cannon
- Examples based on the pbdMPI demos – after installing pbdMPI package, all demos can be found in your R library folder $HOME/apps/R/4.0.5/pbdMPI/demo
- pbdMPI documentation and GitHub
- pbdR website

Hybrid: Multi-node + shared-memory

Using nested futures and package future.batchtools, we can perform a multi-node and multi-core job.

Package future and future.batchtools
- Install future and future.batchtools on Cannon
- Example and its submit script
- future documentation and GitHub
- future.batchtools documentation and GitHub

Resources

For R basics, refer to R-Basics
For R package installations, refer to:
- General package installs: R-Packages
- Packages sp, rgdal, rgeos, sf, and INLA
- Packages ENMTools, ecospat, raster, rJava
- Package rstan
Parallel R:
- HPC @ Louisiana State University training materials
- HPC @ Norwegian University of Science and Technology training materials
- R Programming for Data Science by Roger D. Peng.
- HPC @ University of Maryland, Baltimore County training materials

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

R Parallel

Purpose

Cannon cluster basics

Sequential vs. multi-core vs. multi-node

Processing large datasets

Single-node, multi-core (shared memory)

Multi-node, distributed memory

Hybrid: Multi-node + shared-memory

Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

R Parallel

Purpose

Cannon cluster basics

Sequential vs. multi-core vs. multi-node

Processing large datasets

Single-node, multi-core (shared memory)

Multi-node, distributed memory

Hybrid: Multi-node + shared-memory

Resources