Here, we briefly explain different ways to use R in parallel on the Harvard University FASRC Cannon cluster.
Parallel computing may be necessary to speed up a code or to deal with large datasets. It can divide the workload into chunks and each worker (i.e. core) will take one chunk. The goal of using parallel computing is to reduce the total computational time by having each worker process its workload in parallel with other workers.
Cannon has 1800+ compute nodes with 80,000+ CPU cores. Each compute node is equivalent to a computer and typically made up of CPU cores, memory, local storage, and sometimes a GPU card.
A sequential (or serial) code uses one single CPU core and each instruction is processed in sequence (indicated by the triangle).
A multi-core (and single-node) code uses one compute node (i.e. one "computer") and it can use any number of cores that comprises a node (indicated by the stars). In Cannon, depending on the partition, the number of cores varies from 32 to 64. In addition, multi-core codes can take advantage of the shared memory between the cores.
A multi-node code uses cores across multiple nodes (indicated by the Xs), which means that we need a special communication between nodes. This communication generally happens using message passing interface (MPI), which has a specific standard for exchanging messages between many computers working in parallel. In R
, we show below a few packages that wrap MPI.
In addition to parallel codes, we may also need different strategies to deal with large datasets.
Below we provide a summary of R parallel packages that can be used in Cannon. You can find a complete list of available packages at CRAN. You can also find more examples under Resources
- Working with large data that does not fit into memory
- Processing Single instruction multiple data problem on shared and distributed memory systems
-
Package
parallel
- FAS RC embarrassingly parallel documentation
- FAS RC embarrassingly parallel Cannon example (using
parLapply
) - FAS RC Embarrassingly parallel VDI example (using
parLapply
) - parallel documentation
-
Package
future
- Install future on Cannon
- Example of
multisession
(not shared memory) andmulticore
(shared memory) and its submit script - future documentation
-
Package
Rmpi
-
Package
pbdMPI
(programming big data MPI)- Install pbdMPI on Cannon
- Examples based on the
pbdMPI
demos – after installingpbdMPI
package, all demos can be found in your R library folder$HOME/apps/R/4.0.5/pbdMPI/demo
- pbdMPI documentation and GitHub
- pbdR website
Using nested futures and package future.batchtools
, we can perform a multi-node and multi-core job.
- Package
future
andfuture.batchtools
- For R basics, refer to R-Basics
- For R package installations, refer to:
- General package installs: R-Packages
- Packages sp, rgdal, rgeos, sf, and INLA
- Packages ENMTools, ecospat, raster, rJava
- Package rstan
- Parallel R:
- HPC @ Louisiana State University training materials
- HPC @ Norwegian University of Science and Technology training materials
- R Programming for Data Science by Roger D. Peng.
- HPC @ University of Maryland, Baltimore County training materials