Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextflow #2377

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
9 changes: 5 additions & 4 deletions docs/apps/nextflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,17 @@ tags:
# Nextflow

Nextflow is a scientific workflow management system for creating scalable,
portable, and reproducible workflows.
portable, and reproducible workflows. It is a groovy-based language for expressing the entire workflow in a single script and also supports running scripts (via script/run/shell directive of Snakemake rule) from other languages such as R, bash and Python.


[TOC]

## Available

Versions available on CSC's servers

* Puhti: 21.10.6, 22.04.5, 22.10.1, 23.04.3
* Mahti: 22.05.0-edge
* Puhti: 21.10.6, 22.04.5, 22.10.1, 23.04.3, 24.01.0-edge.5903, 24.10.0
* Mahti: 22.05.0-edge, 24.04.4
* LUMI: 22.10.4

!!! info "Pay attention to usage of Nextflow version"
Expand Down Expand Up @@ -69,7 +70,7 @@ computational workflows. Nat. Biotechnol. 35, 316–319 (2017).

## More information

* [Nextflow documentation](https://www.nextflow.io/docs/latest/index.html)
* [Nextflow official documentation](https://www.nextflow.io/docs/latest/index.html)
* [Running Nextflow on Puhti](../support/tutorials/nextflow-puhti.md)
* [High-throughput Nextflow workflow using HyperQueue](../support/tutorials/nextflow-hq.md)
* [Contact CSC Service Desk for technical support](../support/contact.md)
288 changes: 168 additions & 120 deletions docs/support/tutorials/nextflow-puhti.md
Original file line number Diff line number Diff line change
@@ -1,121 +1,137 @@
# Running Nextflow pipelines on Puhti
# Running Nextflow pipelines on supercomputers

[Nextflow](https://www.nextflow.io/) is a scalable and reproducible scientific
workflow management system that interacts with containerized applications to
perform compute-intensive tasks. Nextflow provides built-in support for
HPC-friendly containers such as Apptainer and Singularity. Although Nextflow
pipelines allow us to choose Docker engine as an executor for running
pipelines, please note that Docker containers can't be used on Puhti due to the
lack of administrative privileges for regular users.
[Nextflow](https://www.nextflow.io/) is one of the scientific wokrflow managers and provides built-in support for
HPC-friendly containers such as Apptainer (= Singularity). One of the advantages of Nextflow is that the actual pipeline functional logic is separated from the execution environment. The same script can therefore be executed in different environments by changing the execution environment without touching actual pipeline code. Nextflow uses `executor` information to decide where the job should be run. Once executor is configured, Nextflow submits each process to the specified job scheduler on your behalf.

## Strengths of Nextflow
Default executor is `local` where processes are run in the computer where Nextflow is launched. Several other [executors](https://www.nextflow.io/docs/latest/executor.html) are supported, to CSC computing environment, best suit SLURM and HyperQueue executors.

* Easy installation
* Supports implicit parallelism
* Can handle complex dependencies and conditional execution
* Handles error recovery
There are many other high-throughput tools and workflow managers for scientific computing and selecting the right tool can sometimes be challenging. Please refer to our [high-throughput computing and workflows page](../../computing/running/throughput.md) to get an overview of relevant tools.

## Disadvantages of Nextflow
## Installation

### Nextflow

* Has limited MPI support
* Creates a lot of job steps and excessive I/O load
* Does not efficiently integrate with Slurm scheduler
Nextflow itself is available as a module on Puhti, Mahti and LUMI. The default version is usually the latest. Choose the version of the Nextflow depending on the requirements of your own pipeline. It is recommended to load Nextflow module with a version, for the reproducibility point of view.

## Use Apptainer/Singularity containers with Nextflow
To load Nextflow module:

```bash
module load nextflow/<version> # e.g., module load nextflow/22.10.1
```

!!! warning
The Nextflow 23.04.3 and newer support only pipelines built with DSL2 syntax. Select an older version for DSL1-compliant pipelines.

### Installation of tools used in Nextflow

#### Local installations

By default, Nextflow expects that the analysis tools are available locally. Tools can be activated from existing [modules](../../apps/by_discipline.md) or [own custom module installations](../../computing/modules.md#using-your-own-module-files). See also how to create [create containers](../../computing/containers/creating.md).

#### On-the-fly Apptainer installations
Containers can be smoothly integrated with Nextflow pipelines. No additional
modifications to Nextflow scripts are needed except enabling the
Singularity/Apptainer engine (instead of Docker) in the Nextflow configuration
file in the HPC environment. Nextflow is able to pull remote container images
stored in Singularity or Docker Hub registry. The remote container images are
Apptainer engine in the Nextflow configuration
file. Nextflow can pull remote container images as Apptainer
from container registries on the fly. The remote container images are
usually specified in the Nextflow script or configuration file by simply
prefixing the image name with `shub://` or `docker://`. It is also possible to
specify a different Singularity image for each process definition in the
specify a different Apptainer image for each process definition in the
Nextflow pipeline script.

Here is a generic recipe for running a Nextflow pipeline on Puhti:
Most Nextflow pipelines pull the needed container images on the fly. However,
when multiple images are needed in a pipeline, it is a good idea to prepare the
containers locally before launching the Nextflow pipeline.

* [1. Login to Puhti supercomputer](#1-login-to-puhti-supercomputer)
* [2. Prepare your Apptainer/Singularity images](#2-prepare-your-apptainersingularity-images)
* [3. Load Nextflow module on Puhti](#3-load-nextflow-module-on-puhti)
* [4. Set-up your Nextflow pipeline environment](#4-set-up-your-nextflow-pipeline-environment)
* [5. Run your Nextflow pipeline as a batch job](#5-run-your-nextflow-pipeline-as-a-batch-job)
* [6. Demonstration of nf-core Nextflow pipeline using HyperQueue executor (optional)](#6-demonstration-of-nf-core-nextflow-pipeline-using-hyperqueue-executor-optional)
Practical considerations:

## 1. Login to Puhti supercomputer
* Apptainer is installed on login and compute nodes and does not require loading a separate module on CSC supercomputers.
* For binding folders or using other [Apptainer settings](https://www.nextflow.io/docs/latest/reference/config.html#apptainer) use `nextflow.config` file.
* If you are directly pulling multiple Apptainer images on the fly, please use NVMe disk of a compute node for storing the Apptainer images. For that in your batch job file, first request NVMe disk and then set Apptainer tempory folders as environmental variables.

SSH to the login node of Puhti supercomputer
([more instructions here](../../computing/index.md#connecting-to-the-supercomputers)).
```bash title="batch_job.sh"
#SBATCH --gres=nvme:100 # Request 100 GB of space to local disk

```bash
ssh <username>@puhti.csc.fi # replace <username> with your CSC username
export APPTAINER_TMPDIR=$LOCAL_SCRATCH
export APPTAINER_CACHEDIR=$LOCAL_SCRATCH
```

## 2. Prepare your Apptainer/Singularity images
!!! warning
Although Nextflow supports also Docker containers, these can't be used as such on supercomputers due to the lack of administrative privileges for normal users.

Most Nextflow pipelines pull the needed container images on the fly. However,
when there are multiple images involved, it is a good idea to prepare the
images locally first before launching your Nextflow pipeline.
## Usage

Here are some options for preparing your Apptainer/Singularity image:
Nextflow pipelines can be run in different ways in supercomputering environment:

* Build a Singularity/Apptainer image on Puhti without `sudo` access using
`--fakeroot` flag.
* Convert a Docker image to Apptainer on your local system and then copy it
to Puhti.
* Convert a Docker image from a container registry on Puhti.
1. [In interactive mode](../../computing/running/interactive-usage.md) with local executor, with limited resources. Useful mainly for debugging or testing very small workflows.
2. With batch job and local executor. Useful for small and medium size workflows
3. With batch job and SLURM executor. This can use multiple nodes and different SLURM partitions (CPU and GPU), but may create significant overhead, with many small jobs. Could be used, if each job step for each file takes at least 30 min.
4. With batch job and HyperQueue as a sub-job scheduler. Can use multiple nodes in the same batch job allocation, most complex set up. Suits well for cases, when workflow includes a lot of small job steps with many input files (high-troughput computing).

More information on these different options can be found in our
[documentation on creating containers](../../computing/containers/creating.md).
!!! info "Note"
Please do not launch heavy Nextflow workflows on login nodes.

!!! note
Singularity/Apptainer is installed on login and compute nodes and does
not require loading a separate module on either Puhti, Mahti or LUMI.
For general introduction to batch jobs, see [example job scripts for Puhti](../../computing/running/example-job-scripts-puhti.md).

## 3. Load Nextflow module on Puhti
### Nextflow script

Nextflow is available as a module on Puhti and can be loaded for example as
below:
The following minimalist example demonstrates the basic syntax of a Nextflow script.

```bash
module load nextflow/22.10.1
```
```nextflow title="workflow.nf"
#!/usr/bin/env nextflow

greets = Channel.fromList(["Moi", "Ciao", "Hello", "Hola","Bonjour"])

!!! note
Please make sure to specify the correct version of the Nextflow module as
some pipelines require a specific version of Nextflow.
/*
* Use echo to print 'Hello !' in different languages to a file
*/

## 4. Set-up your Nextflow pipeline environment
process sayHello {

Running Nextflow pipelines can sometimes be quite compute-intensive and may
require downloading large volumes of data such as databases and container
images. This can take a while and may not even work successfully for the first
time when downloading multiple Apptainer/Singularity images or databases.
input:
val greet

You can do the following basic preparation steps before running your Nextflow
pipeline:
output:
path "${greet}.txt"

* Copy Apptainer images from your local workstation to your project folder on
Puhti. Pay attention to the Apptainer cache directory (i.e.
`$APPTAINER_CACHEDIR`) which is usually `$HOME/.apptainer/cache`. Note that
`$HOME` directory quota is only 10 GB on Puhti, so it may fill up quickly.
* Move all your raw data to your project directory (`/scratch/<project>`)
on Puhti.
* Clone the GitHub repository of your pipeline to your scratch directory and
then [run your pipeline](#5-run-your-nextflow-pipeline-as-a-batch-job).
script:
"""
echo ${greet} > ${greet}.txt
"""
}

## 5. Run your Nextflow pipeline as a batch job
workflow {

Please follow our
[instructions for writing a batch job script for Puhti](../../computing/running/example-job-scripts-puhti.md).
// Print a greeting
sayHello(greets)
}

Although Nextflow comes with native Slurm support, one has to avoid launching
large amounts of very short jobs using it. Instead, one can launch a Nextflow
job as a regular batch job that co-executes all job tasks in the same job
allocation. Below is a minimal example to get started with your Nextflow
pipeline on Puhti:
```
This script defines one process named `sayHello`. This process takes a set of greetings from different languages and then writes each one to a separate file in a random order.

The resulting terminal output would look similar to the text shown below:

```bash
N E X T F L O W ~ version 23.04.3
Launching `hello-world.nf` [intergalactic_panini] DSL2 - revision: 880a4a2dfd
executor > local (5)
[a0/bdf83f] process > sayHello (5) [100%] 5 of 5 ✔
```

### Running Nextflow pipeline with local executor interactively
To run Nextflow in [interactive session](https://docs.csc.fi/computing/running/interactive-usage/):
```
sinteractive -c 2 -m 4G -d 250 -A project_2xxxx # replace actual project number here
module load nextflow/23.04.3 # Load nextflow module
nextflow run workflow.nf
```

### Running Nextflow with local executor in a batch job

To launch a Nextflow job as a regular batch job that executes all job tasks in the same job
allocation, create the batch job file:

```bash title="nextflow_local_batch_job.sh"
#!/bin/bash
#SBATCH --time=00:15:00 # Change your runtime settings
#SBATCH --partition=test # Change partition as needed
Expand All @@ -124,41 +140,81 @@ pipeline on Puhti:
#SBATCH --mem-per-cpu=1G # Increase as needed

# Load Nextflow module
module load nextflow/22.10.1
module load nextflow/23.04.3

# Actual Nextflow command here
nextflow run workflow.nf <options>
# nf-core pipeline example:
# nextflow run nf-core/scrnaseq -profile test,singularity -resume --outdir .
```

Finally, submit the job to the supercomputer:

```
sbatch nextflow_local_batch_job.sh
```

!!! note
If you are directly pulling multiple images on the fly, please set
`$APPTAINER_TMPDIR` and `$APPTAINER_CACHEDIR` to either local scratch
(i.e. `$LOCAL_SCRATCH`) or to your scratch folder (`/scratch/<project>`)
in the batch script. Otherwise `$HOME` directory, the size of which is
only 10 GB, will be used. To avoid any disk quota errors while pulling
images, set `$APPTAINER_TMPDIR` and `$APPTAINER_CACHEDIR` in your batch
script as below:
### Running Nextflow with SLURM executor

```bash
export APPTAINER_TMPDIR=$LOCAL_SCRATCH
export APPTAINER_CACHEDIR=$LOCAL_SCRATCH
```
The first batch job file reserves resources only for Nextflow itself. Nextflow then creates further SLURM jobs for workflow's processes. The SLURM jobs created by Nextflow may be distributed to several nodes of a supercomputer and also to use different partitions for different workflow rules, for example CPU and GPU. SLURM executor should be used only, if the job steps are at least 20-30 minutes long, otherwise the it could overload SLURM.

Note that this also requires requesting NVMe disk in the batch script by
adding `#SBATCH --gres=nvme:<value in GB>`. For example, add
`#SBATCH --gres=nvme:100` to request 100 GB of space on `$LOCAL_SCRATCH`.
!!! warning
Please do not use SLURM executor, if your workflow includes a lot of short processes. It would overload SLURM.

## 6. Demonstration of nf-core Nextflow pipeline using HyperQueue executor (optional)
To enable the SLURM executor, set the `process.xx` settings in [nextflow.config file](https://www.nextflow.io/docs/latest/config.html). The settings are similar to [batch job files](../../computing/running/example-job-scripts-puhti.md).

In this example, let's use the
[HyperQueue meta-scheduler](../../apps/hyperqueue.md) for executing a Nextflow
pipeline. This executor can be used to scale up analysis across multiple nodes
when needed.
```bash title="nextflow.config"
profiles {


standard {
process.executor = 'local'
}

puhti {
process.clusterOptions = '--account=project_xxxx --ntasks-per-node=1 --cpus-per-task=4 --ntasks=1 --time=00:00:05'
process.executor = 'slurm'
process.queue = 'small'
process.memory = '10GB'
}

}
```

Create the batch job file, note the usage of a profile.

```bash title="nextflow_slurm_batch_job.sh"
#!/bin/bash
#SBATCH --time=00:15:00 # Change your runtime settings
#SBATCH --partition=test # Change partition as needed
#SBATCH --account=<project> # Add your project name here
#SBATCH --cpus-per-task=1 # Change as needed
#SBATCH --mem-per-cpu=1G # Increase as needed

# Load Nextflow module
module load nextflow/23.04.3

# Actual Nextflow command here
nextflow run workflow.nf -profile puhti
```

Finally, submit the job to the supercomputer:

```
sbatch nextflow_slurm_batch_job.sh
```

This will submit each process of your workflow as a separate batch job to Puhti supercomputer.


### Running Nextflow with HyperQueue executor

[HyperQueue meta-scheduler](../../apps/hyperqueue.md) executer is suitable, if your workflow includes a lot of short processes and you need several nodes for the computation. However, the executor settings can be complex depending on the pipeline.

Here is a batch script for running a
[nf-core pipeline](https://nf-co.re/pipelines) on Puhti:
[nf-core pipeline](https://nf-co.re/pipelines):

```bash
```bash title="nextflow_hyperqueue_batch_job.sh"
#!/bin/bash
#SBATCH --job-name=nextflowjob
#SBATCH --account=<project>
Expand Down Expand Up @@ -198,7 +254,8 @@ hq worker wait "${SLURM_NTASKS}"
git clone https://github.com/nf-core/rnaseq.git -b 3.10
cd rnaseq

# Ensure Nextflow uses the right executor and knows how much it can submit
# Ensure Nextflow uses the right executor and knows how many jobs it can submit
# The `queueSize` can be limited as needed.
echo "executor {
queueSize = $(( 40*SLURM_NNODES ))
name = 'hq'
Expand All @@ -213,24 +270,15 @@ hq worker stop all
hq server stop
```

!!! note
Please make sure that your nextflow configuration file (`nextflow.config`)
has the correct executor name when using the HypeQueue executor. Also,
when multiple nodes are used, ensure that the executor knows how many jobs
it can submit using the parameter `queueSize` under the `executor` block.
The `queueSize` can be limited as needed. Here is an example snippet that
you can use and modify as needed in your `nextflow.config` file:
Finally, submit the job to the supercomputer:

```text
executor {
queueSize = 40*SLURM_NNODES
name = 'hq'
cpus = 40*SLURM_NNODES
}
```
```
sbatch nextflow_hyperqueue_batch_job.sh
```

## More information

* [Official Nextflow documentation](https://www.nextflow.io/docs/latest/index.html)
* [CSC's Nextflow documentation](../../apps/nextflow.md)
* [High-throughput Nextflow workflow using HyperQueue](nextflow-hq.md)
* [Master thesis by Antoni Gołoś comparing automated workflow approaches on supercomputers](https://urn.fi/URN:NBN:fi:aalto-202406164397)
* [Full code Nextflow example from Antoni Gołoś with 3 different executors for Puhti](https://github.com/antonigoo/LIPHE-processing/tree/nextflow/workflow)