CPS Extractor Pipeline

The CPS extractor pipeline is a Nextflow pipeline designed for processing Streptococcus pneumoniae reads to extract the capsular locus sequence (CPS) and check for disruptive mutations

The pipeline is designed to be easy to set up and use, and is suitable for use on local machines and high-performance computing (HPC) clusters alike. Once you have downloaded the necessary docker/singularity images the pipeline can be used offline unless you have changed the selection of any database or container image.

The development of this pipeline is part of the GPS Project (Global Pneumococcal Sequencing Project).

Workflow

The current pipeline workflow is as follows:

The pipeline takes S.pneumoniae reads and uses SeroBA to determine their serotype. It then assembles the reads using Unicycler in bold mode (to avoid contig breaks where possible). Following this, a blast search is performed to compare the assembly to a database of reference CPS sequences. Python code is used to extract the CPS sequence with the best blast hit for the given serotype. If any gaps are determined, these are filled in using a consensus sequence method. The CPS sequence is annotated using Bakta and checked for any potentially disruptive mutations using ARIBA. A gene comparison plot for each sample versus the reference is made using clinker. Finally, Panaroo is used to assess gene content difference for individual genes in the CPS sequence. Optionally, if you know the serotype of your samples, serotyping via SeroBA is skipped, a pangenome analysis of all your samples is performed using panaroo and all amino acid sequences for each gene will be concatenated for easy alignment and tree building. Additionally, a gene comparison plot will be generated for samples containing disruptive mutations using clinker and genetic variants will be shown using clinker.

Output

Each sample will have its own results folder.

For example the sample ERR311103:

ERR311103
├── ERR311103
│   ├── ERR311103_aliA_blast_results.xml
│   ├── ERR311103_blast_results.xml
│   ├── ERR311103_cps.fa
│   ├── ERR311103_cps.gff3
│   ├── ERR311103_dexb_blast_results.xml
│   ├── ERR311103_gene_order.tsv
│   ├── ERR311103_plot.html
│   ├── ariba_report.tsv
│   ├── cps_extractor.log
│   ├── gene_integrity.csv
│   ├── key_ariba_mutations.tsv
│   └── proteins
|       ├── ERR311103-aliA_protein.fa
│       ├── ERR311103-dexB_protein.fa

Each results folder will contain the following:

Ariba report file (ariba_report.tsv)
aliA blast results file (sample_aliA_blast_results.xml)
dexB blast results file (sample_dexb_blast_results.xml)
CPS reference blast results XML file (sample_blast_results.xml)
The CPS sequence (sample_cps.fa)
The CPS annotation (sample_cps.gff3)
A log file (cps_extractor.log) containing the logs from the CPS extraction
A gene integrity file (gene_integrity.csv) which shows which genes (if any) contain disruptive mutations
Key mutations file (key_ariba_mutations.tsv) containing potentially important mutations in the cps genes (e.g frameshift mutations)
A gene comparison plot (sample_plot.html) generated by clinker
A gene order file (sample_gene_order.tsv) which shows the genes in the sample and the genes in the reference sequence
A proteins folder which contains the amino acid sequence for each gene in the sample

There will also be two files in the output folder summarising the key mutations for all samples:

Key mutations detected by ARIBA for all samples (key_ariba_mutations_all_samples.tsv)
Disrupted genes detected by ARIBA for all samples (disrupted_genes_all_samples.csv)

If you run the pipeline using the --serotype argument, the pangenome analysis results will be in the panaroo_pangenome_results folder and there will be a proteins folder containing amino acid sequences per gene for all samples. If there are any cps genes which contain disruptive mutations, a potential_new_serotypes_plot.html file generated by clinker will be in the output folder. In addition, there will also be files showing the genetic variants within your sample set: genetic_variants_plot.html, genetic_clusters.csv and genetic_groups.csv

Usage

Requirements

A POSIX-compatible system (e.g. Linux, macOS, Windows with WSL) with Bash 3.2 or later
Java 11 or later (up to 21) (OpenJDK/Oracle Java)
Docker or Singularity/Apptainer
- For Linux, Singularity/Apptainer or Docker Engine is recommended over Docker Desktop for Linux. The latter is known to cause permission issues when running the pipeline on Linux.
Nextflow >= 23.04

Accepted Inputs

Only Illumina paired-end short reads are supported
Each sample is expected to be a pair of raw reads following this file name pattern:
- *_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}
  - example 1: SampleName_R1_001.fastq.gz, SampleName_R2_001.fastq.gz
  - example 2: SampleName_1.fastq.gz, SampleName_2.fastq.gz
  - example 3: SampleName_R1.fq, SampleName_R2.fq

Setup

Clone the repository (if Git is installed on your system)
```
git clone https://github.com/GlobalPneumoSeq/cps_extractor.git
```
or

Download and unzip/extract the latest commit
Go into the local directory of the pipeline and it is ready to use without installation (the directory name might be different)
```
cd cps_extractor
```
Run the database setup to download all required additional files and container images, so the pipeline can be used at any time with or without the Internet afterwards.

⚠️ Docker or Singularity must be running, and an Internet connection is required.
- Using Docker as the container engine
```
./run_cps_extractor --setup
```
- Using Singularity as the container engine
```
./run_cps_extractor --setup -profile singularity
```

Run

⚠️ Docker or Singularity must be running.

ℹ️ By default, Docker is used as the container engine and all the processes are executed by the local machine. See Profile for details on running the pipeline with Singularity or on a HPC cluster.

You can run the pipeline without options. It will attempt to get the raw reads from the default location (i.e. input directory inside the cps_extractor local directory)
```
./run_cps_extractor
```
You can also specify the location of the raw reads by adding the --input option
```
./run_cps_extractor --input /path/to/raw-reads-directory
```

Options

|Usage:
|./run_cps_extractor [option] [value]
|
|--input [PATH]                  Path to the input directory that contains reads to be processed. Default: ./input
|--output [PATH]                 Path to the output directory that save the results. Default: output
|--serotype [STR]                Serotype (if known). Default: None
|--setup                         Alternative workflow for setting up the required databases.
|--version                       Alternative workflow for getting versions of pipeline, container images, tools and databases
|--help                          Print this help message

Profile

By default, Docker is used as the container engine and all the processes are executed by the local machine. To change this, you could use Nextflow's built-in -profile option to switch to other available profiles

ℹ️ -profile is a built-in Nextflow option, it only has one leading -
```
nextflow run . -profile [profile name]
```

Available profiles:

Profile Name	Details
`standard` (Default)	Docker is used as the container engine. Processes are executed locally.
`singularity`	Singularity is used as the container engine. Processes are executed locally.
`lsf`	The pipeline should be launched from a LSF cluster head node with this profile. Singularity is used as the container engine. Processes are submitted to your LSF cluster via `bsub` by the pipeline. (Tested on Wellcome Sanger Institute farm5 LSF cluster only)

Resume

If the pipeline is interrupted mid-run, Nextflow's built-in -resume option can be used to resume the pipeline execution instead of starting from scratch again
You should use the same command of the original run, only add -resume at the end (i.e. all pipeline options should be identical)

ℹ️ -resume is a built-in Nextflow option, it only has one leading -
- If the original command is
```
./run_cps_extractor --input /path/to/raw-reads-directory
```
- The command to resume the pipeline execution should be
```
./run_cps_extractor --input /path/to/raw-reads-directory -resume
```

Clean Up

During the run of the pipeline, Nextflow generates a considerable amount of intermediate files
If the run has been completed and you do not intend to use the -resume option or those intermediate files, you can remove the intermediate files using one of the following ways:
- Run the included clean_pipeline script
  - It runs the commands in manual removal for you
  - It removes the work directory and log files within the cps_extractor local directory
```
./clean_pipeline
```
- Manual removal
  - Remove the work directory and log files within the cps_extractor local directory
```
rm -rf work
rm -rf .nextflow.log*
```
- Run nextflow clean command
  - This built-in command cleans up cache and work directories
  - By default, it only cleans up the latest run
  - For details and available options of nextflow clean, refer to the Nextflow documentation
```
./nextflow clean
```

Pipeline Options

The tables below contain the available options that can be used when you run the pipeline
Usage:
```
./run_cps_extractor [option] [value]
```

ℹ️ To permanently change the value of an option, edit the nextflow.config file inside the cps_extractor local directory.

ℹ️ $projectDir is a Nextflow built-in implicit variables, it is defined as the local directory of gps-pipeline.

ℹ️ Pipeline options are not built-in Nextflow options, they are lead with -- instead of -

Alternative Workflows

Option	Values	Description
`--setup`	`true` or `false` (Default: `false`)	Use alternative workflow for initialisation, which means downloading all required additional files and container images, and creating databases. Can be enabled by including `--setup` without value.
`--version`	`true` or `false` (Default: `false`)	Use alternative workflow for showing versions of pipeline, container images, tools and databases. Can be enabled by including `--version` without value. (This workflow pulls the required container images if they are not yet available locally)
`--help`	`true` or `false` (Default: `false`)	Show help message. Can be enabled by including `--help` without value.

General options

Option	Values	Description
`--input`	Any valid path containing paired end fastq.gz files (Default: `$projectDir/input`)	Input folder containing S.pneumoniae reads
`--output`	Any valid path (Default: `$projectDir/output`)	Output folder which stores the pipeline results
`--blastdb`	Any valid blast database path `.n*` (Default: `$projectDir/cps_reference_database/cps_blastdb`	Path to blast database containing CPS references
`--prodigal_training_file`	Any valid path containing a prodigal training file (Default: `$projectDir/cps_reference_database/all.trn`	Training file for improved annotation
`--bakta_db`	Any valid path containing a bakta database (Default: `$projectDir/cps_reference_database/bakta_db`)	Path to bakta database used for annotation
`--bakta_threads`	Any valid integer value (Default: 32)	Threads used for bakta annotation
`--reference_database`	Any valid reference database path (Default: `$projectDir/cps_reference_database`)	Full reference database used by the pipeline
`--serotype`	Any valid serotype string (Default: None)	Manually set the serotype of your input sequences instead of having it determined by SeroBA
`--minimum_cps_length`	Any valid integer value (Default: 8000)	Minimum length of CPS sequence to pass quality control

Default database

The default database is stored at: https://github.com/GlobalPneumoSeq/cps_reference_database

Credits

See Citations.MD for the full list of citations.

Thanks to Harry Hung for his excellent NextFlow code architecture which this pipeline also uses

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
.github/workflows		.github/workflows
bin		bin
docker_files		docker_files
lib		lib
modules		modules
tests		tests
workflows		workflows
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CITATIONS.md		CITATIONS.md
README.md		README.md
clean_pipeline		clean_pipeline
main.nf		main.nf
nextflow.config		nextflow.config
requirements.txt		requirements.txt
run_cps_extractor		run_cps_extractor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CPS Extractor Pipeline

Table of contents

Workflow

Output

Usage

Requirements

Accepted Inputs

Setup

Run

Options

Profile

Resume

Clean Up

Pipeline Options

Alternative Workflows

General options

Default database

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

GlobalPneumoSeq/cps_extractor

Folders and files

Latest commit

History

Repository files navigation

CPS Extractor Pipeline

Table of contents

Workflow

Output

Usage

Requirements

Accepted Inputs

Setup

Run

Options

Profile

Resume

Clean Up

Pipeline Options

Alternative Workflows

General options

Default database

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages