Skip to content

Latest commit

 

History

History
190 lines (165 loc) · 9.96 KB

README.md

File metadata and controls

190 lines (165 loc) · 9.96 KB

nanoporeWorkflow

Shell scripts and workflows for working with Nanopore data. Submits jobs to CDC's Aspen HPC using qsub.

⚠️ Don't bother reading if you aren't working on CDC's servers ⚠️

There are 2 main workflows:

  • run_basecall-w-gpu.sh - Guppy GPU basecalling, demultiplexing, and trimming. Followed by NanoPlot for generating seq run stats and graphs.
  • workflow-after-gpu-basecalling.sh - Assembly with Flye and polishing with Racon and Medaka

TOC

Install

Download the repository from the latest release (v0.5.0 is latest as of March 2021) and uncompress.

$ wget https://github.com/kapsakcj/nanoporeWorkflow/archive/v0.5.0.tar.gz 
$ tar -xzf v0.5.0.tar.gz

Optional - add the workflows to your $PATH (edit the PATH below to wherever you downloaded the repo). Refresh your environment by source'ing your .bashrc file.

# Be careful with this command - make sure the PATH is properly edited!
$ echo 'export PATH=$PATH:/path/to/nanoporeWorkflow-0.5.0/workflows' >> ~/.bashrc
$ source ~/.bashrc

Workflows

Guppy GPU basecalling, demultiplexing, trimming, and NanoPlot

run_basecall-w-gpu.sh - Guppy GPU basecalling, demultiplexing, and adapter/barcode trimming. Followed by NanoPlot for generating seq run stats and graphs.

Requirements

  • Must be logged into a server with the ability to run qsub for submitting jobs to Aspen:
    • Aspen head node
    • Aspen interactive node (run qlogin from aspen head node)
    • Monoliths 1-3 (cannot submit jobs from M4)
  • Currently supported MinION data:
    • R9.4.1 flowcell (FLO-MIN106)
      • Rapid barcoding kit (RBK-004)
      • Ligation sequencing kit (SQK-LSK109) + native barcodes 1-24 (NBD103/104/114)
    • R10 flowcell
      • Ligation sequencing kit (SQK-LSK109) + native barcodes 1-24 (NBD103/104/114)
  • Unsupported combos (want to add in the future!):
    • R9.4.1 + rapid or ligation sequencing kit without barcoding (RAD-004)
    • R10 + ligation without barcoding
    • R10.3 + ligation with & without barcoding

This workflow does the following:

  • Takes in 5 arguments (a double pipe || is the same as OR):
    1. -i path/to/fast5files/ - the input directory containing raw fast5 files
    2. -o path/to/outputDirectory/ - an output directory
    3. -b y || yes || n || no - were barcodes used?
    4. -f r941 || r10 - flowcell type used?
    5. -k rapid || ligation - sequencing kit used?
  • copies fast5s from input directory to /scicomp/scratch/$USER/guppy.gpu.XXXXXX
  • runs guppy_basecaller in high-accuracy mode on a GPU
  • Demultiplexes using guppy_basecaller and additionally trims adapter and barcode sequences (using --trim_barcodes ; --barcode_kits "EXP-NBD104 EXP-NBD114" or "SQK-RBK004" options)
  • Compresses the demultiplexed reads (--compress_fastq option)
  • Copies demultiplexed, trimmed, compressed reads into subdirectories in $OUTDIR/demux/barcodeXX
  • Runs NanoPlot to generate sequencing run stats on the entire run, as well as individual barcodes.

USAGE:

Pull up help/usage statement by running run_basecall-w-gpu.sh or run_basecall-w-gpu.sh -h

Usage: /path/to/nanoporeWorkflow-0.5.0/workflows/run_basecall-w-gpu.sh
                 -i path/to/fast5files/        searches recursively for fast5 files
                 -o path/to/outputDirectory/   output directory
                 -b y || yes || n || no        barcodes used?
                 -f r941 || r10                flowcell type used?
                 -k rapid || ligation          sequencing kit used?

example: /path/to/nanoporeWorkflow-0.5.0/workflows/run_basecall-w-gpu.sh -i fast5s/ -o output/ -b y -f r941 -k rapid

# EXAMPLE OUTPUT (reduced for brevity)
$OUTDIR
├── demux
│   ├── barcode01
│   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz (there will be many .fastq.gz files per barcode)
│   ├── barcode02
│   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
│   ├── barcode03
│   │   └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
│   ├── guppy-logs
│       └── guppy_basecaller_log-2020-04-17_09-45-00.log (there will be many guppy-logs)
│   ├── nanoplot
│       └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot
│   ├── nanoplot-barcoded
│       └── NanoPlot-report.html # additionally all images and other files produced by NanoPlot, but for each barcode
│   ├── sequencing_summary.txt
│   ├── sequencing_telemetry.js
│   └── unclassified
│       └── fastq_runid_fbc8eee46271cbe60ee8a49d0ca657f6e92e174e_0_0.fastq.gz
└── log # qsub logs
    └── guppy.log
    └── nanoplot.log

Assembly with Flye and polishing with Racon and Medaka

workflow-after-gpu-basecalling.sh - Assembly with Flye and polishing with Racon and Medaka

Requirements

  • Must have previously run the above workflow run_basecall-w-gpu.sh
  • Must be logged into a server with the ability to qsub (Aspen, Monoliths 1-3).
  • OUTDIR argument must be the same directory as the OUTDIR from the run_basecall-w-gpu.sh workflow

This workflow does the following:

  • Takes in 1 argument:
    1. $outdir - The output directory from running run_basecall-w-gpu.sh, which contain demux/barcodeXX/ subdirectories
  • Prepares a barcoded sample - concatenates all fastq files into one, compresses, and counts read lengths
  • Runs filtlong to remove reads <500bp and downsample reads to 600 Mb (roughly 120X for a 5 Mb genome)
  • Assembles downsampled/filtered reads using flye (--plasmids and -g 5M options used)
  • Polishes flye draft assembly using racon 4 times
  • Polishes racon polished assembly using Medaka (specific to r9.4.1 flowcell, high accuracy basecaller model, and guppy version 3.6.x, --m r941_min_high_g360 option used)
  • Final, polished assembly for each barcode can be found in each barcode subdirectory demux/barcodeXX/final.asm.barcodeXX.fasta

USAGE

Pull up help/usage statement by running workflow-after-gpu-basecalling.sh or workflow-after-gpu-basecalling.sh -h

# note: ensure that the outdir supplied in this command is the exact same as the outdir you
# supplied when you ran the run_basecall-w-gpu.sh script
Usage: /path/to/nanoporeWorkflow-0.5.0/workflows/workflow-after-gpu-basecalling.sh outdir/

This workflow runs the following on barcodes 01-24:

filtlong     removes reads <500bp and downsamples to 600Mb (roughly 120X for 5Mb genome)
flye         assembles reads. --plasmids and -g 5M options used
racon        polishes 4X with Racon
medaka       polishes once with Medaka using r9.4.1 pore and HAC guppy basecaller profile

# EXAMPLE OUTPUT - only showing one barcode for brevity
$OUTDIR/
├── demux
│   ├── barcode01
│   │   ├── all.fastq.gz
│   │   ├── flye
|   |   ├── final.asm.barcodeXX.fasta
│   │   ├── log  # qsub logs for each barcode
│   │   │   ├── assemble-d64ffbc5-4012-44c5-8191-1a57d4a7d15c.log
│   │   │   ├── polish-medaka-00e52c16-0bd3-460d-b955-3a532be958b1.log
│   │   │   ├── polish-racon-d7ebc124-d100-43e0-b347-1e60bbc0bf18.log
│   │   │   └── prepSample-7ecc6f51-4937-40d1-a6bd-d83e66078984.log
│   │   ├── medaka
│   │   ├── racon
│   │   ├── readlengths.txt.gz
│   │   └── reads.minlen1000.600Mb.fastq.gz
│   ├── guppy-logs
│   ├── nanoplot
│   ├── nanoplot-barcoded
│   ├── sequencing_summary.txt
│   ├── sequencing_telemetry.js
│   └── unclassified
└── log # qsub logs
    └── guppy.log
    └── nanoplot.log

Notes on assembly and polishing workflow

  • It will check for the following files, to determine if it should skip any of the steps. Helps if one part doesn't run correctly and you don't want to repeat a certain step, e.g. re-assembling.
    • np_filter_filtlong.sh looks for ./demux/barcodeXX/reads.minlen500.600Mb.fastq.gz
    • np_assemble_flye.sh looks for ./demux/barcodeXX/flye/assembly.fasta
    • np_consensus_racon.sh looks for ./demux/barcodeXX/racon/ctg.consensus.iteration4.fasta
    • np_polish_medaka.sh looks for ./demux/barcodeXX/medaka/consensus.fasta

Contributing

If you are interested in contributing to nanoporeWorkflow, please take a look at the contribution guidelines. We welcome issues or pull requests!

Future plans

  • Add support for passing in a config file to workflow-after-gpu-basecalling.sh that contains:
    • Sample ID
    • barcode number (RBK or NBD)
    • estimated genome size (to be used as input parameter in various places)
  • Allow users to specify a read length for removing reads w/ filtlong
  • Test and add support for rasusa for randomly subsampling and filtering reads (filtlong is biased towards reads with highest q-scores)
  • add flags/options for other sequencing kits, barcoding kits, flowcells (direct RNAseq?)
    • R10.3 + ligation with native barcodes 1-24 (R10 flowcell discontinued)

Resources