-
Notifications
You must be signed in to change notification settings - Fork 203
PacBio CCS Amplicon SOP v2 (qiime2 2022.2)
This standard operating procedure (SOP) is based on QIIME2 and is meant for users who want to quickly run their PacBio CCS amplicon data through the Microbiome Helper virtual box image and for internal use. Note that this workflow simply adapts our current Illumina amplicon workflows by altering the first steps to be compatible with the PacBio datatype, therefore you will need to use and familiarize yourself with the Illumina workflow of your choice on the right menu bar in order to complete processing.
If you use this workflow make sure to keep track of the commands you use locally as this page will be updated over time (see "revisions" above for earlier versions).
This workflow assumes that you have QIIME2 installed in a conda environment (the appropriate version matching the Illumina workflow you will continue) and that you also have the seqtk toolset installed in this environment.
This workflow also assumes that the input is raw Circular Consensus Sequencing (CCS) PacBio data in demultiplexed FASTQ format located within a folder called raw_data
. The filenames can be almost anything you wish (contrary to most QIIME2 importing) since you are going to use a "manifest file" to list each file.
As per the Illumina amplicon SOP.
You should run the rest of the workflow in a conda environment, which makes sure the correct version of the Python packages required by QIIME 2 are being used. You can activate this conda environment with this command (you may need to swap in source
for conda
if you get an error):
conda activate qiime2-2022.2
The raw reads will be imported into the QIIME 2 "artifact" file format (with the extension QZA). The slight difference here compared to standard Illumina file importing is that you need to use a "manifest" file - consult the QIIME2 documentation about preparing it, but essentially it is just a tab-delimited text file containing the sample names + absolute path to each file in the raw_data/ folder we just made above.
cd raw_reads
gzip -d raw_data/*.fastq.gz
cd ..
mkdir reads_qza
qiime tools import \
--type SampleData[SequencesWithQuality] \
--input-path PacBioCCSmanifest.tsv \
--output-path reads_qza/raw_reads.qza \
--input-format SingleEndFastqManifestPhred33V2
You can run the demux summarize
command after trimming the reads to get a report of the number of reads per sample and quality distribution across the reads. This generates a more basic output compared to FASTQC/MultiQC, but is sufficient for this step.
qiime demux summarize \
--i-data reads_qza/raw_reads.qza \
--o-visualization reads_qza/raw_reads_summary.qzv
At this stage, the main 2 pipelines you can use are based on either deblur or DADA2. Below we will describe the commands for running DADA2, which now supports PacBio reads, as Deblur may not work correctly since it does not currently have a PacBio mode.
Run the DADA2 workflow to remove primers, correct reads and get amplicon sequence variants (ASVs). Note that we are not doing any form of initial quality filtering (unlike our Illumina SOP) or truncation in the below command due to the fact the CCS reads are already of very high quality when produced (99.9% consensus accuracy). You will probably also want to increase the number of threads used below to the maximum your system has available. If your PacBio data is not CCS in nature or you have used a lower consensus threshold (ie: <99%), then we would suggest you add in a quality-filter step. The below primers correspond to the full-length 16S (Bacteria-specific primer set).
mkdir dada2_output
qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
--p-min-len 1200 --p-max-len 1800 \
--p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT \
--p-n-threads 10 \
--o-table dada2_ccs_output/table.qza \
--o-representative-sequences dada2_ccs_output/representative_sequences.qza \
--o-denoising-stats dada2_ccs_output/stats.qza \
--verbose
Note we are using linked anchored adapters here to ensure only reads having the primers at the extremities are retained and a size-range of 1300-1800 nt to allow some indels (there should be relatively few in actuality), but prevent amplicon dimers (we've observed ~1% in PacBio CCS data) from passing through, since the size of the 16S amplicon here is ~1500 nt.
If using our full-length 16S Archaea-specific primers (currently in testing), use the following command:
qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
--p-min-len 1200 --p-max-len 1800 \
--p-front TCCGGTTGATCCYGCCGG --p-adapter CRGTGWGTRCAAGGRGCA \
--p-n-threads 10 \
--o-table dada2_ccs_output/table.qza \
--o-representative-sequences dada2_ccs_output/representative_sequences.qza \
--o-denoising-stats dada2_ccs_output/stats.qza \
--verbose
If using our full-length 18S primers, use the following command (note a much wider size-range, around the ~1800 nt normal size, due to the fact some species of Eukaryotes have much larger indels in their rRNA than Bacteria do):
qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
--p-min-len 1000 --p-max-len 3000 \
--p-front CTGGTTGATYCTGCCAGT --p-adapter TGATCCTTCTGCAGGTTCACCTAC \
--p-n-threads 10 \
--o-table dada2_ccs_output/table.qza \
--o-representative-sequences dada2_ccs_output/representative_sequences.qza \
--o-denoising-stats dada2_ccs_output/stats.qza \
--verbose
If using our full-length fungal ITS primers, use the following command (again, size-range is modified for the proper range to keep variation, but hopefully remove larger dimers; ~600 nt average size):
qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
--p-min-len 300 --p-max-len 900 \
--p-front TAGAGGAAGTAAAAGTCGTAA --p-adapter TCCTCCGCTTWTTGWTWTGC \
--p-n-threads 10 \
--o-table dada2_ccs_output/table.qza \
--o-representative-sequences dada2_ccs_output/representative_sequences.qza \
--o-denoising-stats dada2_ccs_output/stats.qza \
--verbose
If you see substantial losses of read numbers after this step (ie: your file-sizes are now much smaller than the original "raw" CCS files), then make absolutely certain you are using the correct primer sequences (in the correct linked orientations) and have not filtered out most of your reads due to the size-range restrictions - this would indicate a problem in matching your fragment with the correct parameters (especially important if using custom primers and adjusting all the above values).
Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for you to view.
qiime feature-table summarize \
--i-table dada2_output/table.qza \
--o-visualization dada2_output/dada2_table_summary.qzv
Continue on with the standard Illumina amplicon SOP, keeping in mind the following notes:
- Remember to modify the expected filenames and folders from "deblur..." to "dada2..." throughout.
- There is no "bleed-through" phenomenon in PacBio sequencing (addressed in Step 4.1) and depth is much shallower than Illumina, therefore you may have to adjust the levels to which you filter out rare sequences (although some filtering to remove noise/singletons is probably still recommended).
- Remember to use the full-length versions of the taxonomic reference files when identifying ASVs.
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].
Useful Links
Microbiome Helper 2
- Overview
- Tutorial data
- QIIME2 marker gene workflow
- QIIME2 basic statistics and visualisation
- Statistical analysis workflow in R
- Metagenomics first steps
- Taxonomic annotation with Kraken 2
- Getting a tree for metagenomic reads
- MAG assembly, binning, and curation with Anvi'o
Main SOPs
Amplicon SOP v2 (qiime2-amplicon-2024.5)
PacBio Amplicon SOP v2 (qiime2-2022.2)
Tutorials