Skip to content

PacBio CCS Amplicon SOP v2 (qiime2 2022.2)

Gina Filloramo edited this page Nov 4, 2022 · 34 revisions

This standard operating procedure (SOP) is based on QIIME2 and is meant for users who want to quickly run their PacBio CCS amplicon data through the Microbiome Helper virtual box image and for internal use. Note that this workflow simply adapts our current Illumina amplicon workflows by altering the first steps to be compatible with the PacBio datatype, therefore you will need to use and familiarize yourself with the Illumina workflow of your choice on the right menu bar in order to complete processing.

If you use this workflow make sure to keep track of the commands you use locally as this page will be updated over time (see "revisions" above for earlier versions).

Requirements

This workflow assumes that you have QIIME2 installed in a conda environment (the appropriate version matching the Illumina workflow you will continue) and that you also have the seqtk toolset installed in this environment.

This workflow also assumes that the input is raw Circular Consensus Sequencing (CCS) PacBio data in demultiplexed FASTQ format located within a folder called raw_data. The filenames can be almost anything you wish (contrary to most QIIME2 importing) since you are going to use a "manifest file" to list each file.

1. First steps

Steps 1.1 to 1.3

As per the Illumina amplicon SOP.

1.4 Activate QIIME 2 conda environment

You should run the rest of the workflow in a conda environment, which makes sure the correct version of the Python packages required by QIIME 2 are being used. You can activate this conda environment with this command (you may need to swap in source for conda if you get an error):

conda activate qiime2-2022.2

1.5 Import FASTQs as QIIME 2 artifact

The raw reads will be imported into the QIIME 2 "artifact" file format (with the extension QZA). The slight difference here compared to standard Illumina file importing is that you need to use a "manifest" file - consult the QIIME2 documentation about preparing it, but essentially it is just a tab-delimited text file containing the sample names + absolute path to each file in the raw_data/ folder we just made above.

cd raw_reads
gzip -d raw_data/*.fastq.gz
cd ..

mkdir reads_qza

qiime tools import \
    --type SampleData[SequencesWithQuality] \
    --input-path PacBioCCSmanifest.tsv \
    --output-path reads_qza/raw_reads.qza \
    --input-format SingleEndFastqManifestPhred33V2

1.6 Summarize trimmed FASTQs

You can run the demux summarize command after trimming the reads to get a report of the number of reads per sample and quality distribution across the reads. This generates a more basic output compared to FASTQC/MultiQC, but is sufficient for this step.

qiime demux summarize \
   --i-data reads_qza/raw_reads.qza \
   --o-visualization reads_qza/raw_reads_summary.qzv

2. Denoising the reads into amplicon sequence variants

At this stage, the main 2 pipelines you can use are based on either deblur or DADA2. Below we will describe the commands for running DADA2, which now supports PacBio reads, as Deblur may not work correctly since it does not currently have a PacBio mode.

2.1 Running DADA2

Run the DADA2 workflow to remove primers, correct reads and get amplicon sequence variants (ASVs). Note that we are not doing any form of initial quality filtering (unlike our Illumina SOP) or truncation in the below command due to the fact the CCS reads are already of very high quality when produced (99.9% consensus accuracy). You will probably also want to increase the number of threads used below to the maximum your system has available. If your PacBio data is not CCS in nature or you have used a lower consensus threshold (ie: <99%), then we would suggest you add in a quality-filter step. The below primers correspond to the full-length 16S (Bacteria-specific primer set).

mkdir dada2_output

qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
 --p-min-len 1200 --p-max-len 1800 \
 --p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT \
 --p-n-threads 10 \
 --o-table dada2_ccs_output/table.qza \
 --o-representative-sequences dada2_ccs_output/representative_sequences.qza \
 --o-denoising-stats dada2_ccs_output/stats.qza \
 --verbose

Note we are using linked anchored adapters here to ensure only reads having the primers at the extremities are retained and a size-range of 1300-1800 nt to allow some indels (there should be relatively few in actuality), but prevent amplicon dimers (we've observed ~1% in PacBio CCS data) from passing through, since the size of the 16S amplicon here is ~1500 nt.

If using our full-length 16S Archaea-specific primers (currently in testing), use the following command:

qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
 --p-min-len 1200 --p-max-len 1800 \
 --p-front TCCGGTTGATCCYGCCGG --p-adapter CRGTGWGTRCAAGGRGCA \
 --p-n-threads 10 \
 --o-table dada2_ccs_output/table.qza \
 --o-representative-sequences dada2_ccs_output/representative_sequences.qza \
 --o-denoising-stats dada2_ccs_output/stats.qza \
 --verbose

If using our full-length 18S primers, use the following command (note a much wider size-range, around the ~1800 nt normal size, due to the fact some species of Eukaryotes have much larger indels in their rRNA than Bacteria do):

qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
 --p-min-len 1000 --p-max-len 3000 \
 --p-front CTGGTTGATYCTGCCAGT --p-adapter TGATCCTTCTGCAGGTTCACCTAC \
 --p-n-threads 10 \
 --o-table dada2_ccs_output/table.qza \
 --o-representative-sequences dada2_ccs_output/representative_sequences.qza \
 --o-denoising-stats dada2_ccs_output/stats.qza \
 --verbose

If using our full-length fungal ITS primers, use the following command (again, size-range is modified for the proper range to keep variation, but hopefully remove larger dimers; ~600 nt average size):

qiime dada2 denoise-ccs --i-demultiplexed-seqs reads_qza/raw_reads.qza \
 --p-min-len 300 --p-max-len 900 \
 --p-front TAGAGGAAGTAAAAGTCGTAA --p-adapter TCCTCCGCTTWTTGWTWTGC \
 --p-n-threads 10 \
 --o-table dada2_ccs_output/table.qza \
 --o-representative-sequences dada2_ccs_output/representative_sequences.qza \
 --o-denoising-stats dada2_ccs_output/stats.qza \
 --verbose

If you see substantial losses of read numbers after this step (ie: your file-sizes are now much smaller than the original "raw" CCS files), then make absolutely certain you are using the correct primer sequences (in the correct linked orientations) and have not filtered out most of your reads due to the size-range restrictions - this would indicate a problem in matching your fragment with the correct parameters (especially important if using custom primers and adjusting all the above values).

2.2 Summarizing DADA2 output

Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for you to view.

qiime feature-table summarize \
   --i-table dada2_output/table.qza \
   --o-visualization dada2_output/dada2_table_summary.qzv

Steps 3 and Onward

Continue on with the standard Illumina amplicon SOP, keeping in mind the following notes:

  1. Remember to modify the expected filenames and folders from "deblur..." to "dada2..." throughout.
  2. There is no "bleed-through" phenomenon in PacBio sequencing (addressed in Step 4.1) and depth is much shallower than Illumina, therefore you may have to adjust the levels to which you filter out rare sequences (although some filtering to remove noise/singletons is probably still recommended).
  3. Remember to use the full-length versions of the taxonomic reference files when identifying ASVs.
Clone this wiki locally