Skip to content

Microbiome Helper 2 Metagenomics Standard Operating Procedure Initial Steps

Robyn Wright edited this page Dec 5, 2024 · 4 revisions

Authors:
Modifications by:
Based on initial versions by:

Note that this workflow is currently a work in progress. Please see the Metagenomics SOP v3 in the right hand side bar for our current version. If you want to use the below commands be sure to keep track of them locally.

Initial points

  • The below options are not necessarily best for your data; it is important to explore different options, especially at the read QC stage.
  • You should check that only the FASTQs of interest are being specified at each step (e.g. with the ls command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ.
  • The tool parallel comes up several times in this workflow. Be sure to check out our tutorial on this tool here.
  • You can always run parallel with the --dry-run option to see what commands are being run to double-check they are doing what you want!

Requirements

  • This workflow starts with raw paired-end sequencing data in demultiplexed FASTQ format assumed to be located within a folder called raw_data.
  • If you do not have your own data and would like to run this, you can use some of our tutorial data from here.
  • concat_lanes.pl script.
  • fastqc
  • multiqc
  • kneaddata
  • Trimmomatic
  • Bowtie2 database for use with kneaddata (maybe just human?)

1. First Steps

1.1 Join lanes together (only some Illumina)

Concatenate multiple lanes of sequencing together (e.g. for NextSeq550 data; NS2000 does not need this). If you do NOT do this step, remember to change cat_lanes to raw_data in the further commands below that have assumed you are working with the most common type of metagenome data.

concat_lanes.pl raw_data/* -o cat_lanes -p 4

1.2 Inspect read quality

Run fastqc to allow manual inspection of the quality of sequences.

mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/

1.3 (Optional) Stitch F+R reads

Stitch paired-end reads together with PEAR (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).

run_pear.pl -p 4 -o stitched_reads cat_lanes/*

If you don't stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.

2. Read Quality-Control and Contaminant Screens

2.1 Running KneadData

Use kneaddata to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \ characters at the end of each line are just to split the command over multiple lines to make it easier to read.

parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
    -db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
    -t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
    --bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
    ::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq

Optional: If you stitched the F+R reads together above, then use the following version.

parallel -j 1 --link 'kneaddata -i {} -o kneaddata_out/ \
    -db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
    -t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
    --bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
    ::: stitched_reads/*.assembled.fastq

Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:

mkdir kneaddata_out/contam_seq
mkdir kneaddata_out/unmatched_seq
mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq
mv kneaddata_out/*_unmatched*.fastq kneaddata_out/unmatched_seq

You can produce a logfile summarizing the kneadData output with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

If you are having issues with running Kneaddata and all of your reads are being output as unmatched despite the fastq headers in both files matching, you may want to try using an older version of Kneaddata. It seems that this issue has been around for a while, and that the simplest fix is to use an older version of Kneaddata. I have found that version 0.7.4 works as expected.

2.2 Concatenate unstitched output (omit if data stitched above)

If you did not stitch your paired-end reads together with PEAR, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It's important to only specify the "paired" FASTQs outputted by kneaddata in the below command.

concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq 

You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.

Next steps

Now that you've carried out the initial steps, you can carry out taxonomic or functional annotation, or assemble MAGs:

  • Taxonomic annotation
  • Functional annotation
  • MAG assembly
Clone this wiki locally