-
Notifications
You must be signed in to change notification settings - Fork 204
Microbiome Helper 2 Metagenomics Standard Operating Procedure Initial Steps
Authors:
Modifications by:
Based on initial versions by:
Note that this workflow is currently a work in progress. Please see the Metagenomics SOP v3 in the right hand side bar for our current version. If you want to use the below commands be sure to keep track of them locally.
- The below options are not necessarily best for your data; it is important to explore different options, especially at the read QC stage.
- You should check that only the FASTQs of interest are being specified at each step (e.g. with the
ls
command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ. - The tool
parallel
comes up several times in this workflow. Be sure to check out our tutorial on this tool here. - You can always run parallel with the
--dry-run
option to see what commands are being run to double-check they are doing what you want!
- This workflow starts with raw paired-end sequencing data in demultiplexed FASTQ format assumed to be located within a folder called
raw_data
. - If you do not have your own data and would like to run this, you can use some of our tutorial data from here.
-
concat_lanes.pl
script. fastqc
multiqc
kneaddata
Trimmomatic
- Bowtie2 database for use with kneaddata (maybe just human?)
Concatenate multiple lanes of sequencing together (e.g. for NextSeq550 data; NS2000 does not need this). If you do NOT do this step, remember to change cat_lanes
to raw_data
in the further commands below that have assumed you are working with the most common type of metagenome data.
concat_lanes.pl raw_data/* -o cat_lanes -p 4
Run fastqc
to allow manual inspection of the quality of sequences.
mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/
Stitch paired-end reads together with PEAR (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).
run_pear.pl -p 4 -o stitched_reads cat_lanes/*
If you don't stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.
Use kneaddata
to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel
, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \
characters at the end of each line are just to split the command over multiple lines to make it easier to read.
parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq
Optional: If you stitched the F+R reads together above, then use the following version.
parallel -j 1 --link 'kneaddata -i {} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
::: stitched_reads/*.assembled.fastq
Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:
mkdir kneaddata_out/contam_seq
mkdir kneaddata_out/unmatched_seq
mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq
mv kneaddata_out/*_unmatched*.fastq kneaddata_out/unmatched_seq
You can produce a logfile summarizing the kneadData output with this command:
kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt
If you are having issues with running Kneaddata and all of your reads are being output as unmatched despite the fastq headers in both files matching, you may want to try using an older version of Kneaddata. It seems that this issue has been around for a while, and that the simplest fix is to use an older version of Kneaddata. I have found that version 0.7.4 works as expected.
If you did not stitch your paired-end reads together with PEAR, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It's important to only specify the "paired" FASTQs outputted by kneaddata
in the below command.
concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq
You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.
Now that you've carried out the initial steps, you can carry out taxonomic or functional annotation, or assemble MAGs:
- Taxonomic annotation
- Functional annotation
- MAG assembly
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].