Skip to content

Metagenomics Sequencing Pre processing

Gavin Douglas edited this page Oct 4, 2017 · 12 revisions

Filtering out Contaminants and Low-Quality Sequences with kneadData

kneaddata is a helpful wrapper script for a number of pre-processing tools, including Bowtie2 to screen out contaminant sequences and Trimmomatic to exclude low-quality sequences. We also have written wrapper scripts to run these tools (see below), but using kneaddata allows for more flexibility in options.

A basic kneaddata command can be run like this (for paired-end data: note that the -i option is set twice):

kneaddata -i sample_R1.fastq -i sample_R2.fastq -o kneaddata_out -db /path/to/bowtie2/db \
--trimmomatic /path/to/trimmomatic 

To run an example on all FASTQs on the Virtual Box Image you can use a command like this (it's a good idea to use the --dry-run option with parallel to double-check the correct forward and reverse FASTQs are being run together):

parallel -j 1 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --mm --dovetail" --remove-intermediate-output' \
 ::: subsampled_fastqs/*_R1.fastq ::: subsampled_fastqs/*_R2.fastq

You can then get a summary table of reads that passed at each pre-processing step per sample with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

Wrapper script for Bowtie2

Our run_contaminant_filter.pl script wraps Bowtie2 to screen out human sequences. You can use the script like so:

run_contaminant_filter.pl -p 4 -o screened_reads/ stitched_reads/*.assembled*

Note that the GRCh38_PhiX (or whatever reference you're using) Bowtie2 index files need to be in /home/shared/bowtiedb/GRCh38_PhiX and bowtie2 needs to be in your PATH if you want to filter out human reads.

Alternatively you could use run_deconseq.pl instead, but it is much slower and the required databases are no longer available online so we have been unable to add them to the Virtual Box Image.

Wrapper script for Trimmomatic

run_trimmomatic.pl is a wrapper script that will run Trimmomatic on specified FASTQs. This script will automatically identify forward and reverse FASTQ pairs from the filenames. Note: Trimmomatic assumes that input forward and reverse reads are in the same order! See an example command below, you can type run_trimmomatic.pl -h to see all the options.

run_trimmomatic.pl -l 5 -t 5 -r 15 -w 4 -m 70 -j /usr/local/prg/Trimmomatic-0.36/trimmomatic-0.36.jar \
--thread 1 -o trimmomatic_filtered screened_reads/*fastq
Clone this wiki locally