-
Notifications
You must be signed in to change notification settings - Fork 204
Metagenomics Sequencing Pre processing
Filtering out Contaminants and Low-Quality Sequences with kneadData
kneaddata
is a helpful wrapper script for a number of pre-processing tools, including Bowtie2 to screen out contaminant sequences and Trimmomatic to exclude low-quality sequences. We also have written wrapper scripts to run these tools (see below), but using kneaddata
allows for more flexibility in options.
A basic kneaddata command can be run like this (for paired-end data: note that the -i
option is set twice):
kneaddata -i sample_R1.fastq -i sample_R2.fastq -o kneaddata_out -db /path/to/bowtie2/db \
--trimmomatic /path/to/trimmomatic
To run an example on all FASTQs on the Virtual Box Image you can use a command like this (it's a good idea to use the --dry-run
option with parallel
to double-check the correct forward and reverse FASTQs are being run together):
parallel -j 1 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --mm --dovetail" --remove-intermediate-output' \
::: subsampled_fastqs/*_R1.fastq ::: subsampled_fastqs/*_R2.fastq
You can then get a summary table of reads that passed at each pre-processing step per sample with this command:
kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt
Our run_contaminant_filter.pl
script wraps Bowtie2 to screen out human sequences. You can use the script like so:
run_contaminant_filter.pl -p 4 -o screened_reads/ stitched_reads/*.assembled*
Note that the GRCh38_PhiX (or whatever reference you're using) Bowtie2 index files need to be in /home/shared/bowtiedb/GRCh38_PhiX
and bowtie2 needs to be in your PATH if you want to filter out human reads.
Alternatively you could use run_deconseq.pl
instead, but it is much slower and the required databases are no longer available online so we have been unable to add them to the Virtual Box Image.
run_trimmomatic.pl
is a wrapper script that will run Trimmomatic on specified FASTQs. This script will automatically identify forward and reverse FASTQ pairs from the filenames. Note: Trimmomatic assumes that input forward and reverse reads are in the same order! See an example command below, you can type run_trimmomatic.pl -h
to see all the options.
run_trimmomatic.pl -l 5 -t 5 -r 15 -w 4 -m 70 -j /usr/local/prg/Trimmomatic-0.36/trimmomatic-0.36.jar \
--thread 1 -o trimmomatic_filtered screened_reads/*fastq
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].