-
Notifications
You must be signed in to change notification settings - Fork 204
Metagenomics Sequencing Pre processing
Filtering out Contaminants and Low-Quality Sequences with kneadData
kneaddata
is a helpful wrapper script for a number of pre-processing tools, including Bowtie2 to screen out contaminant sequences and Trimmomatic to exclude low-quality sequences. We also have written wrapper scripts to run these tools (see below), but using kneaddata
allows for more flexibility in options.
A basic kneaddata command can be run like this (the \
characters just break the command over multiple lines to make it easier to read):
kneaddata -i sample_R1.fastq -i sample_R2.fastq -o kneaddata_out \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output
The options above are:
-
-i
Input FASTQ file. This option is given twice for paired-end data. -
-o
Output directory. -
-db
Path to bowtie2 database. -
--trimmomatic
Path to Trimmomatic folder containing jarfile. -
-t 4
Number of threads to use (4 in this case). -
--trimmomatic-options
Options for Trimmomatic to use, in quotations ("SLIDINGWINDOW:4:20 MINLEN:50" in this case). See the Trimmomatic website for more options. -
--bowtie2-options
Options for bowtie2 to use, in quotations. The options "--very-sensitive" and "--dovetail" in this set alignment parameters to be very sensitive and sets cases where mates extend past each other to be concordant (i.e. they will be called as contaminants and be excluded). -
--remove-intermediate-output
Intermediate files, including large FASTQs, will be removed.
To run an example on all FASTQs on the Virtual Box Image you can use a command like this:
parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
::: raw_data/*_R1.fastq ::: raw_data/*_R2.fastq
As always with parallel
commands, there are options specific to parallel
and specific to the program being run (kneaddata
in this case). It's a good idea to use the parallel
option --dry-run
when running large commands like this to double-check that the commands are being run as you'd like (e.g. the R1 and R2 FASTQs are being paired correctly). This option will cause the commands to be printed to screen without running them.
Below is the same command as above, but with the kneaddata command being run collapsed to "KNEADDATA".
parallel -j 1 --link 'KNEADDATA' ::: raw_data/*_R1.fastq ::: raw_data/*_R2.fastq
The parallel
options specified are -j 1
to specify that one job should be run at a time and --link
to specify that two different sets of input files will be provided. These two sets of input files are the ::: raw_data/*_R1.fastq
and ::: raw_data/*_R2.fastq
at the end of the command. Since --link
was specified each parallel
command will run the files in set1 and set2 together (specified as {1}
and {2}
in the full command). Importantly, if you do not set the --link
option all the files will just be looped over as one set.
If you have stitched reads or single-end data you can pre-process your data using this command instead:
parallel -j 1 'kneaddata -i {} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--remove-intermediate-output' \
::: raw_data/*.fastq
You can then get a summary table of reads that passed at each pre-processing step per sample with this command:
kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt
Our run_contaminant_filter.pl
script wraps Bowtie2 to screen out human sequences. You can use the script like so:
run_contaminant_filter.pl -p 4 -o screened_reads/ stitched_reads/*.assembled*
Note that the GRCh38_PhiX (or whatever reference you're using) Bowtie2 index files need to be in /home/shared/bowtiedb/GRCh38_PhiX
and bowtie2 needs to be in your PATH if you want to filter out human reads.
Alternatively you could use run_deconseq.pl
instead, but it is much slower and the required databases are no longer available online so we have been unable to add them to the Virtual Box Image.
run_trimmomatic.pl
is a wrapper script that will run Trimmomatic on specified FASTQs. This script will automatically identify forward and reverse FASTQ pairs from the filenames. Note: Trimmomatic assumes that input forward and reverse reads are in the same order! See an example command below, you can type run_trimmomatic.pl -h
to see all the options.
run_trimmomatic.pl -l 5 -t 5 -r 15 -w 4 -m 70 -j /usr/local/prg/Trimmomatic-0.36/trimmomatic-0.36.jar \
--thread 1 -o trimmomatic_filtered screened_reads/*fastq
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].