Skip to content

Remove chimeric reads

Gavin Douglas edited this page Mar 10, 2016 · 24 revisions

Our script chimeraFilter.pl wraps usearch (v6.1), specifically the uchime algorithm, to remove chimeric reads.

Here is an example command:

chimeraFilter.pl -type 1 -db /usr/local/db/single_strand/Bacteria_RDP_trainset15_092015.udb fasta_files/*

Where "-type 1" means that any reads clearly called as chimeric AND reads that are ambiguous are filtered out.

Note that a DB file needs to be input as well. If you'd like to use the UDB format rather than FASTA then you'll need to use the "-makeudb_usearch" function of usearch v6.1 (the same usearch version as used for chimera checking).

Note that it is possible that the settings of "mindiv" and "minh" (see http://www.drive5.com/usearch/manual/UCHIME_score.html) could have significant effects on results. However, so far we have found that small adjustments in these parameters has only a minor effect on sensitivity and specificity when running chimera checking for 16S sequences.

You can download the DB used in the above example [here] (https://www.dropbox.com/s/8qr42doaez48oc3/Bacteria_RDP_trainset15_092015.udb?dl=0) (70 MB), which is originally from the Ribosome Database Project (RDP) and then parsed to include on bacteria.

Options:

  • -h, --help
    Displays the entire help documentation.

  • -v, --version
    Displays version number and exits.

  • -type <[0|1]>
    Non-chimeric output type, either only sequences that are clearly non-chimeric (1) or all sequences that are not called as chimeric ( 0 - includes borderline sequences, "?" in uchime output).

  • -mindiv
    Min % divergence between query and target sequence (default 1.5, note that this differs from the uchime default of 0.8).

  • -minh
    Min score to be called as chimeric (default 0.2, note that this differs from the uchime default of 0.28).

  • -o, --out_dir
    Output directory for filtered fastq files. Default is "non_chimeras".

  • -thread <# of CPUs>
    Using this option without a value will use all CPUs on machine, while giving it a value will limit to that many CPUs. Without option only one CPU is used.

  • -log
    The location to write the log file.

  • -db, --database
    Database of 16S sequences to use as a reference (UDB or FASTA file).

Clone this wiki locally