Skip to content

Viromics Pipeline

cmajones edited this page Apr 25, 2018 · 19 revisions

Note that this workflow is continually being updated. If you want to use the below commands be sure to keep track of them locally.

Last updated: 23 April 2018

Important points to keep in mind:

  • The below options are not necessarily best for your data; it is important to explore different options, especially at the read quality filtering stage.
  • If you are confused about a particular step be sure to check out the pages under Metagenomic Resources on the right side-bar.
  • You should check that only the FASTQs of interest are being specified at each step (e.g. with the ls command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ.
  • The tool parallel comes up several times in this workflow. Be sure to check out our tutorial on this tool here.
  • You can always run parallel with the --dry-run option to see what commands are being run to double-check they are doing what you want!

This workflow starts with raw paired-end MiSeq or NextSeq data in demultiplexed FASTQ format assumed to be located within a folder called raw_data

  1. (Optional) Concatenate multiple lanes of sequencing together (e.g. for NextSeq data). If you do this step remember to change raw_data to concat_data below.
concat_lanes.pl raw_data/* -o concat_data -p 4
  1. Run fastqc to allow manual inspection of the quality of sequences.
mkdir fastqc_out
fastqc -t 4 raw_data/* -o fastqc_out/
  1. (Optional) Stitch paired-end reads together (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 6).
run_pear.pl -p 4 -o stitched_reads raw_data/*

If you don't stitch your reads together at this step you will need to unzip your FASTQ files before running the below command.

  1. Use kneaddata to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note kneadData is being run below on all FASTQ pairs with parallel, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \ characters at the end of each line are just to split the command over multiple lines to make it easier to read.
parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
 ::: raw_data/*_R1.fastq ::: raw_data/*_R2.fastq

You can produce a logfile summarizing the kneadData output with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt
  1. Use metaSPAdes to cross-assemble all FASTQ files from step 4. You will have to first generate a yaml file with your fastq paired file names. See more on metaSPAdes usage here.
python metaspades.py --dataset your_yaml_file.yaml -t THREADS -m MAX MEMORY USAGE (GB) -k 21,33,55,77,99,127 -o assembled_viral_contigs
  1. Use VirFinder to predict viral contigs. This is an R package that can be downloaded here.
## In R-Studio / R command line
VirFinder_result <- VF.pred("scaffolds.fasta")
VirFinder_result$qvalue <- VF.qvalue(VirFinder_result$pvalue)
write.table(VLP_MRN_april_17_VF_result, 
            "/path/to/your/wdir/VirFinder_result.txt", quote=FALSE, row.names=FALSE, col.names=TRUE, sep="\t")
  1. Filter out contig names with a p-value greater than 0.05 and q-value greater than 0.1, and use reformat_fasta_edit.py to get desired contigs.
awk '$4 <= 0.05' VirFinder_result.txt > VirFinder_result_pval.txt

awk '$5 <= 0.1' VirFinder_result_pval.txt > VirFinder_result_pval_qval.txt

reformat_fasta_edit.py -f scaffolds.fasta --ids2keep VirFinder_result_pval_qval.txt > VirFinder_scaffolds.fasta
  1. Use VirSorter to identify further viral contigs from your cross-assembled samples. We found that the CyVerse iteration of VirSorter worked best compared to the command line (Miniconda or docker) versions.

  2. Concatenate VirFinder and VirSorter (found in Predicted_viral_sequences in VirSorter folder) contigs.

cat VirFinder_scaffolds.fasta VirSorter_scaffolds.fasta > VirFinder_VirSorter_scaffolds.fasta
  1. Use Metabat2 to bin viral contigs.

First, create BAM files by aligning the reads of each sample separately to the assembled metagenome (in this case, VirFinder_VirSorter_scaffolds.fasta).

bowtie2-build -f --threads 40 VirFinder_VirSorter_scaffolds.fasta VF_VS_contigs_ref

parallel -j 40 --link 'bowtie2 -q --sensitive-local -x VF_VS_contigs_ref -1 {1} -2 {2} -S {1.}.sam' ::: /home/casey/local/SPAdes-3.11.1/MRN-1-4-7-10-14-17-fastqs/*_1.fastq ::: /home/casey/local/SPAdes-3.11.1/MRN-1-4-7-10-14-17-fastqs/*_2.fastq

Convert SAM to BAM then sort BAMs (Tutorial here).

parallel -j 40 'samtools view -S -b {} > {.}.bam' ::: *sam

parallel -j 40 'samtools sort {} -o {.}.sorted.bam' ::: *bam

Generate a depth file from BAM files for input into Metabat

./jgi_summarize_bam_contig_depths --outputDepth depth.txt *.bam

Run metabat for binning. See here for metabat usage and options. Parameters used as described in Anderson et al., 2017, Microbiome.

metabat --inFile VF_VS_contigs.fna --outFile bins/VFVS_contigs --abdFile depth.txt --minContig=1500 --minClsSize=10000 -t 40 --unbinned

CheckM can be used to assess bin quality.

  1. Determine abundances of viral populations (bins) by mapping sample reads to bins, using Bowtie2. Adapted from tutorial written by Jin Choi for EDAMAME2016.

Build bowtie database with bins

bowtie2-build -f --threads 40 <Comma separated list of bins from metabat> <reference-db-name>

Run Bowtie2. Use paired fastq files as input against database created in last step with bins.

parallel -j <THREADS> --link 'bowtie2 -q --sensitive-local -x <reference-db?> -1 {1} -2 {2} -S {1.}.sam' ::: path/to/QCed/fastq/*_1.fastq ::: path/to/QCed/fastq/*_2.fastq

Index reference contigs with samtools (in this case, the concatenated VirSorter and VirFinder contigs file <VF_VS_contigs.fna>).

samtools faidx VF_VS_contigs.fna

Convert SAM files to BAM, sort, then index BAM files for quicker downstream processsing, . See here for more on SAM/BAM.

parallel -j 40 'samtools import VF_VS_contigs.fna.fai {} {.}.bam' ::: *sam 

parallel -j 40 'samtools sort {} -o {.}.sorted.bam' ::: *bam

parallel -j 40 'samtools index {}' ::: *sorted.bam

Make summary of mapped read counts

parallel -j 40 'samtools idxstats {} > {.}.idxstats.txt' ::: *sorted.bam

Use scripts from Jin Choi to process these files into one table. Note that python2.7 must be used to run the below script. You can create a miniconda environment for python2.7 using the instructions here.

git clone https://github.com/metajinomics/mapping_tools.git
python mapping_tools/get_count_table.py *.idxstats.txt > counts.txt
less counts.txt

Create a mapping file that will map contig names to bins.

To get a file that links contig names to each bin, add header to this file:

echo “bin contig” > bins2contigs.txt

Parse out all the headerlines (lines starting with “>”) from each bin fasta, assuming they end in “fa”

grep -H “>” *fa >> bins2contigs.txt

Replace all instances of “.fa:>” in this file with spaces

sed -i ‘s/.fa:>/ /g’ bins2contigs.txt

In R, collapse contigs to bin names. counts.txt should be a file with contig names in the first column, and sample names in the following columns with read mapping abundances as rows. bins2contigs.txt should be a two-column file with bins in the first column and contigs in the next.

#Change to dataframe
counts <- as.data.frame(counts)
bins2contigs <- as.data.frame(bins2contigs)

#Set rownames as contigs in both dataframes
rownames(counts) <- counts$contig
rownames(bins2contigs) <- bins2contigs$contig
counts$bins <- bins2contigs[rownames(counts), "bins"]

# Remove first column
counts <- counts[, -1]
bin_abundances <- aggregate(. ~ bins, counts, sum)

#Export bin_abundances as text file. This resulting file can be analyzed like an OTU table. 
write.table(bin_abundances_new, "/home/casey/local/SPAdes-3.11.1/MRN-1-4-7-10-14-17-fastqs/sam_readmapping_VFVS/R/bin_abundances.txt", 
            quote=FALSE, row.names=FALSE, col.names=TRUE, sep="\t")

OLD MGS PAGE

  1. (Optional) Rarify Filtered MGS FASTQs - this can be helpful if you are concerned about differences in read depth among your samples influencing your results. See this page for details.

  2. (Optional) If you did not stitch your paired-end reads together with pear, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It's important to only specify the "paired" FASTQs outputted by kneaddata in the below command. You will need to move the sequences called as contaminants to a separate folder first (first two commands below).

mkdir kneaddata_out/contam_seq

mv kneaddata_out/*_contam_*fastq kneaddata_out/contam_seq

concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq 

You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.

  1. Run humann2 with parallel to calculate abundance of UniRef90 gene families and MetaCyc pathways.
mkdir humann2_out

parallel -j 4 'humann2 --threads 1 --input {} --output humann2_out/{/.}' ::: cat_reads/*fastq
  1. Join HUMAnN2 output per sample into one table.
mkdir humann2_final_out

humann2_join_tables -s --input humann2_out/ --file_name pathabundance --output humann2_final_out/humann2_pathabundance.tsv
humann2_join_tables -s --input humann2_out/ --file_name pathcoverage --output humann2_final_out/humann2_pathcoverage.tsv
humann2_join_tables -s --input humann2_out/ --file_name genefamilies --output humann2_final_out/humann2_genefamilies.tsv
  1. Re-normalize gene family and pathway abundances (so that each sample's abundance sums to 100).
humann2_renorm_table --input humann2_final_out/humann2_pathabundance.tsv --units relab --output humann2_final_out/humann2_pathabundance_relab.tsv
humann2_renorm_table --input humann2_final_out/humann2_genefamilies.tsv --units relab --output humann2_final_out/humann2_genefamilies_relab.tsv
  1. Split HUMAnN2 output abundance tables in stratified and unstratified tables (stratified tables include the taxa associated with a functional profile).
humann2_split_stratified_table --input humann2_final_out/humann2_pathabundance_relab.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_genefamilies_relab.tsv --output humann2_final_out
humann2_split_stratified_table --input humann2_final_out/humann2_pathcoverage.tsv --output humann2_final_out
  1. Convert unstratified HUMAnN2 abundance tables to STAMP format by changing header-line. These commands remove the comment character and the spaces in the name of the first column. Trailing descriptions of the abundance datatype are also removed from each sample's column name.
sed 's/_Abundance-RPKs//g' humann2_final_out/humann2_genefamilies_relab_unstratified.tsv | sed 's/# Gene Family/GeneFamily/' > humann2_final_out/humann2_genefamilies_relab_unstratified.spf

sed 's/_Abundance//g' humann2_final_out/humann2_pathabundance_relab_unstratified.tsv | sed 's/# Pathway/Pathway/' > humann2_final_out/humann2_pathabundance_relab_unstratified.spf
  1. Since HUMAnN2 also runs MetaPhlAn2 as an initial step, we can use the output tables already created to get the taxonomic composition of our samples. First we need to gather all the output MetaPhlAn2 results per sample into a single directory and then merge them into a single table using MetaPhlAn2's merge_metaphlan_tables.py command. After this file is created we can fix the header so that each column corresponds to a sample name without the trailing "_metaphlan_bugs_list" description.
mkdir metaphlan2_out
cp humann2_out/*/*/*metaphlan_bugs_list.tsv metaphlan2_out/
/usr/local/metaphlan2/utils/merge_metaphlan_tables.py metaphlan2_out/*metaphlan_bugs_list.tsv > metaphlan2_merged.txt
sed -i 's/_metaphlan_bugs_list//g' metaphlan2_merged.txt
  1. Lastly we can convert this MetaPhlAn2 abundance table to STAMP format
metaphlan_to_stamp.pl metaphlan2_merged.txt > metaphlan2_merged.spf
Clone this wiki locally