-
Notifications
You must be signed in to change notification settings - Fork 204
Viromics Pipeline
Note that this workflow is continually being updated and only represents one pipeline for viral annotation. If you want to use the below commands be sure to keep track of them locally.
Last updated: 8 Feb 2019
Important points to keep in mind:
- The below options are not necessarily best for your data; it is important to explore different options, especially at the read quality filtering stage.
- If you are confused about a particular step be sure to check out the pages under
Metagenomic Resources
on the right side-bar. - You should check that only the FASTQs of interest are being specified at each step (e.g. with the
ls
command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ. - The tool
parallel
comes up several times in this workflow. Be sure to check out our tutorial on this tool here. - You can always run
parallel
with the--dry-run
option to see what commands are being run to double-check they are doing what you want!
- SPAdes 3.13
- VirFinder R package
- VirSorter Note: we used CyVerse iteration of VirSorter.
- MetaBAT2
- CheckM
Other software and scripts needed for initial metagenomics steps can be found in the Microbiome Helper virtual box. These include:
This workflow starts with raw paired-end MiSeq or NextSeq data in demultiplexed FASTQ format assumed to be located within a folder called raw_data
This pipeline was adapted from Anderson, C. L., Sullivan, M. B. & Fernando, S. C. Dietary energy drives the dynamic response of bovine rumen viral communities. Microbiome 5, 155 (2017).
- (Optional) Concatenate multiple lanes of sequencing together (e.g. for NextSeq data). If you do this step remember to change
raw_data
toconcat_data
below.
concat_lanes.pl raw_data/* -o concat_data -p 4
- Run
fastqc
to allow manual inspection of the quality of sequences.
mkdir fastqc_out
fastqc -t 4 raw_data/* -o fastqc_out/
- Stitch paired-end reads together (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 6 of our MGS SOP for more information)
run_pear.pl -p 4 -o stitched_reads concat_data/*
If you don't stitch your reads together at this step you will need to unzip your FASTQ files before running the below command.
- Use
kneaddata
to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note kneadData is being run below on all FASTQ pairs withparallel
, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the\
characters at the end of each line are just to split the command over multiple lines to make it easier to read.
parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
::: raw_data/*_R1.fastq ::: raw_data/*_R2.fastq
You can produce a logfile summarizing the kneadData output with this command:
kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt
- Use metaSPAdes to cross-assemble all FASTQ files from step 4. You will have to first generate a yaml file with your fastq paired file names. See more on metaSPAdes usage here.
python metaspades.py --dataset your_yaml_file.yaml -t THREADS -m MAX MEMORY USAGE (GB) \
-k 21,33,55,77,99,127 -o assembled_viral_contigs
- Use VirFinder to predict viral contigs. This is an R package that can be downloaded here..
## In R
VirFinder_result <- VF.pred("scaffolds.fasta")
Add FDR-corrected p-values (q-value):
VirFinder_result$qvalue <- VF.qvalue(VirFinder_result$pvalue)
Keep only contigs with a q-value < 0.1:
VirFinder_result_qval <- VirFinder_result %>% filter(qvalue < 0.1)
Write table to server or desktop from R:
write.table(VirFinder_result_qval,
"/path/to/your/wdir/VirFinder_result.txt", quote=FALSE,
row.names=FALSE, col.names=TRUE, sep="\t")
- Now on UNIX server, capture only contig IDs and use
reformat_fasta_edit.py
to get desired contigs from larger scaffolds file.
cut -f 1 VirFinder_result.txt > VF_ids_2_keep.txt
reformat_fasta_edit.py -f scaffolds.fasta --ids2keep VF_ids_2_keep.txt > VirFinder_scaffolds.fasta
-
Use VirSorter to identify further viral contigs from your cross-assembled samples. We found that the CyVerse iteration of VirSorter worked best compared to the command line (Miniconda or docker) versions.
-
Concatenate VirFinder and VirSorter (found in Predicted_viral_sequences in VirSorter folder) contigs.
cat VirFinder_scaffolds.fasta VirSorter_scaffolds.fasta > VF_VS_scaffolds.fna
9b. Filter the resulting concatenated file for redundancy with cd-hit
cd-hit -i VF_VS_contigs.fna -o VF_VS_contigs_cd_hit.fna -c 1.0 -T threads -M memory (in MB)
- Use Metabat2 to bin viral contigs.
First, create BAM files by aligning the reads of each sample separately to the deduplicated set of contigs from VirSorter and VirFinder (in this case, VirFinder_VirSorter_scaffolds.fasta).
bowtie2-build -f --threads 40 VF_VS_contigs_cd_hit.fna VF_VS_contigs_ref
parallel -j 40 --link 'bowtie2 -q --sensitive-local \
-x VF_VS_contigs_ref -1 {1} -2 {2} -S {1.}.sam' \
::: /path/to/QCed/paired-end/reads/*_1.fastq \
::: /path/to/QCed/paired-end/reads/*_2.fastq
Convert SAM to BAM then sort BAMs (Tutorial here).
parallel -j 40 'samtools view -S -b {} > {.}.bam' ::: *sam
parallel -j 40 'samtools sort {} -o {.}.sorted.bam' ::: *bam
Generate a depth file from BAM files for input into Metabat (script found in MetaBat download folder)
./jgi_summarize_bam_contig_depths --outputDepth depth.txt *.sorted.bam
Run metabat for binning. See here for metabat usage and options. Parameters used as described in Anderson et al., 2017, Microbiome.
metabat --inFile VF_VS_contigs_cd_hit.fna --outFile bins/VFVS_contigs \
--abdFile depth.txt --minContig=1500 --minClsSize=10000 \
-t <threads> --unbinned
- Run CheckM to assess bin quality and bacterial contamination. We followed the taxonomic-specific workflow.
checkm taxon_set domain Bacteria bacteria.ms
checkm analyze bacteria.ms <bin folder> <output folder> -t (THREADS)
checkm qa bacteria.ms (contamination output folder from last step) \
-o 1 --file checkm_qa.out --threads 10
Manually analyze checkm_qa.out for bins containing bacterial marker genes, and remove them from downstream analyses.
- Determine abundances of viral populations (bins) by mapping sample reads to bins, using Bowtie2. Adapted from tutorial written by Jin Choi for EDAMAME2016.
Build bowtie database with bins
Usage:
bowtie2-build -f --threads 40 <all bins> <reference_db_name>
bowtie2-build -f --threads 40 *fa VFVS_ref
Run Bowtie2. Use paired fastq files as input against database created in last step with bins.
parallel -j <THREADS> --link 'bowtie2 -q --sensitive-local \
-x VFVS_ref -1 {1} -2 {2} -S {1.}.sam' \
::: path/to/QCed/fastq/*_1.fastq ::: path/to/QCed/fastq/*_2.fastq
Concatenate all bins, then index with samtools:
cat *fna > VF_VS_all_bins.fna
samtools faidx VF_VS_all_bins.fna
Convert SAM files to BAM, sort, then index BAM files for quicker downstream processsing. See here for more on SAM/BAM.
parallel -j 40 'samtools import VF_VS_all_bins.fna.fai {} {.}.bam' ::: *sam
parallel -j 40 'samtools sort {} -o {.}.sorted.bam' ::: *bam
parallel -j 40 'samtools index {}' ::: *sorted.bam
Make summary of mapped read counts
parallel -j 40 'samtools idxstats {} > {.}.idxstats.txt' ::: *sorted.bam
Use scripts from Jin Choi to process these files into one table. Note that python2.7 must be used to run the below script. You can create a miniconda environment for python2.7 using the instructions here.
git clone https://github.com/metajinomics/mapping_tools.git
python mapping_tools/get_count_table.py *.idxstats.txt > counts.txt
less counts.txt
Create a mapping file that will map contig names to bins.
To get a file that links contig names to each bin, add header to this file:
echo “bin contig” > bins2contigs.txt
Parse out all the headerlines (lines starting with “>”) from each bin fasta, assuming they end in “fa”
grep -H '>' *fa >> bins2contigs.txt
Replace all instances of “.fa:>” in this file with spaces, or split into two columns in Excel.
sed -i ‘s/.fa:>/ /g’ bins2contigs.txt
In R, collapse contigs to bin names. counts.txt should be a file with contig names in the first column, and sample names in the following columns with read mapping abundances as rows. bins2contigs.txt should be a two-column file with bins in the first column and contigs in the next.
#Change to dataframe
counts <- as.data.frame(counts)
bins2contigs <- as.data.frame(bins2contigs)
#Set rownames as contigs in both dataframes
rownames(counts) <- counts$contig
rownames(bins2contigs) <- bins2contigs$contig
counts$bins <- bins2contigs[rownames(counts), "bins"]
# Remove first column
counts <- counts[, -1]
# Make new bin abundances table; adds up counts of contigs --> Bins
bin_abundances <- aggregate(. ~ bins, counts, sum)
- Normalize bin abundances based on length of bin and total number of reads in a given sample.
First, get length (in BP) of bins
parallel -j 40 'grep "^[ACGTN]" {} | tr -d "\n" | wc -m > {.}_length.txt' ::: *fa
ls *txt > bin_names.txt
cat *length.txt > all_bins_length.txt
In Excel or R, combine bin_names and all_bins_length to generate a 2-column table of bins and their respective length.
In R, divide rows of bin abundance table (bin_abundances) by bin length (using bin_lengths as a mapping file) and columns by column sums.
bin_abundances_rownamed <- bin_abundances
#Ensure rownames are the same in both dataframes
rownames(bin_abundances_rownamed) <- bin_abundances$bin
rownames(bin_lengths) <- bin_lengths$bin
bin_abundances_rownames <- bin_abundances_rownamed[as.character(bin_lengths$bin),]
bin_abundances_rownames <- bin_abundances_rownames[,-1]
# Divide row-wise by bin length, and column wise by colsums to get relative abundance.
bin_abundances_norm <- (bin_abundances_rownames/bin_lengths$Length)
bin_abundances_norm_colsum <- data.frame(sweep(bin_abundances_norm,
2, colSums(bin_abundances_norm), '/'))
# Multiply all cells by integer so that table can be used
for diversity analyses in QIIME.
bin_abundances_norm_colsum_100 <- (bin_abundances_norm_colsum*100)
# Write Table to working directory
write.table(bin_abundances_norm_colsum_100,
"/working/directory/bin_abundances_100.txt",
quote=FALSE, row.names=TRUE, col.names=TRUE, sep="\t")
The resulting tab-separated table can be then imported into QIIME2 for diversity analyses, analyzed in R using packages like vegan, or input into STAMP for relative abundance comparisons across metadata groupings.
We used QIIME2 for alpha- and beta-diversity analyses. Bin abundance table and metadata file are found in the "tables" directory and the resulting QIIME2 artifact (.qza file) is found in the "artifacts" directory in the following example.
biom convert -i bin_abundances_100.txt -o bin_abundances_100.biom --to-hdf5
source activate qiime2-2019.1
qiime tools import \
--input-path tables/bin_abundances_100.biom \
--output-path artifacts/bins.qza \
--type 'FeatureTable[Frequency]'
# Run core-metrics on bin abundance table.
Outputs several diversity metrics including Shannon diversity
and Bray-Curtis dissimilarity plots.
qiime diversity core-metrics \
--i-table artifacts/bins.qza \
--p-sampling-depth 10 \
--m-metadata-file tables/metadata.txt \
--p-n-jobs 30 --output-dir diversity_metrics_bins_100
# Get Shannon Diversity metrics from bin abundances and compare by binary metadata:
qiime diversity alpha-group-significance \
--i-alpha-diversity diversity_metrics_bins_100/shannon_vector.qza \
--m-metadata-file tables/metadata.txt \
--o-visualization shan_div_out
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].