Since VEBA functionality benefits from structure, it's good to have a list of identifiers that you can use for for-loops. In the examples, it will be the following: identifiers.list
However, for datasets with metagenomics and metatranscriptomics it's often useful to have a master list identifiers.list
and separate identifiers.dna.list
and identifiers.rna.list
for metagenomic and metatranscriptomic samples, respectively.
Our VEBA project directory is going to be veba_output
and each step will be in a subdirectory.
- e.g.,
veba_output/preprocess
will have all of the preprocessed reads.
In the workflows that work on specific samples, there will be sample subdirectories.
- e.g.,
veba_output/preprocess/SRR17458603/output
,veba_output/preprocess/SRR17458606/output
, ... - e.g.,
veba_output/assembly/SRR17458603/output
,veba_output/assembly/SRR17458606/output
, ...
Many of these jobs should be run using a job scheduler like SLURM or SunGridEngine. This resource is useful for converting commands between SunGridEnginer and SLURM. I've used both and these are adaptations of the submission commands you can use as a template:
# Let's create some informative name. Remember we are going to create a lot of jobs and log files for the different workflows if you have multiple samples
N=preprocessing__${ID}
CMD="some command we want to run"
# SunGridEngine:
qsub -o logs/${N}.o -e logs/${N}.e -cwd -N ${N} -j y -pe threaded ${N_JOBS} "${CMD}"
# SLURM:
sbatch -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 12:00:00 --mem=20G --wrap="${CMD}"
- Interpreting module outputs - This guide serves as a reference to understand the most important outputs from each module along with how analyze certain files downstream.
- Downloading and preprocessing fastq files - Explains how to download reads from NCBI and run VEBA's
preprocess.py
module to decontaminate either metagenomic and/or metatranscriptomic reads.
- Complete end-to-end metagenomics analysis - Goes through assembling metagenomic reads, binning, clustering, classification, and annotation. We also show how to use the unbinned contigs in a pseudo-coassembly with guidelines on when it's a good idea to go this route.
- Recovering viruses from metatranscriptomics - Goes through assembling metatranscriptomic reads, viral binning, clustering, and classification.
- Setting up bona fide coassemblies for metagenomics or metatranscriptomics - In the case where all samples are of low depth, it may be useful to use coassembly instead of sample-specific approaches. This walkthrough goes through concatenating reads, creating a reads table, coassembly of concatenated reads, aligning sample-specific reads to the coassembly for multiple sorted BAM files, and mapping reads for scaffold/transcript-level counts. Please note that a coassembly differs from the pseudo-coassembly concept introduced in the VEBA publication. For more information regarding the differences between bona fide coassembly and pseud-coassembly, please refer to FAQ: What's the difference between a coassembly and a pseudo-coassembly?.
- Phylogenetic inference - Phylogenetic inference of eukaryotic diatoms.
- Bioprospecting for biosynthetic gene clusters - Detecting biosynthetic gene clusters (BGC) with and scoring novelty of BGCs.
- CRISPR-Cas system screening with de novo genomes How use
CRISPRCasTyper
as a post hoc analysis for screening genomes.
- Taxonomic profiling de novo genomes - Explains how to build and profile reads to custom
Sylph
databases from de novo genomes. - Pathway profiling de novo genomes - Explains how to build and align reads to custom
HUMAnN
databases from de novo genomes and annotations. - Read mapping and counts tables - Traditional read mapping and generating counts tables at the contig, MAG, SLC, ORF, and SSO levels.
- Merging counts tables with taxonomy - Explains how to merge counts tables with taxonomy.
- Phylogenomic functional categories using de novo genomes - PhyloGenomic Functional Categories (PGFC) using annotations, clusters, and counts tables as implemented in Espinoza et al. 2022.
- Converting counts tables - Convert your counts table (with or without metadata) to anndata or biom format. Also supports Pandas pickle format.
- Adapting commands for Docker - Explains how to download and use Docker for running VEBA.
- Adapting commands for Singularity - Explains how to download and use Singularity for running VEBA.
- Adapting commands for AWS - Explains how to download and use Docker for running VEBA specifically on AWS.
Coming Soon:
- Visualizing genome-clusters with
NetworkX
- Workflow for low-depth samples with no bins
- Assigning eukaryotic taxonomy to unbinned contigs (
metaeuk taxtocontig
) - Bioprospecting using
PlasticDB
database - Targeted pathway profiling of large and complex reference databases
- The final output files are in the
output
subdirectory and to avoid redundant files many of these symlinked from theintermediate
directory. This can cause issues if you are using a "scratch" directory where the files are deleted after a certain amount of time. If you have a crontab set up, make sure it also touches symlinks and not just files. - You'll need to adjust the memory and time for different jobs. Assembly will take much longer than preprocessing. Annotation will require more memory than mapping.