Skip to content

Latest commit

 

History

History
87 lines (55 loc) · 7.21 KB

README.md

File metadata and controls

87 lines (55 loc) · 7.21 KB

Getting started with VEBA

Basics:

Since VEBA functionality benefits from structure, it's good to have a list of identifiers that you can use for for-loops. In the examples, it will be the following: identifiers.list

However, for datasets with metagenomics and metatranscriptomics it's often useful to have a master list identifiers.list and separate identifiers.dna.list and identifiers.rna.list for metagenomic and metatranscriptomic samples, respectively.

Our VEBA project directory is going to be veba_output and each step will be in a subdirectory.

  • e.g., veba_output/preprocess will have all of the preprocessed reads.

In the workflows that work on specific samples, there will be sample subdirectories.

  • e.g., veba_output/preprocess/SRR17458603/output, veba_output/preprocess/SRR17458606/output , ...
  • e.g., veba_output/assembly/SRR17458603/output , veba_output/assembly/SRR17458606/output , ...

Many of these jobs should be run using a job scheduler like SLURM or SunGridEngine. This resource is useful for converting commands between SunGridEnginer and SLURM. I've used both and these are adaptations of the submission commands you can use as a template:

# Let's create some informative name. Remember we are going to create a lot of jobs and log files for the different workflows if you have multiple samples
N=preprocessing__${ID}
	
CMD="some command we want to run"
	
# SunGridEngine:
qsub -o logs/${N}.o -e logs/${N}.e -cwd -N ${N} -j y -pe threaded ${N_JOBS} "${CMD}"
	
# SLURM:
sbatch -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 12:00:00 --mem=20G --wrap="${CMD}"

Quick guides:

  • Interpreting module outputs - This guide serves as a reference to understand the most important outputs from each module along with how analyze certain files downstream.

Available walkthroughs:

Accessing SRA:
End-to-end workflows:
  • Complete end-to-end metagenomics analysis - Goes through assembling metagenomic reads, binning, clustering, classification, and annotation. We also show how to use the unbinned contigs in a pseudo-coassembly with guidelines on when it's a good idea to go this route.
  • Recovering viruses from metatranscriptomics - Goes through assembling metatranscriptomic reads, viral binning, clustering, and classification.
  • Setting up bona fide coassemblies for metagenomics or metatranscriptomics - In the case where all samples are of low depth, it may be useful to use coassembly instead of sample-specific approaches. This walkthrough goes through concatenating reads, creating a reads table, coassembly of concatenated reads, aligning sample-specific reads to the coassembly for multiple sorted BAM files, and mapping reads for scaffold/transcript-level counts. Please note that a coassembly differs from the pseudo-coassembly concept introduced in the VEBA publication. For more information regarding the differences between bona fide coassembly and pseud-coassembly, please refer to FAQ: What's the difference between a coassembly and a pseudo-coassembly?.
Phylogenetics:
Bioprospecting:
Read mapping, rapid profiling, feature engineering, and converting counts tables:
Containerization and AWS:

Coming Soon:

  • Visualizing genome-clusters with NetworkX
  • Workflow for low-depth samples with no bins
  • Assigning eukaryotic taxonomy to unbinned contigs (metaeuk taxtocontig)
  • Bioprospecting using PlasticDB database
  • Targeted pathway profiling of large and complex reference databases

Notes:
  • The final output files are in the output subdirectory and to avoid redundant files many of these symlinked from the intermediate directory. This can cause issues if you are using a "scratch" directory where the files are deleted after a certain amount of time. If you have a crontab set up, make sure it also touches symlinks and not just files.
  • You'll need to adjust the memory and time for different jobs. Assembly will take much longer than preprocessing. Annotation will require more memory than mapping.