Release VEBA_v1.4.1 · jolespin/veba

Release v1.4.1 Highlights:

VEBA Modules:
- Added profile-taxonomic.py module which uses sylph to build a sketch database for genomes and queries the genome database for taxonomic abundance.
- Added long read support for fastq_preprocessor, preprocess.py, assembly-long.py, coverage-long, and all binning modules.
- Redesign binning-eukaryotic module to handle custom MetaEuk databases
- Added new usage syntax veba --module preprocess --params “${PARAMS}” where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
- Added skani which is the new default for genome-level clustering based on ANI.
- Added Diamond DeepClust as an alternative to MMSEQS2 for protein clustering.
VEBA Database (VDB_v6):
- Completely rebuilt VEBA's Microeukaryotic Protein Database to produce a clustered database MicroEuk100/90/50 similar to UniRef100/90/50. Available on doi:10.5281/zenodo.10139450.
- Number of sequences:
  - MicroEuk100 = 79,920,431 (19 GB)
  - MicroEuk90 = 51,767,730 (13 GB)
  - MicroEuk50 = 29,898,853 (6.5 GB)
- Number of source organisms per dataset:
  - MycoCosm = 2503
  - PhycoCosm = 174
  - EnsemblProtists = 233
  - MMETSP = 759
  - TARA_SAGv1 = 8
  - EukProt = 366
  - EukZoo = 27
  - TARA_SMAGv1 = 389
  - NR_Protists-Fungi = 48217

**Release v1.4.0 Details**

* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance. * [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/mikolmogorov/Flye/issues/652). * [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes. * [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`. * [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped. * [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`. * [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules. * [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`. * [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`. * [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`. * [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script. * [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. * [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`. * [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py` * [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels. * [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VEBA_v1.4.1