Release v1.4.1 Highlights:
-
VEBA
Modules:
- Added
profile-taxonomic.py
module which uses sylph
to build a sketch database for genomes and queries the genome database for taxonomic abundance.
- Added long read support for
fastq_preprocessor
, preprocess.py
, assembly-long.py
, coverage-long
, and all binning modules.
- Redesign
binning-eukaryotic
module to handle custom MetaEuk
databases
- Added new usage syntax
veba --module preprocess --params “${PARAMS}”
where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
- Added
skani
which is the new default for genome-level clustering based on ANI.
- Added
Diamond DeepClust
as an alternative to MMSEQS2
for protein clustering.
-
VEBA
Database (VDB_v6
):
-
Completely rebuilt VEBA's Microeukaryotic Protein Database
to produce a clustered database MicroEuk100/90/50
similar to UniRef100/90/50
. Available on doi:10.5281/zenodo.10139450.
-
Number of sequences:
- MicroEuk100 = 79,920,431 (19 GB)
- MicroEuk90 = 51,767,730 (13 GB)
- MicroEuk50 = 29,898,853 (6.5 GB)
-
Number of source organisms per dataset:
- MycoCosm = 2503
- PhycoCosm = 174
- EnsemblProtists = 233
- MMETSP = 759
- TARA_SAGv1 = 8
- EukProt = 366
- EukZoo = 27
- TARA_SMAGv1 = 389
- NR_Protists-Fungi = 48217
**Release v1.4.0 Details**
* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance.
* [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/mikolmogorov/Flye/issues/652).
* [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes.
* [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`.
* [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped.
* [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`.
* [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules.
* [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`.
* [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`.
* [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`.
* [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script.
* [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks.
* [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`.
* [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py`
* [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels.
* [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.