Skip to content

VEBA_v1.4.1

Compare
Choose a tag to compare
@jolespin jolespin released this 19 Dec 23:22
· 182 commits to main since this release

Release v1.4.1 Highlights:

  • VEBA Modules:

    • Added profile-taxonomic.py module which uses sylph to build a sketch database for genomes and queries the genome database for taxonomic abundance.
    • Added long read support for fastq_preprocessor, preprocess.py, assembly-long.py, coverage-long, and all binning modules.
    • Redesign binning-eukaryotic module to handle custom MetaEuk databases
    • Added new usage syntax veba --module preprocess --params “${PARAMS}” where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
    • Added skani which is the new default for genome-level clustering based on ANI.
    • Added Diamond DeepClust as an alternative to MMSEQS2 for protein clustering.
  • VEBA Database (VDB_v6):

    • Completely rebuilt VEBA's Microeukaryotic Protein Database to produce a clustered database MicroEuk100/90/50 similar to UniRef100/90/50. Available on doi:10.5281/zenodo.10139450.

    • Number of sequences:

      • MicroEuk100 = 79,920,431 (19 GB)
      • MicroEuk90 = 51,767,730 (13 GB)
      • MicroEuk50 = 29,898,853 (6.5 GB)
    • Number of source organisms per dataset:

      • MycoCosm = 2503
      • PhycoCosm = 174
      • EnsemblProtists = 233
      • MMETSP = 759
      • TARA_SAGv1 = 8
      • EukProt = 366
      • EukZoo = 27
      • TARA_SMAGv1 = 389
      • NR_Protists-Fungi = 48217
**Release v1.4.0 Details** * [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance. * [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/mikolmogorov/Flye/issues/652). * [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes. * [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`. * [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped. * [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`. * [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules. * [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`. * [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`. * [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`. * [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script. * [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. * [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`. * [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py` * [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels. * [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.