Skip to content

Releases: jolespin/veba

VEBA_v2.3.0

22 Sep 16:52
2848a37
Compare
Choose a tag to compare
  • [2024.9.21] - Added KEGG Pathway Profiler to VEBA-database_env and VEBA-annotate_env which replaces MicrobeAnnotator-KEGG for module completion ratios. Replacing ${VEBA_DATABASE}/Annotate/MicrobeAnnotator-KEGG with ${VEBA_DATABASE}/Annotate/KEGG-Pathway-Profiler/ database files. Note: New module completion ratio output does not have classes labels for KEGG modules.
  • [2024.8.30] - Added ${N_JOBS} to download scripts with default set to maximum threads available

VEBA_v2.2.1

30 Aug 00:38
2a504ae
Compare
Choose a tag to compare
  • [2024.8.29] - Added VERSION file created in download_databases.sh
  • [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query. Added --af_mode with either relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af or strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af) to edgelist_to_clusters.py, global_clustering.py, local_clustering.py, and cluster.py.
  • [2024.7.3] - Added pigz to VEBA-annotate_env which isn't a problem with most conda installations but needed for docker containers.
  • [2024.6.21] - Changed choose_fastest_mirror.py to determine_fastest_mirror.py
  • [2024.6.20] - Added -m/--include_mrna to compile_metaeuk_identifiers.py for Issue #110

VEBA_v2.2.0

10 Jun 01:16
05af0fd
Compare
Choose a tag to compare

Disclaimer:
I made some large updates in this version and I believe everything has been adequately tested but just in case anything has slipped between the cracks you can use v2.1.0 which has been thoroughly tested in accordance to the NAR Espinoza 2024 paper. Benefits of using this version include much faster and robust prokaryotic classifications and fast/scalable HMM-based annotation modeling.

Large performance updates for this version including:

  • Updating GTDB-Tk 2.3.0 -> 2.4.0 which means the GTDB needed to be updated from r214.1 -> r220
  • VEBA-classify_env was split up into VEBA-classify-eukaryotic_env, VEBA-classify-prokaryotic_env, and VEBA-prokaryotic_env
  • annotate.py, classify-eukaryotic.py, and phylogeny.py were rewritten (and their utility scripts) were updated to used PyHMMER (pyhmmsearch and pykofamsearch) which is faster than HMMSearch when multithreaded.
  • KOFAM was changed to KOfam

NOTE: Please don't use the tar.gz as it contains the 2.1.0 version for some reason:

VERSION="2.2.0"
# wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz # The .tar.gz is out of date in this release
# tar -xvf v${VERSION}.tar.gz && mv veba-${VERSION} veba

# Alternative download
wget https://github.com/jolespin/veba/releases/download/v${VERSION}/v${VERSION}.zip
unzip -d veba v${VERSION}.zip

VEBA_v2.1.0-zen

03 Jun 18:46
05af0fd
Compare
Choose a tag to compare

This is the exact same version as VEBA_v2.1.0. New VEBA releases will now automatically be synced to Zenodo.

VEBA_v2.1.0

17 May 14:13
b67f0ed
Compare
Choose a tag to compare

Official release of VEBA v2.1.0 with updates to address peer reviewers. Mostly documentation but also including the following:

  • [2024.4.30] - Added concatenate_files.py which can concatenate files (and mixed compressed/decompressed files) using either arguments, list file, or glob. Reason for this is that unix has a limit of arguments that can be used (e.g., cat *.fasta > output.fasta where *.fasta results in 50k files will crash)
  • [2024.4.29] - Added /volumes/workspace/ directory to Docker containers for situations when your input and output directories are the same.
  • [2024.4.29] - featureCounts can only handle 64 threads at a time so added min(64, opts.n_jobs) for all the modules/scripts that use featureCounts commands.
  • [2024.4.23] - Added uniprot_to_enzymes.py which reformats tables and fasta from https://www.uniprot.org/uniprotkb?query=ec%3A*
  • [2024.4.18] - Developed a faster CLI implementation of KofamScan called PyKofamSearch which leverage PyHmmer. This will be used in future versions of VEBA.
  • [2024.4.18] - Developed a faster CLI implementation of HMMSearch called PyHMMSearch which leverage PyHmmer. This will be used in future versions of VEBA.
  • [2024.3.26] - Added --metaeuk_split_memory_limit to metaeuk_wrapper.py.
  • [2024.3.26] - Added -d/--genome_identifier_directory_index to scaffolds_to_bins.py for directories that are structured path/to/genomes/bin_a/reference.fasta where you would use -d -2.
  • [2024.3.26] - Added --minimum_af to edgelist_to_clusters.py with an option to accept 4 column inputs [id_1]<tab>[id_2]<tab>[weight]<tab>[alignment_fraction]. global_clustering.py, local_clustering.py, and cluster.py now use this by default --af_threshold 30.0. If you want to retain previous behavior, just use --af_threshold 0.0.
  • [2024.3.18] - edgelist_to_clusters.py only includes edges where both nodes are in identifiers set. If --identifiers are provided, then only those identifiers are used. If not, then it includes all nodes.
  • [2024.3.18] - Added --export_representatives argument for edgelist_to_clusters.py to output table with [id_node]<tab>[id_cluster]<tab>[intra-cluster_connectivity]<tab>[representative]. Also includes this information in nx.Graph objects.
  • [2024.3.18] - Changed singleton weight to np.nan instead of np.inf for edgelist_to_clusters.py to allow for representative calculations.
  • YouTube channel (https://www.youtube.com/@VEBA-Multiomics)

VEBA_v2.1.0b (pre-release)

08 May 21:11
783932d
Compare
Choose a tag to compare
Pre-release

Beta release of VEBA v2.1.0b with updates to address peer reviewers. Mostly documentation but also including the following:

  • [2024.4.30] - Added concatenate_files.py which can concatenate files (and mixed compressed/decompressed files) using either arguments, list file, or glob. Reason for this is that unix has a limit of arguments that can be used (e.g., cat *.fasta > output.fasta where *.fasta results in 50k files will crash)
  • [2024.4.29] - Added /volumes/workspace/ directory to Docker containers for situations when your input and output directories are the same.
  • [2024.4.29] - featureCounts can only handle 64 threads at a time so added min(64, opts.n_jobs) for all the modules/scripts that use featureCounts commands.
  • [2024.4.23] - Added uniprot_to_enzymes.py which reformats tables and fasta from https://www.uniprot.org/uniprotkb?query=ec%3A*
  • [2024.4.18] - Developed a faster implementation of KofamScan called PyKofamSearch which leverage PyHmmer. This will be used in future versions of VEBA.
  • [2024.3.26] - Added --metaeuk_split_memory_limit to metaeuk_wrapper.py.
  • [2024.3.26] - Added -d/--genome_identifier_directory_index to scaffolds_to_bins.py for directories that are structured path/to/genomes/bin_a/reference.fasta where you would use -d -2.
  • [2024.3.26] - Added --minimum_af to edgelist_to_clusters.py with an option to accept 4 column inputs [id_1]<tab>[id_2]<tab>[weight]<tab>[alignment_fraction]. global_clustering.py, local_clustering.py, and cluster.py now use this by default --af_threshold 30.0. If you want to retain previous behavior, just use --af_threshold 0.0.
  • [2024.3.18] - edgelist_to_clusters.py only includes edges where both nodes are in identifiers set. If --identifiers are provided, then only those identifiers are used. If not, then it includes all nodes.
  • [2024.3.18] - Added --export_representatives argument for edgelist_to_clusters.py to output table with [id_node]<tab>[id_cluster]<tab>[intra-cluster_connectivity]<tab>[representative]. Also includes this information in nx.Graph objects.
  • [2024.3.18] - Changed singleton weight to np.nan instead of np.inf for edgelist_to_clusters.py to allow for representative calculations.

VEBA_v2.0.0

10 Mar 20:12
07496fa
Compare
Choose a tag to compare
  • Changed default assembly algorithm to metaflye instead of flye in assembly-long.py
  • Added number_of_genomes, number_of_genome-clusters, number_of_proteins, and number_of_protein-clusters to feature_compression_ratios.tsv.gz from cluster.py
  • Added -A/--from_antismash in biosynthetic.py to use preexisting antiSMASH results. Also changed -i/--input to -i/--from_genomes.
  • Changed antimash_genbanks_to_table.py to biosynthetic_genbanks_to_table.py for future support of DeepBGC and GECCO
  • Added busco_version parameter to merge_busco_json.py with default set to 5.4.x and additional support for 5.6.x.
  • Added CONDA_ENVS_PATH to update_environment_scripts.sh, update_environment_variables.sh, and check_installation.sh
  • Added CONDA_ENVS_PATH to veba to allow for custom environment locations
  • Changed install.sh to support custom CONDA_ENVS_PATH argument bash install.sh path/to/log path/to/envs/
  • Added merge_counts_with_taxonomy.py

VEBA_v1.5.0

30 Jan 21:46
8a582da
Compare
Choose a tag to compare

Warning:
For this release, use the https://github.com/jolespin/veba/releases/download/v1.5.0/v1.5.0.zip asset not the "Source code" assets as those are out of date.

Release v1.5.0 Highlights:

  • Added VeryFastTree to phylogeny.py
  • Added --blacklist to compile_eukaryotic_classifications.py
  • Added compatibility for antismash_genbanks_to_table.py to operate on antiSMASH v7 genbanks
  • Added compile_phylogenomic_functional_categories.py script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)
  • Fixed error in annotations.protein_clusters.tsv formatting from annotate.py
  • Fixed situation where unbinned.fasta were not added in binning-prokaryotic.py and bad symlinks were created for GFF, rRNA, and tRNA when no genoems were detected.
  • Fixed critical error where classify_eukaryotic.py was trying to access a deprecated database file from MicroEuk_v2.
Release v1.5.0 Details
  • Cleaned up installation files
  • Changed veba/src/ to veba/bin/
  • Checked SCRIPT_VERSIONS to VEBA_SCRIPT_VERSIONS which are now in bin/ of conda environment
  • Fixed header being offset in annotations.protein_clusters.tsv where it could not be read with Pandas.
  • Fixed binning-prokaryotic.py the creation of non-existing symlinks where "'.gff'", "'.rRNA'", and "'*.tRNA'" were created.
  • Fixed .strip method on Pandas series in antismash_genbanks_to_table.py for compatibilty with antiSMASH 6 and 7
  • Fixed situation where unbinned.fasta is empty in binning-prokaryotic.py when there are no bins that pass qc.
  • Fixed minor error in coverage.py where samtools sort --reference was getting reads_table.tsv and not reference.fasta
  • Changed default behavior from deterministic to not deterministic for increase in speed in assembly-long.py. (i.e., --no_deterministic --> --deterministic)
  • Added VeryFastTree as an option to phylogeny.py with FastTree remaining as the default.
  • Changed default --leniency parameter on classify_eukaryotic.py and consensus_genome_classification_ranked.py to 1.0 and added --leniecy_genome_classification as a separate option.
  • Added --blacklist option to compile_eukaryotic_classifications.py with a default value of species:uncultured eukaryote in classify_eukaryotic.py
  • Fixed critical error where classify_eukaryotic.py was trying to access a deprecated database file from MicrEuk_v2.
  • Fixed minor error with eukaryotic_gene_modeling_wrapper.py not allowing for Tiara to run in backend.
  • Added compile_phylogenomic_functional_categories.py script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)

VEBA_v1.4.2

22 Dec 03:41
2374798
Compare
Choose a tag to compare
  • [2023.12.21] - GTDB-Tk changed name of archaea summary file so VEBA was not adding this to final classification. Fixed this in classify-prokaryotic.py.
  • [2023.12.20] - Fixed files not being closed in compile_custom_humann_database_from_annotations.py and added options to use different annotation file formats (i.e., multilevel, header, and no header).

VEBA_v1.4.1

19 Dec 23:22
Compare
Choose a tag to compare

Release v1.4.1 Highlights:

  • VEBA Modules:

    • Added profile-taxonomic.py module which uses sylph to build a sketch database for genomes and queries the genome database for taxonomic abundance.
    • Added long read support for fastq_preprocessor, preprocess.py, assembly-long.py, coverage-long, and all binning modules.
    • Redesign binning-eukaryotic module to handle custom MetaEuk databases
    • Added new usage syntax veba --module preprocess --params “${PARAMS}” where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
    • Added skani which is the new default for genome-level clustering based on ANI.
    • Added Diamond DeepClust as an alternative to MMSEQS2 for protein clustering.
  • VEBA Database (VDB_v6):

    • Completely rebuilt VEBA's Microeukaryotic Protein Database to produce a clustered database MicroEuk100/90/50 similar to UniRef100/90/50. Available on doi:10.5281/zenodo.10139450.

    • Number of sequences:

      • MicroEuk100 = 79,920,431 (19 GB)
      • MicroEuk90 = 51,767,730 (13 GB)
      • MicroEuk50 = 29,898,853 (6.5 GB)
    • Number of source organisms per dataset:

      • MycoCosm = 2503
      • PhycoCosm = 174
      • EnsemblProtists = 233
      • MMETSP = 759
      • TARA_SAGv1 = 8
      • EukProt = 366
      • EukZoo = 27
      • TARA_SMAGv1 = 389
      • NR_Protists-Fungi = 48217
**Release v1.4.0 Details** * [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance. * [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/mikolmogorov/Flye/issues/652). * [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes. * [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`. * [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped. * [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`. * [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules. * [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`. * [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`. * [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`. * [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script. * [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. * [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`. * [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py` * [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels. * [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.