VEBA is currently under active development. If you are interested in requesting features or wish to report a bug, please post a GitHub issue prefixed with the tag [Feature Request]
and [Bug]
, respectively. If you want to contribute or have any other inquiries, contact me at jol.espinoz[A|T]gmail[DOT]com
.
Release v2.0.0 Highlights:
- Added
-A/--from_antismash
inbiosynthetic.py
to use preexistingantiSMASH
results. Also changed-i/--input
to-i/--from_genomes
. - Added
number_of_genomes
,number_of_genome-clusters
,number_of_proteins
, andnumber_of_protein-clusters
tofeature_compression_ratios.tsv.gz
fromcluster.py
- Added custom path for
conda
environments - Added
busco_version
parameter tomerge_busco_json.py
with default set to5.4.x
and additional support for5.6.x
. - Changed
antimash_genbanks_to_table.py
tobiosynthetic_genbanks_to_table.py
for future support ofDeepBGC
andGECCO
Release v2.0.0 Details
- Changed default assembly algorithm to
metaflye
instead offlye
inassembly-long.py
- Added
number_of_genomes
,number_of_genome-clusters
,number_of_proteins
, andnumber_of_protein-clusters
tofeature_compression_ratios.tsv.gz
fromcluster.py
- Added
-A/--from_antismash
inbiosynthetic.py
to use preexistingantiSMASH
results. Also changed-i/--input
to-i/--from_genomes
. - Changed
antimash_genbanks_to_table.py
tobiosynthetic_genbanks_to_table.py
for future support ofDeepBGC
andGECCO
- Added
busco_version
parameter tomerge_busco_json.py
with default set to5.4.x
and additional support for5.6.x
. - Added
CONDA_ENVS_PATH
toupdate_environment_scripts.sh
,update_environment_variables.sh
, andcheck_installation.sh
- Added
CONDA_ENVS_PATH
toveba
to allow for custom environment locations - Changed
install.sh
to support customCONDA_ENVS_PATH
argumentbash install.sh path/to/log path/to/envs/
- Added
merge_counts_with_taxonomy.py
Release v1.5.0 Highlights:
- Added
VeryFastTree
tophylogeny.py
- Added
--blacklist
tocompile_eukaryotic_classifications.py
- Added compatibility for
antismash_genbanks_to_table.py
to operate onantiSMASH v7
genbanks - Added
compile_phylogenomic_functional_categories.py
script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239) - Fixed error in
annotations.protein_clusters.tsv
formatting fromannotate.py
- Fixed situation where
unbinned.fasta
were not added inbinning-prokaryotic.py
and bad symlinks were created for GFF, rRNA, and tRNA when no genoems were detected. - Fixed critical error where
classify_eukaryotic.py
was trying to access a deprecated database file from MicroEuk_v2.
Release v1.5.0 Details
- Cleaned up installation files
- Changed
veba/src/
toveba/bin/
- Checked
SCRIPT_VERSIONS
toVEBA_SCRIPT_VERSIONS
which are now inbin/
of conda environment - Fixed header being offset in
annotations.protein_clusters.tsv
where it could not be read with Pandas. - Fixed
binning-prokaryotic.py
the creation of non-existing symlinks where "'.gff'", "'.rRNA'", and "'*.tRNA'" were created. - Fixed .strip method on Pandas series in
antismash_genbanks_to_table.py
for compatibilty withantiSMASH 6 and 7
- Fixed situation where
unbinned.fasta
is empty inbinning-prokaryotic.py
when there are no bins that pass qc. - Fixed minor error in
coverage.py
wheresamtools sort --reference
was gettingreads_table.tsv
and notreference.fasta
- Changed default behavior from deterministic to not deterministic for increase in speed in
assembly-long.py
. (i.e.,--no_deterministic
-->--deterministic
) - Added
VeryFastTree
as an option tophylogeny.py
withFastTree
remaining as the default. - Changed default
--leniency
parameter onclassify_eukaryotic.py
andconsensus_genome_classification_ranked.py
to1.0
and added--leniecy_genome_classification
as a separate option. - Added
--blacklist
option tocompile_eukaryotic_classifications.py
with a default value ofspecies:uncultured eukaryote
inclassify_eukaryotic.py
- Fixed critical error where
classify_eukaryotic.py
was trying to access a deprecated database file from MicrEuk_v2. - Fixed minor error with
eukaryotic_gene_modeling_wrapper.py
not allowing forTiara
to run in backend. - Added
compile_phylogenomic_functional_categories.py
script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)
Release v1.4.2 Highlights:
-
VEBA
Modules:- Added
profile-taxonomic.py
module which usessylph
to build a sketch database for genomes and queries the genome database for taxonomic abundance. - Added long read support for
fastq_preprocessor
,preprocess.py
,assembly-long.py
,coverage-long
, and all binning modules. - Redesign
binning-eukaryotic
module to handle customMetaEuk
databases - Added new usage syntax
veba --module preprocess --params “${PARAMS}”
where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change. - Added
skani
which is the new default for genome-level clustering based on ANI. - Added
Diamond DeepClust
as an alternative toMMSEQS2
for protein clustering.
- Added
-
VEBA
Database (VDB_v6
):- Completely rebuilt
VEBA's Microeukaryotic Protein Database
to produce a clustered databaseMicroEuk100/90/50
similar toUniRef100/90/50
. Available on doi:10.5281/zenodo.10139450.
- Completely rebuilt
Release v1.4.2 Details
- Fixed critical error where
classify_eukaryotic.py
was trying to access a deprecated database file from MicrEuk_v2. - Added
profile-taxonomic.py
module which usessylph
to build a sketch database for genomes and queries the genome database similar toKraken
for taxonomic abundance. - Removed requirement to have
--estimated_assembly_size
for Flye per Flye Issue #652. - Added
sylph
toVEBA-profile_env
for abundance profiling of genomes. - Dereplicate duplicate contigs in
concatenate_fasta.py
. - Added
--reference_gzipped
toindex.py
andmapping.py
with new default being that the reference fasta is not gzipped. - Added
skani
as new default for genome clustering incluster.py
,global_clustering.py
, andlocal_clustering.py
. - Added support for long reads in
fastq_preprocessor
,preprocess.py
,assembly-long.py
,coverage-long
, and all binning modules. - Fixed
annotations.protein_clusters.tsv.gz
frommerge_annotations.py
added in patch update ofv1.3.1
. - Added support for missing values in
compile_eukaryotic_classifications.py
. - Added
--metaeuk_split_memory_limit
argument with (experimental) default set to36G
inbinning-eukaryotic.py
andeukaryotic_gene_modeling.py
. - Added
--compressed 1
tommseqs createdb
indownload_databases.sh
installation script. - Added a check to
check_fasta_duplicates.py
andclean_fasta.py
to make sure there are no>
characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. - Added
Diamond DeepClust
toclustering_wrapper.py
,global/local_clustering.py
, andcluster.py
. Changedmmseqs2_wrapper.py
toclustering_wrapper.py
. Changedeasy-cluster
andeasy-linclust
tommseqs-cluster
andmmseqs-linclust
. - Fixed viral quality in
merge_genome_quality_assessments.py
- Changed
consensus_genome_classification.py
toconsensus_genome_classification_ranked.py
. Also, default behavior to allow for missing taxonomic levels. - Fixed the
merge_annotations.py
resulting in a memory leak when creating theannotations.protein_clusters.tsv.gz
output table. However, still need to correct the formatting for empty sets and string lists.
Release v1.3.0 Highlights:
-
VEBA
Modules:- Added
profile-pathway.py
module and associated scripts for buildingHUMAnN
databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method viaHUMAnN
using binned genomes as the database. - Added
marker_gene_clustering.py
script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space. - Added
module_completion_ratios.py
script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend ofannotate.py
. - Updated
annotate.py
andmerge_annotations.py
to provide better annotations for clustered proteins. - Added
merge_genome_quality.py
andmerge_taxonomy_classifications.py
which compiles genome quality and taxonomy, respectively, for all organisms. - Added BGC clustering in protein and nucleotide space to
biosynthetic.py
. Also, produces prevalence tables that can be used for further clustering of BGCs. - Added
pangenome_core_sequences
incluster.py
writes both protein and CDS sequences for each genome cluster. - Added PDF visualization of newick trees in
phylogeny.py
.
- Added
-
VEBA
Database (VDB_v5.2
):- Added
CAZy
- Added
MicrobeAnnotator-KEGG
- Added
Release v1.3.0 Details
- Update
annotate.py
andmerge_annotations.py
to handleCAZy
. They also properly address clustered protein annotations now. - Added
module_completion_ratio.py
script which is a fork ofMicrobeAnnotator
ko_mapper.py
. Also included a database Zenodo: 10020074 which will be included inVDB_v5.2
- Added a checkpoint for
tRNAscan-SE
inbinning-prokaryotic.py
andeukaryotic_gene_modeling_wrapper.py
. - Added
profile-pathway.py
module andVEBA-profile_env
environments which is a wrapper aroundHUMAnN
for the custom database created fromannotate.py
andcompile_custom_humann_database_from_annotations.py
- Added
GenoPype version
to log output - Added
merge_genome_quality.py
which combinesCheckV
,CheckM2
, andBUSCO
results. - Added
compile_custom_humann_database_from_annotations.py
which compiles aHUMAnN
protein database table from the output ofannotate.py
and taxonomy classifications. - Added functionality to
merge_taxonomy_classifications.py
to allow for--no_domain
and--no_header
which will serve as input tocompile_custom_humann_database_from_annotations.py
- Added
marker_gene_clustering.py
script which gets core marker genes unique to each SLC (i.e., pangenome).average_number_of_copies_per_genome
to protein clusters. - Added
--minimum_core_prevalence
inglobal_clustering.py
,local_clustering.py
, andcluster.py
which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove--no_singletons
fromcluster.py
to avoid complications with marker genes. Relabeled--input
to--genomes_table
in clustering scripts/module. - Added a check in
coverage.py
to see if themapped.sorted.bam
files are created, if they are then skip them. Not yet implemented for GNU parallel option. - Changed default representative sequence format from table to fasta for
mmseqs2_wrapper.py
. - Added
--nucleotide_fasta_output
toantismash_genbank_to_table.py
which outputs the actual BGC DNA sequence. Changed--fasta_output
to--protein_fasta_output
and added output tobiosynthetic.py
. Changed BGC component identifiers to[bgc_id]_[position_in_bgc]|[start]:[end]([strand])
to match withMetaEuk
identifiers. Changedbgc_type
toprotocluster_type
.biosynthetic.py
now supports GFF files fromMetaEuk
(exon and gene features not supported byantiSMASH
). Fixed error related toantiSMASH
adding CDS (i.e.,allorf_[start]_[end]
) that are not in GFF soantismash_genbank_to_table.py
failed in those cases. - Added
ete3
toVEBA-phylogeny_env.yml
and automatically renders trees to PDF. - Added presets for
MEGAHIT
using the--megahit_preset
option. - The change for using
--mash_db
withGTDB-Tk
violated the assumption that all prokaryotic classifications had amsa_percent
field which caused the cluster-level taxonomy to fail.compile_prokaryotic_genome_cluster_classification_scores_table.py
fixes this by usesfastani_ani
as the weight when genomes were classified using ANI andmsa_percent
for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications. - Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs.
- Fixed critical error where descriptions in header were not being removed in
eukaryota.scaffolds.list
and did not remove eukaryotic scaffolds inseqkit grep
soDAS_Tool
output eukaryotic MAGs inidentifier_mapping.tsv
and__DASTool_scaffolds2bin.no_eukaryota.txt
- Fixed
krona.html
inbiosynthetic.py
which was being created incorrectly fromcompile_krona.py
script. - Create
pangenome_core_sequences
inglobal_clustering.py
andlocal_clustering.py
which writes both protein and CDS sequences for each SLC. Also made default incluster.py
to NOT do local clustering switching--no_local_clustering
to--local_clustering
. pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
inbiosynthetic.py
whenDiamond
finds multiple regions in one hit that matches. Added--sort_by
and--ascending
toconcatenate_dataframes.py
along with automatic detection and removal of duplicate indices. Also added--sort_by bitscore
inbiosynthetic.py
.- Added core pangenome and singleton hits to clustering output
- Updated
--megahit_memory
default from 0.9 to 0.99 - Fixed error in
genomad_taxonomy_wrapper.py
whereviral_taxonomy.tsv
should have beentaxonomy.tsv
. - Fixed minor error in
assembly.py
that was preventing users from usingSPAdes
programs that were notspades.py
,metaspades.py
, orrnaspades.py
that was the result of using an incorrect string formatting. - Updated
bowtie2
in preprocess, assembly, and mapping modules. Updatedfastp
andfastq_preprocessor
in preprocess module.
Release v1.2.0 Highlights:
-
VEBA
Modules:- Updated
GTDB-Tk
now usesMash
for ANI screening to speed up classification (now provided inVDB_v5.1
database) - rRNA and tRNA are identified for prokaryotic and eukaryotic genomes via
BARRNAP
andtRNAscan-SE
- Eukaryotic genes (CDS, rRNA, tRNA) are analyzed separately for nuclear, mitochondrion, and plastid sequences
- Genome GFF files include contigs, CDS, rRNA, and tRNA with tags for mitochondrion and plastids when applicable
- Clustering automatically generates pangenome protein prevalence tables for each genome cluster
- Ratios of singletons in each genome are now calculated
- Virulence factor database (
VFDB
) is now included in annotations - UniRef50/90 is now included in annotations
Krona
plots are generated for taxonomy classifications and biosynthetic gene cluster detection- Fixed a minor issue in
biosynthetic.py
where the fasta and genbank files were not properly symlinked. Also added virulence factor results to synopsis.
- Updated
-
**
VEBA
Database (VDB_v5.1
) **:- Added
VFDB
- Updated
GTDB v207_v2 → v214.1
- Changed
NR → UniRef50/90
- Deprecated
RefSeq non-redundant
in place ofUniRef
- Added
Release v1.2.0 Details
- Fixed minor error in
binning-prokaryotic.py
where the--veba_database
argument wasn't utilized and only the environment variableVEBA_DATABASE
could be used. - Updated the Docker images to have
/volumes/input
,/volumes/output
, and/volumes/database
directories to mount. - Replaced
prodigal
withpyrodigal
as it is faster and under active development. - Added support for missing classifications in
compile_krona.py
andconsensus_genome_classification.py
. - Updated
GTDB-Tk
from version2.1.3
→2.3.0
andGTDB
from versionr202_v2
→r214
. Changed${VEBA_DATABASE}/Classify/GTDBTk
→${VEBA_DATABASE}/Classify/GTDB
. Addedgtdb_r214.msh
toGTDB
database for ANI screening. - Added pangenome and singularity tables to
cluster.py
(and associated global/local clustering scripts) to output automatically. - Added
compile_gff.py
to merge CDS, rRNA, and tRNA GFF files. Used inbinning-prokaryotic.py
andbinning-viral.py
.binning-eukaryotic.py
uses the source of this in the backend offilter_busco_results.py
. Includes GC content for contigs and various tags. - Updated
BUSCO v5.3.2 -> v5.4.3
which changes the json output structure and made the appropriate changes infilter_busco_results.py
. - Added
eukaryotic_gene_modeling_wrapper.py
which 1) splits nuclear, mitochondrial, and plastid genomes; 2) performs gene modeling viaMetaEuk
andPyrodigal
; 3) performs rRNA detection viaBARRNAP
; 4) performs tRNA detection viatRNAscan-SE
; 5) merges processed GFF files; and 5) calculates sequences statistics. - Added
gene_biotype=protein_coding
toP(y)rodigal(-GV)
GFF output. - Added
VFDB
toannotate.py
and database. - Compiled and pushed
gtdb_r214.msh
mash file to Zenodo:8048187 which is now used by default inclassify-prokaryotic.py
. It is now included inVDB_v5.1
. - Cleaned up global and local clustering intermediate files. Added pangenome tables and singelton information to outputs.
Release v1.1.2 Details
- Created Docker images for all modules
- Replaced all absolute path symlinks with relative symlinks
- Changed
prokaryotic_taxonomy.tsv
andprokaryotic_taxonomy.clusters.tsv
inclassify-prokaryotic.py
(along with eukaryotic and viral) files totaxonomy.tsv
andtaxonomy.clusters.tsv
for uniformity. - Updating all symlinks to relative links (also in
fastq_preprocessor
) to prepare for dockerization and updating all environments to use updated GenoPype 2023.4.13. - Changed
nr
touniref
inannotate.py
and addedpropagate_annotations_from_representatives.py
script while simplifyingmerge_annotations_and_taxonomy.py
tomerge_annotations.py
and excluding taxonomy operations. - Changed
nr
toUniRef90
andUniRef50
inVDB_v5
- Changed
orfs_to_orthogroups.tsv
toproteins_to_orthogroups.tsv
for consistency with thecluster.py
module. Will eventually find some consitency withscaffolds_to_bins/scaffolds_to_mags
but this will be later. - Added a
scaffolds_to_mags.tsv
in the clustering output. - Added
convert_counts_table.py
which converts a counts table (and metadata) to Pandas pickle, Anndata h5ad, or Biom hdf5 - Fixed output directory for
mapping.py
which now usesoutput_directory/${NAME}
structure likebinning-*.py
. - Removed "python" prefix for script calls and now uses shebang in script for executable. Also added single paranthesis around script filepath (e.g.,
'[script_filepath]'
) to escape characters/spaces in filepath. - Added support for
index.py
to accept individual--references [file.fasta]
and--gene_models [file.gff]
. - Added
stdin
support forscaffolds_to_bins.py
along with the ability to input genome tables [id_genome][filepath]. Also added progress bars. - As a result of issues/22,
assembly.py
,assembly-sequential.py
,binning-*.py
, andmapping.py
will use-p --countReadPairs
forfeatureCounts
and updatessubread 2.0.1 → subread 2.0.3
. Forbinning-*.py
, long reads can be used with the--long_reads
flag. - Updated
cluster.py
and associatedglobal_clustering.py
/local_clustering.py
scripts to usemmseqs2_wrapper.py
which now automatically outputs representative sequences. - Added
check_fasta_duplicates.py
script that gives0
and1
exit codes for fasta without and with duplicates, respectively. Addedreformat_representative_sequences.py
to reformat representative sequences fromMMSEQS2
into either a table or fasta file where the identifers are cluster labels. Removed--dbtype
from[global/local]_clustering.py
. Removed appended prefix for.graph.pkl
anddict.pkl
inedgelist_to_clusters.py
. Addedmmseqs2_wrapper.py
andhmmer_wrapper.py
scripts. - Added an option to
merge_generalized_mapping.py
to include the sample index in a filepath and also an option to remove empty features (useful for Salmon). Added anexecutable='/bin/bash'
option to thesubprocess.Popen
calls inGenoPype
to address issues/23. - Added
genbanks/[id_genome]/
to output directory ofbiosynthetic.py
which has symlinks to all the BGC genbanks fromantiSMASH
.
Release v1.1.1 Details
- Most important update includes fixing a broken VEBA-
binning-viral.yml
install recipe which had package conflicts foraria2
30e8b0a. - Fixes on conda-related environment variables in the install scripts.
- Added
MIBiG
to database andannotate.py
- Added a composite label for annotations in
annotate.py
- Added
--dastool_minimum_score
tobinning-prokaryotic.py
module - Added a wrapper around
STAR
aligner - Updated
merge_generalized_mapping.py
script to take in BAM files instead of being dependent on a specific directory. - Added option to have no header in
subset_table.py
Release v1.1.0 Details
-
Modules:
-
annotate.py
- Added
NCBIfam-AMRFinder
AMR domain annotations - Added
AntiFam
contimination annotations - Uses
taxopy
instead ofete3
in backend withmerge_annotations_and_score_taxonomy.py
- Added
-
assembly.py
- Added a
transcripts_to_genes.py
script which creates agenes_to_transcripts.tsv
table that can be used withTransDecoder
.
- Added a
-
binning-prokaryotic.py
- Updated
CheckM
→CheckM2
. This removes the dependency ofGTDB-Tk
and EXTREMELY REDUCES compute resource requirements (e.g., memory and time) asCheckM2
automatically handles candidate phyla radiation. With this, several backend scripts were deprecated. This cleans up the binning pipeline and error messages SUBSTANTIALLY. - Uses
binning_wrapper.py
for all binning. This makes it easier to add new binning algorithms in the future (e.g.,VAMB
). Also, check out the new multi-split binning functionality described below. - Added
--skip_concoct
in addition to the already existing--skip_maxbin2
option asMaxBin2
takes very long when there's a lot of contigs andCONCOCT
takes a long time when there are a lot of samples (i.e., BAM files).MetaBAT2
is not optional.
- Updated
-
binning-viral.py
- Complete rewrite of this module which now uses
geNomad
as the default binning algorithm but still supportsVirFinder
. - If
VirFinder
is used, thegenomad annotate
is run via thegenomad_taxonomy_wrapper.py
script included in the update. - Updated
Prodigal
→Prodigal-GV
to handle additional viral genetic codes.
- Complete rewrite of this module which now uses
-
biosynthetic.py
- Introduces
component_id
andbgc_id
which are unique, pareseable, and informative. For example,component_id = SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001_1|2-2681(+)
contains the uniquebgc_id
(i.e.,SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001
), shows that it is the 1st gene in the cluster (the_1
inregion001_1
), and the gene start/end/strand. Thebgc_id
is composed of thegenome_id|contig_id|region_id
.
- Introduces
-
classify-prokaryotic.py
- Updated
GTDB-Tk v2.1.1
→GTDB-Tk v2.2.3
. For now,--skip_ani_screen
is the only option because of this thread. However,--mash_db
may be an option in the near future. - Added functionality to classify prokaryotic genomes that were not binned via
VEBA
which is available with the--genomes
option (--prokaryotic_binning_directory
is still available which can leverage existing intermediate files).
- Updated
-
classify-eukaryotic.py
- Added functionality to classify eukaryotic genomes that were not binned via
VEBA
which is available with the--genomes
option (--eukaryotic_binning_directory
is still available which can leverage existing intermediate files). This is implemented by using theeukaryota_odb10
markers from theVEBA Microeukaryotic Database
to substantially improve performance and decrease resources required for gene models.
- Added functionality to classify eukaryotic genomes that were not binned via
-
classify-viral.py
- Complete rewrite of this module which does not rely on (deprecated) intermediate files from
CheckV
. - Uses taxonomy generated from
geNomad
andconsensus_genome_classification_unranked.py
(a wrapper aroundtaxopy
) that can handle the chaotic taxonomy of viruses. - Added functionality to classify viral genomes that were not binned via
VEBA
which is available with the--genomes
option (--viral_binning_directory
is still available which can leverage existing intermediate files).
- Complete rewrite of this module which does not rely on (deprecated) intermediate files from
-
cluster.py
- Complete rewrite of this module which now uses
MMSEQS2
as the orthogroup detection algorithm instead ofOrthoFinder
.OrthoFinder
is overkill for creating protein clusters and it generates thousands of intermediate files (e.g., fasta, alignments, trees, etc.) which substantially increases the compute time.MMSEQS2
has very similar performance with a fraction of the resources and compute time. Clustered the entire Plastisphere dataset on a local machine in ~30 minutes compared to several days on a HPC. - Now that the resources are minimal, clustering is performed at global level as before (i.e., all samples in the dataset) and now at the local level, optionally but ON by default, which clusters all genomes within a sample. Accompanying wrapper scripts are
global_clustering.py
andlocal_clustering.py
. - The genomic and functional feature compression ratios (FCR) (described here]) are now calculated automatically. The calculation is
1 - number_of_clusters/number_of_features
which can easily be converted into an unsupervised biodiversity metric. This is calculated at the global (original implementation) and local levels. - Input is now a table with the following columns:
[organism_type]<tab>[id_sample]<tab>[id_mag]<tab>[genome]<tab>[proteins]
and is generated easily with thecompile_genomes_table.py
script. This allows clustering to be performed for prokaryotes, eukaryotes, and viruses all at the same time. - SLC-specific orthogroups (SSO) are now refered to as SLC-specific protein clusters (SSPC).
- Support zfilling (e.g.,
zfill=3, SLC7 → SLC007
) for genomic and protein clusters. - Deprecated
fastani_to_clusters.py
to now use the more generalizableedgelist_to_clusters.py
which is used for both genomic and protein clusters. This also outputs aNetworkX
graph and a pickled dictionary{"cluster_a":{"component_1", "component_2", ..., "component_n"}}
- Complete rewrite of this module which now uses
-
phylogeny.py
- Updated
MUSCLE
tov5
which has-align
and-super5
algorithms which are now accessible with--alignment_algorithm
. Cannot usestdin
so now the fasta files are not gzipped. Themerge_msa.py
now output uncompressed fasta as default and can output gzipped with the--gzip
flag.
- Updated
-
-
VEBA Database
:VDB_v3.1
→VDB_v4
- Updated
CheckV DB v1.0
→CheckV DB v1.5
- Added
geNomad DB v1.2
- Added
CheckM2 DB
- Removed
CheckM DB
- Removed
taxa.sqlite
andtaxa.sqlite.traverse.pkl
- Added
reference.eukaryota_odb10.list
and correspondingMMSEQS2
database (i.e.,microeukaryotic.eukaryota_odb10
) - Added
NCBIfam-AMRFinder
marker set for annotation - Added
AntiFam
marker set for contamination - Marker sets HMMs are now all gzipped (previously could not gzip because
CheckM
CPR workflow)
- Updated
-
Scripts:
-
Added:
append_geneid_to_transdecoder_gff.py
bowtie2_wrapper.py
compile_genomes_table.py
consensus_genome_classification_unranked.py
cut_table.py
cut_table_by_column_labels.py
drop_missing_values.py
edgelist_to_clusters.py
filter_checkm2_results.py
genomad_taxonomy_wrapper.py
global_clustering.py
local_clustering.py
partition_multisplit_bins.py
scaffolds_to_clusters.py
scaffolds_to_samples.py
transcripts_to_genes.py
transdecoder_wrapper.py
(Note: Requires separate environment to run due to dependency conflicts)
-
Updated:
antismash_genbanks_to_table.py
- Added option to output biosynthetic gene cluster (BGC) fasta. Adds unique (and parseable) BGC identifiers making the output much more useful.binning_wrapper.py
- This binning wrapper now includes functionality to use multi-split binning (i.e., concatenated contigs from different assemblies, map all reads to the contigs, bin all together, and then parition bins by sample). This concept AFAIK was first introduced in theVAMB
paper.compile_reads_table.py
- Minimal change but now the extension excludes the.
to make usage more consistent with other tools.consensus_genome_classification.py
- Changed the output to match that ofconsensus_genome_classification_unranked.py
.filter_checkv_results.py
- Option to use taxonomy and viral summaries generated bygeNomad
.scaffolds_to_bins.py
- Support for getting scaffolds to bins for a list of genomes via--genomes
argument while maintaining original support with--binning_directory
argument.subset_table.py
- Added option to set index column and to drop duplicates.virfinder_wrapper.r
- Used to beVirFinder_wrapper.R
. This now has an option to use FDR values instead of P values.merge_annotations_and_score_taxonomy.py
- Completely rewritten. Usestaxopy
instead ofete3
.merge_msa.py
- Output uncompressed protein fasta files by default and can compress with--gzip
flag.
-
Deprecated:
adjust_genomes_for_cpr.py
filter_checkm_results.py
fastani_to_clusters.py
partition_orthogroups.py
partition_clusters.py
compile_viral_classifications.py
build_taxa_sqlite.py
-
-
Miscellaneous:
- Updated environments and now add versions to environments.
- Added
mamba
to installation to speed up. - Added
transdecoder_wrapper.py
which is a wrapper aroundTransDecoder
with direct support forDiamond
andHMMSearch
homology searches. Also includesappend_geneid_to_transdecoder_gff.py
which is run in the backend to clean up the GFF file and make them compatible with what is output byProdigal
andMetaEuk
runs ofVEBA
. - Added support for using
n_jobs -1
to use all available threads (similar toscikit-learn
methodology).
Release v1.0.4 Details
- Added
biopython
toVEBA-assembly_env
which is needed when runningMEGAHIT
as the scaffolds are rewritten and an error was raised. aea51c3 - Updated Microeukaryotic protein database to exclude a few higher eukaryotes that were present in database, changed naming scheme to hash identifiers (from
cat reference.faa | seqkit fx2tab -s -n > id_to_hash.tsv
). Switching database from FigShare to Zenodo. Uses database versionVDB_v3
which has the updated microeukaryotic protein database (VDB-Microeukaryotic_v2
) 0845ba6
Release v1.0.3e Details
- Patch fix for
install_veba.sh
whereinstall/environments/VEBA-assembly_env.yml
raised a compatibilty error when creating theVEBA-assembly_env
environment. c2ab957 - Patch fix for
VirFinder_wrapper.R
where__version__ =
variable was throwing an R error when runningbinning-viral.py
module. 19e8f38 - Patch fix for
filter_busco_results.py
where an error arose that produced emptyidentifier_mapping.metaeuk.tsv
subset tables. 359e4569 - Patch fix for
compile_metaeuk_identifiers.py
where a Python error arised when duplicate gene identifiers were present. c248527 - Patch fix for
install_veba.sh
whereinstall/environments/VEBA-preprocess_env.yml
raised a compatibilty error when creating theVEBA-preprocess_env
environment 8ed6eea - Added
biosynthetic.py
module which runs antiSMASH and converts genbank files to tabular format. 6c0ed82 - Added
megahit
support forassembly.py
module (not yet available inassembly-sequential.py
). 6c0ed82 - Changed
-P/--spades_program
to-P/--program
forassembly.py
. 6c0ed82 - Replaced penultimate step in
binning-prokaryotic.py
to useadjust_genomes_for_cpr.py
instead of the extremely long series of bash commands. This will make it easier to diagnose errors in this critical step. 6c0ed82 - Added support for contig descriptions and added MAG identifier in fasta files in
binning-eukaryotic.py
. Now uses themetaeuk_wrapper.py
script for theMetaEuk
step. 6c0ed82 - Added separate option of
--run_metaplasmidspades
forassembly-sequential.py
instead of making it mandatory (now it just runsbiosyntheticSPAdes
andmetaSPAdes
by default). 6c0ed82 - Added
--use_mag_as_description
inparition_gene_models.py
script to include the MAG identifier in the contig description of the fasta header which is default inbinning-prokaryotic.py
. 6c0ed82 - Added
adjust_genomes_for_cpr.py
script to easier run and understand the CPR adjustment step ofbinning-prokaryotic.py
. 6c0ed82 - Added support for fasta header descriptions in
binning-prokaryotic.py
. 6c0ed82 - Added functionality to
replace_fasta_descriptions.py
script to be able to use a string for replacing fasta headers in addition to the original functionality. 6c0ed82
Release v1.0.2a Details
- Updated GTDB-Tk in
VEBA-binning-prokaryotic_env
from1.x
to2.x
(this version uses much less memory): f3507dd - Updated the GTDB-Tk database from
R202
toR207_v2
to be compatible with GTDB-Tk v2.x: f3507dd - Updated the GRCh38 no-alt analysis set to T2T CHM13v2.0 for the default human reference: 5ccb4e2
- Added an experimental
amplicon.py
module for short-read ASV detection via the DADA2 workflow of QIIME2: cd4ed2b - Added additional functionality to
compile_reads_table.py
to handle advanced parsing of samples from fastq directories while also maintaining support for parsing filenames fromveba_output/preprocess
: cd4ed2b - Added
sra-tools
toVEBA-preprocess_env
: f3507dd - Fixed symlinks to scripts for
install_veba.sh
: d1fad03 - Added missing
CHECKM_DATA_PATH
environment variable toVEBA-binning-prokaryotic_env
andVEBA-classify_env
: d1fad03
Release v1.0.1 Details
Release v1.0.0 Details
- Released with BMC Bionformatics publication (doi:10.1186/s12859-022-04973-8).
Check:
- Start/end positions on
MetaEuk
gene ID might be off.
Critical:
- Return code for
cluster.py
when it fails during global and local clustering is 0 but should be 1. - Don't load all genomes, proteins, and cds into memory for clustering.
- Genome checkpoints in
tRNAscan-SE
aren't working properly. - Dereplicate CDS sequences in GFF from
MetaEuk
forantiSMASH
to work for eukaryotic genomes
Definitely:
- Add number of unique protein clusters to
identifier_mapping.genomes.tsv.gz
incluster.py
to assess most metabolicly diverse representative. - Add
--proteins
option toclassify-eukaryotic.py
which aligns proteins toMicroEuk100.eukaryota_odb10
viaMMseqs2
and then proceeds with the pipeline. - Add
BiNI
biosynthetic novelty index tobiosynthetic.py
busco_wrapper.py
that relabels all the genes, runs analysis, then converts output to tsv.- Script to update genome clusters
- Script to update protein clusters
- Script to add
Diamond
orHMMSearch
annotations toannotations.proteins.tsv.gz
- Add
convert_reads_long_to_short.py
which will take windows of 150 bp for the long reads. - Add option to
compile_custom_humann_database_from_annotations.py
to only output best hit of a UniRef identifier per genome. - Use
pigz
instead ofgzip
- Create a taxdump for
MicroEuk
- Reimplement
compile_eukaryotic_classifications.py
- Add representative to
identifier_mapping.proteins.tsv.gz
- Use
aria2
in parallel instead ofwget
. - Add support for
Salmon
inmapping.py
andindex.py
. This can be used instead ofSTAR
which will require adding theexon
field toPyrodigal
GFF file (MetaEuk
modified GFF files already have exon ids). - [Optional] Number of plasmids (via
geNomad
) for each MAG.
Eventually (Yes)?:
NextFlow
support- Install each module via
bioconda
- Consistent usage of the following terms: 1) dataframe vs. table; 2) protein-cluster vs. orthogroup. Dataframes should refer to generic tables while tables refer to specifics like "genomes table".
- Add coding density to GFF files
- Run
cmsearch
beforetRNAscan-SE
- DN/DS from pangeome analysis
- Add a
metabolic.py
module - For viral binning, contigs that are not identified as viral via
geNomad -> CheckV
use withvRhyme
. - Add
vRhyme
tobinning_wrapper.py
and supportvRhyme
inbinning-viral.py
.
...Maybe (Not)?
- Swap
TransDecoder
forTransSuite
Developmental:
- Error with
amplicon.py
that works when run manually... (Developmental module)
There was a problem importing veba_output/misc/reads_table.tsv:
Missing one or more files for SingleLanePerSamplePairedEndFastqDirFmt: '.+_.+_L[0-9][0-9][0-9]_R[12]_001\\.fastq\\.gz'
Daily Change Log:
- [2024.9.21] - Added
KEGG Pathway Profiler
toVEBA-database_env
andVEBA-annotate_env
which replacesMicrobeAnnotator-KEGG
for module completion ratios. Replacing${VEBA_DATABASE}/Annotate/MicrobeAnnotator-KEGG
with${VEBA_DATABASE}/Annotate/KEGG-Pathway-Profiler/
database files. Note: New module completion ratio output does not have classes labels for KEGG modules. - [2024.8.30] - Added ${N_JOBS} to download scripts with default set to maximum threads available
- [2024.8.29] - Added
VERSION
file created indownload_databases.sh
- [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query. Added
--af_mode
with eitherrelaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af
orstrict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)
toedgelist_to_clusters.py
,global_clustering.py
,local_clustering.py
, andcluster.py
. - [2024.7.3] - Added
pigz
toVEBA-annotate_env
which isn't a problem with mostconda
installations but needed fordocker
containers. - [2024.6.21] - Changed
choose_fastest_mirror.py
todetermine_fastest_mirror.py
- [2024.6.20] - Added
-m/--include_mrna
tocompile_metaeuk_identifiers.py
for Issue #110 - [2024.6.7] - Adapted
phylogeny.py
andpartition_pyhmmsearch.py
to usepyhmmsearch
instead ofhmmsearch
andKofam_Scan
. - [2024.6.7] - Adapted
annotate.py
,merge_annotations.py
, andcompile_ko_from_annotations.py
to usepyhmmsearch
andpykofamsearch
instead ofhmmsearch
andKofam_Scan
. - [2024.6.6] - Changed
Diamond
output format from-f 6 qseqid sseqid stitle pident length mismatch qlen qstart qend slen sstart send evalue bitscore qcovhsp scovhsp
to-f 6 qseqid sseqid stitle pident evalue bitscore qcovhsp scovhsp
- [2024.6.6] - Adapted
classify-eukaryotic.py
to usepyhmmsearch
instead ofhmmsearch
. - [2024.6.6] - Updating
GTDB-Tk
andBUSCO
introduced conflicting dependencies. To provide more flexibility for version updates,VEBA-classify_env
has been split out intoVEBA-classify-eukaryotic_env
,VEBA-classify-prokaryotic_env
, andVEBA-classify-viral_env
. - [2024.6.5] - Update
GTDB
version fromr214.1
tor220
in VEBA database versionVDB_v7
and inclassify-prokaryotic.py
. Corresponding mash database forr220
is available here: - [2024.6.5] - Added
choose_fastest_mirror.py
to utility scripts which checks the speed of multiple urls and then outputs the fastest one. - [2024.6.5] - Removing version name from
GTDB
.msh file. Previous versions includedgtdb_r214.msh
but now they will begtdb.sh
. - [2024.5.20] - Added
reformat_minpath_report.py
to reformat minpath reports.MinPath
isn't used directly by VEBA but it might be in the future. - [2024.4.30] - Added
concatenate_files.py
which can concatenate files (and mixed compressed/decompressed files) using either arguments, list file, or glob. Reason for this is that unix has a limit of arguments that can be used (e.g.,cat *.fasta > output.fasta
where *.fasta results in 50k files will crash) - [2024.4.29] - Added
/volumes/workspace/
directory to Docker containers for situations when your input and output directories are the same. - [2024.4.29] -
featureCounts
can only handle 64 threads at a time so addedmin(64, opts.n_jobs)
for all the modules/scripts that usefeatureCounts
commands. - [2024.4.23] - Added
uniprot_to_enzymes.py
which reformats tables and fasta from https://www.uniprot.org/uniprotkb?query=ec%3A* - [2024.4.18] - Developed a faster implementation of
KofamScan
calledPyKofamSearch
which leveragePyHmmer
. This will be used in future versions of VEBA. - [2024.3.26] - Added
--metaeuk_split_memory_limit
tometaeuk_wrapper.py
. - [2024.3.26] - Added
-d/--genome_identifier_directory_index
toscaffolds_to_bins.py
for directories that are structuredpath/to/genomes/bin_a/reference.fasta
where you would use-d -2
. - [2024.3.26] - Added
--minimum_af
toedgelist_to_clusters.py
with an option to accept 4 column inputs[id_1]<tab>[id_2]<tab>[weight]<tab>[alignment_fraction]
.global_clustering.py
,local_clustering.py
, andcluster.py
now use this by default--af_threshold 30.0
. If you want to retain previous behavior, just use--af_threshold 0.0
. - [2024.3.18] -
edgelist_to_clusters.py
only includes edges where both nodes are in identifiers set. If--identifiers
are provided, then only those identifiers are used. If not, then it includes all nodes. - [2024.3.18] - Added
--export_representatives
argument foredgelist_to_clusters.py
to output table with[id_node]<tab>[id_cluster]<tab>[intra-cluster_connectivity]<tab>[representative]
. Also includes this information innx.Graph
objects. - [2024.3.18] - Changed singleton weight to
np.nan
instead ofnp.inf
foredgelist_to_clusters.py
to allow for representative calculations. - [2024.3.8] - Changed default assembly algorithm to
metaflye
instead offlye
inassembly-long.py
- [2024.3.8] - Added
number_of_genomes
,number_of_genome-clusters
,number_of_proteins
, andnumber_of_protein-clusters
tofeature_compression_ratios.tsv.gz
fromcluster.py
- [2024.3.5] - Added
-A/--from_antismash
inbiosynthetic.py
to use preexistingantiSMASH
results. Also changed-i/--input
to-i/--from_genomes
. - [2024.3.4] - Changed
antimash_genbanks_to_table.py
tobiosynthetic_genbanks_to_table.py
for future support ofDeepBGC
andGECCO
- [2024.2.28] - Added
busco_version
parameter tomerge_busco_json.py
with default set to5.4.x
and additional support for5.6.x
. - [2024.2.24] - Added
CONDA_ENVS_PATH
toupdate_environment_scripts.sh
,update_environment_variables.sh
, andcheck_installation.sh
- [2024.2.17] - Added
CONDA_ENVS_PATH
toveba
to allow for custom environment locations - [2024.2.16] - Changed
install.sh
to support customCONDA_ENVS_PATH
argumentbash install.sh path/to/log path/to/envs/
- [2024.2.16] - Added
merge_counts_with_taxonomy.py
- [2024.1.28] - Replaced
src/
withbin/
and added-V|--full_versions to show all VEBA versions
- [2024.1.23] - Added
compile_phylogenomic_functional_categories.py
script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239) - [2024.1.22] - Fixed header being offset in
annotations.protein_clusters.tsv
where it could not be read with Pandas. - [2024.1.22] - Fixed
binning-prokaryotic.py
the creation of non-existing symlinks where "'.gff'", "'.rRNA'", and "'*.tRNA'" were created. - [2024.1.16] - Fixed .strip method on Pandas series in
antismash_genbanks_to_table.py
for compatibilty withantiSMASH 6 and 7
- [2024.1.7] - Fixed situation where
unbinned.fasta
is empty inbinning-prokaryotic.py
when there are no bins that pass qc. - [2024.1.7] - Fixed minor error in
coverage.py
wheresamtools sort --reference
was gettingreads_table.tsv
and notreference.fasta
- [2023.1.4] - Changed default behavior from deterministic to not deterministic for increase in speed in
assembly-long.py
. (i.e.,--no_deterministic
-->--deterministic
) - [2024.1.2] - Added
VeryFastTree
as an option tophylogeny.py
withFastTree
remaining as the default. - [2023.12.30] - Changed default
--leniency
parameter onclassify_eukaryotic.py
andconsensus_genome_classification_ranked.py
to1.0
and added--leniecy_genome_classification
as a separate option. - [2023.12.28] - Added
--blacklist
option tocompile_eukaryotic_classifications.py
with a default value ofspecies:uncultured eukaryote
inclassify_eukaryotic.py
- [2023.12.28] - Fixed critical error where
classify_eukaryotic.py
was trying to access a deprecated database file from MicrEuk_v2. - [2023.12.22] - Fixed minor error with
eukaryotic_gene_modeling_wrapper.py
not allowing forTiara
to run in backend. - [2023.12.21] -
GTDB-Tk
changed name of archaea summary file so VEBA was not adding this to final classification. Fixed this inclassify-prokaryotic.py
. - [2023.12.20] - Fixed files not being closed in
compile_custom_humann_database_from_annotations.py
and added options to use different annotation file formats (i.e., multilevel, header, and no header). - [2023.12.15] - Added
profile-taxonomic.py
module which usessylph
to build a sketch database for genomes and queries the genome database similar toKraken
for taxonomic abundance. - [2023.12.14] - Removed requirement to have
--estimated_assembly_size
for Flye per Flye Issue #652. - [2023.12.14] - Added
sylph
toVEBA-profile_env
for abundance profiling of genomes. - [2023.12.13] - Dereplicate duplicate contigs in
concatenate_fasta.py
. - [2023.12.12] - Added
--reference_gzipped
toindex.py
andmapping.py
with new default being that the reference fasta is not gzipped. - [2023.12.11] - Added
skani
as new default for genome clustering incluster.py
,global_clustering.py
, andlocal_clustering.py
. - [2023.12.11] - Added support for long reads in
fastq_preprocessor
,preprocess.py
,assembly-long.py
, and all binning modules. - [2023.11.28] - Fixed
annotations.protein_clusters.tsv.gz
frommerge_annotations.py
added in patch update ofv1.3.1
. - [2023.11.14] - Added support for missing values in
compile_eukaryotic_classifications.py
. - [2023.11.13] - Added
--metaeuk_split_memory_limit
argument with (experimental) default set to36G
inbinning-eukaryotic.py
andeukaryotic_gene_modeling.py
. - [2023.11.10] - Added
--compressed 1
tommseqs createdb
indownload_databases.sh
installation script. - [2023.11.10] - Added a check to
check_fasta_duplicates.py
andclean_fasta.py
to make sure there are no>
characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. - [2023.11.10] - Added
Diamond DeepClust
toclustering_wrapper.py
,global/local_clustering.py
, andcluster.py
. Changedmmseqs2_wrapper.py
toclustering_wrapper.py
. Changedeasy-cluster
andeasy-linclust
tommseqs-cluster
andmmseqs-linclust
. - [2023.11.9] - Fixed viral quality in
merge_genome_quality_assessments.py
- [2023.11.3] - Changed
consensus_genome_classification.py
toconsensus_genome_classification_ranked.py
. Also, default behavior to allow for missing taxonomic levels. - [2023.11.2] - Fixed the
merge_annotations.py
resulting in a memory leak when creating theannotations.protein_clusters.tsv.gz
output table. However, still need to correct the formatting for empty sets and string lists. - [2023.10.27] - Update
annotate.py
andmerge_annotations.py
to handleCAZy
. They also properly address clustered protein annotations now. - [2023.10.18] - Added
module_completion_ratio.py
script which is a fork ofMicrobeAnnotator
ko_mapper.py
. Also included a database Zenodo: 10020074 which will be included inVDB_v5.2
- [2023.10.16] - Added a checkpoint for
tRNAscan-SE
inbinning-prokaryotic.py
andeukaryotic_gene_modeling_wrapper.py
. - [2023.10.16] - Added
profile-pathway.py
module andVEBA-profile_env
environments which is a wrapper aroundHUMAnN
for the custom database created fromannotate.py
andcompile_custom_humann_database_from_annotations.py
- [2023.10.16] - Added
GenoPype version
to log output - [2023.10.16] - Added
merge_genome_quality.py
which combinesCheckV
,CheckM2
, andBUSCO
results. - [2023.10.11] - Added
compile_custom_humann_database_from_annotations.py
which compiles aHUMAnN
protein database table from the output ofannotate.py
and taxonomy classifications. - [2023.10.11] - Added functionality to
merge_taxonomy_classifications.py
to allow for--no_domain
and--no_header
which will serve as input tocompile_custom_humann_database_from_annotations.py
- [2023.10.5] - Added
marker_gene_clustering.py
script which gets core marker genes unique to each SLC (i.e., pangenome).average_number_of_copies_per_genome
to protein clusters. - [2023.10.5] - Added
--minimum_core_prevalence
inglobal_clustering.py
,local_clustering.py
, andcluster.py
which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove--no_singletons
fromcluster.py
to avoid complications with marker genes. Relabeled--input
to--genomes_table
in clustering scripts/module. - [2023.9.21] - Added a check in
coverage.py
to see if themapped.sorted.bam
files are created, if they are then skip them. Not yet implemented for GNU parallel option. - [2023.9.15] - Changed default representative sequence format from table to fasta for
mmseqs2_wrapper.py
. - [2023.9.12] - Added
--nucleotide_fasta_output
toantismash_genbank_to_table.py
which outputs the actual BGC DNA sequence. Changed--fasta_output
to--protein_fasta_output
and added output tobiosynthetic.py
. Changed BGC component identifiers to[bgc_id]_[position_in_bgc]|[start]:[end]([strand])
to match withMetaEuk
identifiers. Changedbgc_type
toprotocluster_type
.biosynthetic.py
now supports GFF files fromMetaEuk
(exon and gene features not supported byantiSMASH
). Fixed error related toantiSMASH
adding CDS (i.e.,allorf_[start]_[end]
) that are not in GFF soantismash_genbank_to_table.py
failed in those cases. - [2023.9.12] - Added
ete3
toVEBA-phylogeny_env.yml
and automatically renders trees to PDF. #! Need to test - [2023.9.11] - Added presets for
MEGAHIT
using the--megahit_preset
option. - [2023.9.11] - The change for using
--mash_db
withGTDB-Tk
violated the assumption that all prokaryotic classifications had amsa_percent
field which caused the cluster-level taxonomy to fail.compile_prokaryotic_genome_cluster_classification_scores_table.py
fixes this by usesfastani_ani
as the weight when genomes were classified using ANI andmsa_percent
for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications. - [2023.9.8] - Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs.
- [2023.9.8] - Fixed critical error where descriptions in header were not being removed in
eukaryota.scaffolds.list
and did not remove eukaryotic scaffolds inseqkit grep
soDAS_Tool
output eukaryotic MAGs inidentifier_mapping.tsv
and__DASTool_scaffolds2bin.no_eukaryota.txt
- [2023.9.5] - Fixed
krona.html
inbiosynthetic.py
which was being created incorrectly fromcompile_krona.py
script. - [2023.8.30] - Create
pangenome_core_sequences
inglobal_clustering.py
andlocal_clustering.py
which writes both protein and CDS sequences for each SLC. Also made default incluster.py
to NOT do local clustering switching--no_local_clustering
to--local_clustering
. - [2023.8.30] -
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
inbiosynthetic.py
whenDiamond
finds multiple regions in one hit that matches. Added--sort_by
and--ascending
toconcatenate_dataframes.py
along with automatic detection and removal of duplicate indices. Also added--sort_by bitscore
inbiosynthetic.py
. - [2023.8.28] - Added core pangenome and singleton hits to clustering output
- [2023.8.25] - Updated
--megahit_memory
default from 0.9 to 0.99 - [2023.8.16] - Fixed error in
genomad_taxonomy_wrapper.py
whereviral_taxonomy.tsv
should have beentaxonomy.tsv
. - [2023.7.26] - Fixed minor error in
assembly.py
that was preventing users from usingSPAdes
programs that were notspades.py
,metaspades.py
, orrnaspades.py
that was the result of using an incorrect string formatting. - [2023.7.25] - Updated
bowtie2
in preprocess, assembly, and mapping modules. Updatedfastp
andfastq_preprocessor
in preprocess module. - [2023.7.7] - Added
compile_gff.py
to merge CDS, rRNA, and tRNA GFF files. Used inbinning-prokaryotic.py
andbinning-viral.py
.binning-eukaryotic.py
uses the source of this in the backend offilter_busco_results.py
. Includes GC content for contigs and various tags. - [2023.7.6] - Updated
BUSCO v5.3.2 -> v5.4.3
which changes the json output structure and made the appropriate changes infilter_busco_results.py
. - [2023.7.3] - Added
eukaryotic_gene_modeling_wrapper.py
which 1) splits nuclear, mitochondrial, and plastid genomes; 2) performs gene modeling viaMetaEuk
andPyrodigal
; 3) performs rRNA detection viaBARRNAP
; 4) performs tRNA detection viatRNAscan-SE
; 5) merges processed GFF files; and 5) calculates sequences statistics. - [2023.6.29] - Added
gene_biotype=protein_coding
toprodigal
GFF output. - [2023.6.20] - Added
VFDB
toannotate.py
and database. - [2023.6.16] - Compiled and pushed
gtdb_r214.msh
mash file to Zenodo:8048187 which is now used by default inclassify-prokaryotic.py
. It is now included inVDB_v5.1
. - [2023.6.15] - Cleaned up global and local clustering intermediate files. Added pangenome tables and singelton information to outputs.
- [2023.6.12] - Changed
${VEBA_DATABASE}/Classify/GTDBTk
→${VEBA_DATABASE}/Classify/GTDB
. - [2023.6.12] - Replace
prodigal
withpyrodigal
inbinning-prokaryotic.py
(prodigal
is still in environment b/cDAS_Tool
dependency). - [2023.6.12] -
consensus_genome_classification.py
now based missing classifications off of a missing weight value. Defaults for unclassified labels areUnclassified prokaryote
,Unclassified eukaryote
, andUnclassified virus
for the various classification modules. Also changed "id_genome_cluster" to "id" and "genomes" to "components" to generalize for eukaryotic classification. - [2023.6.12] -
global_clustering.py
andlocal_clustering.py
(accessible throughcluster.py
) now outputs NetworkX graph and Python dictionary pickled objects. - [2023.6.12] - Added support for missing values and unclassified taxa in
compile_krona.py
andconsensus_genome_classification.py
. - [2023.5.18] - Added
compile_protein_cluster_prevalence_table.py
script - [2023.5.17] - Added
convert_table_to_fasta.py
script - [2023.5.16] - Created Docker images for all modules
- [2023.5.16] - Replaced all absolute path symlinks with relative symlinks.
- [2023.5.15] - Changed
prokaryotic_taxonomy.tsv
andprokaryotic_taxonomy.clusters.tsv
inclassify-prokaryotic.py
(along with eukaryotic and viral) files totaxonomy.tsv
andtaxonomy.clusters.tsv
for uniformity. - [2023.5.15] - Updating all symlinks to relative links (also in
fastq_preprocessor
) to prepare for dockerization and updating all environments to use updated GenoPype 2023.4.13. - [2023.5.14] - Changed
nr
touniref
inannotate.py
and addedpropagate_annotations_from_representatives.py
script while simplifyingmerge_annotations_and_taxonomy.py
tomerge_annotations.py
and excluding taxonomy operations. - [2023.5.14] - Changed
nr
toUniRef90
andUniRef50
inVDB_v5
- [2023.5.12] - Changed
orfs_to_orthogroups.tsv
toproteins_to_orthogroups.tsv
for consistency with thecluster.py
module. Will eventually find some consitency withscaffolds_to_bins/scaffolds_to_mags
but this will be later. - [2023.5.12] - Added a
scaffolds_to_mags.tsv
in the clustering output. - [2023.5.8] - Added
convert_counts_table.py
which converts a counts table (and metadata) to Pandas pickle, Anndata h5ad, or Biom hdf5 - [2023.5.8] - Fixed output directory for
mapping.py
which now usesoutput_directory/${NAME}
structure likebinning-*.py
. - [2023.5.8] - Removed "python" prefix for script calls and now uses shebang in script for executable. Also added single paranthesis around script filepath (e.g.,
'[script_filepath]'
) to escape characters/spaces in filepath. - [2023.5.8] - Added support for
index.py
to accept individual--references [file.fasta]
and--gene_models [file.gff]
. - [2023.4.25] - Added
stdin
support forscaffolds_to_bins.py
along with the ability to input genome tables [id_genome][filepath]. Also added progress bars. - [2023.4.23] - As a result of issues/22,
assembly.py
,assembly-sequential.py
,binning-*.py
, andmapping.py
will use-p --countReadPairs
forfeatureCounts
and updatessubread 2.0.1 → subread 2.0.3
. Forbinning-*.py
, long reads can be used with the--long_reads
flag. - [2023.4.20] - Updated
cluster.py
and associatedglobal_clustering.py
/local_clustering.py
scripts to usemmseqs2_wrapper.py
which now automatically outputs representative sequences. - [2023.4.17] - Added
check_fasta_duplicates.py
script that gives0
and1
exit codes for fasta without and with duplicates, respectively. Addedreformat_representative_sequences.py
to reformat representative sequences fromMMSEQS2
into either a table or fasta file where the identifers are cluster labels. Removed--dbtype
from[global/local]_clustering.py
. Removed appended prefix for.graph.pkl
anddict.pkl
inedgelist_to_clusters.py
. Addedmmseqs2_wrapper.py
andhmmer_wrapper.py
scripts. - [2023.4.13] - Added an option to
merge_generalized_mapping.py
to include the sample index in a filepath and also an option to remove empty features (useful for Salmon). Added anexecutable='/bin/bash'
option to thesubprocess.Popen
calls inGenoPype
to address issues/23. - [2023.3.23] - Added
genbanks/[id_genome]/
to output directory ofbiosynthetic.py
which has symlinks to all the BGC genbanks fromantiSMASH
. - [2023.3.20] - Added
database
field tosource_taxonomy.tsv.gz
inVDB-Microeukaryotic_v2.1
as an additional file which wille eventually replace the default file. Also changedSourceID
toid_source
in updated version. - [2023.3.17] - Fixed rare bug when
antiSMASH
genbank files have a space appended to the contig. Also fixed a typo in the BGC features fasta file name. - [2023.3.13] - Fixed
--skip_maxbin2
and--skip_concoct
arguments by adding missingseed
parameters (Issue #21). Added a wrapper aroundSTAR
RNAseq-aligner (star_wrapper.py
) in preperation to add as an option formapping.py
. This also includes a helper script in compiling the summary log (compile_star_statistics.py
). - [2023.3.9] - Added
bgc_novelty_scorer.py
script to get novelty scores of biosynthetic gene clusters. - [2023.3.7] - Added prefix and minimum contig length threshold to
assembly.py
by default. Addedmerge_generalized_mapping.py
which can be used forbowtie2_wrapper.py
and (the future)star_wrapper.py
helper scripts. - [2023.3.6] - Added dereplicated
MIBiG
Diamond
database to (mibig_v3.1.dmnd
)VDB_v4.1
. Adds protein fasta files for genes in BGCs forbiosynthetic.py
which are used to run against themibig_v3.1.dmnd
database. - [2023.3.3] - Updated
binning-viral.py
module'sgeNomad
run to use--relaxed
settings by default sinceCheckV
is used after with conservative settings (https://portal.nersc.gov/genomad/post_classification_filtering.html#default-parameters-and-presets) - [2023.2.23] - The largest update to date. Please refer to v1.1 for details on what has been changed.
- [2023.01.20] - Changed
-a --ani
to-t --threshold
infastani_to_clusters.py
to match the usage inedgelist_to_clusters.py
which is a generalization offastani_to_clusters.py
developed forMMSEQS2
andDiamond
implementations. - [2023.01.12] - Updated
VDB-Microeukaryotic_v2
toVDB-Microeukaryotic_v2.1
to include areference.eukaryota_odb10.list
containing all the eukaryotic core markers. To accomodate this, I've also updatedVDB_v3
toVDB_v3.1
, thedownload_databases.sh
script, and theVEBA-database_env.yml
environment file. Now amicroeukaryotic.eukaryota_odb10
will be available for streamlined eukaryotic classification. - [2023.01.11] -
biosynthetic.py
automatically removes assembly.gbk and assembly.json files because they are big and unnecessary. - [2023.01.08] - Added an internal checkpoint system for
biosynthetic.py
when re-running an incompleteantiSMASH
step (useful when running large numbers of genomes). Fixed the follow environment files:VEBA-amplicon_env.yml
,VEBA-binning-prokaryotic_env.yml
, andVEBA-binning-eukaryotic_env.yml
as they had either PyPI or package conflict errors during installation. - [2023.01.05] - Added start, end, and strand to antismash output table in
antismash_genbanks_to_table.py
. Output is sorted by["genome_id", "contig_id", "start", "end"]
FixedVEBA-phylogney_env.yml
environment file. Important fix inupdate_environment_scripts.sh
for symlinking scripts in path. - [2023.01.03] - Moving
VEBA-biosynthetic_env
as a developmental environment so it won't be installed automatically. The reasoning for this is thatantiSMASH
downloads and configures thatantiSMASH database
in the backend which uses a lot of compute resources and takes a long time. Didn't want to slow up the installation more. - [2022.12.21] - Added
biopython
toVEBA-assembly_env
which is needed when runningMEGAHIT
as the scaffolds are rewritten. - [2022.12.12] - Fixed duplicate
step__step__program
labels forclassify-prokaryotic.py
module. Added support for prepending index/column levels andindex_col
selection inconcatenate_dataframes.py
. - [2022.12.07] - Fixed the compatibility issues for
VEBA-preprocess_env.yml
and issues with the following scripts:compile_metaeuk_identifiers.py
,filter_busco_results.py
, andVirFinder_wrapper.R
. - [2022.11.14] - Added
megahit
support forassembly.py
module (not yet available inassembly-sequential.py
). Changed-P/--spades_program
to-P/--program
forassembly.py
. Addedbiosynthetic
module which runs antiSMASH and converts genbank files to tabular format.binning-prokaryotic.py
defaults toTMPDIR
environment variable for CheckM step, if not available, then it uses[PROJECT_DIRECTORY]/[ID]/tmp
. See #12 of FAQ. - [2022.11.8] - Replaced penultimate step in
binning-prokaryotic.py
to useadjust_genomes_for_cpr.py
instead of the extremely long series of bash commands. This will make it easier to diagnose errors in this critical step. Also added support for contig descriptions and added MAG identifier in fasta files inbinning-eukaryotic.py
. Now uses themetaeuk_wrapper.py
script for theMetaEuk
step. Added separate option of--run_metaplasmidspades
forassembly-sequential.py
instead of making it mandatory (now it just runsbiosyntheticSPAdes
andmetaSPAdes
by default). - [2022.11.7] - Added
--use_mag_as_description
inparition_gene_models.py
script to include the MAG identifier in the contig description of the fasta header which is default inbinning-prokaryotic.py
. Addedadjust_genomes_for_cpr.py
script to easier run and understand the CPR adjustment step ofbinning-prokaryotic.py
. Added support for fasta header descriptions inbinning-prokaryotic.py
. - [2022.11.4] - Added functionality to
replace_fasta_descriptions.py
script to be able to use a string for replacing fasta headers in addition to the original functionality. - [2022.10.26] - Fixed symlinks to scripts for
install_veba.sh
and added missingCHECKM_DATA_PATH
environment variable. Also addeduninstall_veba.sh
, addedupdate_environment_variables.sh
scripts, and cleaned up install/database scripts. - [2022.10.25] - Updated default
GTDB-Tk
database fromR202
toR207_v2
and along with this updatedGTDB-Tk
inVEBA-binning-prokaryotic_env
andVEBA-classify_env
. Also, updated thebinning-prokaryotic.py
to include thecheckm_output.filtered.tsv
instead of unfilteredoutput.tsv
. - [2022.10.24] - Added new functionality to
compile_reads_table.py
by adding a method to compile reads tables from Fastq directories. Compatible withQIIME2
manifest. Defaults to absolute path with added option--relative
for relative paths. Also added an experimentalamplicon.py
module for ASV detection/classification along with the appropriate environment recipe and README.md update. - [2022.10.18] - Replace
GRCh38 alt analysis set
with theCHM13v2.0 telomere-to-telomere build
for the included human reference. Also updated theVEBA-database_env
to includeunzip
and added a patch for users to update their human reference if desired. - [2022.10.16] - Added
edgelist.tsv
andgraph.pkl
to output directory for cluster.py. These files were already in intermediate 1__fastani but the file name was weird. (e.g., graph.pkl-ani_95.0.edgelist.tsv). Fixed and also changed output of graph infastani_to_clusters.py
:nx.write_gpickle(graph, "{}-ani_{}.graph.pkl".format(opts.export_pickle, tol)) (Adding the .graph. part). - [2022.08.27] - Added
metaeuk_wrapper.py
script - [2022.08.17] - Added --scaffolds_to_bins option to mapping.py. Automated
samtools index
andsamtools coverage
steps formapped.sorted.bam
files for use spatial coverage calculation which is now automated as well producing agenome_spatial_coverage.tsv.gz
file if--scaffolds_to_bins
is provided. Addedgenome_spatial_coverage.py
script that uses thesamtools coverage
files from themapping.py
module or custom runs. Fixed --table_header error ingroupby_table.py
script. - [2022.07.14] - Added
--absolute
argument tocompile_reads_table.py
to use absolute paths instead of relative paths. - [2022.07.14] - Added
genome_coverage_from_spades.py
script to help with coverage calculations for NCBI submissions. - [2022.07.13] - Added output files to documentation readme. Also added subread to viral binning environment and changed the output filenames for viral classification.
- [2022.07.08] - Fixed database arguments to use --veba_database. Also added --bam support for binning-viral.py
- [2022.06.21] - Added
prefiltered_alignment_table.tsv.gz
tophylogeny.py
andmerge_msa.py
for debugging and finding a balance between removing genomes and removing markers. Changedsamples
andnumber_of_samples
togenomes
andnumber_of_genomes
inmerge_msa.py
. Also addedminimum_markers_aligned_ratio
to remove poor quality genomes. - [2022.06.20] - Added
filter_hmmsearch_results.py
and score thresholding table input forphylogeny.py
andpartition_hmmsearch.py
. Changedminimum_genomes_aligned_ratio
default to 0.95 instead of 0.5. - [2022.06.03] - Changed
coassembly.py
tocoverage.py
which includesveba_output/assembly/coassembly
toveba_output/assembly/multisample
andcoassembly.fasta.*
toreference.fasta.*
. Also change the GNU parallel default to an optional since it's much slower when the samples are different sizes. This can be selected using --one_task_per_cpu argument. Should probably do this forcluster.py
. - [2022.05.25] - Added --skip_maxbin2 argument for binning-prokaryotic. The reason for this is that it takes an extremely long time. For a 1.5GB fasta file and ~50 or so samples in the coverage matrix it takes over 40 hours per MaxBin2 run (2 per run). This will be 30 days to run 10 iterations.
- [2022.04.12] - Added coassembly module and support for multiple bam files in binning-prokaryotic, binning-eukaryotic, and binning-wrapper
- [2022.03.28] - Added GTDB-Tk to prokaryotic binning so check for CPR and then rerun CheckM using the proper parameters.
- [2022.03.14] - Created a
binning_wrapper.py
to normalize the binning process and add --minimum_genome_length capabilities. This is useful for eukaryotic binning but more complicated for prokaryotic binning because the current pipeline is hardcoded to handle errors on iterative binning. Also switched to CoverM for all coverage calculations because it's faster. Split out prokaryotic, eukaryotic, and viral binning environments. For eukaryotic binning, I've removed EukCC and use BUSCO v5 instead. - [2022.03.01] - Added a domain classification script that is run during prokaryotic binning. I've created a hack that moves all of the eukaryotic genomes to another directory to allow for proper gene calls in a separate module. This hack will remain until DAS_Tool can handle custom gene sets because it cannot in the current version. The other option is to remove
- [2022.02.24] - Added saf file to
assembly.py
and feature counts of scaffolds/transcripts - [2022.02.22] - Made the original
preprocess.py
→preprocess-kneaddata.py
and the newpreprocess.py
a wrapper aroundfastq_preprocessor
- [2022.02.22] - Made the
index.py
module - [2022.02.22] -
concatenate_fasta.py
andconcatenate_gff.py
- [2022.02.02] -
consensus_genome_classification.py