VEBA_v1.3.0
Release v1.3.0:
-
VEBA
Modules:- Added
profile-pathway.py
module and associated scripts for buildingHUMAnN
databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method viaHUMAnN
using binned genomes as the database. - Added
marker_gene_clustering.py
script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space. - Added
module_completion_ratios.py
script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend ofannotate.py
. - Updated
annotate.py
andmerge_annotations.py
to provide better annotations for clustered proteins. - Added
merge_genome_quality.py
andmerge_taxonomy_classifications.py
which compiles genome quality and taxonomy, respectively, for all organisms. - Added BGC clustering in protein and nucleotide space to
biosynthetic.py
. Also, produces prevalence tables that can be used for further clustering of BGCs. - Added
pangenome_core_sequences
incluster.py
writes both protein and CDS sequences for each genome cluster. - Added PDF visualization of newick trees in
phylogeny.py
.
- Added
-
VEBA
Database (VDB_v5.2
):- Added
CAZy
- Added
MicrobeAnnotator-KEGG
- Added
**Release v1.3.0 Details**
- Update
annotate.py
andmerge_annotations.py
to handleCAZy
. They also properly address clustered protein annotations now. - Added
module_completion_ratio.py
script which is a fork ofMicrobeAnnotator
ko_mapper.py
. Also included a database Zenodo: 10020074 which will be included inVDB_v5.2
- Added a checkpoint for
tRNAscan-SE
inbinning-prokaryotic.py
andeukaryotic_gene_modeling_wrapper.py
. - Added
profile-pathway.py
module andVEBA-profile_env
environments which is a wrapper aroundHUMAnN
for the custom database created fromannotate.py
andcompile_custom_humann_database_from_annotations.py
- Added
GenoPype version
to log output - Added
merge_genome_quality.py
which combinesCheckV
,CheckM2
, andBUSCO
results. - Added
compile_custom_humann_database_from_annotations.py
which compiles aHUMAnN
protein database table from the output ofannotate.py
and taxonomy classifications. - Added functionality to
merge_taxonomy_classifications.py
to allow for--no_domain
and--no_header
which will serve as input tocompile_custom_humann_database_from_annotations.py
- Added
marker_gene_clustering.py
script which gets core marker genes unique to each SLC (i.e., pangenome).average_number_of_copies_per_genome
to protein clusters. - Added
--minimum_core_prevalence
inglobal_clustering.py
,local_clustering.py
, andcluster.py
which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove--no_singletons
fromcluster.py
to avoid complications with marker genes. Relabeled--input
to--genomes_table
in clustering scripts/module. - Added a check in
coverage.py
to see if themapped.sorted.bam
files are created, if they are then skip them. Not yet implemented for GNU parallel option. - Changed default representative sequence format from table to fasta for
mmseqs2_wrapper.py
. - Added
--nucleotide_fasta_output
toantismash_genbank_to_table.py
which outputs the actual BGC DNA sequence. Changed--fasta_output
to--protein_fasta_output
and added output tobiosynthetic.py
. Changed BGC component identifiers to[bgc_id]_[position_in_bgc]|[start]:[end]([strand])
to match withMetaEuk
identifiers. Changedbgc_type
toprotocluster_type
.biosynthetic.py
now supports GFF files fromMetaEuk
(exon and gene features not supported byantiSMASH
). Fixed error related toantiSMASH
adding CDS (i.e.,allorf_[start]_[end]
) that are not in GFF soantismash_genbank_to_table.py
failed in those cases. - Added
ete3
toVEBA-phylogeny_env.yml
and automatically renders trees to PDF. - Added presets for
MEGAHIT
using the--megahit_preset
option. - The change for using
--mash_db
withGTDB-Tk
violated the assumption that all prokaryotic classifications had amsa_percent
field which caused the cluster-level taxonomy to fail.compile_prokaryotic_genome_cluster_classification_scores_table.py
fixes this by usesfastani_ani
as the weight when genomes were classified using ANI andmsa_percent
for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications. - Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs.
- Fixed critical error where descriptions in header were not being removed in
eukaryota.scaffolds.list
and did not remove eukaryotic scaffolds inseqkit grep
soDAS_Tool
output eukaryotic MAGs inidentifier_mapping.tsv
and__DASTool_scaffolds2bin.no_eukaryota.txt
- Fixed
krona.html
inbiosynthetic.py
which was being created incorrectly fromcompile_krona.py
script. - Create
pangenome_core_sequences
inglobal_clustering.py
andlocal_clustering.py
which writes both protein and CDS sequences for each SLC. Also made default incluster.py
to NOT do local clustering switching--no_local_clustering
to--local_clustering
. pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
inbiosynthetic.py
whenDiamond
finds multiple regions in one hit that matches. Added--sort_by
and--ascending
toconcatenate_dataframes.py
along with automatic detection and removal of duplicate indices. Also added--sort_by bitscore
inbiosynthetic.py
.- Added core pangenome and singleton hits to clustering output
- Updated
--megahit_memory
default from 0.9 to 0.99 - Fixed error in
genomad_taxonomy_wrapper.py
whereviral_taxonomy.tsv
should have beentaxonomy.tsv
. - Fixed minor error in
assembly.py
that was preventing users from usingSPAdes
programs that were notspades.py
,metaspades.py
, orrnaspades.py
that was the result of using an incorrect string formatting. - Updated
bowtie2
in preprocess, assembly, and mapping modules. Updatedfastp
andfastq_preprocessor
in preprocess module.