Release VEBA_v1.3.0 · jolespin/veba

Release v1.3.0:

VEBA Modules:
- Added profile-pathway.py module and associated scripts for building HUMAnN databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method via HUMAnN using binned genomes as the database.
- Added marker_gene_clustering.py script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space.
- Added module_completion_ratios.py script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend of annotate.py.
- Updated annotate.py and merge_annotations.py to provide better annotations for clustered proteins.
- Added merge_genome_quality.py and merge_taxonomy_classifications.py which compiles genome quality and taxonomy, respectively, for all organisms.
- Added BGC clustering in protein and nucleotide space to biosynthetic.py. Also, produces prevalence tables that can be used for further clustering of BGCs.
- Added pangenome_core_sequences in cluster.py writes both protein and CDS sequences for each genome cluster.
- Added PDF visualization of newick trees in phylogeny.py.
VEBA Database (VDB_v5.2):
- Added CAZy
- Added MicrobeAnnotator-KEGG

**Release v1.3.0 Details**

Update annotate.py and merge_annotations.py to handle CAZy. They also properly address clustered protein annotations now.
Added module_completion_ratio.py script which is a fork of MicrobeAnnotator ko_mapper.py. Also included a database Zenodo: 10020074 which will be included in VDB_v5.2
Added a checkpoint for tRNAscan-SE in binning-prokaryotic.py and eukaryotic_gene_modeling_wrapper.py.
Added profile-pathway.py module and VEBA-profile_env environments which is a wrapper around HUMAnN for the custom database created from annotate.py and compile_custom_humann_database_from_annotations.py
Added GenoPype version to log output
Added merge_genome_quality.py which combines CheckV, CheckM2, and BUSCO results.
Added compile_custom_humann_database_from_annotations.py which compiles a HUMAnN protein database table from the output of annotate.py and taxonomy classifications.
Added functionality to merge_taxonomy_classifications.py to allow for --no_domain and --no_header which will serve as input to compile_custom_humann_database_from_annotations.py
Added marker_gene_clustering.py script which gets core marker genes unique to each SLC (i.e., pangenome). average_number_of_copies_per_genome to protein clusters.
Added --minimum_core_prevalence in global_clustering.py, local_clustering.py, and cluster.py which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove --no_singletons from cluster.py to avoid complications with marker genes. Relabeled --input to --genomes_table in clustering scripts/module.
Added a check in coverage.py to see if the mapped.sorted.bam files are created, if they are then skip them. Not yet implemented for GNU parallel option.
Changed default representative sequence format from table to fasta for mmseqs2_wrapper.py.
Added --nucleotide_fasta_output to antismash_genbank_to_table.py which outputs the actual BGC DNA sequence. Changed --fasta_output to --protein_fasta_output and added output to biosynthetic.py. Changed BGC component identifiers to [bgc_id]_[position_in_bgc]|[start]:[end]([strand]) to match with MetaEuk identifiers. Changed bgc_type to protocluster_type. biosynthetic.py now supports GFF files from MetaEuk (exon and gene features not supported by antiSMASH). Fixed error related to antiSMASH adding CDS (i.e., allorf_[start]_[end]) that are not in GFF so antismash_genbank_to_table.py failed in those cases.
Added ete3 to VEBA-phylogeny_env.yml and automatically renders trees to PDF.
Added presets for MEGAHIT using the --megahit_preset option.
The change for using --mash_db with GTDB-Tk violated the assumption that all prokaryotic classifications had a msa_percent field which caused the cluster-level taxonomy to fail. compile_prokaryotic_genome_cluster_classification_scores_table.py fixes this by uses fastani_ani as the weight when genomes were classified using ANI and msa_percent for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications.
Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs.
Fixed critical error where descriptions in header were not being removed in eukaryota.scaffolds.list and did not remove eukaryotic scaffolds in seqkit grep so DAS_Tool output eukaryotic MAGs in identifier_mapping.tsv and __DASTool_scaffolds2bin.no_eukaryota.txt
Fixed krona.html in biosynthetic.py which was being created incorrectly from compile_krona.py script.
Create pangenome_core_sequences in global_clustering.py and local_clustering.py which writes both protein and CDS sequences for each SLC. Also made default in cluster.py to NOT do local clustering switching --no_local_clustering to --local_clustering.
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects in biosynthetic.py when Diamond finds multiple regions in one hit that matches. Added --sort_by and --ascending to concatenate_dataframes.py along with automatic detection and removal of duplicate indices. Also added --sort_by bitscore in biosynthetic.py.
Added core pangenome and singleton hits to clustering output
Updated --megahit_memory default from 0.9 to 0.99
Fixed error in genomad_taxonomy_wrapper.py where viral_taxonomy.tsv should have been taxonomy.tsv.
Fixed minor error in assembly.py that was preventing users from using SPAdes programs that were not spades.py, metaspades.py, or rnaspades.py that was the result of using an incorrect string formatting.
Updated bowtie2 in preprocess, assembly, and mapping modules. Updated fastp and fastq_preprocessor in preprocess module.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VEBA_v1.3.0