Skip to content

VEBA_v1.3.0

Compare
Choose a tag to compare
@jolespin jolespin released this 27 Oct 21:13
· 190 commits to main since this release
bb683ab

Release v1.3.0:

  • VEBA Modules:

    • Added profile-pathway.py module and associated scripts for building HUMAnN databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method via HUMAnN using binned genomes as the database.
    • Added marker_gene_clustering.py script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space.
    • Added module_completion_ratios.py script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend of annotate.py.
    • Updated annotate.py and merge_annotations.py to provide better annotations for clustered proteins.
    • Added merge_genome_quality.py and merge_taxonomy_classifications.py which compiles genome quality and taxonomy, respectively, for all organisms.
    • Added BGC clustering in protein and nucleotide space to biosynthetic.py. Also, produces prevalence tables that can be used for further clustering of BGCs.
    • Added pangenome_core_sequences in cluster.py writes both protein and CDS sequences for each genome cluster.
    • Added PDF visualization of newick trees in phylogeny.py.
  • VEBA Database (VDB_v5.2):

    • Added CAZy
    • Added MicrobeAnnotator-KEGG
**Release v1.3.0 Details**
  • Update annotate.py and merge_annotations.py to handle CAZy. They also properly address clustered protein annotations now.
  • Added module_completion_ratio.py script which is a fork of MicrobeAnnotator ko_mapper.py. Also included a database Zenodo: 10020074 which will be included in VDB_v5.2
  • Added a checkpoint for tRNAscan-SE in binning-prokaryotic.py and eukaryotic_gene_modeling_wrapper.py.
  • Added profile-pathway.py module and VEBA-profile_env environments which is a wrapper around HUMAnN for the custom database created from annotate.py and compile_custom_humann_database_from_annotations.py
  • Added GenoPype version to log output
  • Added merge_genome_quality.py which combines CheckV, CheckM2, and BUSCO results.
  • Added compile_custom_humann_database_from_annotations.py which compiles a HUMAnN protein database table from the output of annotate.py and taxonomy classifications.
  • Added functionality to merge_taxonomy_classifications.py to allow for --no_domain and --no_header which will serve as input to compile_custom_humann_database_from_annotations.py
  • Added marker_gene_clustering.py script which gets core marker genes unique to each SLC (i.e., pangenome). average_number_of_copies_per_genome to protein clusters.
  • Added --minimum_core_prevalence in global_clustering.py, local_clustering.py, and cluster.py which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove --no_singletons from cluster.py to avoid complications with marker genes. Relabeled --input to --genomes_table in clustering scripts/module.
  • Added a check in coverage.py to see if the mapped.sorted.bam files are created, if they are then skip them. Not yet implemented for GNU parallel option.
  • Changed default representative sequence format from table to fasta for mmseqs2_wrapper.py.
  • Added --nucleotide_fasta_output to antismash_genbank_to_table.py which outputs the actual BGC DNA sequence. Changed --fasta_output to --protein_fasta_output and added output to biosynthetic.py. Changed BGC component identifiers to [bgc_id]_[position_in_bgc]|[start]:[end]([strand]) to match with MetaEuk identifiers. Changed bgc_type to protocluster_type. biosynthetic.py now supports GFF files from MetaEuk (exon and gene features not supported by antiSMASH). Fixed error related to antiSMASH adding CDS (i.e., allorf_[start]_[end]) that are not in GFF so antismash_genbank_to_table.py failed in those cases.
  • Added ete3 to VEBA-phylogeny_env.yml and automatically renders trees to PDF.
  • Added presets for MEGAHIT using the --megahit_preset option.
  • The change for using --mash_db with GTDB-Tk violated the assumption that all prokaryotic classifications had a msa_percent field which caused the cluster-level taxonomy to fail. compile_prokaryotic_genome_cluster_classification_scores_table.py fixes this by uses fastani_ani as the weight when genomes were classified using ANI and msa_percent for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications.
  • Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs.
  • Fixed critical error where descriptions in header were not being removed in eukaryota.scaffolds.list and did not remove eukaryotic scaffolds in seqkit grep so DAS_Tool output eukaryotic MAGs in identifier_mapping.tsv and __DASTool_scaffolds2bin.no_eukaryota.txt
  • Fixed krona.html in biosynthetic.py which was being created incorrectly from compile_krona.py script.
  • Create pangenome_core_sequences in global_clustering.py and local_clustering.py which writes both protein and CDS sequences for each SLC. Also made default in cluster.py to NOT do local clustering switching --no_local_clustering to --local_clustering.
  • pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects in biosynthetic.py when Diamond finds multiple regions in one hit that matches. Added --sort_by and --ascending to concatenate_dataframes.py along with automatic detection and removal of duplicate indices. Also added --sort_by bitscore in biosynthetic.py.
  • Added core pangenome and singleton hits to clustering output
  • Updated --megahit_memory default from 0.9 to 0.99
  • Fixed error in genomad_taxonomy_wrapper.py where viral_taxonomy.tsv should have been taxonomy.tsv.
  • Fixed minor error in assembly.py that was preventing users from using SPAdes programs that were not spades.py, metaspades.py, or rnaspades.py that was the result of using an incorrect string formatting.
  • Updated bowtie2 in preprocess, assembly, and mapping modules. Updated fastp and fastq_preprocessor in preprocess module.