Skip to content

Commit

Permalink
Merge pull request #36 from jolespin/devel
Browse files Browse the repository at this point in the history
v1.4.0
  • Loading branch information
jolespin authored Dec 19, 2023
2 parents b9d706b + e95fe63 commit 793bc99
Show file tree
Hide file tree
Showing 103 changed files with 10,029 additions and 1,016 deletions.
Binary file removed .DS_Store
Binary file not shown.
107 changes: 98 additions & 9 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,73 @@ ________________________________________________________________

#### Current Releases:

**Release v1.3.0:**
**Release v1.4.0 Highlights:**

* **`VEBA` Modules:**

* Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database for taxonomic abundance.
* Added long read support for `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules.
* Redesign `binning-eukaryotic` module to handle custom `MetaEuk` databases
* Added new usage syntax `veba --module preprocess --params “${PARAMS}”` where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
* Added `skani` which is the new default for genome-level clustering based on ANI.
* Added `Diamond DeepClust` as an alternative to `MMSEQS2` for protein clustering.

* **`VEBA` Database (`VDB_v6`)**:

* Completely rebuilt `VEBA's Microeukaryotic Protein Database` to produce a clustered database `MicroEuk100/90/50` similar to `UniRef100/90/50`. Available on [doi:10.5281/zenodo.10139450](https://zenodo.org/records/10139451).

* **Number of sequences:**

* MicroEuk100 = 79,920,431 (19 GB)
* MicroEuk90 = 51,767,730 (13 GB)
* MicroEuk50 = 29,898,853 (6.5 GB)



* **Number of source organisms per dataset:**

* MycoCosm = 2503
* PhycoCosm = 174
* EnsemblProtists = 233
* MMETSP = 759
* TARA_SAGv1 = 8
* EukProt = 366
* EukZoo = 27
* TARA_SMAGv1 = 389
* NR_Protists-Fungi = 48217

<details>
<summary>**Release v1.4.0 Details**</summary>
* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance.
* [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/fenderglass/Flye/issues/652).
* [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes.
* [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`.
* [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped.
* [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`.
* [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules.
* [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`.
* [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`.
* [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`.
* [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script.
* [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks.
* [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`.
* [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py`
* [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels.
* [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.

</details>

**Release v1.3.0 Highlights:**

* **`VEBA` Modules:**
* Added `profile-pathway.py` module and associated scripts for building `HUMAnN` databases from *de novo* genomes and annotations. Essentially, a reads-based functional profiling method via `HUMAnN` using binned genomes as the database.
Expand Down Expand Up @@ -139,6 +205,7 @@ ________________________________________________________________
<summary>**Release v1.1.0 Details**</summary>

* **Modules**:

* `annotate.py`
* Added `NCBIfam-AMRFinder` AMR domain annotations
* Added `AntiFam` contimination annotations
Expand Down Expand Up @@ -238,6 +305,7 @@ ________________________________________________________________
* `build_taxa_sqlite.py`

* **Miscellaneous**:

* Updated environments and now add versions to environments.
* Added `mamba` to installation to speed up.
* Added `transdecoder_wrapper.py` which is a wrapper around `TransDecoder` with direct support for `Diamond` and `HMMSearch` homology searches. Also includes `append_geneid_to_transdecoder_gff.py` which is run in the backend to clean up the GFF file and make them compatible with what is output by `Prodigal` and `MetaEuk` runs of `VEBA`.
Expand Down Expand Up @@ -317,6 +385,8 @@ ________________________________________________________________

**Critical:**

* `binning-prokaryotic.py` doesn't produce an `unbinned.fasta` file for long reads if there aren't any genomes. It also creates a symlink called `genomes` in the working directory.
* Add a way to show all versions
* Genome checkpoints in `tRNAscan-SE` aren't working properly.
* Dereplcate CDS sequences in GFF from `MetaEuk` for `antiSMASH` to work for eukaryotic genomes
* Error with `amplicon.py` that works when run manually...
Expand All @@ -329,39 +399,58 @@ There was a problem importing veba_output/misc/reads_table.tsv:

**Definitely:**

* Use `pigz` instead of `gzip`
* Create a taxdump for `MicroEuk`
* Reimplement `compile_eukaryotic_classifications.py`
* Add representative to `identifier_mapping.proteins.tsv.gz`
* Add coding density to GFF files
* Split `download_databases.sh` into `download_databases.sh` (low memory, high threads) and `configure_databases.sh` (high memory, low-to-mid threads). Use `aria2` in parallel instead of `wget`.
* `NextFlow` support
* Consistent usage of the following terms: 1) dataframe vs. table; 2) protein-cluster vs. orthogroup.
* Add support for `FAMSA` in `phylogeny.py`
* Create a `assembly-longreads.py` module that uses `MetaFlye`
* Expand Microeukaryotic Protein Database to include more microeukaryotes (`Mycocosm` and `PhycoCosm` from `JGI`)
* Install each module via `bioconda`
* Add support for `Salmon` in `mapping.py` and `index.py`. This can be used instead of `STAR` which will require adding the `exon` field to `Prodigal` GFF file (`MetaEuk` modified GFF files already have exon ids).


**Probably (Yes)?:**
**Eventually (Yes)?:**

* Don't load all genomes, proteins, and cds into memory for clustering.
* Add support for `FAMSA` in `phylogeny.py`
* Consistent usage of the following terms: 1) dataframe vs. table; 2) protein-cluster vs. orthogroup.
* Add coding density to GFF files
* Add `vRhyme` to `binning_wrapper.py` and support `vRhyme` in `binning-viral.py`.
* Phylogenetic tree of `MicroEuk100`
* Convert HMMs to `MMSEQS2` (https://github.com/soedinglab/MMseqs2/wiki#how-to-create-a-target-profile-database-from-pfam)?
* Run `cmsearch` before `tRNAscan-SE`
* DN/DS from pangeome analysis
* Add [iPHoP](https://bitbucket.org/srouxjgi/iphop/src/main/) to `binning-viral.py`.
* Add a `metabolic.py` module
* Swap [`TransDecoder`](https://github.com/TransDecoder/TransDecoder) for [`TransSuite`](https://github.com/anonconda/TranSuite)
* Build a clustered version of the Microeukaryotic Protein Database that is more efficient to run. Similar to UniRef100, UniRef90, UniRef50.
* For viral binning, contigs that are not identified as viral via `geNomad -> CheckV` use with `vRhyme`.

**...Maybe (Not)?**

* Modify behavior of `annotate.py` to allow for skipping Pfam and/or KOFAM since they take a long time.


________________________________________________________________


<details>
<summary>**Daily Change Log:**</summary>

* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance.
* [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/fenderglass/Flye/issues/652).
* [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes.
* [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`.
* [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped.
* [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`.
* [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, and all binning modules.
* [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`.
* [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`.
* [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`.
* [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script.
* [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks.
* [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`.
* [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py`
* [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels.
* [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.
* [2023.10.27] - Update `annotate.py` and `merge_annotations.py` to handle `CAZy`. They also properly address clustered protein annotations now.
* [2023.10.18] - Added `module_completion_ratio.py` script which is a fork of `MicrobeAnnotator` [`ko_mapper.py`](https://github.com/cruizperez/MicrobeAnnotator/blob/master/microbeannotator/pipeline/ko_mapper.py). Also included a database [Zenodo: 10020074](https://zenodo.org/records/10020074) which will be included in `VDB_v5.2`
* [2023.10.16] - Added a checkpoint for `tRNAscan-SE` in `binning-prokaryotic.py` and `eukaryotic_gene_modeling_wrapper.py`.
Expand Down
Binary file added MODULE_RESOURCES.xlsx
Binary file not shown.
51 changes: 35 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,18 @@ ___________________________________________________________________
* **What's new in `VEBA v1.3.0`?**

* **`VEBA` Modules:**
* Added `profile-pathway.py` module and associated scripts for building `HUMAnN` databases from *de novo* genomes and annotations. Essentially, a reads-based functional profiling method via `HUMAnN` using binned genomes as the database.
* Added `marker_gene_clustering.py` script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space.
* Added `module_completion_ratios.py` script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend of `annotate.py`.
* Updated `annotate.py` and `merge_annotations.py` to provide better annotations for clustered proteins.
* Added `merge_genome_quality.py` and `merge_taxonomy_classifications.py` which compiles genome quality and taxonomy, respectively, for all organisms.
* Added BGC clustering in protein and nucleotide space to `biosynthetic.py`. Also, produces prevalence tables that can be used for further clustering of BGCs.
* Added `pangenome_core_sequences` in `cluster.py` writes both protein and CDS sequences for each genome cluster.
* Added PDF visualization of newick trees in `phylogeny.py`.


* **`VEBA` Database (`VDB_v5.2`)**:
* Added `CAZy`
* Added `MicrobeAnnotator-KEGG`

* Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database for taxonomic abundance.
* Added long read support for `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules.
* Redesign `binning-eukaryotic` module to handle custom `MetaEuk` databases
* Added new usage syntax `veba --module preprocess --params “${PARAMS}”` where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
* Added `skani` which is the new default for genome-level clustering based on ANI.
* Added `Diamond DeepClust` as an alternative to `MMSEQS2` for protein clustering.

* **`VEBA` Database (`VDB_v6`)**:

* Completely rebuilt `VEBA's Microeukaryotic Protein Database` to produce a clustered database `MicroEuk100/90/50` similar to `UniRef100/90/50`. Available on [doi:10.5281/zenodo.10139450](https://zenodo.org/records/10139451).


Check out the [*VEBA* Change Log](CHANGELOG.md) for insight into what is being implemented in the upcoming version.

Expand All @@ -68,9 +67,9 @@ ___________________________________________________________________

### Installation and databases

**Current Stable Version:** [`v1.3.0`](https://github.com/jolespin/veba/releases/tag/v1.3.0)
**Current Stable Version:** [`v1.4.0`](https://github.com/jolespin/veba/releases/tag/v1.4.0)

**Current Database Version:** `VDB_v5.2`
**Current Database Version:** `VDB_v6`

Please refer to the [*Installation and Database Configuration Guide*](install/README.md) for software installation and database configuration.

Expand All @@ -85,7 +84,25 @@ ___________________________________________________________________
[*Usage and Resource Requirements Guide*](src/README.md) for parameters and module descriptions

[*Walkthrough Guides*](walkthroughs/README.md) for tutorials and workflows on how to get started


**Usage Example:**

Running `preprocess` module.

1) Available with `v1.4.0+`:

```
source activate VEBA
veba --module preprocess --params "{PARAMS}"
```

2) Available with `v1.0.0 - v1.4.0+`:

```
source activate VEBA-preprocess_env
preprocess.py "{PARAMS}"
```

<p align="right"><a href="#readme-top">^__^</a></p>

___________________________________________________________________
Expand All @@ -100,8 +117,10 @@ If you wish *VEBA* did something that isn't implemented, please submit a [`[Feat

<p align="right"><a href="#readme-top">^__^</a></p>


___________________________________________________________________


### Output structure
*VEBA*'s is built on the [*GenoPype*](https://github.com/jolespin/genopype) archituecture which creates a reproducible and easy-to-navigate directory structure. *GenoPype*'s philosophy is to use the same names for all files but to have sample names as subdirectories. This makes it easier to glob files for grepping, concatenating, etc. *NextFlow* support is in the works...

Expand Down
Binary file modified SOURCES.xlsx
Binary file not shown.
4 changes: 2 additions & 2 deletions VERSION
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
1.3.0
VDB_v5.2
1.4.0b
VDB_v6
Loading

0 comments on commit 793bc99

Please sign in to comment.