Skip to content

Commit

Permalink
Merge pull request #118 from jolespin/devel
Browse files Browse the repository at this point in the history
* [2024.8.29] - Added `VERSION` file created in `download_databases.sh`
* [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query.  Added `--af_mode` with either `relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af` or `strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)` to `edgelist_to_clusters.py`, `global_clustering.py`, `local_clustering.py`, and `cluster.py`.
* [2024.7.3] - Added `pigz` to `VEBA-annotate_env` which isn't a problem with most `conda` installations but needed for `docker` containers.
* [2024.6.21] - Changed `choose_fastest_mirror.py` to `determine_fastest_mirror.py`
* [2024.6.20] - Added `-m/--include_mrna` to `compile_metaeuk_identifiers.py` for [Issue #110](#110)
  • Loading branch information
jolespin authored Aug 30, 2024
2 parents ae5ef02 + 679623b commit 2a504ae
Show file tree
Hide file tree
Showing 23 changed files with 218 additions and 36 deletions.
166 changes: 166 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/


.DS_Store
Icon?
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,12 +413,14 @@ ________________________________________________________________

**Critical:**

* Return code for `cluster.py` when it fails during global and local clustering is 0 but should be 1.
* Don't load all genomes, proteins, and cds into memory for clustering.
* Genome checkpoints in `tRNAscan-SE` aren't working properly.
* Dereplicate CDS sequences in GFF from `MetaEuk` for `antiSMASH` to work for eukaryotic genomes

**Definitely:**

* Add number of unique protein clusters to `identifier_mapping.genomes.tsv.gz` in `cluster.py` to assess most metabolicly diverse representative.
* Add a `--proteins` option to `classify-eukaryotic.py` which aligns proteins to `MicroEuk100.eukaryota_odb10` via `MMseqs2` and then proceeds with the pipeline.
* Add `BiNI` biosynthetic novelty index to `biosynthetic.py`
* `busco_wrapper.py` that relabels all the genes, runs analysis, then converts output to tsv.
Expand Down Expand Up @@ -470,7 +472,8 @@ ________________________________________________________________
<details>
<summary> <b>Daily Change Log:</b> </summary>

* [2024.7.11] - * Alignment fraction threshold for genome clustering only applied to reference but should also apply to query. Added `--af_mode` with either `relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af` or `strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)` to `edgelist_to_clusters.py`, `global_clustering.py`, `local_clustering.py`, and `cluster.py`.
* [2024.8.29] - Added `VERSION` file created in `download_databases.sh`
* [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query. Added `--af_mode` with either `relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af` or `strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)` to `edgelist_to_clusters.py`, `global_clustering.py`, `local_clustering.py`, and `cluster.py`.
* [2024.7.3] - Added `pigz` to `VEBA-annotate_env` which isn't a problem with most `conda` installations but needed for `docker` containers.
* [2024.6.21] - Changed `choose_fastest_mirror.py` to `determine_fastest_mirror.py`
* [2024.6.20] - Added `-m/--include_mrna` to `compile_metaeuk_identifiers.py` for [Issue #110](https://github.com/jolespin/veba/issues/110)
Expand Down
12 changes: 11 additions & 1 deletion FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -661,7 +661,17 @@ Check out the [*VEBA* step-by-step guide](https://github.com/jolespin/veba/tree/

______________________

#### How can I update the database from VEBA v2.1.0 (VEBA Database: VDB_v6) to VEBA v2.2.0 (VEBA Database: VDB_v7)?
#### Why is GTDB-Tk taking so long to download?

This is a known issue and seems to be dependent on which region you're in so check out [GTDB-Tk Issue #522](https://github.com/Ecogenomics/GTDBTk/issues/522#issuecomment-2182847947)
I developed a script in `veba/bin/scripts/determine_fastest_mirror.py` (formerly `choose_fastest_mirror.py`) where you can give both URL mirrors and it will tell you which ones faster.

<p align="right"><a href="#faq-top">^__^</a></p>

______________________


#### How can I update the database from VEBA v2.1.0 (VEBA Database: VDB_v6) to VEBA ≥v2.2.0 (VEBA Database: VDB_v7)?

After uninstalling and installing the new environments:

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,11 @@ ___________________________________________________________________

### Announcements

* **Current Stable Version:** [`v2.2.0`](https://github.com/jolespin/veba/releases/tag/v2.2.0)
* **Current Stable Version:** [`v2.2.1`](https://github.com/jolespin/veba/releases/tag/v2.2.0)

* **Current Database Version:** [`VDB_v7`](install/DATABASE.md)

If you are updating to v2.2.0 you will need to modify your existing database.
If you are updating to v2.2.0 you will need to modify your existing database.
Please see [FAQs](FAQ.md#how-can-i-update-the-database-from-veba-v210-veba-database-vdb_v6-to-veba-v220-veba-database-vdb_v7) for more details.

<details>
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
2.2.1b
2.2.1
VDB_v7
4 changes: 2 additions & 2 deletions bin/binning-prokaryotic.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2024.4.29"
__version__ = "2024.8.29"

# Assembly
def get_coverage_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
Expand Down Expand Up @@ -105,7 +105,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc

"&&",

"rm",
"rm -rf",
os.path.join(directories["tmp"], "tmp.*")

]
Expand Down
3 changes: 1 addition & 2 deletions bin/biosynthetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -417,8 +417,7 @@ def get_diamond_cmd( input_filepaths, output_filepaths, output_directory, direct

"&&",

"rm",
"-rf",
"rm -rf",
os.path.join(directories["tmp"], "components.concatenated.faa"),
os.path.join(output_directory, "*.no_header.tsv"),

Expand Down
4 changes: 2 additions & 2 deletions bin/mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2024.4.29"
__version__ = "2024.8.29"


# Bowtie2
Expand Down Expand Up @@ -184,7 +184,7 @@ def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, d
else:
cmd += [
"&&",
"rm {}".format(os.path.join(output_directory, "featurecounts.*.tsv")),
"rm -rf {}".format(os.path.join(output_directory, "featurecounts.*.tsv")),
]

if opts.proteins_to_orthogroups:
Expand Down
4 changes: 2 additions & 2 deletions bin/scripts/binning_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2023.12.4"
__version__ = "2024.8.29"

def get_maxbin2_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
# Create dummy scaffolds_to_bins.tsv to overwrite later. This makes DAS_Tool easier to run
Expand Down Expand Up @@ -281,7 +281,7 @@ def get_concoct_cmd( input_filepaths, output_filepaths, output_directory, direct
# "&&",
# # VAMB
# "(",
# "rm -r {}".format(output_directory), # There can't be an existing directory for some reason
# "rm -rf {}".format(output_directory), # There can't be an existing directory for some reason
# "&&",
# os.environ["vamb"],
# "--fasta {}".format(input_filepaths[0]),
Expand Down
4 changes: 2 additions & 2 deletions bin/scripts/bowtie2_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2024.4.29"
__version__ = "2024.8.29"


# Bowtie2
Expand Down Expand Up @@ -145,7 +145,7 @@ def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, d
else:
cmd += [
"&&",
"rm {}".format(os.path.join(output_directory, "featurecounts.tsv")),
"rm -rf {}".format(os.path.join(output_directory, "featurecounts.tsv")),
]

return cmd
Expand Down
4 changes: 2 additions & 2 deletions bin/scripts/eukaryotic_gene_modeling_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2023.12.22"
__version__ = "2024.8.29"

# Tiara
def get_tiara_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
Expand Down Expand Up @@ -230,7 +230,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc

"&&",

"rm",
"rm -rf",
os.path.join(directories["tmp"], "tmp.*"),
]

Expand Down
6 changes: 3 additions & 3 deletions bin/scripts/prokaryotic_gene_modeling_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2023.8.28"
__version__ = "2024.8.29"

# Pyrodigal
def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
Expand Down Expand Up @@ -55,7 +55,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc

"&&",

"rm",
"rm -rf",
os.path.join(directories["tmp"], "tmp.*")

]
Expand Down Expand Up @@ -167,7 +167,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc

"&&",

"rm",
"rm -rf",
os.path.join(directories["tmp"], "tmp.*"),
]

Expand Down
4 changes: 2 additions & 2 deletions bin/scripts/transdecoder_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
__version__ = "2023.5.8"
__version__ = "2024.8.29"


# TransDecoder
Expand Down Expand Up @@ -149,7 +149,7 @@ def get_transdecoder_predict_cmd(input_filepaths, output_filepaths, output_direc

"&&",

"rm -f pipeliner.*.cmds",
"rm -rf pipeliner.*.cmds",
]
return cmd

Expand Down
Binary file removed images/Schematic.png
Binary file not shown.
Binary file not shown.
Binary file added images/Schematic_v2.0.0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions install/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ The `VEBA` installation is going to configure some `conda` environments for you
```
# For stable version, download and decompress the tarball:
VERSION="2.2.0"
VERSION="2.2.1"
# wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz # The .tar.gz is out of date in this release
# tar -xvf v${VERSION}.tar.gz && mv veba-${VERSION} veba
Expand All @@ -89,7 +89,7 @@ wget https://github.com/jolespin/veba/releases/download/v${VERSION}/v${VERSION}.
unzip -d veba v${VERSION}.zip
# For developmental version, clone the repository:
# Note: This is not recommended because between v2.1.0 and v2.2.0, case changes were introduced (KOFAM -> KOfam)
# Note: This is not recommended because between v2.1.0 and v2.2.0, case changes were introduced (KOFAM -> KOfam)
# and these changes are not updating on GitHub. Please use official releases instead of pulling the repo:
# git clone --branch devel https://github.com/jolespin/veba.git
Expand Down
Loading

0 comments on commit 2a504ae

Please sign in to comment.