Merge pull request #118 from jolespin/devel

* [2024.8.29] - Added `VERSION` file created in `download_databases.sh` * [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query. Added `--af_mode` with either `relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af` or `strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)` to `edgelist_to_clusters.py`, `global_clustering.py`, `local_clustering.py`, and `cluster.py`. * [2024.7.3] - Added `pigz` to `VEBA-annotate_env` which isn't a problem with most `conda` installations but needed for `docker` containers. * [2024.6.21] - Changed `choose_fastest_mirror.py` to `determine_fastest_mirror.py` * [2024.6.20] - Added `-m/--include_mrna` to `compile_metaeuk_identifiers.py` for [Issue #110](#110)
jolespin · Aug 30, 2024 · 2a504ae · 2a504ae
2 parents ae5ef02 + 679623b
commit 2a504ae
Show file tree

Hide file tree

Showing 23 changed files with 218 additions and 36 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,166 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+
+.DS_Store
+Icon?
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -413,12 +413,14 @@ ________________________________________________________________
 
 **Critical:**
 
+* Return code for `cluster.py` when it fails during global and local clustering is 0 but should be 1.
 * Don't load all genomes, proteins, and cds into memory for clustering.
 * Genome checkpoints in `tRNAscan-SE` aren't working properly.
 * Dereplicate CDS sequences in GFF from `MetaEuk` for `antiSMASH` to work for eukaryotic genomes
 
 **Definitely:**
 
+* Add number of unique protein clusters to `identifier_mapping.genomes.tsv.gz` in `cluster.py` to assess most metabolicly diverse representative.
 * Add a `--proteins` option to `classify-eukaryotic.py` which aligns proteins to `MicroEuk100.eukaryota_odb10` via `MMseqs2` and then proceeds with the pipeline.
 * Add `BiNI` biosynthetic novelty index to `biosynthetic.py`
 * `busco_wrapper.py` that relabels all the genes, runs analysis, then converts output to tsv.
@@ -470,7 +472,8 @@ ________________________________________________________________
 <details>
 	<summary> <b>Daily Change Log:</b> </summary>
 
-* [2024.7.11] - * Alignment fraction threshold for genome clustering only applied to reference but should also apply to query.  Added `--af_mode` with either `relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af` or `strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)` to `edgelist_to_clusters.py`, `global_clustering.py`, `local_clustering.py`, and `cluster.py`.
+* [2024.8.29] - Added `VERSION` file created in `download_databases.sh`
+* [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query.  Added `--af_mode` with either `relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af` or `strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)` to `edgelist_to_clusters.py`, `global_clustering.py`, `local_clustering.py`, and `cluster.py`.
 * [2024.7.3] - Added `pigz` to `VEBA-annotate_env` which isn't a problem with most `conda` installations but needed for `docker` containers.
 * [2024.6.21] - Changed `choose_fastest_mirror.py` to `determine_fastest_mirror.py`
 * [2024.6.20] - Added `-m/--include_mrna` to `compile_metaeuk_identifiers.py` for [Issue #110](https://github.com/jolespin/veba/issues/110)

diff --git a/FAQ.md b/FAQ.md
@@ -661,7 +661,17 @@ Check out the [*VEBA* step-by-step guide](https://github.com/jolespin/veba/tree/
 
 ______________________
 
-#### How can I update the database from VEBA v2.1.0 (VEBA Database: VDB_v6) to VEBA v2.2.0 (VEBA Database: VDB_v7)?
+#### Why is GTDB-Tk taking so long to download?
+
+This is a known issue and seems to be dependent on which region you're in so check out [GTDB-Tk Issue #522](https://github.com/Ecogenomics/GTDBTk/issues/522#issuecomment-2182847947)
+I developed a script in `veba/bin/scripts/determine_fastest_mirror.py` (formerly `choose_fastest_mirror.py`) where you can give both URL mirrors and it will tell you which ones faster.
+
+<p align="right"><a href="#faq-top">^__^</a></p>
+
+______________________
+
+
+#### How can I update the database from VEBA v2.1.0 (VEBA Database: VDB_v6) to VEBA ≥v2.2.0 (VEBA Database: VDB_v7)?
 
 After uninstalling and installing the new environments: 
 

diff --git a/README.md b/README.md
@@ -49,11 +49,11 @@ ___________________________________________________________________
 
 ### Announcements
 
-* **Current Stable Version:** [`v2.2.0`](https://github.com/jolespin/veba/releases/tag/v2.2.0)
+* **Current Stable Version:** [`v2.2.1`](https://github.com/jolespin/veba/releases/tag/v2.2.0)
 
 * **Current Database Version:** [`VDB_v7`](install/DATABASE.md)
 
-	If you are updating to v2.2.0 you will need to modify your existing database.  
+	If you are updating to ≥v2.2.0 you will need to modify your existing database.  
 	Please see [FAQs](FAQ.md#how-can-i-update-the-database-from-veba-v210-veba-database-vdb_v6-to-veba-v220-veba-database-vdb_v7) for more details.
 
 	<details>

diff --git a/VERSION b/VERSION
@@ -1,2 +1,2 @@
-2.2.1b
+2.2.1
 VDB_v7
diff --git a/bin/binning-prokaryotic.py b/bin/binning-prokaryotic.py
@@ -13,7 +13,7 @@
 pd.options.display.max_colwidth = 100
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2024.4.29"
+__version__ = "2024.8.29"
 
 # Assembly
 def get_coverage_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -105,7 +105,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc
 
             "&&",
 
-        "rm",
+        "rm -rf",
         os.path.join(directories["tmp"], "tmp.*")
 
     ]

diff --git a/bin/biosynthetic.py b/bin/biosynthetic.py
@@ -417,8 +417,7 @@ def get_diamond_cmd( input_filepaths, output_filepaths, output_directory, direct
 
             "&&",
 
-        "rm",
-        "-rf",
+        "rm -rf",
         os.path.join(directories["tmp"], "components.concatenated.faa"),
         os.path.join(output_directory, "*.no_header.tsv"),
 

diff --git a/bin/mapping.py b/bin/mapping.py
@@ -13,7 +13,7 @@
 pd.options.display.max_colwidth = 100
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2024.4.29"
+__version__ = "2024.8.29"
 
 
 # Bowtie2
@@ -184,7 +184,7 @@ def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, d
     else:
         cmd += [
             "&&",
-        "rm {}".format(os.path.join(output_directory, "featurecounts.*.tsv")),
+        "rm -rf {}".format(os.path.join(output_directory, "featurecounts.*.tsv")),
         ]
 
     if opts.proteins_to_orthogroups:

diff --git a/bin/scripts/binning_wrapper.py b/bin/scripts/binning_wrapper.py
@@ -12,7 +12,7 @@
 
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.12.4"
+__version__ = "2024.8.29"
 
 def get_maxbin2_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
     # Create dummy scaffolds_to_bins.tsv to overwrite later. This makes DAS_Tool easier to run
@@ -281,7 +281,7 @@ def get_concoct_cmd( input_filepaths, output_filepaths, output_directory, direct
 #     "&&",
 #     # VAMB
 #     "(",
-#     "rm -r {}".format(output_directory), # There can't be an existing directory for some reason
+#     "rm -rf {}".format(output_directory), # There can't be an existing directory for some reason
 #     "&&",
 #     os.environ["vamb"],
 #     "--fasta {}".format(input_filepaths[0]),

diff --git a/bin/scripts/bowtie2_wrapper.py b/bin/scripts/bowtie2_wrapper.py
@@ -12,7 +12,7 @@
 pd.options.display.max_colwidth = 100
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2024.4.29"
+__version__ = "2024.8.29"
 
 
 # Bowtie2
@@ -145,7 +145,7 @@ def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, d
     else:
         cmd += [
             "&&",
-        "rm {}".format(os.path.join(output_directory, "featurecounts.tsv")),
+        "rm -rf {}".format(os.path.join(output_directory, "featurecounts.tsv")),
         ]
 
     return cmd

diff --git a/bin/scripts/eukaryotic_gene_modeling_wrapper.py b/bin/scripts/eukaryotic_gene_modeling_wrapper.py
@@ -13,7 +13,7 @@
 
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.12.22"
+__version__ = "2024.8.29"
 
 # Tiara
 def get_tiara_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -230,7 +230,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc
 
             "&&",
 
-        "rm",
+        "rm -rf",
         os.path.join(directories["tmp"], "tmp.*"),
         ]
 

diff --git a/bin/scripts/prokaryotic_gene_modeling_wrapper.py b/bin/scripts/prokaryotic_gene_modeling_wrapper.py
@@ -12,7 +12,7 @@
 
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.8.28"
+__version__ = "2024.8.29"
 
 # Pyrodigal
 def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -55,7 +55,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc
 
             "&&",
 
-        "rm",
+        "rm -rf",
         os.path.join(directories["tmp"], "tmp.*")
 
     ]
@@ -167,7 +167,7 @@ def get_pyrodigal_cmd(input_filepaths, output_filepaths, output_directory, direc
 
             "&&",
 
-        "rm",
+        "rm -rf",
         os.path.join(directories["tmp"], "tmp.*"),
         ]
 

diff --git a/bin/scripts/transdecoder_wrapper.py b/bin/scripts/transdecoder_wrapper.py
@@ -12,7 +12,7 @@
 
 # from tqdm import tqdm
 __program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.5.8"
+__version__ = "2024.8.29"
 
 
 # TransDecoder
@@ -149,7 +149,7 @@ def get_transdecoder_predict_cmd(input_filepaths, output_filepaths, output_direc
 
             "&&",
 
-        "rm -f pipeliner.*.cmds",
+        "rm -rf pipeliner.*.cmds",
     ]
     return cmd
 

diff --git a/images/Schematic.png b/images/Schematic.png
diff --git a/images/Schematic.pdf → images/Schematic_v2.0.0.pdf b/images/Schematic.pdf → images/Schematic_v2.0.0.pdf
diff --git a/images/Schematic_v2.0.0.png b/images/Schematic_v2.0.0.png
diff --git a/install/README.md b/install/README.md
@@ -80,7 +80,7 @@ The `VEBA` installation is going to configure some `conda` environments for you
 ```
 # For stable version, download and decompress the tarball:
 
-VERSION="2.2.0"
+VERSION="2.2.1"
 # wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz # The .tar.gz is out of date in this release
 # tar -xvf v${VERSION}.tar.gz && mv veba-${VERSION} veba
 
@@ -89,7 +89,7 @@ wget https://github.com/jolespin/veba/releases/download/v${VERSION}/v${VERSION}.
 unzip -d veba v${VERSION}.zip
 
 # For developmental version, clone the repository:
-# Note: This is not recommended because between v2.1.0 and v2.2.0, case changes were introduced (KOFAM -> KOfam)
+# Note: This is not recommended because between v2.1.0 and ≥v2.2.0, case changes were introduced (KOFAM -> KOfam)
 # and these changes are not updating on GitHub.  Please use official releases instead of pulling the repo:
 # git clone --branch devel https://github.com/jolespin/veba.git