diff --git a/.DS_Store b/.DS_Store
deleted file mode 100644
index 2acc3c3..0000000
Binary files a/.DS_Store and /dev/null differ
diff --git a/CHANGELOG.md b/CHANGELOG.md
index a00fb2e..7f1709c 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,7 +6,73 @@ ________________________________________________________________
#### Current Releases:
-**Release v1.3.0:**
+**Release v1.4.0 Highlights:**
+
+* **`VEBA` Modules:**
+
+ * Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database for taxonomic abundance.
+ * Added long read support for `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules.
+ * Redesign `binning-eukaryotic` module to handle custom `MetaEuk` databases
+ * Added new usage syntax `veba --module preprocess --params “${PARAMS}”` where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
+ * Added `skani` which is the new default for genome-level clustering based on ANI.
+ * Added `Diamond DeepClust` as an alternative to `MMSEQS2` for protein clustering.
+
+* **`VEBA` Database (`VDB_v6`)**:
+
+ * Completely rebuilt `VEBA's Microeukaryotic Protein Database` to produce a clustered database `MicroEuk100/90/50` similar to `UniRef100/90/50`. Available on [doi:10.5281/zenodo.10139450](https://zenodo.org/records/10139451).
+
+ * **Number of sequences:**
+
+ * MicroEuk100 = 79,920,431 (19 GB)
+
+ * MicroEuk90 = 51,767,730 (13 GB)
+
+ * MicroEuk50 = 29,898,853 (6.5 GB)
+
+
+
+ * **Number of source organisms per dataset:**
+
+ * MycoCosm = 2503
+
+ * PhycoCosm = 174
+
+ * EnsemblProtists = 233
+
+ * MMETSP = 759
+
+ * TARA_SAGv1 = 8
+
+ * EukProt = 366
+
+ * EukZoo = 27
+
+ * TARA_SMAGv1 = 389
+
+ * NR_Protists-Fungi = 48217
+
+
+ **Release v1.4.0 Details**
+* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance.
+* [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/fenderglass/Flye/issues/652).
+* [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes.
+* [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`.
+* [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped.
+* [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`.
+* [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules.
+* [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`.
+* [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`.
+* [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`.
+* [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script.
+* [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks.
+* [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`.
+* [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py`
+* [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels.
+* [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.
+
+
+
+**Release v1.3.0 Highlights:**
* **`VEBA` Modules:**
* Added `profile-pathway.py` module and associated scripts for building `HUMAnN` databases from *de novo* genomes and annotations. Essentially, a reads-based functional profiling method via `HUMAnN` using binned genomes as the database.
@@ -139,6 +205,7 @@ ________________________________________________________________
**Release v1.1.0 Details**
* **Modules**:
+
* `annotate.py`
* Added `NCBIfam-AMRFinder` AMR domain annotations
* Added `AntiFam` contimination annotations
@@ -238,6 +305,7 @@ ________________________________________________________________
* `build_taxa_sqlite.py`
* **Miscellaneous**:
+
* Updated environments and now add versions to environments.
* Added `mamba` to installation to speed up.
* Added `transdecoder_wrapper.py` which is a wrapper around `TransDecoder` with direct support for `Diamond` and `HMMSearch` homology searches. Also includes `append_geneid_to_transdecoder_gff.py` which is run in the backend to clean up the GFF file and make them compatible with what is output by `Prodigal` and `MetaEuk` runs of `VEBA`.
@@ -317,6 +385,8 @@ ________________________________________________________________
**Critical:**
+* `binning-prokaryotic.py` doesn't produce an `unbinned.fasta` file for long reads if there aren't any genomes. It also creates a symlink called `genomes` in the working directory.
+* Add a way to show all versions
* Genome checkpoints in `tRNAscan-SE` aren't working properly.
* Dereplcate CDS sequences in GFF from `MetaEuk` for `antiSMASH` to work for eukaryotic genomes
* Error with `amplicon.py` that works when run manually...
@@ -329,39 +399,58 @@ There was a problem importing veba_output/misc/reads_table.tsv:
**Definitely:**
+* Use `pigz` instead of `gzip`
+* Create a taxdump for `MicroEuk`
+* Reimplement `compile_eukaryotic_classifications.py`
* Add representative to `identifier_mapping.proteins.tsv.gz`
-* Add coding density to GFF files
* Split `download_databases.sh` into `download_databases.sh` (low memory, high threads) and `configure_databases.sh` (high memory, low-to-mid threads). Use `aria2` in parallel instead of `wget`.
* `NextFlow` support
-* Consistent usage of the following terms: 1) dataframe vs. table; 2) protein-cluster vs. orthogroup.
-* Add support for `FAMSA` in `phylogeny.py`
-* Create a `assembly-longreads.py` module that uses `MetaFlye`
-* Expand Microeukaryotic Protein Database to include more microeukaryotes (`Mycocosm` and `PhycoCosm` from `JGI`)
* Install each module via `bioconda`
* Add support for `Salmon` in `mapping.py` and `index.py`. This can be used instead of `STAR` which will require adding the `exon` field to `Prodigal` GFF file (`MetaEuk` modified GFF files already have exon ids).
-**Probably (Yes)?:**
+**Eventually (Yes)?:**
+* Don't load all genomes, proteins, and cds into memory for clustering.
+* Add support for `FAMSA` in `phylogeny.py`
+* Consistent usage of the following terms: 1) dataframe vs. table; 2) protein-cluster vs. orthogroup.
+* Add coding density to GFF files
+* Add `vRhyme` to `binning_wrapper.py` and support `vRhyme` in `binning-viral.py`.
+* Phylogenetic tree of `MicroEuk100`
* Convert HMMs to `MMSEQS2` (https://github.com/soedinglab/MMseqs2/wiki#how-to-create-a-target-profile-database-from-pfam)?
* Run `cmsearch` before `tRNAscan-SE`
* DN/DS from pangeome analysis
* Add [iPHoP](https://bitbucket.org/srouxjgi/iphop/src/main/) to `binning-viral.py`.
* Add a `metabolic.py` module
* Swap [`TransDecoder`](https://github.com/TransDecoder/TransDecoder) for [`TransSuite`](https://github.com/anonconda/TranSuite)
-* Build a clustered version of the Microeukaryotic Protein Database that is more efficient to run. Similar to UniRef100, UniRef90, UniRef50.
+* For viral binning, contigs that are not identified as viral via `geNomad -> CheckV` use with `vRhyme`.
**...Maybe (Not)?**
* Modify behavior of `annotate.py` to allow for skipping Pfam and/or KOFAM since they take a long time.
-
________________________________________________________________
**Daily Change Log:**
+* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance.
+* [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/fenderglass/Flye/issues/652).
+* [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes.
+* [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`.
+* [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped.
+* [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`.
+* [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, and all binning modules.
+* [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`.
+* [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`.
+* [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`.
+* [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script.
+* [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks.
+* [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`.
+* [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py`
+* [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels.
+* [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.
* [2023.10.27] - Update `annotate.py` and `merge_annotations.py` to handle `CAZy`. They also properly address clustered protein annotations now.
* [2023.10.18] - Added `module_completion_ratio.py` script which is a fork of `MicrobeAnnotator` [`ko_mapper.py`](https://github.com/cruizperez/MicrobeAnnotator/blob/master/microbeannotator/pipeline/ko_mapper.py). Also included a database [Zenodo: 10020074](https://zenodo.org/records/10020074) which will be included in `VDB_v5.2`
* [2023.10.16] - Added a checkpoint for `tRNAscan-SE` in `binning-prokaryotic.py` and `eukaryotic_gene_modeling_wrapper.py`.
diff --git a/MODULE_RESOURCES.xlsx b/MODULE_RESOURCES.xlsx
new file mode 100644
index 0000000..bf99e33
Binary files /dev/null and b/MODULE_RESOURCES.xlsx differ
diff --git a/SOURCES.xlsx b/SOURCES.xlsx
index 33478b5..b0f60d3 100644
Binary files a/SOURCES.xlsx and b/SOURCES.xlsx differ
diff --git a/VERSION b/VERSION
index 0a7e926..a0fef3f 100644
--- a/VERSION
+++ b/VERSION
@@ -1,2 +1,2 @@
-1.3.0
-VDB_v5.2
+1.4.0b
+VDB_v6
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.KEGG_Data_Scrapper.py b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.KEGG_Data_Scrapper.py
new file mode 100644
index 0000000..e1f5f8f
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.KEGG_Data_Scrapper.py
@@ -0,0 +1,165 @@
+from bs4 import BeautifulSoup
+import pandas as pd
+import re
+import pickle
+import ast
+import requests
+
+
+""" Script to download and parse KEGG information and store it in data """
+
+def download_kegg_modules(module_name_file, chrome_driver):
+ module_ids =[]
+ module_names = {}
+ module_components_raw = {}
+ # Parse module names
+ with open(module_name_file) as module_input:
+ for line in module_input:
+ line = line.strip().split("\t")
+ module_ids.append(line[0])
+ module_names[line[0]] = line[1]
+ # Access KEGG and download module information
+ for identifier in module_ids:
+ url = "https://www.kegg.jp/kegg-bin/show_module?" + identifier
+ site_request = requests.get(url)
+ soup = BeautifulSoup(site_request.text, "html.parser")
+ module_definition = ""
+ module_definition_bool = False
+ definition = soup.find(class_ = 'definition')
+ for line in (definition.text).splitlines():
+ if line.strip() == "":
+ continue
+ elif module_definition_bool == True:
+ module_definition = line.strip()
+ module_definition_bool = False
+ elif line.strip() == 'Definition':
+ module_definition_bool = True
+ print(module_definition)
+ module_components_raw[identifier] = module_definition
+ return module_components_raw
+
+
+def parse_regular_module_dictionary(bifurcating_list_file, structural_list_file, module_components_raw):
+ bifurcating_list = []
+ structural_list = []
+ # Populate bifurcating and structural lists
+ with open(bifurcating_list_file, 'r') as bif_list:
+ for line in bif_list:
+ bifurcating_list.append(line.strip())
+ with open(structural_list_file, 'r') as bif_list:
+ for line in bif_list:
+ structural_list.append(line.strip())
+ # Parse raw module information
+ module_steps_parsed = {}
+ for key, values in module_components_raw.items():
+ values = values.replace(" --", "")
+ values = values.replace("-- ", "")
+ if key in bifurcating_list or key in structural_list:
+ continue
+ else:
+ module = []
+ parenthesis_count = 0
+ for character in values:
+ if character == "(":
+ parenthesis_count += 1
+ module.append(character)
+ elif character == " ":
+ if parenthesis_count == 0:
+ module.append(character)
+ else:
+ module.append("_")
+ elif character == ")":
+ parenthesis_count -= 1
+ module.append(character)
+ else:
+ module.append(character)
+ steps = ''.join(module).split()
+ module_steps_parsed[key] = steps
+ # Remove modules depending on other modules
+ temporal_dictionary = module_steps_parsed.copy()
+ for key, values in temporal_dictionary.items():
+ for value in values:
+ if re.search(r'M[0-9]{5}', value) is not None:
+ del module_steps_parsed[key]
+ break
+ return module_steps_parsed
+
+
+def create_final_regular_dictionary(module_steps_parsed, module_components_raw, outfile):
+ final_regular_dict = {}
+ # Parse module steps and export them into a text file
+ with open(outfile, 'w') as output:
+ for key, value in module_steps_parsed.items():
+ output.write("{}\n".format(key))
+ output.write("{}\n".format(module_components_raw[key]))
+ output.write("{}\n".format(value))
+ output.write("{}\n".format("=="))
+ final_regular_dict[key] = {}
+ step_number = 0
+ for step in value:
+ step_number += 1
+ count = 0
+ options = 0
+ temp_string = ""
+ for char in step:
+ if char == "(":
+ count += 1
+ options += 1
+ if len(temp_string) > 1 and temp_string[-1] == "-":
+ temp_string += "%"
+ elif char == ")":
+ count -= 1
+ if count >= 1:
+ temp_string += char
+ else:
+ continue
+ elif char == ",":
+ if count >= 2:
+ temp_string += char
+ print(step)
+ else:
+ temp_string += " "
+ else:
+ temp_string += char
+ if options >= 2:
+ temp_string = temp_string.replace(")_", "_")
+ if re.search('%.*\)', temp_string) is None:
+ temp_string = temp_string.replace(")", "")
+ temp_string = "".join(temp_string.rsplit("__", 1))
+ temp_string = temp_string.split()
+ if isinstance(temp_string, str):
+ temp_string = temp_string.split()
+ temp_string = sorted(temp_string, key=len)
+ final_regular_dict[key][step_number] = temp_string
+ output.write("{}\n".format(temp_string))
+ output.write("{}\n".format("++++++++++++++++++"))
+ return final_regular_dict
+
+
+def export_module_dictionary(dictionary, location):
+ pickle_out = open(location,"wb")
+ pickle.dump(dictionary, pickle_out)
+ pickle_out.close()
+
+
+
+def transform_module_dictionaries(bifurcating_data, structural_data, output_bifur, output_struct):
+ bifurcating_dictionary = ast.literal_eval(open(bifurcating_data).read())
+ export_module_dictionary(bifurcating_dictionary, output_bifur)
+ structural_dictionary = ast.literal_eval(open(structural_data).read())
+ export_module_dictionary(structural_dictionary, output_struct)
+
+
+# Execute parsing functions
+
+module_components_raw = download_kegg_modules("00.Module_Names.txt", 'chromedriver')
+module_steps_parsed = parse_regular_module_dictionary("01.Bifurcating_List.txt",
+ "02.Structural_List.txt", module_components_raw)
+final_regular_dict = create_final_regular_dictionary(module_steps_parsed, module_components_raw, "05.Modules_Parsed.txt")
+
+
+export_module_dictionary(final_regular_dict, "../01.KEGG_Regular_Module_Information.pickle")
+transform_module_dictionaries("03.Bifurcating_Modules.dict",
+ "04.Structural_Modules.dict",
+ "../02.KEGG_Bifurcating_Module_Information.pickle",
+ "../03.KEGG_Structural_Module_Information.pickle")
\ No newline at end of file
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.Module_Names.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.Module_Names.txt
new file mode 100644
index 0000000..db9ec87
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.Module_Names.txt
@@ -0,0 +1,394 @@
+M00015 Proline biosynthesis, glutamate => proline Arginine and proline metabolism #8a3222
+M00028 Ornithine biosynthesis, glutamate => ornithine Arginine and proline metabolism #8a3222
+M00029 Urea cycle Arginine and proline metabolism #8a3222
+M00047 Creatine pathway Arginine and proline metabolism #8a3222
+M00763 Ornithine biosynthesis, mediated by LysW, glutamate => ornithine Arginine and proline metabolism #8a3222
+M00844 Arginine biosynthesis, ornithine => arginine Arginine and proline metabolism #8a3222
+M00845 Arginine biosynthesis, glutamate => acetylcitrulline => arginine Arginine and proline metabolism #8a3222
+M00879 Arginine succinyltransferase pathway, arginine => glutamate Arginine and proline metabolism #8a3222
+M00022 Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate Aromatic amino acid metabolism #8641b6
+M00023 Tryptophan biosynthesis, chorismate => tryptophan Aromatic amino acid metabolism #8641b6
+M00024 Phenylalanine biosynthesis, chorismate => phenylalanine Aromatic amino acid metabolism #8641b6
+M00025 Tyrosine biosynthesis, chorismate => tyrosine Aromatic amino acid metabolism #8641b6
+M00037 Melatonin biosynthesis, tryptophan => serotonin => melatonin Aromatic amino acid metabolism #8641b6
+M00038 Tryptophan metabolism, tryptophan => kynurenine => 2-aminomuconate Aromatic amino acid metabolism #8641b6
+M00040 Tyrosine biosynthesis, prephanate => pretyrosine => tyrosine Aromatic amino acid metabolism #8641b6
+M00042 Catecholamine biosynthesis, tyrosine => dopamine => noradrenaline => adrenaline Aromatic amino acid metabolism #8641b6
+M00043 Thyroid hormone biosynthesis, tyrosine => triiodothyronine--thyroxine Aromatic amino acid metabolism #8641b6
+M00044 Tyrosine degradation, tyrosine => homogentisate Aromatic amino acid metabolism #8641b6
+M00533 Homoprotocatechuate degradation, homoprotocatechuate => 2-oxohept-3-enedioate Aromatic amino acid metabolism #8641b6
+M00545 Trans-cinnamate degradation, trans-cinnamate => acetyl-CoA Aromatic amino acid metabolism #8641b6
+M00418 Toluene degradation, anaerobic, toluene => benzoyl-CoA Aromatics degradation #76d25b
+M00419 Cymene degradation, p-cymene => p-cumate Aromatics degradation #76d25b
+M00534 Naphthalene degradation, naphthalene => salicylate Aromatics degradation #76d25b
+M00537 Xylene degradation, xylene => methylbenzoate Aromatics degradation #76d25b
+M00538 Toluene degradation, toluene => benzoate Aromatics degradation #76d25b
+M00539 Cumate degradation, p-cumate => 2-oxopent-4-enoate + 2-methylpropanoate Aromatics degradation #76d25b
+M00540 Benzoate degradation, cyclohexanecarboxylic acid =>pimeloyl-CoA Aromatics degradation #76d25b
+M00541 Benzoyl-CoA degradation, benzoyl-CoA => 3-hydroxypimeloyl-CoA Aromatics degradation #76d25b
+M00543 Biphenyl degradation, biphenyl => 2-oxopent-4-enoate + benzoate Aromatics degradation #76d25b
+M00544 Carbazole degradation, carbazole => 2-oxopent-4-enoate + anthranilate Aromatics degradation #76d25b
+M00547 Benzene--toluene degradation, benzene => catechol -- toluene => 3-methylcatechol Aromatics degradation #76d25b
+M00548 Benzene degradation, benzene => catechol Aromatics degradation #76d25b
+M00551 Benzoate degradation, benzoate => catechol -- methylbenzoate => methylcatechol Aromatics degradation #76d25b
+M00568 Catechol ortho-cleavage, catechol => 3-oxoadipate Aromatics degradation #76d25b
+M00569 Catechol meta-cleavage, catechol => acetyl-CoA -- 4-methylcatechol => propanoyl-CoA Aromatics degradation #76d25b
+M00623 Phthalate degradation 1, phthalate => protocatechuate Aromatics degradation #76d25b
+M00624 Terephthalate degradation, terephthalate => 3,4-dihydroxybenzoate Aromatics degradation #76d25b
+M00636 Phthalate degradation 2, phthalate => protocatechuate Aromatics degradation #76d25b
+M00637 Anthranilate degradation, anthranilate => catechol Aromatics degradation #76d25b
+M00638 Salicylate degradation, salicylate => gentisate Aromatics degradation #76d25b
+M00878 Phenylacetate degradation, phenylaxetate => acetyl-CoA--succinyl-CoA Aromatics degradation #76d25b
+M00142 NADH:ubiquinone oxidoreductase, mitochondria ATP synthesis #cdd346
+M00143 NADH dehydrogenase (ubiquinone) Fe-S protein--flavoprotein complex, mitochondria ATP synthesis #cdd346
+M00144 NADH:quinone oxidoreductase, prokaryotes ATP synthesis #cdd346
+M00145 NAD(P)H:quinone oxidoreductase, chloroplasts and cyanobacteria ATP synthesis #cdd346
+M00146 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex ATP synthesis #cdd346
+M00147 NADH dehydrogenase (ubiquinone) 1 beta subcomplex ATP synthesis #cdd346
+M00148 Succinate dehydrogenase (ubiquinone) ATP synthesis #cdd346
+M00149 Succinate dehydrogenase, prokaryotes ATP synthesis #cdd346
+M00150 Fumarate reductase, prokaryotes ATP synthesis #cdd346
+M00151 Cytochrome bc1 complex respiratory unit ATP synthesis #cdd346
+M00152 Cytochrome bc1 complex ATP synthesis #cdd346
+M00153 Cytochrome bd ubiquinol oxidase ATP synthesis #cdd346
+M00154 Cytochrome c oxidase ATP synthesis #cdd346
+M00155 Cytochrome c oxidase, prokaryotes ATP synthesis #cdd346
+M00156 Cytochrome c oxidase, cbb3-type ATP synthesis #cdd346
+M00157 F-type ATPase, prokaryotes and chloroplasts ATP synthesis #cdd346
+M00158 F-type ATPase, eukaryotes ATP synthesis #cdd346
+M00159 V-type ATPase, prokaryotes ATP synthesis #cdd346
+M00160 V-type ATPase, eukaryotes ATP synthesis #cdd346
+M00162 Cytochrome b6f complex ATP synthesis #cdd346
+M00416 Cytochrome aa3-600 menaquinol oxidase ATP synthesis #cdd346
+M00417 Cytochrome o ubiquinol oxidase ATP synthesis #cdd346
+M00672 Penicillin biosynthesis, aminoadipate + cycteine + valine => penicillin Beta-Lactam biosynthesis #3b2882
+M00673 Cephamycin C biosynthesis, aminoadipate + cycteine + valine => cephamycin C Beta-Lactam biosynthesis #3b2882
+M00674 Clavaminate biosynthesis, arginine + glyceraldehyde-3P => clavaminate Beta-Lactam biosynthesis #3b2882
+M00675 Carbapenem-3-carboxylate biosynthesis, pyrroline-5-carboxylate + malonyl-CoA => carbapenem-3-carboxylate Beta-Lactam biosynthesis #3b2882
+M00736 Nocardicin A biosynthesis, L-pHPG + arginine + serine => nocardicin A Beta-Lactam biosynthesis #3b2882
+M00039 Monolignol biosynthesis, phenylalanine--tyrosine => monolignol Biosynthesis of other secondary metabolites #cbde82
+M00137 Flavanone biosynthesis, phenylalanine => naringenin Biosynthesis of other secondary metabolites #cbde82
+M00138 Flavonoid biosynthesis, naringenin => pelargonidin Biosynthesis of other secondary metabolites #cbde82
+M00370 Glucosinolate biosynthesis, tryptophan => glucobrassicin Biosynthesis of other secondary metabolites #cbde82
+M00661 Paspaline biosynthesis, geranylgeranyl-PP + indoleglycerol phosphate => paspaline Biosynthesis of other secondary metabolites #cbde82
+M00785 Cycloserine biosynthesis, arginine--serine => cycloserine Biosynthesis of other secondary metabolites #cbde82
+M00786 Fumitremorgin alkaloid biosynthesis, tryptophan + proline => fumitremorgin C--A Biosynthesis of other secondary metabolites #cbde82
+M00787 Bacilysin biosynthesis, prephenate => bacilysin Biosynthesis of other secondary metabolites #cbde82
+M00788 Terpentecin biosynthesis, GGAP => terpentecin Biosynthesis of other secondary metabolites #cbde82
+M00789 Rebeccamycin biosynthesis, tryptophan => rebeccamycin Biosynthesis of other secondary metabolites #cbde82
+M00790 Pyrrolnitrin biosynthesis, tryptophan => pyrrolnitrin Biosynthesis of other secondary metabolites #cbde82
+M00805 Staurosporine biosynthesis, tryptophan => staurosporine Biosynthesis of other secondary metabolites #cbde82
+M00808 Violacein biosynthesis, tryptophan => violacein Biosynthesis of other secondary metabolites #cbde82
+M00814 Acarbose biosynthesis, sedoheptulopyranose-7P => acarbose Biosynthesis of other secondary metabolites #cbde82
+M00815 Validamycin A biosynthesis, sedoheptulopyranose-7P => validamycin A Biosynthesis of other secondary metabolites #cbde82
+M00819 Pentalenolactone biosynthesis, farnesyl-PP => pentalenolactone Biosynthesis of other secondary metabolites #cbde82
+M00835 Pyocyanine biosynthesis, chorismate => pyocyanine Biosynthesis of other secondary metabolites #cbde82
+M00837 Prodigiosin biosynthesis, L-proline => prodigiosin Biosynthesis of other secondary metabolites #cbde82
+M00838 Undecylprodigiosin biosynthesis, L-proline => undecylprodigiosin Biosynthesis of other secondary metabolites #cbde82
+M00848 Aurachin biosynthesis, anthranilate => aurachin A Biosynthesis of other secondary metabolites #cbde82
+M00875 Staphyloferrin B biosynthesis, L-serine => staphyloferrin B Biosynthesis of other secondary metabolites #cbde82
+M00876 Staphyloferrin A biosynthesis, L-ornithine => staphyloferrin A Biosynthesis of other secondary metabolites #cbde82
+M00877 Kanosamine biosynthesis glucose 6-phosphate => kanosamine Biosynthesis of other secondary metabolites #cbde82
+M00019 Valine--isoleucine biosynthesis, pyruvate => valine -- 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb
+M00036 Leucine degradation, leucine => acetoacetate + acetyl-CoA Branched-chain amino acid metabolism #656cdb
+M00432 Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate Branched-chain amino acid metabolism #656cdb
+M00535 Isoleucine biosynthesis, pyruvate => 2-oxobutanoate Branched-chain amino acid metabolism #656cdb
+M00570 Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb
+M00165 Reductive pentose phosphate cycle (Calvin cycle) Carbon fixation #408937
+M00166 Reductive pentose phosphate cycle, ribulose-5P => glyceraldehyde-3P Carbon fixation #408937
+M00167 Reductive pentose phosphate cycle, glyceraldehyde-3P => ribulose-5P Carbon fixation #408937
+M00168 CAM (Crassulacean acid metabolism), dark Carbon fixation #408937
+M00169 CAM (Crassulacean acid metabolism), light Carbon fixation #408937
+M00170 C4-dicarboxylic acid cycle, phosphoenolpyruvate carboxykinase type Carbon fixation #408937
+M00171 C4-dicarboxylic acid cycle, NAD - malic enzyme type Carbon fixation #408937
+M00172 C4-dicarboxylic acid cycle, NADP - malic enzyme type Carbon fixation #408937
+M00173 Reductive citrate cycle (Arnon-Buchanan cycle) Carbon fixation #408937
+M00374 Dicarboxylate-hydroxybutyrate cycle Carbon fixation #408937
+M00375 Hydroxypropionate-hydroxybutylate cycle Carbon fixation #408937
+M00376 3-Hydroxypropionate bi-cycle Carbon fixation #408937
+M00377 Reductive acetyl-CoA pathway (Wood-Ljungdahl pathway) Carbon fixation #408937
+M00579 Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate Carbon fixation #408937
+M00620 Incomplete reductive citrate cycle, acetyl-CoA => oxoglutarate Carbon fixation #408937
+M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate Central carbohydrate metabolism #c644a5
+M00002 Glycolysis, core module involving three-carbon compounds Central carbohydrate metabolism #c644a5
+M00003 Gluconeogenesis, oxaloacetate => fructose-6P Central carbohydrate metabolism #c644a5
+M00004 Pentose phosphate pathway (Pentose phosphate cycle) Central carbohydrate metabolism #c644a5
+M00005 PRPP biosynthesis, ribose 5P => PRPP Central carbohydrate metabolism #c644a5
+M00006 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P Central carbohydrate metabolism #c644a5
+M00007 Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5
+M00008 Entner-Doudoroff pathway, glucose-6P => glyceraldehyde-3P + pyruvate Central carbohydrate metabolism #c644a5
+M00009 Citrate cycle (TCA cycle, Krebs cycle) Central carbohydrate metabolism #c644a5
+M00010 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate Central carbohydrate metabolism #c644a5
+M00011 Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate Central carbohydrate metabolism #c644a5
+M00307 Pyruvate oxidation, pyruvate => acetyl-CoA Central carbohydrate metabolism #c644a5
+M00308 Semi-phosphorylative Entner-Doudoroff pathway, gluconate => glycerate-3P Central carbohydrate metabolism #c644a5
+M00309 Non-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate Central carbohydrate metabolism #c644a5
+M00580 Pentose phosphate pathway, archaea, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5
+M00633 Semi-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate-3P Central carbohydrate metabolism #c644a5
+M00112 Tocopherol--tocotorienol biosynthesis Cofactor and vitamin metabolism #5fda98
+M00115 NAD biosynthesis, aspartate => NAD Cofactor and vitamin metabolism #5fda98
+M00116 Menaquinone biosynthesis, chorismate => menaquinol Cofactor and vitamin metabolism #5fda98
+M00117 Ubiquinone biosynthesis, prokaryotes, chorismate => ubiquinone Cofactor and vitamin metabolism #5fda98
+M00119 Pantothenate biosynthesis, valine--L-aspartate => pantothenate Cofactor and vitamin metabolism #5fda98
+M00120 Coenzyme A biosynthesis, pantothenate => CoA Cofactor and vitamin metabolism #5fda98
+M00121 Heme biosynthesis, plants and bacteria, glutamate => heme Cofactor and vitamin metabolism #5fda98
+M00122 Cobalamin biosynthesis, cobinamide => cobalamin Cofactor and vitamin metabolism #5fda98
+M00123 Biotin biosynthesis, pimeloyl-ACP--CoA => biotin Cofactor and vitamin metabolism #5fda98
+M00124 Pyridoxal biosynthesis, erythrose-4P => pyridoxal-5P Cofactor and vitamin metabolism #5fda98
+M00125 Riboflavin biosynthesis, GTP => riboflavin--FMN--FAD Cofactor and vitamin metabolism #5fda98
+M00126 Tetrahydrofolate biosynthesis, GTP => THF Cofactor and vitamin metabolism #5fda98
+M00127 Thiamine biosynthesis, AIR => thiamine-P--thiamine-2P Cofactor and vitamin metabolism #5fda98
+M00128 Ubiquinone biosynthesis, eukaryotes, 4-hydroxybenzoate => ubiquinone Cofactor and vitamin metabolism #5fda98
+M00140 C1-unit interconversion, prokaryotes Cofactor and vitamin metabolism #5fda98
+M00141 C1-unit interconversion, eukaryotes Cofactor and vitamin metabolism #5fda98
+M00572 Pimeloyl-ACP biosynthesis, BioC-BioH pathway, malonyl-ACP => pimeloyl-ACP Cofactor and vitamin metabolism #5fda98
+M00573 Biotin biosynthesis, BioI pathway, long-chain-acyl-ACP => pimeloyl-ACP => biotin Cofactor and vitamin metabolism #5fda98
+M00577 Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin Cofactor and vitamin metabolism #5fda98
+M00622 Nicotinate degradation, nicotinate => fumarate Cofactor and vitamin metabolism #5fda98
+M00810 Nicotine degradation, pyridine pathway, nicotine => 2,6-dihydroxypyridine--succinate semialdehyde Cofactor and vitamin metabolism #5fda98
+M00811 Nicotine degradation, pyrrolidine pathway, nicotine => succinate semialdehyde Cofactor and vitamin metabolism #5fda98
+M00836 Coenzyme F430 biosynthesis, sirohydrochlorin => coenzyme F430 Cofactor and vitamin metabolism #5fda98
+M00840 Tetrahydrofolate biosynthesis, mediated by ribA and trpF, GTP => THF Cofactor and vitamin metabolism #5fda98
+M00841 Tetrahydrofolate biosynthesis, mediated by PTPS, GTP => THF Cofactor and vitamin metabolism #5fda98
+M00842 Tetrahydrobiopterin biosynthesis, GTP => BH4 Cofactor and vitamin metabolism #5fda98
+M00843 L-threo-Tetrahydrobiopterin biosynthesis, GTP => L-threo-BH4 Cofactor and vitamin metabolism #5fda98
+M00846 Siroheme biosynthesis, glutamate => siroheme Cofactor and vitamin metabolism #5fda98
+M00847 Heme biosynthesis, archaea, siroheme => heme Cofactor and vitamin metabolism #5fda98
+M00868 Heme biosynthesis, animals and fungi, glycine => heme Cofactor and vitamin metabolism #5fda98
+M00880 Molybdenum cofactor biosynthesis, GTP => molybdenum cofactor Cofactor and vitamin metabolism #5fda98
+M00017 Methionine biosynthesis, apartate => homoserine => methionine Cysteine and methionine metabolism #782975
+M00021 Cysteine biosynthesis, serine => cysteine Cysteine and methionine metabolism #782975
+M00034 Methionine salvage pathway Cysteine and methionine metabolism #782975
+M00035 Methionine degradation Cysteine and methionine metabolism #782975
+M00338 Cysteine biosynthesis, homocysteine + serine => cysteine Cysteine and methionine metabolism #782975
+M00368 Ethylene biosynthesis, methionine => ethylene Cysteine and methionine metabolism #782975
+M00609 Cysteine biosynthesis, methionine => cysteine Cysteine and methionine metabolism #782975
+M00625 Methicillin resistance Drug resistance #869534
+M00627 beta-Lactam resistance, Bla system Drug resistance #869534
+M00639 Multidrug resistance, efflux pump MexCD-OprJ Drug resistance #869534
+M00641 Multidrug resistance, efflux pump MexEF-OprN Drug resistance #869534
+M00642 Multidrug resistance, efflux pump MexJK-OprM Drug resistance #869534
+M00643 Multidrug resistance, efflux pump MexXY-OprM Drug resistance #869534
+M00649 Multidrug resistance, efflux pump AdeABC Drug resistance #869534
+M00651 Vancomycin resistance, D-Ala-D-Lac type Drug resistance #869534
+M00652 Vancomycin resistance, D-Ala-D-Ser type Drug resistance #869534
+M00696 Multidrug resistance, efflux pump AcrEF-TolC Drug resistance #869534
+M00697 Multidrug resistance, efflux pump MdtEF-TolC Drug resistance #869534
+M00698 Multidrug resistance, efflux pump BpeEF-OprC Drug resistance #869534
+M00700 Multidrug resistance, efflux pump AbcA Drug resistance #869534
+M00702 Multidrug resistance, efflux pump NorB Drug resistance #869534
+M00704 Tetracycline resistance, efflux pump Tet38 Drug resistance #869534
+M00705 Multidrug resistance, efflux pump MepA Drug resistance #869534
+M00714 Multidrug resistance, efflux pump QacA Drug resistance #869534
+M00718 Multidrug resistance, efflux pump MexAB-OprM Drug resistance #869534
+M00725 Cationic antimicrobial peptide (CAMP) resistance, dltABCD operon Drug resistance #869534
+M00726 Cationic antimicrobial peptide (CAMP) resistance, lysyl-phosphatidylglycerol (L-PG) synthase MprF Drug resistance #869534
+M00730 Cationic antimicrobial peptide (CAMP) resistance, VraFG transporter Drug resistance #869534
+M00744 Cationic antimicrobial peptide (CAMP) resistance, protease PgtE Drug resistance #869534
+M00745 Imipenem resistance, repression of porin OprD Drug resistance #869534
+M00746 Multidrug resistance, repression of porin OmpF Drug resistance #869534
+M00769 Multidrug resistance, efflux pump MexPQ-OpmE Drug resistance #869534
+M00851 Carbapenem resistance Drug resistance #869534
+M00824 9-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 9-membered enediyne core Enediyne biosynthesis #d27bde
+M00825 10-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 10-membered enediyne core Enediyne biosynthesis #d27bde
+M00826 C-1027 benzoxazolinate moiety biosynthesis, chorismate => benzoxazolinyl-CoA Enediyne biosynthesis #d27bde
+M00827 C-1027 beta-amino acid moiety biosynthesis, tyrosine => 3-chloro-4,5-dihydroxy-beta-phenylalanyl-PCP Enediyne biosynthesis #d27bde
+M00828 Maduropeptin beta-hydroxy acid moiety biosynthesis, tyrosine => 3-(4-hydroxyphenyl)-3-oxopropanoyl-PCP Enediyne biosynthesis #d27bde
+M00829 3,6-Dimethylsalicylyl-CoA biosynthesis, malonyl-CoA => 6-methylsalicylate => 3,6-dimethylsalicylyl-CoA Enediyne biosynthesis #d27bde
+M00830 Neocarzinostatin naphthoate moiety biosynthesis, malonyl-CoA => 2-hydroxy-5-methyl-1-naphthoate => 2-hydroxy-7-methoxy-5-methyl-1-naphthoyl-CoA Enediyne biosynthesis #d27bde
+M00831 Kedarcidin 2-hydroxynaphthoate moiety biosynthesis, malonyl-CoA => 3,6,8-trihydroxy-2-naphthoate => 3-hydroxy-7,8-dimethoxy-6-isopropoxy-2-naphthoyl-CoA Enediyne biosynthesis #d27bde
+M00832 Kedarcidin 2-aza-3-chloro-beta-tyrosine moiety biosynthesis, azatyrosine => 2-aza-3-chloro-beta-tyrosyl-PCP Enediyne biosynthesis #d27bde
+M00833 Calicheamicin biosynthesis, calicheamicinone => calicheamicin Enediyne biosynthesis #d27bde
+M00834 Calicheamicin orsellinate moiety biosynthesis, malonyl-CoA => orsellinate-ACP => 5-iodo-2,3-dimethoxyorsellinate-ACP Enediyne biosynthesis #d27bde
+M00082 Fatty acid biosynthesis, initiation Fatty acid metabolism #d9a344
+M00083 Fatty acid biosynthesis, elongation Fatty acid metabolism #d9a344
+M00085 Fatty acid elongation in mitochondria Fatty acid metabolism #d9a344
+M00086 beta-Oxidation, acyl-CoA synthesis Fatty acid metabolism #d9a344
+M00087 beta-Oxidation Fatty acid metabolism #d9a344
+M00415 Fatty acid elongation in endoplasmic reticulum Fatty acid metabolism #d9a344
+M00861 beta-Oxidation, peroxisome, VLCFA Fatty acid metabolism #d9a344
+M00873 Fatty acid biosynthesis in mitochondria, animals Fatty acid metabolism #d9a344
+M00874 Fatty acid biosynthesis in mitochondria, fungi Fatty acid metabolism #d9a344
+M00055 N-glycan precursor biosynthesis Glycan biosynthesis #588cd6
+M00056 O-glycan biosynthesis, mucin type core Glycan biosynthesis #588cd6
+M00065 GPI-anchor biosynthesis, core oligosaccharide Glycan biosynthesis #588cd6
+M00068 Glycosphingolipid biosynthesis, globo-series, LacCer => Gb4Cer Glycan biosynthesis #588cd6
+M00069 Glycosphingolipid biosynthesis, ganglio series, LacCer => GT3 Glycan biosynthesis #588cd6
+M00070 Glycosphingolipid biosynthesis, lacto-series, LacCer => Lc4Cer Glycan biosynthesis #588cd6
+M00071 Glycosphingolipid biosynthesis, neolacto-series, LacCer => nLc4Cer Glycan biosynthesis #588cd6
+M00072 N-glycosylation by oligosaccharyltransferase Glycan biosynthesis #588cd6
+M00073 N-glycan precursor trimming Glycan biosynthesis #588cd6
+M00074 N-glycan biosynthesis, high-mannose type Glycan biosynthesis #588cd6
+M00075 N-glycan biosynthesis, complex type Glycan biosynthesis #588cd6
+M00872 O-glycan biosynthesis, mannose type (core M3) Glycan biosynthesis #588cd6
+M00057 Glycosaminoglycan biosynthesis, linkage tetrasaccharide Glycosaminoglycan metabolism #d66432
+M00058 Glycosaminoglycan biosynthesis, chondroitin sulfate backbone Glycosaminoglycan metabolism #d66432
+M00059 Glycosaminoglycan biosynthesis, heparan sulfate backbone Glycosaminoglycan metabolism #d66432
+M00076 Dermatan sulfate degradation Glycosaminoglycan metabolism #d66432
+M00077 Chondroitin sulfate degradation Glycosaminoglycan metabolism #d66432
+M00078 Heparan sulfate degradation Glycosaminoglycan metabolism #d66432
+M00079 Keratan sulfate degradation Glycosaminoglycan metabolism #d66432
+M00026 Histidine biosynthesis, PRPP => histidine Histidine metabolism #66d7bf
+M00045 Histidine degradation, histidine => N-formiminoglutamate => glutamate Histidine metabolism #66d7bf
+M00066 Lactosylceramide biosynthesis Lipid metabolism #d53e55
+M00067 Sulfoglycolipids biosynthesis, ceramide--1-alkyl-2-acylglycerol => sulfatide--seminolipid Lipid metabolism #d53e55
+M00088 Ketone body biosynthesis, acetyl-CoA => acetoacetate--3-hydroxybutyrate--acetone Lipid metabolism #d53e55
+M00089 Triacylglycerol biosynthesis Lipid metabolism #d53e55
+M00090 Phosphatidylcholine (PC) biosynthesis, choline => PC Lipid metabolism #d53e55
+M00091 Phosphatidylcholine (PC) biosynthesis, PE => PC Lipid metabolism #d53e55
+M00092 Phosphatidylethanolamine (PE) biosynthesis, ethanolamine => PE Lipid metabolism #d53e55
+M00093 Phosphatidylethanolamine (PE) biosynthesis, PA => PS => PE Lipid metabolism #d53e55
+M00094 Ceramide biosynthesis Lipid metabolism #d53e55
+M00098 Acylglycerol degradation Lipid metabolism #d53e55
+M00099 Sphingosine biosynthesis Lipid metabolism #d53e55
+M00100 Sphingosine degradation Lipid metabolism #d53e55
+M00113 Jasmonic acid biosynthesis Lipid metabolism #d53e55
+M00060 KDO2-lipid A biosynthesis, Raetz pathway, LpxL-LpxM type Lipopolysaccharide metabolism #83d2de
+M00063 CMP-KDO biosynthesis Lipopolysaccharide metabolism #83d2de
+M00064 ADP-L-glycero-D-manno-heptose biosynthesis Lipopolysaccharide metabolism #83d2de
+M00866 KDO2-lipid A biosynthesis, Raetz pathway, non-LpxL-LpxM type Lipopolysaccharide metabolism #83d2de
+M00867 KDO2-lipid A modification pathway Lipopolysaccharide metabolism #83d2de
+M00016 Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00030 Lysine biosynthesis, AAA pathway, 2-oxoglutarate => 2-aminoadipate => lysine Lysine metabolism #d84e8b
+M00031 Lysine biosynthesis, mediated by LysW, 2-aminoadipate => lysine Lysine metabolism #d84e8b
+M00032 Lysine degradation, lysine => saccharopine => acetoacetyl-CoA Lysine metabolism #d84e8b
+M00433 Lysine biosynthesis, 2-oxoglutarate => 2-oxoadipate Lysine metabolism #d84e8b
+M00525 Lysine biosynthesis, acetyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00526 Lysine biosynthesis, DAP dehydrogenase pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00527 Lysine biosynthesis, DAP aminotransferase pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00773 Tylosin biosynthesis, methylmalonyl-CoA + malonyl-CoA => tylactone => tylosin Macrolide biosynthesis #2e4b26
+M00774 Erythromycin biosynthesis, propanoyl-CoA + methylmalonyl-CoA => deoxyerythronolide B => erythromycin A--B Macrolide biosynthesis #2e4b26
+M00775 Oleandomycin biosynthesis, malonyl-CoA + methylmalonyl-CoA => 8,8a-deoxyoleandolide => oleandomycin Macrolide biosynthesis #2e4b26
+M00776 Pikromycin--methymycin biosynthesis, methylmalonyl-CoA + malonyl-CoA => narbonolide--10-deoxymethynolide => pikromycin--methymycin Macrolide biosynthesis #2e4b26
+M00777 Avermectin biosynthesis, 2-methylbutanoyl-CoA--isobutyryl-CoA => 6,8a-Seco-6,8a-deoxy-5-oxoavermectin 1a--1b aglycone => avermectin A1a--B1a--A1b--B1b Macrolide biosynthesis #2e4b26
+M00611 Oxygenic photosynthesis in plants and cyanobacteria Metabolic capacity #9378c3
+M00612 Anoxygenic photosynthesis in purple bacteria Metabolic capacity #9378c3
+M00613 Anoxygenic photosynthesis in green nonsulfur bacteria Metabolic capacity #9378c3
+M00614 Anoxygenic photosynthesis in green sulfur bacteria Metabolic capacity #9378c3
+M00615 Nitrate assimilation Metabolic capacity #9378c3
+M00616 Sulfate-sulfur assimilation Metabolic capacity #9378c3
+M00617 Methanogen Metabolic capacity #9378c3
+M00618 Acetogen Metabolic capacity #9378c3
+M00174 Methane oxidation, methanotroph, methane => formaldehyde Methane metabolism #9e7336
+M00344 Formaldehyde assimilation, xylulose monophosphate pathway Methane metabolism #9e7336
+M00345 Formaldehyde assimilation, ribulose monophosphate pathway Methane metabolism #9e7336
+M00346 Formaldehyde assimilation, serine pathway Methane metabolism #9e7336
+M00356 Methanogenesis, methanol => methane Methane metabolism #9e7336
+M00357 Methanogenesis, acetate => methane Methane metabolism #9e7336
+M00358 Coenzyme M biosynthesis Methane metabolism #9e7336
+M00378 F420 biosynthesis Methane metabolism #9e7336
+M00422 Acetyl-CoA pathway, CO2 => acetyl-CoA Methane metabolism #9e7336
+M00563 Methanogenesis, methylamine--dimethylamine--trimethylamine => methane Methane metabolism #9e7336
+M00567 Methanogenesis, CO2 => methane Methane metabolism #9e7336
+M00608 2-Oxocarboxylic acid chain extension, 2-oxoglutarate => 2-oxoadipate => 2-oxopimelate => 2-oxosuberate Methane metabolism #9e7336
+M00175 Nitrogen fixation, nitrogen => ammonia Nitrogen metabolism #2c2351
+M00528 Nitrification, ammonia => nitrite Nitrogen metabolism #2c2351
+M00529 Denitrification, nitrate => nitrogen Nitrogen metabolism #2c2351
+M00530 Dissimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351
+M00531 Assimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351
+M00804 Complete nitrification, comammox, ammonia => nitrite => nitrate Nitrogen metabolism #2c2351
+M00027 GABA (gamma-Aminobutyrate) shunt Other amino acid metabolism #c5d7a9
+M00118 Glutathione biosynthesis, glutamate => glutathione Other amino acid metabolism #c5d7a9
+M00369 Cyanogenic glycoside biosynthesis, tyrosine => dhurrin Other amino acid metabolism #c5d7a9
+M00012 Glyoxylate cycle Other carbohydrate metabolism #872b4e
+M00013 Malonate semialdehyde pathway, propanoyl-CoA => acetyl-CoA Other carbohydrate metabolism #872b4e
+M00014 Glucuronate pathway (uronate pathway) Other carbohydrate metabolism #872b4e
+M00061 D-Glucuronate degradation, D-glucuronate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e
+M00081 Pectin degradation Other carbohydrate metabolism #872b4e
+M00114 Ascorbate biosynthesis, plants, glucose-6P => ascorbate Other carbohydrate metabolism #872b4e
+M00129 Ascorbate biosynthesis, animals, glucose-1P => ascorbate Other carbohydrate metabolism #872b4e
+M00130 Inositol phosphate metabolism, PI=> PIP2 => Ins(1,4,5)P3 => Ins(1,3,4,5)P4 Other carbohydrate metabolism #872b4e
+M00131 Inositol phosphate metabolism, Ins(1,3,4,5)P4 => Ins(1,3,4)P3 => myo-inositol Other carbohydrate metabolism #872b4e
+M00132 Inositol phosphate metabolism, Ins(1,3,4)P3 => phytate Other carbohydrate metabolism #872b4e
+M00373 Ethylmalonyl pathway Other carbohydrate metabolism #872b4e
+M00532 Photorespiration Other carbohydrate metabolism #872b4e
+M00549 Nucleotide sugar biosynthesis, glucose => UDP-glucose Other carbohydrate metabolism #872b4e
+M00550 Ascorbate degradation, ascorbate => D-xylulose-5P Other carbohydrate metabolism #872b4e
+M00552 D-galactonate degradation, De Ley-Doudoroff pathway, D-galactonate => glycerate-3P Other carbohydrate metabolism #872b4e
+M00554 Nucleotide sugar biosynthesis, galactose => UDP-galactose Other carbohydrate metabolism #872b4e
+M00565 Trehalose biosynthesis, D-glucose 1P => trehalose Other carbohydrate metabolism #872b4e
+M00630 D-Galacturonate degradation (fungi), D-galacturonate => glycerol Other carbohydrate metabolism #872b4e
+M00631 D-Galacturonate degradation (bacteria), D-galacturonate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e
+M00632 Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P Other carbohydrate metabolism #872b4e
+M00740 Methylaspartate cycle Other carbohydrate metabolism #872b4e
+M00741 Propanoyl-CoA metabolism, propanoyl-CoA => succinyl-CoA Other carbohydrate metabolism #872b4e
+M00761 Undecaprenylphosphate alpha-L-Ara4N biosynthesis, UDP-GlcA => undecaprenyl phosphate alpha-L-Ara4N Other carbohydrate metabolism #872b4e
+M00854 Glycogen biosynthesis, glucose-1P => glycogen--starch Other carbohydrate metabolism #872b4e
+M00855 Glycogen degradation, glycogen => glucose-6P Other carbohydrate metabolism #872b4e
+M00097 beta-Carotene biosynthesis, GGAP => beta-carotene Other terpenoid biosynthesis #6e9368
+M00371 Castasterone biosynthesis, campesterol => castasterone Other terpenoid biosynthesis #6e9368
+M00372 Abscisic acid biosynthesis, beta-carotene => abscisic acid Other terpenoid biosynthesis #6e9368
+M00363 EHEC pathogenicity signature, Shiga toxin Pathogenicity #66406d
+M00542 EHEC--EPEC pathogenicity signature, T3SS and effectors Pathogenicity #66406d
+M00564 Helicobacter pylori pathogenicity signature, cagA pathogenicity island Pathogenicity #66406d
+M00574 Pertussis pathogenicity signature, pertussis toxin Pathogenicity #66406d
+M00575 Pertussis pathogenicity signature, T1SS Pathogenicity #66406d
+M00576 ETEC pathogenicity signature, heat-labile and heat-stable enterotoxins Pathogenicity #66406d
+M00850 Vibrio cholerae pathogenicity signature, cholera toxins Pathogenicity #66406d
+M00852 Vibrio cholerae pathogenicity signature, toxin coregulated pilus Pathogenicity #66406d
+M00853 ETEC pathogenicity signature, colonization factors Pathogenicity #66406d
+M00856 Salmonella enterica pathogenicity signature, typhoid toxin Pathogenicity #66406d
+M00857 Salmonella enterica pathogenicity signature, Vi antigen Pathogenicity #66406d
+M00859 Bacillus anthracis pathogenicity signature, anthrax toxin Pathogenicity #66406d
+M00860 Bacillus anthracis pathogenicity signature, polyglutamic acid capsule biosynthesis Pathogenicity #66406d
+M00161 Photosystem II Photosynthesis #cfa68a
+M00163 Photosystem I Photosynthesis #cfa68a
+M00597 Anoxygenic photosystem II [BR:ko00194] Photosynthesis #cfa68a
+M00598 Anoxygenic photosystem I [BR:ko00194] Photosynthesis #cfa68a
+M00660 Xanthomonas spp. pathogenicity signature, T3SS and effectors Plant pathogenicity #461d27
+M00133 Polyamine biosynthesis, arginine => agmatine => putrescine => spermidine Polyamine biosynthesis #a5b3da
+M00134 Polyamine biosynthesis, arginine => ornithine => putrescine Polyamine biosynthesis #a5b3da
+M00135 GABA biosynthesis, eukaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da
+M00136 GABA biosynthesis, prokaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da
+M00793 dTDP-L-rhamnose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00794 dTDP-6-deoxy-D-allose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00795 dTDP-beta-L-noviose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00796 dTDP-D-mycaminose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00797 dTDP-D-desosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00798 dTDP-L-mycarose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00799 dTDP-L-oleandrose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00800 dTDP-L-megosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00801 dTDP-L-olivose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00802 dTDP-D-forosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00803 dTDP-D-angolosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00048 Inosine monophosphate biosynthesis, PRPP + glutamine => IMP Purine metabolism #e0a7d2
+M00049 Adenine ribonucleotide biosynthesis, IMP => ADP,ATP Purine metabolism #e0a7d2
+M00050 Guanine ribonucleotide biosynthesis IMP => GDP,GTP Purine metabolism #e0a7d2
+M00546 Purine degradation, xanthine => urea Purine metabolism #e0a7d2
+M00046 Pyrimidine degradation, uracil => beta-alanine, thymine => 3-aminoisobutanoate Pyrimidine metabolism #25585e
+M00051 Uridine monophosphate biosynthesis, glutamine (+ PRPP) => UMP Pyrimidine metabolism #25585e
+M00052 Pyrimidine ribonucleotide biosynthesis, UMP => UDP--UTP,CDP--CTP Pyrimidine metabolism #25585e
+M00053 Pyrimidine deoxyribonuleotide biosynthesis, CDP--CTP => dCDP--dCTP,dTDP--dTTP Pyrimidine metabolism #25585e
+M00018 Threonine biosynthesis, aspartate => homoserine => threonine Serine and threonine metabolism #de7d78
+M00020 Serine biosynthesis, glycerate-3P => serine Serine and threonine metabolism #de7d78
+M00033 Ectoine biosynthesis, aspartate => ectoine Serine and threonine metabolism #de7d78
+M00555 Betaine biosynthesis, choline => betaine Serine and threonine metabolism #de7d78
+M00101 Cholesterol biosynthesis, squalene 2,3-epoxide => cholesterol Sterol biosynthesis #4e96a2
+M00102 Ergocalciferol biosynthesis Sterol biosynthesis #4e96a2
+M00103 Cholecalciferol biosynthesis Sterol biosynthesis #4e96a2
+M00104 Bile acid biosynthesis, cholesterol => cholate--chenodeoxycholate Sterol biosynthesis #4e96a2
+M00106 Conjugated bile acid biosynthesis, cholate => taurocholate--glycocholate Sterol biosynthesis #4e96a2
+M00107 Steroid hormone biosynthesis, cholesterol => prognenolone => progesterone Sterol biosynthesis #4e96a2
+M00108 C21-Steroid hormone biosynthesis, progesterone => corticosterone--aldosterone Sterol biosynthesis #4e96a2
+M00109 C21-Steroid hormone biosynthesis, progesterone => cortisol--cortisone Sterol biosynthesis #4e96a2
+M00110 C19--C18-Steroid hormone biosynthesis, pregnenolone => androstenedione => estrone Sterol biosynthesis #4e96a2
+M00862 beta-Oxidation, peroxisome, tri--dihydroxycholestanoyl-CoA => choloyl--chenodeoxycholoyl-CoA Sterol biosynthesis #4e96a2
+M00176 Assimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2
+M00595 Thiosulfate oxidation by SOX complex, thiosulfate => sulfate Sulfur metabolism #4e96a2
+M00596 Dissimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2
+M00664 Nodulation Symbiosis #88574e
+M00095 C5 isoprenoid biosynthesis, mevalonate pathway Terpenoid backbone biosynthesis #4e6089
+M00096 C5 isoprenoid biosynthesis, non-mevalonate pathway Terpenoid backbone biosynthesis #4e6089
+M00364 C10-C20 isoprenoid biosynthesis, bacteria Terpenoid backbone biosynthesis #4e6089
+M00365 C10-C20 isoprenoid biosynthesis, archaea Terpenoid backbone biosynthesis #4e6089
+M00366 C10-C20 isoprenoid biosynthesis, plants Terpenoid backbone biosynthesis #4e6089
+M00367 C10-C20 isoprenoid biosynthesis, non-plant eukaryotes Terpenoid backbone biosynthesis #4e6089
+M00849 C5 isoprenoid biosynthesis, mevalonate pathway, archaea Terpenoid backbone biosynthesis #4e6089
+M00778 Type II polyketide backbone biosynthesis, acyl-CoA + malonyl-CoA => polyketide Type II polyketide biosynthesis #af7194
+M00779 Dihydrokalafungin biosynthesis, octaketide => dihydrokalafungin Type II polyketide biosynthesis #af7194
+M00780 Tetracycline--oxytetracycline biosynthesis, pretetramide => tetracycline--oxytetracycline Type II polyketide biosynthesis #af7194
+M00781 Nogalavinone--aklavinone biosynthesis, deoxynogalonate--deoxyaklanonate => nogalavinone--aklavinone Type II polyketide biosynthesis #af7194
+M00782 Mithramycin biosynthesis, 4-demethylpremithramycinone => mithramycin Type II polyketide biosynthesis #af7194
+M00783 Tetracenomycin C--8-demethyltetracenomycin C biosynthesis, tetracenomycin F2 => tetracenomycin C--8-demethyltetracenomycin C Type II polyketide biosynthesis #af7194
+M00784 Elloramycin biosynthesis, 8-demethyltetracenomycin C => elloramycin A Type II polyketide biosynthesis #af7194
+M00823 Chlortetracycline biosynthesis, pretetramide => chlortetracycline Type II polyketide biosynthesis #af7194
\ No newline at end of file
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/01.Bifurcating_List.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/01.Bifurcating_List.txt
new file mode 100644
index 0000000..8d909f9
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/01.Bifurcating_List.txt
@@ -0,0 +1,23 @@
+M00373
+M00532
+M00376
+M00378
+M00088
+M00031
+M00763
+M00133
+M00075
+M00872
+M00125
+M00119
+M00122
+M00827
+M00828
+M00832
+M00833
+M00837
+M00838
+M00785
+M00307
+M00048
+M00127
\ No newline at end of file
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/02.Structural_List.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/02.Structural_List.txt
new file mode 100644
index 0000000..7fbba00
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/02.Structural_List.txt
@@ -0,0 +1,10 @@
+M00144
+M00149
+M00151
+M00152
+M00154
+M00155
+M00153
+M00156
+M00158
+M00160
\ No newline at end of file
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/03.Bifurcating_Modules.dict b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/03.Bifurcating_Modules.dict
new file mode 100644
index 0000000..a09f7b4
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/03.Bifurcating_Modules.dict
@@ -0,0 +1 @@
+{'M00373':{'M00373_1':{1:'K00626',2:'K00023',3:'K17865',4:'K14446',5:'K14447',6:'K14448',7:'K14449',8:'K08691',9:'K14451'},'M00373_2':{1:'K00626',2:'K00023',3:'K17865',4:'K14446',5:'K14447',6:'K14448',7:'K14449',8:'K01965+K01966',9:'K05606',10:'K01847'}},'M00532':{'M00532_1':{1:'K01601-K01602',2:'K19269',3:'K11517',4:'K03781',5:'K14272',6:'K00600',7:'K00830',8:'K15893,K15919',9:'K15918'},'M00532_2':{1:'K01601-K01602',2:'K19269',3:'K11517',4:'K03781',5:'K14272',6:'K00600',7:'K00830',8:'K00281+K00605+K00382+K02437'}},'M00376':{'M00376_1':{1:'K02160+K01961+K01962+K01963',2:'K14468',3:'K14469',4:'K15052',5:'K05606',6:['K01847','K01848+K01849'],7:'K14471+K14472',8:'K00239+K00240+K00241',9:'K01679'},'M00376_2':{1:'K02160+K01961+K01962+K01963',2:'K14468',3:'K14469',4:'K08691',5:'K14449',6:'K14470',7:'K09709'}},'M00378':{'M00378_1':{1:['K11779','K11780+K11781'],2:'K11212',3:'K12234'},'M00378_2':{1:'K14941',2:'K11212',3:'K12234'}},'M00088':{'M00088_1':{1:'K00626',2:'K01641',3:'K01640',4:'K00019'},'M00088_2':{1:'K00626',2:'K01641',3:'K01640',4:'K01574'}},'M00031':{'M00031_1':{1:'K05826',2:'K05827',3:'K05828',4:'K05829',5:'K05830',6:'K05831'}},'M00763':{'M00763_1':{1:'K05826',2:'K19412',3:'K05828',4:'K05829',5:'K05830',6:'K05831'}},'M00133':{'M00133_1':{1:'K01583,K01584,K01585,K02626',2:'K01480',3:'K01611'},'M00133_2':{1:'K00797',2:'K01611'}},'M00075':{'M00075_1':{1:'K01231',2:'K00736',3:'K00737'},'M00075_2':{1:'K01231',2:'K00736',3:'K00738',4:'K00744,K09661',5:'K13748'},'M00075_3':{1:'K01231',2:'K00736',3:'K00717',4:'K07966,K07967,K07968',5:'K00778,K00779'}},'M00872':{'M00872_1':{1:'K00728',2:'K18207',3:'K09654',4:'K17547',5:'K19872',6:'K19873',7:'K21052',8:'K21032',9:'K09668'},'M00872_2':{1:'K21031',2:'K19872',3:'K19873',4:'K21052',5:'K21032',6:'K09668'}},'M00125':{'M00125_1':{1:'K01497,K14652',2:['K01498_K00082','K11752'],3:'K22912,K20860,K20861,K20862,K21063,K21064',4:'K00794',5:'K00793',6:['K00861,K20884_K00953,K22949','K11753']},'M00125_2':{1:'K02858,K14652',2:'K00794',3:'K00793',4:['K00861,K20884_K00953,K22949','K11753']}},'M00119':{'M00119_1':{1:'K00826',2:'K00606',3:'K00077',4:'K01918,K13799'},'M00119_2':{1:'K01579',2:'K01918,K13799'}},'M00122':{'M00122_1':{1:'K00798,K19221',2:'K02232',3:'K02225,K02227',4:'K02231',5:'K02233'},'M00122_2':{1:'K00768',2:'K02226,K22316',3:'K02233'}},'M00827':{'M00827_1':{1:'K21183',2:'K21181',3:'K21182',4:'K16431',5:'K21184',6:'K21185'}},'M00828':{'M00828_1':{1:'K21183',2:'K21181',3:'K21182',4:'K21188'}},'M00832':{'M00832_1':{1:'K21183',2:'K21227',3:'K21228',4:'K16431',5:'K21185'}},'M00833':{'M00833_1':{1:'K21254',2:'K21255',3:'K21256',4:'K21257',5:'K21258',6:'K21261',7:'K21262',8:'K21263'},'M00833_2':{1:'K21259',2:'K21260',3:'K21261',4:'K21262',5:'K21263'}},'M00837':{'M00837_1':{1:'K21780+K21781',2:'K21782',3:'K21783',4:'K21784',5:'K21785',6:'K21786',7:'K21787'},'M00837_2':{1:'K21428',2:'K21778',3:'K21779',4:'K21787'}},'M00838':{'M00838_1':{1:'K21780+K21781',2:'K21782',3:'K21783',4:'K21784',5:'K21785',6:'K21786',7:'K21787'},'M00837_2':{1:'K21791',2:'K21792',3:'K21793',4:'K21787'}},'M00785':{'M00785_1':{1:'K19741',2:'K19723',3:'K19725',4:'K19724',5:'K19727'},'M00785_2':{1:'K19726',2:'K19725',3:'K19724',4:'K19727'}},'M00307':{'M00307_1':{1:'K03737'},'M00307_2':{1:'K00169+K00170+K00171+K00172'},'M00307_3':{1:'K00161+K00162+K00627+K00382-K13997'},'M00307_4':{1:'K00163+K00627+K00382-K13997'}},'M00048':{'M00048_1':{1:'K00764',2:'K01945,K11787,K11788,K13713',3:'K00601,K11175,K08289,K11787,K01492',4:['K01952','K23269+K23264+K23265','K23270+K23265'], 5:'K01933,K11787',6:'K01923,K01587,K13713',7:'K01756',8:['K00602','K01492','K06863_K11176']}, 'M00048_2':{1:'K00764',2:'K01945,K11787,K11788,K13713',3:'K00601,K11175,K08289,K11787,K01492',4:['K01952','K23269+K23264+K23265','K23270+K23265'],5:'K11788',6:['K01587','K11808','K01589_K01588'],7:'K01923,K01587,K13713',8:'K01756',9:['K00602','K01492','K06863_K11176']}},'M00127':{'M00127_1':{1:'K03147',2:'K00877,K00941,K14153',3:'K00788,K14153,K14154',4:'K00946'},'M00127_2':{1:'K00878,K14154',2:'K00788,K14153,K14154',3:'K00946'}}}
\ No newline at end of file
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/04.Structural_Modules.dict b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/04.Structural_Modules.dict
new file mode 100644
index 0000000..b9aa2e8
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/04.Structural_Modules.dict
@@ -0,0 +1 @@
+{'M00144':['K00330', 'K00331+K00332+K00333,K00331+K13378,K13380','K00334+K00335+K00336+K00337+K00338+K00339+K00340','K00341+K00342,K15863','K00343'],'M00149':['K00241','K00242,K18859,K18860','K00239+K00240'],'M00151':[['K03890+K03891+K03889','K03886+K03887+K03888','K00412+K00413,K00410_K00411']],'M00152':['K00412+K00413,K00410','K00411+K00414+K00415+K00416+K00417+K00418+K00419+K00420'],'M00154':['K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268','K02269,K02270-K02271','K02272-K02273+K02258+K02259+K02260'],'M00155':['K02275','K02274+K02276,K15408','K02277'],'M00153':['K00425+K00426','K00424,K22501'],'M00156':['K00404+K00405,K15862','K00407+K00406'],'M00158':['K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138','K02129,K01549','K02130,K02139','K02140','K02141,K02131','K02142-K02143+K02125'],'M00160':['K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154','K03661,K02155','K02146+K02153+K03662']}
\ No newline at end of file
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/05.Modules_Parsed.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/05.Modules_Parsed.txt
new file mode 100644
index 0000000..c8229a5
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/05.Modules_Parsed.txt
@@ -0,0 +1,3343 @@
+M00001
+(K00844,K12407,K00845,K00886,K08074,K00918) (K01810,K06859,K13810,K15916) (K00850,K16370,K00918) (K01623,K01624,K11645,K16305,K16306) K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406)
+['(K00844,K12407,K00845,K00886,K08074,K00918)', '(K01810,K06859,K13810,K15916)', '(K00850,K16370,K00918)', '(K01623,K01624,K11645,K16305,K16306)', 'K01803', '((K00134,K00150)_K00927,K11389)', '(K01834,K15633,K15634,K15635)', 'K01689', '(K00873,K12406)']
+==
+['K00844', 'K12407', 'K00845', 'K00886', 'K08074', 'K00918']
+['K01810', 'K06859', 'K13810', 'K15916']
+['K00850', 'K16370', 'K00918']
+['K01623', 'K01624', 'K11645', 'K16305', 'K16306']
+['K01803']
+['K11389', 'K00134,K00150_K00927']
+['K01834', 'K15633', 'K15634', 'K15635']
+['K01689']
+['K00873', 'K12406']
+++++++++++++++++++
+M00002
+K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406)
+['K01803', '((K00134,K00150)_K00927,K11389)', '(K01834,K15633,K15634,K15635)', 'K01689', '(K00873,K12406)']
+==
+['K01803']
+['K11389', 'K00134,K00150_K00927']
+['K01834', 'K15633', 'K15634', 'K15635']
+['K01689']
+['K00873', 'K12406']
+++++++++++++++++++
+M00003
+(K01596,K01610) K01689 (K01834,K15633,K15634,K15635) K00927 (K00134,K00150) K01803 ((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)
+['(K01596,K01610)', 'K01689', '(K01834,K15633,K15634,K15635)', 'K00927', '(K00134,K00150)', 'K01803', '((K01623,K01624,K11645)_(K03841,K02446,K11532,K01086,K04041),K01622)']
+==
+['K01596', 'K01610']
+['K01689']
+['K01834', 'K15633', 'K15634', 'K15635']
+['K00927']
+['K00134', 'K00150']
+['K01803']
+['K01622', 'K01623,K01624,K11645_K03841,K02446,K11532,K01086,K04041']
+++++++++++++++++++
+M00009
+(K01647,K05942) (K01681,K01682) (K00031,K00030) (K00164+K00658+K00382,K00174+K00175-K00177-K00176) (K01902+K01903,K01899+K01900,K18118) (K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247) (K01676,K01679,K01677+K01678) (K00026,K00025,K00024,K00116)
+['(K01647,K05942)', '(K01681,K01682)', '(K00031,K00030)', '(K00164+K00658+K00382,K00174+K00175-K00177-K00176)', '(K01902+K01903,K01899+K01900,K18118)', '(K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247)', '(K01676,K01679,K01677+K01678)', '(K00026,K00025,K00024,K00116)']
+==
+['K01647', 'K05942']
+['K01681', 'K01682']
+['K00031', 'K00030']
+['K00164+K00658+K00382', 'K00174+K00175-K00177-K00176']
+['K18118', 'K01902+K01903', 'K01899+K01900']
+['K00234+K00235+K00236+K00237', 'K00244+K00245+K00246-K00247', 'K00239+K00240+K00241-%K00242,K18859,K18860)']
+['K01676', 'K01679', 'K01677+K01678']
+['K00026', 'K00025', 'K00024', 'K00116']
+++++++++++++++++++
+M00010
+(K01647,K05942) (K01681,K01682) (K00031,K00030)
+['(K01647,K05942)', '(K01681,K01682)', '(K00031,K00030)']
+==
+['K01647', 'K05942']
+['K01681', 'K01682']
+['K00031', 'K00030']
+++++++++++++++++++
+M00011
+(K00164+K00658+K00382,K00174+K00175-K00177-K00176) (K01902+K01903,K01899+K01900,K18118) (K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247) (K01676,K01679,K01677+K01678) (K00026,K00025,K00024,K00116)
+['(K00164+K00658+K00382,K00174+K00175-K00177-K00176)', '(K01902+K01903,K01899+K01900,K18118)', '(K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247)', '(K01676,K01679,K01677+K01678)', '(K00026,K00025,K00024,K00116)']
+==
+['K00164+K00658+K00382', 'K00174+K00175-K00177-K00176']
+['K18118', 'K01902+K01903', 'K01899+K01900']
+['K00234+K00235+K00236+K00237', 'K00244+K00245+K00246-K00247', 'K00239+K00240+K00241-%K00242,K18859,K18860)']
+['K01676', 'K01679', 'K01677+K01678']
+['K00026', 'K00025', 'K00024', 'K00116']
+++++++++++++++++++
+M00004
+(K13937,((K00036,K19243) (K01057,K07404))) K00033 K01783 (K01807,K01808) K00615 K00616 (K01810,K06859,K13810,K15916)
+['(K13937,((K00036,K19243)_(K01057,K07404)))', 'K00033', 'K01783', '(K01807,K01808)', 'K00615', 'K00616', '(K01810,K06859,K13810,K15916)']
+==
+['K13937', 'K00036,K19243_K01057,K07404']
+['K00033']
+['K01783']
+['K01807', 'K01808']
+['K00615']
+['K00616']
+['K01810', 'K06859', 'K13810', 'K15916']
+++++++++++++++++++
+M00006
+(K13937,((K00036,K19243) (K01057,K07404))) K00033
+['(K13937,((K00036,K19243)_(K01057,K07404)))', 'K00033']
+==
+['K13937', 'K00036,K19243_K01057,K07404']
+['K00033']
+++++++++++++++++++
+M00007
+K00615 (K00616,K13810) K01783 (K01807,K01808)
+['K00615', '(K00616,K13810)', 'K01783', '(K01807,K01808)']
+==
+['K00615']
+['K00616', 'K13810']
+['K01783']
+['K01807', 'K01808']
+++++++++++++++++++
+M00580
+(K08094 (K08093,K13812),K13831) K01807
+['(K08094_(K08093,K13812),K13831)', 'K01807']
+==
+['K13831', 'K08094_K08093,K13812']
+['K01807']
+++++++++++++++++++
+M00005
+K00948
+['K00948']
+==
+['K00948']
+++++++++++++++++++
+M00008
+K00036 (K01057,K07404) K01690 K01625
+['K00036', '(K01057,K07404)', 'K01690', 'K01625']
+==
+['K00036']
+['K01057', 'K07404']
+['K01690']
+['K01625']
+++++++++++++++++++
+M00308
+K05308 K00874 K01625 (K00134 K00927,K00131,K18978)
+['K05308', 'K00874', 'K01625', '(K00134_K00927,K00131,K18978)']
+==
+['K05308']
+['K00874']
+['K01625']
+['K00131', 'K18978', 'K00134_K00927']
+++++++++++++++++++
+M00633
+K05308 K18126 K11395 (K00131,K18978)
+['K05308', 'K18126', 'K11395', '(K00131,K18978)']
+==
+['K05308']
+['K18126']
+['K11395']
+['K00131', 'K18978']
+++++++++++++++++++
+M00309
+K05308 (K11395,K18127) (K18020+K18021+K18022,K18128,K03738)
+['K05308', '(K11395,K18127)', '(K18020+K18021+K18022,K18128,K03738)']
+==
+['K05308']
+['K11395', 'K18127']
+['K18128', 'K03738', 'K18020+K18021+K18022']
+++++++++++++++++++
+M00014
+K00012 ((K12447 K16190),(K00699 (K01195,K14756))) K00002 K13247 -- K03331 (K05351,K00008) K00854
+['K00012', '((K12447_K16190),(K00699_(K01195,K14756)))', 'K00002', 'K13247', 'K03331', '(K05351,K00008)', 'K00854']
+==
+['K00012']
+['K12447_K16190', 'K00699_K01195,K14756']
+['K00002']
+['K13247']
+['K03331']
+['K05351', 'K00008']
+['K00854']
+++++++++++++++++++
+M00630
+(K18106,K19634) K18102 K18103 K18107
+['(K18106,K19634)', 'K18102', 'K18103', 'K18107']
+==
+['K18106', 'K19634']
+['K18102']
+['K18103']
+['K18107']
+++++++++++++++++++
+M00631
+K01812 K00041 (K01685,K16849+K16850) K00874 (K01625,K17463)
+['K01812', 'K00041', '(K01685,K16849+K16850)', 'K00874', '(K01625,K17463)']
+==
+['K01812']
+['K00041']
+['K01685', 'K16849+K16850']
+['K00874']
+['K01625', 'K17463']
+++++++++++++++++++
+M00061
+K01812 K00040 (K01686,K08323) K00874 (K01625,K17463)
+['K01812', 'K00040', '(K01686,K08323)', 'K00874', '(K01625,K17463)']
+==
+['K01812']
+['K00040']
+['K01686', 'K08323']
+['K00874']
+['K01625', 'K17463']
+++++++++++++++++++
+M00081
+K01051 K01184 K01213
+['K01051', 'K01184', 'K01213']
+==
+['K01051']
+['K01184']
+['K01213']
+++++++++++++++++++
+M00632
+K01785 K00849 K00965 K01784
+['K01785', 'K00849', 'K00965', 'K01784']
+==
+['K01785']
+['K00849']
+['K00965']
+['K01784']
+++++++++++++++++++
+M00552
+K01684 K00883 K01631 K00134 K00927
+['K01684', 'K00883', 'K01631', 'K00134', 'K00927']
+==
+['K01684']
+['K00883']
+['K01631']
+['K00134']
+['K00927']
+++++++++++++++++++
+M00129
+K00963 K00012 K00699 (K01195,K14756) K00002 K01053 K00103
+['K00963', 'K00012', 'K00699', '(K01195,K14756)', 'K00002', 'K01053', 'K00103']
+==
+['K00963']
+['K00012']
+['K00699']
+['K01195', 'K14756']
+['K00002']
+['K01053']
+['K00103']
+++++++++++++++++++
+M00114
+((K01810,K06859,K13810) (K01809,K16011),K15916) (K16881,(K17497,K01840,K15778) (K00966,K00971,K16011)) K10046 K14190 (K10047,K18649) (K00064,K17744) K00225
+['((K01810,K06859,K13810)_(K01809,K16011),K15916)', '(K16881,(K17497,K01840,K15778)_(K00966,K00971,K16011))', 'K10046', 'K14190', '(K10047,K18649)', '(K00064,K17744)', 'K00225']
+==
+['K15916', 'K01810,K06859,K13810_K01809,K16011']
+['K16881', 'K17497,K01840,K15778_K00966,K00971,K16011']
+['K10046']
+['K14190']
+['K10047', 'K18649']
+['K00064', 'K17744']
+['K00225']
+++++++++++++++++++
+M00550
+K02821+K02822+K03475 K03476 K03078 K03079 K03077
+['K02821+K02822+K03475', 'K03476', 'K03078', 'K03079', 'K03077']
+==
+['K02821+K02822+K03475']
+['K03476']
+['K03078']
+['K03079']
+['K03077']
+++++++++++++++++++
+M00854
+(K00963 (K00693+K00750,K16150,K16153,K13679,K20812)),(K00975 (K00703,K13679,K20812)) (K00700,K16149)
+['(K00963_(K00693+K00750,K16150,K16153,K13679,K20812)),(K00975_(K00703,K13679,K20812))', '(K00700,K16149)']
+==
+['K00975_K00703,K13679,K20812', 'K00963_K00693+K00750,K16150,K16153,K13679,K20812']
+['K00700', 'K16149']
+++++++++++++++++++
+M00855
+(K00688,K16153) (K01196,((K00705,K22451) (K02438,K01200))) (K15779,K01835,K15778)
+['(K00688,K16153)', '(K01196,((K00705,K22451)_(K02438,K01200)))', '(K15779,K01835,K15778)']
+==
+['K00688', 'K16153']
+['K01196', 'K00705,K22451_K02438,K01200']
+['K15779', 'K01835', 'K15778']
+++++++++++++++++++
+M00565
+K00975 K00703 (K00700,K16149) K01214 K06044 K01236
+['K00975', 'K00703', '(K00700,K16149)', 'K01214', 'K06044', 'K01236']
+==
+['K00975']
+['K00703']
+['K00700', 'K16149']
+['K01214']
+['K06044']
+['K01236']
+++++++++++++++++++
+M00549
+(K00844,K00845,K12407,K00886) K01835 K00963
+['(K00844,K00845,K12407,K00886)', 'K01835', 'K00963']
+==
+['K00844', 'K00845', 'K12407', 'K00886']
+['K01835']
+['K00963']
+++++++++++++++++++
+M00554
+K00849 K00965
+['K00849', 'K00965']
+==
+['K00849']
+['K00965']
+++++++++++++++++++
+M00761
+K10011 K07806 K10012 K13014
+['K10011', 'K07806', 'K10012', 'K13014']
+==
+['K10011']
+['K07806']
+['K10012']
+['K13014']
+++++++++++++++++++
+M00012
+K01647 (K01681,K01682) K01637 (K01638,K19282) (K00026,K00025,K00024)
+['K01647', '(K01681,K01682)', 'K01637', '(K01638,K19282)', '(K00026,K00025,K00024)']
+==
+['K01647']
+['K01681', 'K01682']
+['K01637']
+['K01638', 'K19282']
+['K00026', 'K00025', 'K00024']
+++++++++++++++++++
+M00740
+K01647 K01681 K00031 K00261 K19268+K01846 K04835 K19280 K14449 K19281 K19282 K00024
+['K01647', 'K01681', 'K00031', 'K00261', 'K19268+K01846', 'K04835', 'K19280', 'K14449', 'K19281', 'K19282', 'K00024']
+==
+['K01647']
+['K01681']
+['K00031']
+['K00261']
+['K19268+K01846']
+['K04835']
+['K19280']
+['K14449']
+['K19281']
+['K19282']
+['K00024']
+++++++++++++++++++
+M00013
+(K00248,K00232) (K07511,K07514,K07515,K14729) K05605 K23146 K00140
+['(K00248,K00232)', '(K07511,K07514,K07515,K14729)', 'K05605', 'K23146', 'K00140']
+==
+['K00248', 'K00232']
+['K07511', 'K07514', 'K07515', 'K14729']
+['K05605']
+['K23146']
+['K00140']
+++++++++++++++++++
+M00741
+(K01965+K01966,K11263+(K18472,K19312+K22568),K01964+K15036+K15037) K05606 (K01847,K01848+K01849)
+['(K01965+K01966,K11263+(K18472,K19312+K22568),K01964+K15036+K15037)', 'K05606', '(K01847,K01848+K01849)']
+==
+['K01965+K01966', 'K01964+K15036+K15037', 'K11263+K18472,K19312+K22568']
+['K05606']
+['K01847', 'K01848+K01849']
+++++++++++++++++++
+M00130
+(K00888,K19801,K13711) (K00889,K13712) (K01116,K05857,K05858,K05859,K05860,K05861) K00911
+['(K00888,K19801,K13711)', '(K00889,K13712)', '(K01116,K05857,K05858,K05859,K05860,K05861)', 'K00911']
+==
+['K00888', 'K19801', 'K13711']
+['K00889', 'K13712']
+['K01116', 'K05857', 'K05858', 'K05859', 'K05860', 'K05861']
+['K00911']
+++++++++++++++++++
+M00131
+K01106 (K01107,K15422) K01109 (K01092,K15759,K10047,K18649)
+['K01106', '(K01107,K15422)', 'K01109', '(K01092,K15759,K10047,K18649)']
+==
+['K01106']
+['K01107', 'K15422']
+['K01109']
+['K01092', 'K15759', 'K10047', 'K18649']
+++++++++++++++++++
+M00132
+(K00913,K01765) K00915 K10572
+['(K00913,K01765)', 'K00915', 'K10572']
+==
+['K00913', 'K01765']
+['K00915']
+['K10572']
+++++++++++++++++++
+M00165
+K00855 (K01601-K01602) K00927 (K05298,K00150,K00134) (K01623,K01624) (K03841,K02446,K11532,K01086) K00615 (K01623,K01624) (K01100,K11532,K01086) K00615 (K01807,K01808)
+['K00855', '(K01601-K01602)', 'K00927', '(K05298,K00150,K00134)', '(K01623,K01624)', '(K03841,K02446,K11532,K01086)', 'K00615', '(K01623,K01624)', '(K01100,K11532,K01086)', 'K00615', '(K01807,K01808)']
+==
+['K00855']
+['K01601-K01602']
+['K00927']
+['K05298', 'K00150', 'K00134']
+['K01623', 'K01624']
+['K03841', 'K02446', 'K11532', 'K01086']
+['K00615']
+['K01623', 'K01624']
+['K01100', 'K11532', 'K01086']
+['K00615']
+['K01807', 'K01808']
+++++++++++++++++++
+M00166
+K00855 (K01601-K01602) K00927 (K05298,K00150,K00134)
+['K00855', '(K01601-K01602)', 'K00927', '(K05298,K00150,K00134)']
+==
+['K00855']
+['K01601-K01602']
+['K00927']
+['K05298', 'K00150', 'K00134']
+++++++++++++++++++
+M00167
+(K01623,K01624) (K03841,K02446,K11532,K01086) K00615 (K01623,K01624) (K01100,K11532,K01086) K00615 (K01807,K01808)
+['(K01623,K01624)', '(K03841,K02446,K11532,K01086)', 'K00615', '(K01623,K01624)', '(K01100,K11532,K01086)', 'K00615', '(K01807,K01808)']
+==
+['K01623', 'K01624']
+['K03841', 'K02446', 'K11532', 'K01086']
+['K00615']
+['K01623', 'K01624']
+['K01100', 'K11532', 'K01086']
+['K00615']
+['K01807', 'K01808']
+++++++++++++++++++
+M00168
+K01595 (K00025,K00026,K00024)
+['K01595', '(K00025,K00026,K00024)']
+==
+['K01595']
+['K00025', 'K00026', 'K00024']
+++++++++++++++++++
+M00169
+K00029 K01006
+['K00029', 'K01006']
+==
+['K00029']
+['K01006']
+++++++++++++++++++
+M00172
+K01595 K00051 K00029 K01006
+['K01595', 'K00051', 'K00029', 'K01006']
+==
+['K01595']
+['K00051']
+['K00029']
+['K01006']
+++++++++++++++++++
+M00171
+K01595 K14454 K14455 (K00025,K00026) K00028 (K00814,K14272) K01006
+['K01595', 'K14454', 'K14455', '(K00025,K00026)', 'K00028', '(K00814,K14272)', 'K01006']
+==
+['K01595']
+['K14454']
+['K14455']
+['K00025', 'K00026']
+['K00028']
+['K00814', 'K14272']
+['K01006']
+++++++++++++++++++
+M00170
+K01595 K14454 K14455 K01610
+['K01595', 'K14454', 'K14455', 'K01610']
+==
+['K01595']
+['K14454']
+['K14455']
+['K01610']
+++++++++++++++++++
+M00173
+(K00169+K00170+K00171+K00172,K03737) ((K01007,K01006) K01595,K01959+K01960,K01958) K00024 (K01676,K01679,K01677+K01678) (K00239+K00240-K00241-K00242,K00244+K00245-K00246-K00247,K18556+K18557+K18558+K18559+K18560) (K01902+K01903) (K00174+K00175-K00177-K00176) K00031 (K01681,K01682) (K15230+K15231,K15232+K15233 K15234)
+['(K00169+K00170+K00171+K00172,K03737)', '((K01007,K01006)_K01595,K01959+K01960,K01958)', 'K00024', '(K01676,K01679,K01677+K01678)', '(K00239+K00240-K00241-K00242,K00244+K00245-K00246-K00247,K18556+K18557+K18558+K18559+K18560)', '(K01902+K01903)', '(K00174+K00175-K00177-K00176)', 'K00031', '(K01681,K01682)', '(K15230+K15231,K15232+K15233_K15234)']
+==
+['K03737', 'K00169+K00170+K00171+K00172']
+['K01958', 'K01959+K01960', 'K01007,K01006_K01595']
+['K00024']
+['K01676', 'K01679', 'K01677+K01678']
+['K00239+K00240-K00241-K00242', 'K00244+K00245-K00246-K00247', 'K18556+K18557+K18558+K18559+K18560']
+['K01902+K01903']
+['K00174+K00175-K00177-K00176']
+['K00031']
+['K01681', 'K01682']
+['K15230+K15231', 'K15232+K15233_K15234']
+++++++++++++++++++
+M00375
+K01964+K15037+K15036 K15017 K15039 K15018 K15019 K15020 K05606 K01848+K01849 (K15038,K15017) K14465 (K14466,K18861) K14534 K15016 K00626
+['K01964+K15037+K15036', 'K15017', 'K15039', 'K15018', 'K15019', 'K15020', 'K05606', 'K01848+K01849', '(K15038,K15017)', 'K14465', '(K14466,K18861)', 'K14534', 'K15016', 'K00626']
+==
+['K01964+K15037+K15036']
+['K15017']
+['K15039']
+['K15018']
+['K15019']
+['K15020']
+['K05606']
+['K01848+K01849']
+['K15038', 'K15017']
+['K14465']
+['K14466', 'K18861']
+['K14534']
+['K15016']
+['K00626']
+++++++++++++++++++
+M00374
+K00169+K00170+K00171+K00172 K01007 K01595 K00024 (K01676,K01677+K01678) (K00239+K00240-K00241-K18860) K01902+K01903 (K15038,K15017) K14465 (K14467,K18861) K14534 K15016 K00626
+['K00169+K00170+K00171+K00172', 'K01007', 'K01595', 'K00024', '(K01676,K01677+K01678)', '(K00239+K00240-K00241-K18860)', 'K01902+K01903', '(K15038,K15017)', 'K14465', '(K14467,K18861)', 'K14534', 'K15016', 'K00626']
+==
+['K00169+K00170+K00171+K00172']
+['K01007']
+['K01595']
+['K00024']
+['K01676', 'K01677+K01678']
+['K00239+K00240-K00241-K18860']
+['K01902+K01903']
+['K15038', 'K15017']
+['K14465']
+['K14467', 'K18861']
+['K14534']
+['K15016']
+['K00626']
+++++++++++++++++++
+M00377
+K00198 K05299-K15022 K01938 K01491 K00297 K15023 K14138+K00197+K00194
+['K00198', 'K05299-K15022', 'K01938', 'K01491', 'K00297', 'K15023', 'K14138+K00197+K00194']
+==
+['K00198']
+['K05299-K15022']
+['K01938']
+['K01491']
+['K00297']
+['K15023']
+['K14138+K00197+K00194']
+++++++++++++++++++
+M00579
+(K00625,K13788,K15024) K00925
+['(K00625,K13788,K15024)', 'K00925']
+==
+['K00625', 'K13788', 'K15024']
+['K00925']
+++++++++++++++++++
+M00620
+K00169+K00170+K00171+K00172 K01959+K01960 K00024 K01677+K01678 K18209+K18210 K01902+K01903 K00174+K00175+K00176+K00177
+['K00169+K00170+K00171+K00172', 'K01959+K01960', 'K00024', 'K01677+K01678', 'K18209+K18210', 'K01902+K01903', 'K00174+K00175+K00176+K00177']
+==
+['K00169+K00170+K00171+K00172']
+['K01959+K01960']
+['K00024']
+['K01677+K01678']
+['K18209+K18210']
+['K01902+K01903']
+['K00174+K00175+K00176+K00177']
+++++++++++++++++++
+M00567
+(K00200+K00201+K00202+K00203-K11261+(K00205,K11260,K00204)) K00672 K01499 (K00319,K13942) K00320 (K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584) (K00399+K00401+K00402) (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))
+['(K00200+K00201+K00202+K00203-K11261+(K00205,K11260,K00204))', 'K00672', 'K01499', '(K00319,K13942)', 'K00320', '(K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584)', '(K00399+K00401+K00402)', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))']
+==
+['K00200+K00201+K00202+K00203-K11261+K00205,K11260,K00204']
+['K00672']
+['K01499']
+['K00319', 'K13942']
+['K00320']
+['K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584']
+['K00399+K00401+K00402']
+['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125']
+++++++++++++++++++
+M00357
+(K00925 (K00625,K13788),K01895) (K00193+K00197+K00194) (K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584) (K00399+K00401+K00402) (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))
+['(K00925_(K00625,K13788),K01895)', '(K00193+K00197+K00194)', '(K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584)', '(K00399+K00401+K00402)', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))']
+==
+['K01895', 'K00925_K00625,K13788']
+['K00193+K00197+K00194']
+['K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584']
+['K00399+K00401+K00402']
+['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125']
+++++++++++++++++++
+M00356
+K14080+K04480+K14081 K00399+K00401+K00402 (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))
+['K14080+K04480+K14081', 'K00399+K00401+K00402', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))']
+==
+['K14080+K04480+K14081']
+['K00399+K00401+K00402']
+['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125']
+++++++++++++++++++
+M00563
+K14082 ((K16177-K16176),(K16179-K16178),(K14084-K14083)) K00399+K00401+K00402 (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))
+['K14082', '((K16177-K16176),(K16179-K16178),(K14084-K14083))', 'K00399+K00401+K00402', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))']
+==
+['K14082']
+['K16177-K16176', 'K16179-K16178', 'K14084-K14083']
+['K00399+K00401+K00402']
+['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125']
+++++++++++++++++++
+M00358
+K08097 K05979 K05884 K13039+K06034
+['K08097', 'K05979', 'K05884', 'K13039+K06034']
+==
+['K08097']
+['K05979']
+['K05884']
+['K13039+K06034']
+++++++++++++++++++
+M00608
+K10977 K16792+K16793 K10978
+['K10977', 'K16792+K16793', 'K10978']
+==
+['K10977']
+['K16792+K16793']
+['K10978']
+++++++++++++++++++
+M00174
+((K10944+K10945+K10946),(K16157+K16158+K16159+K16160+K16161+K16162)) ((K14028-K14029),K23995)
+['((K10944+K10945+K10946),(K16157+K16158+K16159+K16160+K16161+K16162))', '((K14028-K14029),K23995)']
+==
+['K10944+K10945+K10946', 'K16157+K16158+K16159+K16160+K16161+K16162']
+['K23995', 'K14028-K14029']
+++++++++++++++++++
+M00346
+K00600 K00830 K00018 K11529 K01689 K01595 K00024 K08692+K14067 K08691
+['K00600', 'K00830', 'K00018', 'K11529', 'K01689', 'K01595', 'K00024', 'K08692+K14067', 'K08691']
+==
+['K00600']
+['K00830']
+['K00018']
+['K11529']
+['K01689']
+['K01595']
+['K00024']
+['K08692+K14067']
+['K08691']
+++++++++++++++++++
+M00345
+(((K08093,K13812) K08094),K13831) (K00850,K16370) K01624
+['(((K08093,K13812)_K08094),K13831)', '(K00850,K16370)', 'K01624']
+==
+['K13831', 'K08093,K13812_K08094']
+['K00850', 'K16370']
+['K01624']
+++++++++++++++++++
+M00344
+K17100 K00863 K01624 K03841
+['K17100', 'K00863', 'K01624', 'K03841']
+==
+['K17100']
+['K00863']
+['K01624']
+['K03841']
+++++++++++++++++++
+M00422
+K00192+K00195 K00193+K00197+K00194
+['K00192+K00195', 'K00193+K00197+K00194']
+==
+['K00192+K00195']
+['K00193+K00197+K00194']
+++++++++++++++++++
+M00175
+K02588+K02586+K02591-K00531,K22896+K22897+K22898+K22899
+['K02588+K02586+K02591-K00531,K22896+K22897+K22898+K22899']
+==
+['K02588+K02586+K02591-K00531', 'K22896+K22897+K22898+K22899']
+++++++++++++++++++
+M00531
+(K00367,K10534,K00372-K00360) (K00366,K17877)
+['(K00367,K10534,K00372-K00360)', '(K00366,K17877)']
+==
+['K00367', 'K10534', 'K00372-K00360']
+['K00366', 'K17877']
+++++++++++++++++++
+M00530
+(K00370+K00371+K00374,K02567+K02568) (K00362+K00363,K03385+K15876)
+['(K00370+K00371+K00374,K02567+K02568)', '(K00362+K00363,K03385+K15876)']
+==
+['K02567+K02568', 'K00370+K00371+K00374']
+['K00362+K00363', 'K03385+K15876']
+++++++++++++++++++
+M00529
+(K00370+K00371+K00374,K02567+K02568) (K00368,K15864) (K04561+K02305) K00376
+['(K00370+K00371+K00374,K02567+K02568)', '(K00368,K15864)', '(K04561+K02305)', 'K00376']
+==
+['K02567+K02568', 'K00370+K00371+K00374']
+['K00368', 'K15864']
+['K04561+K02305']
+['K00376']
+++++++++++++++++++
+M00528
+K10944+K10945+K10946 K10535
+['K10944+K10945+K10946', 'K10535']
+==
+['K10944+K10945+K10946']
+['K10535']
+++++++++++++++++++
+M00804
+K10944+K10945+K10946 K10535 K00370+K00371
+['K10944+K10945+K10946', 'K10535', 'K00370+K00371']
+==
+['K10944+K10945+K10946']
+['K10535']
+['K00370+K00371']
+++++++++++++++++++
+M00176
+(K13811,K00958+K00860,K00955+K00957,K00956+K00957+K00860) K00390 (K00380+K00381,K00392)
+['(K13811,K00958+K00860,K00955+K00957,K00956+K00957+K00860)', 'K00390', '(K00380+K00381,K00392)']
+==
+['K13811', 'K00958+K00860', 'K00955+K00957', 'K00956+K00957+K00860']
+['K00390']
+['K00392', 'K00380+K00381']
+++++++++++++++++++
+M00596
+K00958 (K00394+K00395) (K11180+K11181)
+['K00958', '(K00394+K00395)', '(K11180+K11181)']
+==
+['K00958']
+['K00394+K00395']
+['K11180+K11181']
+++++++++++++++++++
+M00595
+K17222+K17223+K17224-K17225-K22622+K17226+K17227
+['K17222+K17223+K17224-K17225-K22622+K17226+K17227']
+==
+['K17222+K17223+K17224-K17225-K22622+K17226+K17227']
+++++++++++++++++++
+M00161
+K02703+K02706+K02705+K02704+K02707+K02708
+['K02703+K02706+K02705+K02704+K02707+K02708']
+==
+['K02703+K02706+K02705+K02704+K02707+K02708']
+++++++++++++++++++
+M00163
+K02689+K02690+K02691+K02692+K02693+K02694
+['K02689+K02690+K02691+K02692+K02693+K02694']
+==
+['K02689+K02690+K02691+K02692+K02693+K02694']
+++++++++++++++++++
+M00597
+K08928+K08929
+['K08928+K08929']
+==
+['K08928+K08929']
+++++++++++++++++++
+M00598
+K08940+K08941+K08942+K08943
+['K08940+K08941+K08942+K08943']
+==
+['K08940+K08941+K08942+K08943']
+++++++++++++++++++
+M00145
+K05574+K05582+K05581+K05579+K05572+K05580+K05578+K05576+K05577+K05575+K05573-K05583-K05584-K05585
+['K05574+K05582+K05581+K05579+K05572+K05580+K05578+K05576+K05577+K05575+K05573-K05583-K05584-K05585']
+==
+['K05574+K05582+K05581+K05579+K05572+K05580+K05578+K05576+K05577+K05575+K05573-K05583-K05584-K05585']
+++++++++++++++++++
+M00142
+K03878+K03879+K03880+K03881+K03882+K03883+K03884
+['K03878+K03879+K03880+K03881+K03882+K03883+K03884']
+==
+['K03878+K03879+K03880+K03881+K03882+K03883+K03884']
+++++++++++++++++++
+M00143
+K03934+K03935+K03936+K03937+K03938+K03939+K03940+K03941+K03942+K03943-K03944
+['K03934+K03935+K03936+K03937+K03938+K03939+K03940+K03941+K03942+K03943-K03944']
+==
+['K03934+K03935+K03936+K03937+K03938+K03939+K03940+K03941+K03942+K03943-K03944']
+++++++++++++++++++
+M00146
+K03945+K03946+K03947+K03948+K03949+K03950+K03951+K03952+K03953+K03954+K03955+K03956+K11352+K11353
+['K03945+K03946+K03947+K03948+K03949+K03950+K03951+K03952+K03953+K03954+K03955+K03956+K11352+K11353']
+==
+['K03945+K03946+K03947+K03948+K03949+K03950+K03951+K03952+K03953+K03954+K03955+K03956+K11352+K11353']
+++++++++++++++++++
+M00147
+K03957+K03958+K03959+K03960+K03961+K03962+K03963+K03964+K03965+K03966+K11351+K03967+K03968
+['K03957+K03958+K03959+K03960+K03961+K03962+K03963+K03964+K03965+K03966+K11351+K03967+K03968']
+==
+['K03957+K03958+K03959+K03960+K03961+K03962+K03963+K03964+K03965+K03966+K11351+K03967+K03968']
+++++++++++++++++++
+M00150
+K00244+K00245+K00246+K00247
+['K00244+K00245+K00246+K00247']
+==
+['K00244+K00245+K00246+K00247']
+++++++++++++++++++
+M00148
+K00236+K00237+K00234+K00235
+['K00236+K00237+K00234+K00235']
+==
+['K00236+K00237+K00234+K00235']
+++++++++++++++++++
+M00162
+K02635+K02637+K02634+K02636+K02642+K02643+K03689+K02640
+['K02635+K02637+K02634+K02636+K02642+K02643+K03689+K02640']
+==
+['K02635+K02637+K02634+K02636+K02642+K02643+K03689+K02640']
+++++++++++++++++++
+M00154
+(K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268+(K02269,K02270-K02271)+K02272-K02273+K02258+K02259+K02260)
+['(K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268+(K02269,K02270-K02271)+K02272-K02273+K02258+K02259+K02260)']
+==
+['K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268+K02269,K02270-K02271+K02272-K02273+K02258+K02259+K02260']
+++++++++++++++++++
+M00155
+K02275+(K02274+K02276,K15408)-K02277
+['K02275+(K02274+K02276,K15408)-K02277']
+==
+['K15408-K02277', 'K02275+K02274+K02276']
+++++++++++++++++++
+M00153
+K00425+K00426+(K00424,K22501)
+['K00425+K00426+(K00424,K22501)']
+==
+['K22501', 'K00425+K00426+K00424']
+++++++++++++++++++
+M00417
+K02297+K02298+K02299+K02300
+['K02297+K02298+K02299+K02300']
+==
+['K02297+K02298+K02299+K02300']
+++++++++++++++++++
+M00416
+K02827+K02826+K02828+K02829
+['K02827+K02826+K02828+K02829']
+==
+['K02827+K02826+K02828+K02829']
+++++++++++++++++++
+M00156
+((K00404+K00405,K15862)+K00407+K00406)
+['((K00404+K00405,K15862)+K00407+K00406)']
+==
+['K00404+K00405,K15862+K00407+K00406']
+++++++++++++++++++
+M00157
+K02111+K02112+K02113+K02114+K02115+K02108+K02109+K02110
+['K02111+K02112+K02113+K02114+K02115+K02108+K02109+K02110']
+==
+['K02111+K02112+K02113+K02114+K02115+K02108+K02109+K02110']
+++++++++++++++++++
+M00158
+K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138+(K02129,K01549)+(K02130,K02139)+K02140+(K02141,K02131)-K02142-K02143+K02125
+['K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138+(K02129,K01549)+(K02130,K02139)+K02140+(K02141,K02131)-K02142-K02143+K02125']
+==
+['K01549+K02130', 'K02139+K02140+K02141', 'K02131-K02142-K02143+K02125', 'K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138+K02129']
+++++++++++++++++++
+M00159
+K02117+K02118+K02119+K02120+K02121+K02122+K02107+K02123+K02124
+['K02117+K02118+K02119+K02120+K02121+K02122+K02107+K02123+K02124']
+==
+['K02117+K02118+K02119+K02120+K02121+K02122+K02107+K02123+K02124']
+++++++++++++++++++
+M00160
+K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154+(K03661,K02155)+K02146+K02153+K03662
+['K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154+(K03661,K02155)+K02146+K02153+K03662']
+==
+['K02155+K02146+K02153+K03662', 'K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154+K03661']
+++++++++++++++++++
+M00082
+(K11262,(K02160+K01961,K11263)+(K01962+K01963,K18472)) (K00665,K00667+K00668,K11533,(K00645 (K00648,K18473)))
+['(K11262,(K02160+K01961,K11263)+(K01962+K01963,K18472))', '(K00665,K00667+K00668,K11533,(K00645_(K00648,K18473)))']
+==
+['K11262', 'K02160+K01961,K11263+K01962+K01963,K18472']
+['K00665', 'K11533', 'K00667+K00668', 'K00645_K00648,K18473']
+++++++++++++++++++
+M00083
+K00665,(K00667 K00668),K11533,((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209))
+['K00665,(K00667_K00668),K11533,((K00647,K09458)_K00059_(K02372,K01716,K16363)_(K00208,K02371,K10780,K00209))']
+==
+['K00665', 'K11533', 'K00667_K00668', 'K00647,K09458_K00059_K02372,K01716,K16363_K00208,K02371,K10780,K00209']
+++++++++++++++++++
+M00873
+K18660 K03955+K00645 K09458 K11539+K13370 K22540 K07512
+['K18660', 'K03955+K00645', 'K09458', 'K11539+K13370', 'K22540', 'K07512']
+==
+['K18660']
+['K03955+K00645']
+['K09458']
+['K11539+K13370']
+['K22540']
+['K07512']
+++++++++++++++++++
+M00874
+K11262 K03955+K00645 K09458 K00059 K22541 K07512
+['K11262', 'K03955+K00645', 'K09458', 'K00059', 'K22541', 'K07512']
+==
+['K11262']
+['K03955+K00645']
+['K09458']
+['K00059']
+['K22541']
+['K07512']
+++++++++++++++++++
+M00085
+(K07508,K07509) (K00022 K07511,K07515) K07512
+['(K07508,K07509)', '(K00022_K07511,K07515)', 'K07512']
+==
+['K07508', 'K07509']
+['K07515', 'K00022_K07511']
+['K07512']
+++++++++++++++++++
+M00415
+(K10247,K10205,K10248,K10249,K10244,K10203,K10250,K15397,K10245,K10246) K10251 K10703 K10258
+['(K10247,K10205,K10248,K10249,K10244,K10203,K10250,K15397,K10245,K10246)', 'K10251', 'K10703', 'K10258']
+==
+['K10247', 'K10205', 'K10248', 'K10249', 'K10244', 'K10203', 'K10250', 'K15397', 'K10245', 'K10246']
+['K10251']
+['K10703']
+['K10258']
+++++++++++++++++++
+M00086
+K01897,K15013
+['K01897,K15013']
+==
+['K01897', 'K15013']
+++++++++++++++++++
+M00087
+(K00232,K00249,K00255,K06445,K09479) (((K01692,K07511,K13767) (K00022,K07516)),K01825,K01782,K07514,K07515,K10527) (K00632,K07508,K07509,K07513)
+['(K00232,K00249,K00255,K06445,K09479)', '(((K01692,K07511,K13767)_(K00022,K07516)),K01825,K01782,K07514,K07515,K10527)', '(K00632,K07508,K07509,K07513)']
+==
+['K00232', 'K00249', 'K00255', 'K06445', 'K09479']
+['K01825', 'K01782', 'K07514', 'K07515', 'K10527', 'K01692,K07511,K13767_K00022,K07516']
+['K00632', 'K07508', 'K07509', 'K07513']
+++++++++++++++++++
+M00861
+K00232 K12405 (K07513,K08764)
+['K00232', 'K12405', '(K07513,K08764)']
+==
+['K00232']
+['K12405']
+['K07513', 'K08764']
+++++++++++++++++++
+M00101
+K01852 K05917 K00222 K07750 K07748 (K09827,K13373) K09828 K01824 K00227 K00213
+['K01852', 'K05917', 'K00222', 'K07750', 'K07748', '(K09827,K13373)', 'K09828', 'K01824', 'K00227', 'K00213']
+==
+['K01852']
+['K05917']
+['K00222']
+['K07750']
+['K07748']
+['K09827', 'K13373']
+['K09828']
+['K01824']
+['K00227']
+['K00213']
+++++++++++++++++++
+M00102
+K00559 K09829 K00227 K09831 K00223
+['K00559', 'K09829', 'K00227', 'K09831', 'K00223']
+==
+['K00559']
+['K09829']
+['K00227']
+['K09831']
+['K00223']
+++++++++++++++++++
+M00103
+K07419 K07438
+['K07419', 'K07438']
+==
+['K07419']
+['K07438']
+++++++++++++++++++
+M00104
+K00489 K12408 K07431 K00251 K00037 K00488 K08748 K01796 K10214 K12405 K08764 K11992
+['K00489', 'K12408', 'K07431', 'K00251', 'K00037', 'K00488', 'K08748', 'K01796', 'K10214', 'K12405', 'K08764', 'K11992']
+==
+['K00489']
+['K12408']
+['K07431']
+['K00251']
+['K00037']
+['K00488']
+['K08748']
+['K01796']
+['K10214']
+['K12405']
+['K08764']
+['K11992']
+++++++++++++++++++
+M00106
+K08748 K00659
+['K08748', 'K00659']
+==
+['K08748']
+['K00659']
+++++++++++++++++++
+M00862
+K10214 K12405 K08764
+['K10214', 'K12405', 'K08764']
+==
+['K10214']
+['K12405']
+['K08764']
+++++++++++++++++++
+M00107
+K00498 K00070
+['K00498', 'K00070']
+==
+['K00498']
+['K00070']
+++++++++++++++++++
+M00108
+K00513 (K00497,K07433) K07433
+['K00513', '(K00497,K07433)', 'K07433']
+==
+['K00513']
+['K00497', 'K07433']
+['K07433']
+++++++++++++++++++
+M00109
+K00512 K00513 K00497 (K15680,K00071)
+['K00512', 'K00513', 'K00497', '(K15680,K00071)']
+==
+['K00512']
+['K00513']
+['K00497']
+['K15680', 'K00071']
+++++++++++++++++++
+M00110
+K00512 K00070 K07434
+['K00512', 'K00070', 'K07434']
+==
+['K00512']
+['K00070']
+['K07434']
+++++++++++++++++++
+M00089
+(K00629,K13506,K13507,K00630,K13508) (K00655,K13509,K13523,K19007,K13513,K13517,K13519,K14674,K22831) (K01080,K15728,K18693) (K11155,K11160,K14456,K22848,K22849)
+['(K00629,K13506,K13507,K00630,K13508)', '(K00655,K13509,K13523,K19007,K13513,K13517,K13519,K14674,K22831)', '(K01080,K15728,K18693)', '(K11155,K11160,K14456,K22848,K22849)']
+==
+['K00629', 'K13506', 'K13507', 'K00630', 'K13508']
+['K00655', 'K13509', 'K13523', 'K19007', 'K13513', 'K13517', 'K13519', 'K14674', 'K22831']
+['K01080', 'K15728', 'K18693']
+['K11155', 'K11160', 'K14456', 'K22848', 'K22849']
+++++++++++++++++++
+M00098
+(K01046,K12298,K16816,K13534,K14073,K14074,K14075,K14076,K22283,K14452,K22284,K14674,K14675,K17900) K01054
+['(K01046,K12298,K16816,K13534,K14073,K14074,K14075,K14076,K22283,K14452,K22284,K14674,K14675,K17900)', 'K01054']
+==
+['K01046', 'K12298', 'K16816', 'K13534', 'K14073', 'K14074', 'K14075', 'K14076', 'K22283', 'K14452', 'K22284', 'K14674', 'K14675', 'K17900']
+['K01054']
+++++++++++++++++++
+M00090
+(K00866,K14156) K00968 (K00994,K13644)
+['(K00866,K14156)', 'K00968', '(K00994,K13644)']
+==
+['K00866', 'K14156']
+['K00968']
+['K00994', 'K13644']
+++++++++++++++++++
+M00091
+K00551,(K16369 K00550),K00570
+['K00551,(K16369_K00550),K00570']
+==
+['K00551', 'K00570', 'K16369_K00550']
+++++++++++++++++++
+M00092
+(K00894,K14156) K00967 (K00993,K13644)
+['(K00894,K14156)', 'K00967', '(K00993,K13644)']
+==
+['K00894', 'K14156']
+['K00967']
+['K00993', 'K13644']
+++++++++++++++++++
+M00093
+K00981 (K00998,K17103) K01613
+['K00981', '(K00998,K17103)', 'K01613']
+==
+['K00981']
+['K00998', 'K17103']
+['K01613']
+++++++++++++++++++
+M00094
+K00654 K04708 (K04709,K04710,K23727) K04712
+['K00654', 'K04708', '(K04709,K04710,K23727)', 'K04712']
+==
+['K00654']
+['K04708']
+['K04709', 'K04710', 'K23727']
+['K04712']
+++++++++++++++++++
+M00066
+K00720 K07553
+['K00720', 'K07553']
+==
+['K00720']
+['K07553']
+++++++++++++++++++
+M00067
+K04628 K01019
+['K04628', 'K01019']
+==
+['K04628']
+['K01019']
+++++++++++++++++++
+M00099
+K00654 K04708 (K04709,K04710,K23727) K04712 (K01441,K12348,K12349)
+['K00654', 'K04708', '(K04709,K04710,K23727)', 'K04712', '(K01441,K12348,K12349)']
+==
+['K00654']
+['K04708']
+['K04709', 'K04710', 'K23727']
+['K04712']
+['K01441', 'K12348', 'K12349']
+++++++++++++++++++
+M00100
+K04718 K01634
+['K04718', 'K01634']
+==
+['K04718']
+['K01634']
+++++++++++++++++++
+M00113
+K00454 K01723 K10525 K05894 K10526 K00232 K10527 K07513 --
+['K00454', 'K01723', 'K10525', 'K05894', 'K10526', 'K00232', 'K10527', 'K07513']
+==
+['K00454']
+['K01723']
+['K10525']
+['K05894']
+['K10526']
+['K00232']
+['K10527']
+['K07513']
+++++++++++++++++++
+M00049
+K01939 K01756 (K00939,K18532,K18533,K00944) (K00940,K00873,K12406)
+['K01939', 'K01756', '(K00939,K18532,K18533,K00944)', '(K00940,K00873,K12406)']
+==
+['K01939']
+['K01756']
+['K00939', 'K18532', 'K18533', 'K00944']
+['K00940', 'K00873', 'K12406']
+++++++++++++++++++
+M00050
+K00088 K01951 K00942 (K00940,K18533,K00873,K12406)
+['K00088', 'K01951', 'K00942', '(K00940,K18533,K00873,K12406)']
+==
+['K00088']
+['K01951']
+['K00942']
+['K00940', 'K18533', 'K00873', 'K12406']
+++++++++++++++++++
+M00546
+(K00106,K00087+K13479+K13480,K13481+K13482,K11177+K11178+K13483) (K00365,K16838,K16839,K22879) (K13484,K07127 (K13485,K16838,K16840)) (K01466,K16842) K01477
+['(K00106,K00087+K13479+K13480,K13481+K13482,K11177+K11178+K13483)', '(K00365,K16838,K16839,K22879)', '(K13484,K07127_(K13485,K16838,K16840))', '(K01466,K16842)', 'K01477']
+==
+['K00106', 'K13481+K13482', 'K00087+K13479+K13480', 'K11177+K11178+K13483']
+['K00365', 'K16838', 'K16839', 'K22879']
+['K13484', 'K07127_K13485,K16838,K16840']
+['K01466', 'K16842']
+['K01477']
+++++++++++++++++++
+M00051
+(K11540,(K11541 K01465),((K01954,K01955+K01956) (K00609+K00610,K00608) K01465)) (K00226,K00254,K17828) (K13421,K00762 K01591)
+['(K11540,(K11541_K01465),((K01954,K01955+K01956)_(K00609+K00610,K00608)_K01465))', '(K00226,K00254,K17828)', '(K13421,K00762_K01591)']
+==
+['K11540', 'K11541_K01465', 'K01954,K01955+K01956_K00609+K00610,K00608_K01465']
+['K00226', 'K00254', 'K17828']
+['K13421', 'K00762_K01591']
+++++++++++++++++++
+M00052
+(K13800,K13809,K09903) (K00940,K18533) K01937
+['(K13800,K13809,K09903)', '(K00940,K18533)', 'K01937']
+==
+['K13800', 'K13809', 'K09903']
+['K00940', 'K18533']
+['K01937']
+++++++++++++++++++
+M00053
+(K00524,K00525+K00526,K10807+K10808) (K00940,K18533) (K00527,K21636) K01494 K01520 (K00560,K13998) K00943 K00940
+['(K00524,K00525+K00526,K10807+K10808)', '(K00940,K18533)', '(K00527,K21636)', 'K01494', 'K01520', '(K00560,K13998)', 'K00943', 'K00940']
+==
+['K00524', 'K00525+K00526', 'K10807+K10808']
+['K00940', 'K18533']
+['K00527', 'K21636']
+['K01494']
+['K01520']
+['K00560', 'K13998']
+['K00943']
+['K00940']
+++++++++++++++++++
+M00046
+(K00207,K17722+K17723) K01464 (K01431,K06016)
+['(K00207,K17722+K17723)', 'K01464', '(K01431,K06016)']
+==
+['K00207', 'K17722+K17723']
+['K01464']
+['K01431', 'K06016']
+++++++++++++++++++
+M00020
+K00058 K00831 (K01079,K02203,K22305)
+['K00058', 'K00831', '(K01079,K02203,K22305)']
+==
+['K00058']
+['K00831']
+['K01079', 'K02203', 'K22305']
+++++++++++++++++++
+M00018
+(K00928,K12524,K12525,K12526) K00133 (K00003,K12524,K12525) (K00872,K02204,K02203) K01733
+['(K00928,K12524,K12525,K12526)', 'K00133', '(K00003,K12524,K12525)', '(K00872,K02204,K02203)', 'K01733']
+==
+['K00928', 'K12524', 'K12525', 'K12526']
+['K00133']
+['K00003', 'K12524', 'K12525']
+['K00872', 'K02204', 'K02203']
+['K01733']
+++++++++++++++++++
+M00555
+(K17755,((K00108,K11440,K00499) (K00130,K14085)))
+['(K17755,((K00108,K11440,K00499)_(K00130,K14085)))']
+==
+['K17755', 'K00108,K11440,K00499_K00130,K14085']
+++++++++++++++++++
+M00033
+K00928 K00133 K00836 K06718 K06720
+['K00928', 'K00133', 'K00836', 'K06718', 'K06720']
+==
+['K00928']
+['K00133']
+['K00836']
+['K06718']
+['K06720']
+++++++++++++++++++
+M00021
+(K00640,K23304) (K01738,K13034,K17069)
+['(K00640,K23304)', '(K01738,K13034,K17069)']
+==
+['K00640', 'K23304']
+['K01738', 'K13034', 'K17069']
+++++++++++++++++++
+M00338
+(K01697,K10150) K01758
+['(K01697,K10150)', 'K01758']
+==
+['K01697', 'K10150']
+['K01758']
+++++++++++++++++++
+M00609
+K00789 K17462 K01243 K07173 K17216 K17217
+['K00789', 'K17462', 'K01243', 'K07173', 'K17216', 'K17217']
+==
+['K00789']
+['K17462']
+['K01243']
+['K07173']
+['K17216']
+['K17217']
+++++++++++++++++++
+M00017
+(K00928,K12524,K12525) K00133 (K00003,K12524,K12525) (K00651,K00641) K01739 (K01760,K14155) (K00548,K24042,K00549)
+['(K00928,K12524,K12525)', 'K00133', '(K00003,K12524,K12525)', '(K00651,K00641)', 'K01739', '(K01760,K14155)', '(K00548,K24042,K00549)']
+==
+['K00928', 'K12524', 'K12525']
+['K00133']
+['K00003', 'K12524', 'K12525']
+['K00651', 'K00641']
+['K01739']
+['K01760', 'K14155']
+['K00548', 'K24042', 'K00549']
+++++++++++++++++++
+M00034
+K00789 K01611 K00797 ((K01243,K01244) K00899,K00772) K08963 (K16054,K08964 (K09880,K08965 K08966)) K08967 (K00815,K08969,K23977,K00832,K00838)
+['K00789', 'K01611', 'K00797', '((K01243,K01244)_K00899,K00772)', 'K08963', '(K16054,K08964_(K09880,K08965_K08966))', 'K08967', '(K00815,K08969,K23977,K00832,K00838)']
+==
+['K00789']
+['K01611']
+['K00797']
+['K00772', 'K01243,K01244_K00899']
+['K08963']
+['K16054', 'K08964_K09880,K08965_K08966']
+['K08967']
+['K00815', 'K08969', 'K23977', 'K00832', 'K00838']
+++++++++++++++++++
+M00035
+K00789 (K00558,K17398,K17399) K01251 (K01697,K10150)
+['K00789', '(K00558,K17398,K17399)', 'K01251', '(K01697,K10150)']
+==
+['K00789']
+['K00558', 'K17398', 'K17399']
+['K01251']
+['K01697', 'K10150']
+++++++++++++++++++
+M00368
+K00789 (K01762,K20772) K05933
+['K00789', '(K01762,K20772)', 'K05933']
+==
+['K00789']
+['K01762', 'K20772']
+['K05933']
+++++++++++++++++++
+M00019
+K01652+(K01653,K11258) K00053 K01687 K00826
+['K01652+(K01653,K11258)', 'K00053', 'K01687', 'K00826']
+==
+['K11258', 'K01652+K01653']
+['K00053']
+['K01687']
+['K00826']
+++++++++++++++++++
+M00535
+K09011 K01703+K01704 K00052
+['K09011', 'K01703+K01704', 'K00052']
+==
+['K09011']
+['K01703+K01704']
+['K00052']
+++++++++++++++++++
+M00570
+(K17989,K01754) K01652+(K01653,K11258) K00053 K01687 K00826
+['(K17989,K01754)', 'K01652+(K01653,K11258)', 'K00053', 'K01687', 'K00826']
+==
+['K17989', 'K01754']
+['K11258', 'K01652+K01653']
+['K00053']
+['K01687']
+['K00826']
+++++++++++++++++++
+M00432
+K01649 (K01702,K01703+K01704) K00052
+['K01649', '(K01702,K01703+K01704)', 'K00052']
+==
+['K01649']
+['K01702', 'K01703+K01704']
+['K00052']
+++++++++++++++++++
+M00036
+K00826 ((K00166+K00167,K11381)+K09699+K00382) (K00253,K00249) (K01968+K01969) (K05607,K13766) K01640
+['K00826', '((K00166+K00167,K11381)+K09699+K00382)', '(K00253,K00249)', '(K01968+K01969)', '(K05607,K13766)', 'K01640']
+==
+['K00826']
+['K00166+K00167,K11381+K09699+K00382']
+['K00253', 'K00249']
+['K01968+K01969']
+['K05607', 'K13766']
+['K01640']
+++++++++++++++++++
+M00016
+(K00928,K12524,K12525,K12526) K00133 K01714 K00215 K00674 (K00821,K14267) K01439 K01778 (K01586,K12526)
+['(K00928,K12524,K12525,K12526)', 'K00133', 'K01714', 'K00215', 'K00674', '(K00821,K14267)', 'K01439', 'K01778', '(K01586,K12526)']
+==
+['K00928', 'K12524', 'K12525', 'K12526']
+['K00133']
+['K01714']
+['K00215']
+['K00674']
+['K00821', 'K14267']
+['K01439']
+['K01778']
+['K01586', 'K12526']
+++++++++++++++++++
+M00525
+K00928 K00133 K01714 K00215 K05822 K00841 K05823 K01778 K01586
+['K00928', 'K00133', 'K01714', 'K00215', 'K05822', 'K00841', 'K05823', 'K01778', 'K01586']
+==
+['K00928']
+['K00133']
+['K01714']
+['K00215']
+['K05822']
+['K00841']
+['K05823']
+['K01778']
+['K01586']
+++++++++++++++++++
+M00526
+(K00928,K12524,K12525,K12526) K00133 K01714 K00215 K03340 (K01586,K12526)
+['(K00928,K12524,K12525,K12526)', 'K00133', 'K01714', 'K00215', 'K03340', '(K01586,K12526)']
+==
+['K00928', 'K12524', 'K12525', 'K12526']
+['K00133']
+['K01714']
+['K00215']
+['K03340']
+['K01586', 'K12526']
+++++++++++++++++++
+M00527
+(K00928,K12524,K12525,K12526) K00133 K01714 K00215 K10206 K01778 (K01586,K12526)
+['(K00928,K12524,K12525,K12526)', 'K00133', 'K01714', 'K00215', 'K10206', 'K01778', '(K01586,K12526)']
+==
+['K00928', 'K12524', 'K12525', 'K12526']
+['K00133']
+['K01714']
+['K00215']
+['K10206']
+['K01778']
+['K01586', 'K12526']
+++++++++++++++++++
+M00030
+K01655 K17450 K01705 K05824 K00838 K00143 (K00293,K24034) K00290
+['K01655', 'K17450', 'K01705', 'K05824', 'K00838', 'K00143', '(K00293,K24034)', 'K00290']
+==
+['K01655']
+['K17450']
+['K01705']
+['K05824']
+['K00838']
+['K00143']
+['K00293', 'K24034']
+['K00290']
+++++++++++++++++++
+M00433
+K01655 (K17450 K01705,K16792+K16793) K05824
+['K01655', '(K17450_K01705,K16792+K16793)', 'K05824']
+==
+['K01655']
+['K17450_K01705', 'K16792+K16793']
+['K05824']
+++++++++++++++++++
+M00032
+K14157 K14085 K00825 (K15791+K00658+K00382) K00252 (K07514,(K07515,K07511) K00022)
+['K14157', 'K14085', 'K00825', '(K15791+K00658+K00382)', 'K00252', '(K07514,(K07515,K07511)_K00022)']
+==
+['K14157']
+['K14085']
+['K00825']
+['K15791+K00658+K00382']
+['K00252']
+['K07514', 'K07515,K07511_K00022']
+++++++++++++++++++
+M00028
+(K00618,K00619,K14681,K14682,K00620,K22477,K22478) ((K00930,K22478) K00145,K12659) (K00818,K00821) (K01438,K14677,K00620)
+['(K00618,K00619,K14681,K14682,K00620,K22477,K22478)', '((K00930,K22478)_K00145,K12659)', '(K00818,K00821)', '(K01438,K14677,K00620)']
+==
+['K00618', 'K00619', 'K14681', 'K14682', 'K00620', 'K22477', 'K22478']
+['K12659', 'K00930,K22478_K00145']
+['K00818', 'K00821']
+['K01438', 'K14677', 'K00620']
+++++++++++++++++++
+M00844
+K00611 K01940 (K01755,K14681)
+['K00611', 'K01940', '(K01755,K14681)']
+==
+['K00611']
+['K01940']
+['K01755', 'K14681']
+++++++++++++++++++
+M00845
+K22478 K00145 K00821 K09065 K01438 K01940 K01755
+['K22478', 'K00145', 'K00821', 'K09065', 'K01438', 'K01940', 'K01755']
+==
+['K22478']
+['K00145']
+['K00821']
+['K09065']
+['K01438']
+['K01940']
+['K01755']
+++++++++++++++++++
+M00029
+K01948 K00611 K01940 (K01755,K14681) K01476
+['K01948', 'K00611', 'K01940', '(K01755,K14681)', 'K01476']
+==
+['K01948']
+['K00611']
+['K01940']
+['K01755', 'K14681']
+['K01476']
+++++++++++++++++++
+M00015
+((K00931 K00147),K12657) K00286
+['((K00931_K00147),K12657)', 'K00286']
+==
+['K12657', 'K00931_K00147']
+['K00286']
+++++++++++++++++++
+M00047
+K00613 K00542 K00933
+['K00613', 'K00542', 'K00933']
+==
+['K00613']
+['K00542']
+['K00933']
+++++++++++++++++++
+M00879
+K00673 K01484 K00840 K06447 K05526
+['K00673', 'K01484', 'K00840', 'K06447', 'K05526']
+==
+['K00673']
+['K01484']
+['K00840']
+['K06447']
+['K05526']
+++++++++++++++++++
+M00134
+K01476 K01581
+['K01476', 'K01581']
+==
+['K01476']
+['K01581']
+++++++++++++++++++
+M00135
+K00657 K00274 (K00128,K14085,K00149) --
+['K00657', 'K00274', '(K00128,K14085,K00149)']
+==
+['K00657']
+['K00274']
+['K00128', 'K14085', 'K00149']
+++++++++++++++++++
+M00136
+K09470 K09471 K09472 K09473
+['K09470', 'K09471', 'K09472', 'K09473']
+==
+['K09470']
+['K09471']
+['K09472']
+['K09473']
+++++++++++++++++++
+M00026
+(K00765-K02502) (K01523 K01496,K11755,K14152) (K01814,K24017) (K02501+K02500,K01663) ((K01693 K00817 (K04486,K05602,K18649)),(K01089 K00817)) (K00013,K14152)
+['(K00765-K02502)', '(K01523_K01496,K11755,K14152)', '(K01814,K24017)', '(K02501+K02500,K01663)', '((K01693_K00817_(K04486,K05602,K18649)),(K01089_K00817))', '(K00013,K14152)']
+==
+['K00765-K02502']
+['K11755', 'K14152', 'K01523_K01496']
+['K01814', 'K24017']
+['K01663', 'K02501+K02500']
+['K01089_K00817', 'K01693_K00817_K04486,K05602,K18649']
+['K00013', 'K14152']
+++++++++++++++++++
+M00045
+K01745 K01712 K01468 (K01479,K00603,K13990,(K05603 K01458))
+['K01745', 'K01712', 'K01468', '(K01479,K00603,K13990,(K05603_K01458))']
+==
+['K01745']
+['K01712']
+['K01468']
+['K01479', 'K00603', 'K13990', 'K05603_K01458']
+++++++++++++++++++
+M00022
+(K01626,K03856,K13853) (((K01735,K13829) ((K03785,K03786) K00014,K13832)),K13830) ((K00891,K13829) (K00800,K24018),K13830) K01736
+['(K01626,K03856,K13853)', '(((K01735,K13829)_((K03785,K03786)_K00014,K13832)),K13830)', '((K00891,K13829)_(K00800,K24018),K13830)', 'K01736']
+==
+['K01626', 'K03856', 'K13853']
+['K13830', 'K01735,K13829_K03785,K03786_K00014,K13832']
+['K13830', 'K00891,K13829_K00800,K24018']
+['K01736']
+++++++++++++++++++
+M00023
+(((K01657+K01658,K13503,K13501,K01656) K00766),K13497) (((K01817,K24017) (K01656,K01609)),K13498,K13501) (K01695+(K01696,K06001),K01694)
+['(((K01657+K01658,K13503,K13501,K01656)_K00766),K13497)', '(((K01817,K24017)_(K01656,K01609)),K13498,K13501)', '(K01695+(K01696,K06001),K01694)']
+==
+['K13497', 'K01657+K01658,K13503,K13501,K01656_K00766']
+['K13498', 'K13501', 'K01817,K24017_K01656,K01609']
+['K01694', 'K01695+K01696,K06001']
+++++++++++++++++++
+M00024
+((K01850,K04092,K14187,K04093,K04516,K06208,K06209,K13853) (K01713,K04518,K05359),K14170) (K00832,K00838)
+['((K01850,K04092,K14187,K04093,K04516,K06208,K06209,K13853)_(K01713,K04518,K05359),K14170)', '(K00832,K00838)']
+==
+['K14170', 'K01850,K04092,K14187,K04093,K04516,K06208,K06209,K13853_K01713,K04518,K05359']
+['K00832', 'K00838']
+++++++++++++++++++
+M00025
+(((K01850,K04092,K14170,K04093,K04516,K06208,K06209,K13853) K04517),K14187) (K00815,K00832,K00838)
+['(((K01850,K04092,K14170,K04093,K04516,K06208,K06209,K13853)_K04517),K14187)', '(K00815,K00832,K00838)']
+==
+['K14187', 'K01850,K04092,K14170,K04093,K04516,K06208,K06209,K13853_K04517']
+['K00815', 'K00832', 'K00838']
+++++++++++++++++++
+M00040
+(K00832,K15849) (K00220,K24018,K15226,K15227)
+['(K00832,K15849)', '(K00220,K24018,K15226,K15227)']
+==
+['K00832', 'K15849']
+['K00220', 'K24018', 'K15226', 'K15227']
+++++++++++++++++++
+M00042
+(K00505,K00501) (K01592,K01593) K00503 K00553
+['(K00505,K00501)', '(K01592,K01593)', 'K00503', 'K00553']
+==
+['K00505', 'K00501']
+['K01592', 'K01593']
+['K00503']
+['K00553']
+++++++++++++++++++
+M00043
+K00431
+['K00431']
+==
+['K00431']
+++++++++++++++++++
+M00044
+(K00815,K00838,K03334) K00457 K00451 K01800 (K01555,K16171)
+['(K00815,K00838,K03334)', 'K00457', 'K00451', 'K01800', '(K01555,K16171)']
+==
+['K00815', 'K00838', 'K03334']
+['K00457']
+['K00451']
+['K01800']
+['K01555', 'K16171']
+++++++++++++++++++
+M00533
+K00455 K00151 K01826 K05921
+['K00455', 'K00151', 'K01826', 'K05921']
+==
+['K00455']
+['K00151']
+['K01826']
+['K05921']
+++++++++++++++++++
+M00545
+(((K05708+K05709+K05710+K00529) K05711),K05712) K05713 K05714 K02554 K01666 K04073
+['(((K05708+K05709+K05710+K00529)_K05711),K05712)', 'K05713', 'K05714', 'K02554', 'K01666', 'K04073']
+==
+['K05712', 'K05708+K05709+K05710+K00529_K05711']
+['K05713']
+['K05714']
+['K02554']
+['K01666']
+['K04073']
+++++++++++++++++++
+M00037
+K00502 K01593 K00669 K00543
+['K00502', 'K01593', 'K00669', 'K00543']
+==
+['K00502']
+['K01593']
+['K00669']
+['K00543']
+++++++++++++++++++
+M00038
+(K00453,K00463) (K01432,K14263,K07130) K00486 K01556 K00452 K03392 (K10217,K23234)
+['(K00453,K00463)', '(K01432,K14263,K07130)', 'K00486', 'K01556', 'K00452', 'K03392', '(K10217,K23234)']
+==
+['K00453', 'K00463']
+['K01432', 'K14263', 'K07130']
+['K00486']
+['K01556']
+['K00452']
+['K03392']
+['K10217', 'K23234']
+++++++++++++++++++
+M00027
+K01580 (K13524,K07250,K00823,K16871) (K00135,K00139,K17761)
+['K01580', '(K13524,K07250,K00823,K16871)', '(K00135,K00139,K17761)']
+==
+['K01580']
+['K13524', 'K07250', 'K00823', 'K16871']
+['K00135', 'K00139', 'K17761']
+++++++++++++++++++
+M00369
+K13027 K13029 K13030
+['K13027', 'K13029', 'K13030']
+==
+['K13027']
+['K13029']
+['K13030']
+++++++++++++++++++
+M00118
+(K11204+K11205,K01919) (K21456,K01920)
+['(K11204+K11205,K01919)', '(K21456,K01920)']
+==
+['K01919', 'K11204+K11205']
+['K21456', 'K01920']
+++++++++++++++++++
+M00055
+K01001 (K07432+K07441) K03842 K03843 K03844 K03845 K03846 K03847 K03846 K00729 K03848 K03849 K03850
+['K01001', '(K07432+K07441)', 'K03842', 'K03843', 'K03844', 'K03845', 'K03846', 'K03847', 'K03846', 'K00729', 'K03848', 'K03849', 'K03850']
+==
+['K01001']
+['K07432+K07441']
+['K03842']
+['K03843']
+['K03844']
+['K03845']
+['K03846']
+['K03847']
+['K03846']
+['K00729']
+['K03848']
+['K03849']
+['K03850']
+++++++++++++++++++
+M00072
+K07151+K12666+K12667+K12668+K12669+K12670-K00730-K12691
+['K07151+K12666+K12667+K12668+K12669+K12670-K00730-K12691']
+==
+['K07151+K12666+K12667+K12668+K12669+K12670-K00730-K12691']
+++++++++++++++++++
+M00073
+K01228 K05546 K23741 K01230
+['K01228', 'K05546', 'K23741', 'K01230']
+==
+['K01228']
+['K05546']
+['K23741']
+['K01230']
+++++++++++++++++++
+M00074
+K05546 K23741 K01230 K05528 K05529+K05530 K05529+K05531+K05532+K05533+K05534 K05535
+['K05546', 'K23741', 'K01230', 'K05528', 'K05529+K05530', 'K05529+K05531+K05532+K05533+K05534', 'K05535']
+==
+['K05546']
+['K23741']
+['K01230']
+['K05528']
+['K05529+K05530']
+['K05529+K05531+K05532+K05533+K05534']
+['K05535']
+++++++++++++++++++
+M00056
+K00710 (K00731,K09653) (K00727,K09662,K09663) K00739
+['K00710', '(K00731,K09653)', '(K00727,K09662,K09663)', 'K00739']
+==
+['K00710']
+['K00731', 'K09653']
+['K00727', 'K09662', 'K09663']
+['K00739']
+++++++++++++++++++
+M00065
+(K03857+K03859+K03858+K03861+K03860+(K11001,K11002)-K09658) K03434 K05283 (K05284+K07541) K07542 K05285 K05286 (K05288+K05287)
+['(K03857+K03859+K03858+K03861+K03860+(K11001,K11002)-K09658)', 'K03434', 'K05283', '(K05284+K07541)', 'K07542', 'K05285', 'K05286', '(K05288+K05287)']
+==
+['K03857+K03859+K03858+K03861+K03860+K11001,K11002-K09658']
+['K03434']
+['K05283']
+['K05284+K07541']
+['K07542']
+['K05285']
+['K05286']
+['K05288+K05287']
+++++++++++++++++++
+M00070
+K03766 (K07819,K07820,K03877)
+['K03766', '(K07819,K07820,K03877)']
+==
+['K03766']
+['K07819', 'K07820', 'K03877']
+++++++++++++++++++
+M00071
+K03766 (K07966,K07967,K07968,K07969)
+['K03766', '(K07966,K07967,K07968,K07969)']
+==
+['K03766']
+['K07966', 'K07967', 'K07968', 'K07969']
+++++++++++++++++++
+M00068
+K01988 K00719
+['K01988', 'K00719']
+==
+['K01988']
+['K00719']
+++++++++++++++++++
+M00069
+K03370 (K03371,K03369)
+['K03370', '(K03371,K03369)']
+==
+['K03370']
+['K03371', 'K03369']
+++++++++++++++++++
+M00057
+K00771 K00733 K00734 K10158
+['K00771', 'K00733', 'K00734', 'K10158']
+==
+['K00771']
+['K00733']
+['K00734']
+['K10158']
+++++++++++++++++++
+M00058
+K00746 (K13499,K00747,K03419)
+['K00746', '(K13499,K00747,K03419)']
+==
+['K00746']
+['K13499', 'K00747', 'K03419']
+++++++++++++++++++
+M00059
+(K02369,K02370) (K02366,K02367) (K02368,K02370) (K02576,K02577,K02578,K02579) K01793
+['(K02369,K02370)', '(K02366,K02367)', '(K02368,K02370)', '(K02576,K02577,K02578,K02579)', 'K01793']
+==
+['K02369', 'K02370']
+['K02366', 'K02367']
+['K02368', 'K02370']
+['K02576', 'K02577', 'K02578', 'K02579']
+['K01793']
+++++++++++++++++++
+M00076
+-- K01136 K01217 K01135 K01197 K01195
+['K01136', 'K01217', 'K01135', 'K01197', 'K01195']
+==
+['K01136']
+['K01217']
+['K01135']
+['K01197']
+['K01195']
+++++++++++++++++++
+M00077
+K01135 K01197 K01195 K01132
+['K01135', 'K01197', 'K01195', 'K01132']
+==
+['K01135']
+['K01197']
+['K01195']
+['K01132']
+++++++++++++++++++
+M00078
+(K07964,K07965) K01136 K01217 K01565 K10532 K01205 -- K01195 K01137
+['(K07964,K07965)', 'K01136', 'K01217', 'K01565', 'K10532', 'K01205', 'K01195', 'K01137']
+==
+['K07964', 'K07965']
+['K01136']
+['K01217']
+['K01565']
+['K10532']
+['K01205']
+['K01195']
+['K01137']
+++++++++++++++++++
+M00079
+-- K01132 K12309 K01137 K12373
+['K01132', 'K12309', 'K01137', 'K12373']
+==
+['K01132']
+['K12309']
+['K01137']
+['K12373']
+++++++++++++++++++
+M00060
+K00677 K02535 K02536 K03269 K00748 K00912 K02527 K02517 K02560
+['K00677', 'K02535', 'K02536', 'K03269', 'K00748', 'K00912', 'K02527', 'K02517', 'K02560']
+==
+['K00677']
+['K02535']
+['K02536']
+['K03269']
+['K00748']
+['K00912']
+['K02527']
+['K02517']
+['K02560']
+++++++++++++++++++
+M00866
+K00677 (K02535,K16363) K02536 K03269 K00748 K00912 K02527 K02517 K09778
+['K00677', '(K02535,K16363)', 'K02536', 'K03269', 'K00748', 'K00912', 'K02527', 'K02517', 'K09778']
+==
+['K00677']
+['K02535', 'K16363']
+['K02536']
+['K03269']
+['K00748']
+['K00912']
+['K02527']
+['K02517']
+['K09778']
+++++++++++++++++++
+M00867
+K12977 K03760 K23082+K23083 K23159 K09953
+['K12977', 'K03760', 'K23082+K23083', 'K23159', 'K09953']
+==
+['K12977']
+['K03760']
+['K23082+K23083']
+['K23159']
+['K09953']
+++++++++++++++++++
+M00063
+K06041 K01627 K03270 K00979
+['K06041', 'K01627', 'K03270', 'K00979']
+==
+['K06041']
+['K01627']
+['K03270']
+['K00979']
+++++++++++++++++++
+M00064
+K03271 (K03272,K21344) K03273 (K03272,K21345) K03274
+['K03271', '(K03272,K21344)', 'K03273', '(K03272,K21345)', 'K03274']
+==
+['K03271']
+['K03272', 'K21344']
+['K03273']
+['K03272', 'K21345']
+['K03274']
+++++++++++++++++++
+M00127
+K03147 (K00877,K00941,K14153)(K00878,K14154)(K00788,K14153,K14154) K00946
+['K03147', '(K00877,K00941,K14153)(K00878,K14154)(K00788,K14153,K14154)', 'K00946']
+==
+['K03147']
+['K00877', 'K00941', 'K14153', 'K14154', 'K14153K00878', 'K14154K00788']
+['K00946']
+++++++++++++++++++
+M00124
+K03472 K03473 K00831 K00097 K03474 (K00275,K23998)
+['K03472', 'K03473', 'K00831', 'K00097', 'K03474', '(K00275,K23998)']
+==
+['K03472']
+['K03473']
+['K00831']
+['K00097']
+['K03474']
+['K00275', 'K23998']
+++++++++++++++++++
+M00115
+K00278 K03517 K00767 (K00969,K06210) (K01916,K01950)
+['K00278', 'K03517', 'K00767', '(K00969,K06210)', '(K01916,K01950)']
+==
+['K00278']
+['K03517']
+['K00767']
+['K00969', 'K06210']
+['K01916', 'K01950']
+++++++++++++++++++
+M00810
+K19818+K19819+K19820 (K19826,K19890) K19185+K19186+K19187 K19188 -K20155
+['K19818+K19819+K19820', '(K19826,K19890)', 'K19185+K19186+K19187', 'K19188', '-K20155']
+==
+['K19818+K19819+K19820']
+['K19826', 'K19890']
+['K19185+K19186+K19187']
+['K19188']
+['-K20155']
+++++++++++++++++++
+M00811
+K20170,K20169 (K20170,(K20158 K19700)) K20171-K20172 K15359,K18276
+['K20170,K20169', '(K20170,(K20158_K19700))', 'K20171-K20172', 'K15359,K18276']
+==
+['K20170', 'K20169']
+['K20170', 'K20158_K19700']
+['K20171-K20172']
+['K15359', 'K18276']
+++++++++++++++++++
+M00622
+K18029+K18030 K14974 K18028 K15357 K13995 K01799
+['K18029+K18030', 'K14974', 'K18028', 'K15357', 'K13995', 'K01799']
+==
+['K18029+K18030']
+['K14974']
+['K18028']
+['K15357']
+['K13995']
+['K01799']
+++++++++++++++++++
+M00120
+(K00867,K03525,K09680,K01947) ((K01922,K21977) K01598,K13038) (K02318,(K00954,K02201) K00859)
+['(K00867,K03525,K09680,K01947)', '((K01922,K21977)_K01598,K13038)', '(K02318,(K00954,K02201)_K00859)']
+==
+['K00867', 'K03525', 'K09680', 'K01947']
+['K13038', 'K01922,K21977_K01598']
+['K02318', 'K00954,K02201_K00859']
+++++++++++++++++++
+M00572
+K02169 (K00647,K09458) K00059 K02372 K00208 (K02170,K09789,K19560,K19561)
+['K02169', '(K00647,K09458)', 'K00059', 'K02372', 'K00208', '(K02170,K09789,K19560,K19561)']
+==
+['K02169']
+['K00647', 'K09458']
+['K00059']
+['K02372']
+['K00208']
+['K02170', 'K09789', 'K19560', 'K19561']
+++++++++++++++++++
+M00123
+K00652 ((K00833,K19563) K01935,K19562) K01012
+['K00652', '((K00833,K19563)_K01935,K19562)', 'K01012']
+==
+['K00652']
+['K19562', 'K00833,K19563_K01935']
+['K01012']
+++++++++++++++++++
+M00573
+K16593 K00652 K19563 K01935 K01012
+['K16593', 'K00652', 'K19563', 'K01935', 'K01012']
+==
+['K16593']
+['K00652']
+['K19563']
+['K01935']
+['K01012']
+++++++++++++++++++
+M00577
+K01906 K00652 (K00833,K19563) K01935 K01012
+['K01906', 'K00652', '(K00833,K19563)', 'K01935', 'K01012']
+==
+['K01906']
+['K00652']
+['K00833', 'K19563']
+['K01935']
+['K01012']
+++++++++++++++++++
+M00126
+(K01495,K09007,K22391) (K01077,K01113,(K08310,K19965)) ((K13939,(K13940,K01633 K00950) K00796),(K01633 K13941)) (K11754,K20457) (K00287,K13998)
+['(K01495,K09007,K22391)', '(K01077,K01113,(K08310,K19965))', '((K13939,(K13940,K01633_K00950)_K00796),(K01633_K13941))', '(K11754,K20457)', '(K00287,K13998)']
+==
+['K01495', 'K09007', 'K22391']
+['K01077', 'K01113', 'K08310,K19965']
+['K01633_K13941', 'K13939,K13940,K01633_K00950_K00796']
+['K11754', 'K20457']
+['K00287', 'K13998']
+++++++++++++++++++
+M00840
+K14652 K22100 -- K01633 K13941 K22099 K00287
+['K14652', 'K22100', 'K01633', 'K13941', 'K22099', 'K00287']
+==
+['K14652']
+['K22100']
+['K01633']
+['K13941']
+['K22099']
+['K00287']
+++++++++++++++++++
+M00841
+K01495 K22101 K00950 K00796 K11754 K13998
+['K01495', 'K22101', 'K00950', 'K00796', 'K11754', 'K13998']
+==
+['K01495']
+['K22101']
+['K00950']
+['K00796']
+['K11754']
+['K13998']
+++++++++++++++++++
+M00842
+K01495 K01737 K00072
+['K01495', 'K01737', 'K00072']
+==
+['K01495']
+['K01737']
+['K00072']
+++++++++++++++++++
+M00843
+K01495 K01737 K17745
+['K01495', 'K01737', 'K17745']
+==
+['K01495']
+['K01737']
+['K17745']
+++++++++++++++++++
+M00880
+((K03639 K03637),K20967) (K03635,K21142) (((K03831,K03638) K03750),K15376)
+['((K03639_K03637),K20967)', '(K03635,K21142)', '(((K03831,K03638)_K03750),K15376)']
+==
+['K20967', 'K03639_K03637']
+['K03635', 'K21142']
+['K15376', 'K03831,K03638_K03750']
+++++++++++++++++++
+M00140
+K00600 (K01491,(K00300 K01500)) K01938
+['K00600', '(K01491,(K00300_K01500))', 'K01938']
+==
+['K00600']
+['K01491', 'K00300_K01500']
+['K01938']
+++++++++++++++++++
+M00141
+K00600 (K00288,(K13403 K13402))
+['K00600', '(K00288,(K13403_K13402))']
+==
+['K00600']
+['K00288', 'K13403_K13402']
+++++++++++++++++++
+M00868
+K00643 K01698 K01749 K01719 K01599 K00228 K00231 K01772
+['K00643', 'K01698', 'K01749', 'K01719', 'K01599', 'K00228', 'K00231', 'K01772']
+==
+['K00643']
+['K01698']
+['K01749']
+['K01719']
+['K01599']
+['K00228']
+['K00231']
+['K01772']
+++++++++++++++++++
+M00121
+(K01885,K14163) K02492 K01845 K01698 K01749 (K01719,K13542,K13543) K01599 (K00228,K02495) (K00230,K00231) K01772
+['(K01885,K14163)', 'K02492', 'K01845', 'K01698', 'K01749', '(K01719,K13542,K13543)', 'K01599', '(K00228,K02495)', '(K00230,K00231)', 'K01772']
+==
+['K01885', 'K14163']
+['K02492']
+['K01845']
+['K01698']
+['K01749']
+['K01719', 'K13542', 'K13543']
+['K01599']
+['K00228', 'K02495']
+['K00230', 'K00231']
+['K01772']
+++++++++++++++++++
+M00846
+(K01885,K14163) K02492 K01845 K01698 K01749 (K01719,K13542,K13543) (K02302,(K00589,K02303,K02496,K13542,K13543)+K02304-K03794)
+['(K01885,K14163)', 'K02492', 'K01845', 'K01698', 'K01749', '(K01719,K13542,K13543)', '(K02302,(K00589,K02303,K02496,K13542,K13543)+K02304-K03794)']
+==
+['K01885', 'K14163']
+['K02492']
+['K01845']
+['K01698']
+['K01749']
+['K01719', 'K13542', 'K13543']
+['K02302', 'K00589,K02303,K02496,K13542,K13543+K02304-K03794']
+++++++++++++++++++
+M00847
+K22225 K22226 K22227
+['K22225', 'K22226', 'K22227']
+==
+['K22225']
+['K22226']
+['K22227']
+++++++++++++++++++
+M00836
+K22011 K22012 (K21610+K21611) K21612
+['K22011', 'K22012', '(K21610+K21611)', 'K21612']
+==
+['K22011']
+['K22012']
+['K21610+K21611']
+['K21612']
+++++++++++++++++++
+M00117
+(K03181,K18240) K03179 (K03182+K03186) K18800 K00568 K03185 K03183 K03184 K00568
+['(K03181,K18240)', 'K03179', '(K03182+K03186)', 'K18800', 'K00568', 'K03185', 'K03183', 'K03184', 'K00568']
+==
+['K03181', 'K18240']
+['K03179']
+['K03182+K03186']
+['K18800']
+['K00568']
+['K03185']
+['K03183']
+['K03184']
+['K00568']
+++++++++++++++++++
+M00128
+K06125 K00591 K06126 K06127 K06134 K00591
+['K06125', 'K00591', 'K06126', 'K06127', 'K06134', 'K00591']
+==
+['K06125']
+['K00591']
+['K06126']
+['K06127']
+['K06134']
+['K00591']
+++++++++++++++++++
+M00116
+K02552 K02551 K08680 K02549 K01911 K01661 K19222 K02548 K03183
+['K02552', 'K02551', 'K08680', 'K02549', 'K01911', 'K01661', 'K19222', 'K02548', 'K03183']
+==
+['K02552']
+['K02551']
+['K08680']
+['K02549']
+['K01911']
+['K01661']
+['K19222']
+['K02548']
+['K03183']
+++++++++++++++++++
+M00112
+K09833 (K12502,K18534) K09834 K05928
+['K09833', '(K12502,K18534)', 'K09834', 'K05928']
+==
+['K09833']
+['K12502', 'K18534']
+['K09834']
+['K05928']
+++++++++++++++++++
+M00095
+K00626 K01641 K00021 K00869 (K00938,K13273) K01597 K01823
+['K00626', 'K01641', 'K00021', 'K00869', '(K00938,K13273)', 'K01597', 'K01823']
+==
+['K00626']
+['K01641']
+['K00021']
+['K00869']
+['K00938', 'K13273']
+['K01597']
+['K01823']
+++++++++++++++++++
+M00849
+K00626 K01641 (K00021,K00054) ((K00869 K17942),(K18689 K18690 K22813)) K06981 K01823
+['K00626', 'K01641', '(K00021,K00054)', '((K00869_K17942),(K18689_K18690_K22813))', 'K06981', 'K01823']
+==
+['K00626']
+['K01641']
+['K00021', 'K00054']
+['K00869_K17942', 'K18689_K18690_K22813']
+['K06981']
+['K01823']
+++++++++++++++++++
+M00096
+K01662 K00099 (K00991,K12506) K00919 (K01770,K12506) K03526 K03527 K01823
+['K01662', 'K00099', '(K00991,K12506)', 'K00919', '(K01770,K12506)', 'K03526', 'K03527', 'K01823']
+==
+['K01662']
+['K00099']
+['K00991', 'K12506']
+['K00919']
+['K01770', 'K12506']
+['K03526']
+['K03527']
+['K01823']
+++++++++++++++++++
+M00364
+K01823 (K00795,K13789,K13787)
+['K01823', '(K00795,K13789,K13787)']
+==
+['K01823']
+['K00795', 'K13789', 'K13787']
+++++++++++++++++++
+M00365
+K01823 K13787
+['K01823', 'K13787']
+==
+['K01823']
+['K13787']
+++++++++++++++++++
+M00366
+K01823 K14066 K00787 K13789
+['K01823', 'K14066', 'K00787', 'K13789']
+==
+['K01823']
+['K14066']
+['K00787']
+['K13789']
+++++++++++++++++++
+M00367
+K01823 K00787 K00804
+['K01823', 'K00787', 'K00804']
+==
+['K01823']
+['K00787']
+['K00804']
+++++++++++++++++++
+M00097
+K02291 K02293 K15744 K00514 K09835 K06443
+['K02291', 'K02293', 'K15744', 'K00514', 'K09835', 'K06443']
+==
+['K02291']
+['K02293']
+['K15744']
+['K00514']
+['K09835']
+['K06443']
+++++++++++++++++++
+M00372
+(K15746,K15747) K09838 -K14594 K09840 K09841 K09842
+['(K15746,K15747)', 'K09838', '-K14594', 'K09840', 'K09841', 'K09842']
+==
+['K15746', 'K15747']
+['K09838']
+['-K14594']
+['K09840']
+['K09841']
+['K09842']
+++++++++++++++++++
+M00371
+(K09587,K12639) K09588 K09591 (K12637,K12638) K20623 (K09590,K12640)
+['(K09587,K12639)', 'K09588', 'K09591', '(K12637,K12638)', 'K20623', '(K09590,K12640)']
+==
+['K09587', 'K12639']
+['K09588']
+['K09591']
+['K12637', 'K12638']
+['K20623']
+['K09590', 'K12640']
+++++++++++++++++++
+M00773
+K15988 K15989+K15990 K15992 K15991 K15993 K15994 K15995 K15996
+['K15988', 'K15989+K15990', 'K15992', 'K15991', 'K15993', 'K15994', 'K15995', 'K15996']
+==
+['K15988']
+['K15989+K15990']
+['K15992']
+['K15991']
+['K15993']
+['K15994']
+['K15995']
+['K15996']
+++++++++++++++++++
+M00774
+K10817 K14366 K14367 K14368+K15997 K14370 K14369
+['K10817', 'K14366', 'K14367', 'K14368+K15997', 'K14370', 'K14369']
+==
+['K10817']
+['K14366']
+['K14367']
+['K14368+K15997']
+['K14370']
+['K14369']
+++++++++++++++++++
+M00775
+K16007 K16008 K16009 K13320 K16010
+['K16007', 'K16008', 'K16009', 'K13320', 'K16010']
+==
+['K16007']
+['K16008']
+['K16009']
+['K13320']
+['K16010']
+++++++++++++++++++
+M00776
+K16000+K16001+K16002-K16003 K16004 K16005 K16006
+['K16000+K16001+K16002-K16003', 'K16004', 'K16005', 'K16006']
+==
+['K16000+K16001+K16002-K16003']
+['K16004']
+['K16005']
+['K16006']
+++++++++++++++++++
+M00777
+K14371 K14372 K14373 K14374 K14375
+['K14371', 'K14372', 'K14373', 'K14374', 'K14375']
+==
+['K14371']
+['K14372']
+['K14373']
+['K14374']
+['K14375']
+++++++++++++++++++
+M00824
+K15314 K21160+K21161+K21162+K21163+K21164+K21165+K21166+K21167
+['K15314', 'K21160+K21161+K21162+K21163+K21164+K21165+K21166+K21167']
+==
+['K15314']
+['K21160+K21161+K21162+K21163+K21164+K21165+K21166+K21167']
+++++++++++++++++++
+M00825
+K15314 K21168+K21169+K21170+K21171+K21172+K21173+K21174
+['K15314', 'K21168+K21169+K21170+K21171+K21172+K21173+K21174']
+==
+['K15314']
+['K21168+K21169+K21170+K21171+K21172+K21173+K21174']
+++++++++++++++++++
+M00826
+K20159+K21175 K20156 K21176 K21177 K21178 K21179
+['K20159+K21175', 'K20156', 'K21176', 'K21177', 'K21178', 'K21179']
+==
+['K20159+K21175']
+['K20156']
+['K21176']
+['K21177']
+['K21178']
+['K21179']
+++++++++++++++++++
+M00829
+K15320 K21191 K21192
+['K15320', 'K21191', 'K21192']
+==
+['K15320']
+['K21191']
+['K21192']
+++++++++++++++++++
+M00830
+K20422 K20420 K20421 K20423
+['K20422', 'K20420', 'K20421', 'K20423']
+==
+['K20422']
+['K20420']
+['K20421']
+['K20423']
+++++++++++++++++++
+M00831
+K21221 K21222 K21223 K21224 K21225
+['K21221', 'K21222', 'K21223', 'K21224', 'K21225']
+==
+['K21221']
+['K21222']
+['K21223']
+['K21224']
+['K21225']
+++++++++++++++++++
+M00834
+K21254 K21255 K21256 K21257 K21258
+['K21254', 'K21255', 'K21256', 'K21257', 'K21258']
+==
+['K21254']
+['K21255']
+['K21256']
+['K21257']
+['K21258']
+++++++++++++++++++
+M00778
+K05551+K05552+K05553 -K12420 ((K05554,K14249,K15884,K15885) (K05555,K14250),K15886)
+['K05551+K05552+K05553', '-K12420', '((K05554,K14249,K15884,K15885)_(K05555,K14250),K15886)']
+==
+['K05551+K05552+K05553']
+['-K12420']
+['K15886', 'K05554,K14249,K15884,K15885_K05555,K14250']
+++++++++++++++++++
+M00779
+K05556 (K14626,K14627) (K14628,K14629) (K14630+K14631,K14632)
+['K05556', '(K14626,K14627)', '(K14628,K14629)', '(K14630+K14631,K14632)']
+==
+['K05556']
+['K14626', 'K14627']
+['K14628', 'K14629']
+['K14632', 'K14630+K14631']
+++++++++++++++++++
+M00780
+K14251 K14252 K14253 K14254 K14255 K14256 K21301
+['K14251', 'K14252', 'K14253', 'K14254', 'K14255', 'K14256', 'K21301']
+==
+['K14251']
+['K14252']
+['K14253']
+['K14254']
+['K14255']
+['K14256']
+['K21301']
+++++++++++++++++++
+M00823
+K14251 K14252 K14253 K14254 K14255 K14256 K21301 K14257+K21297
+['K14251', 'K14252', 'K14253', 'K14254', 'K14255', 'K14256', 'K21301', 'K14257+K21297']
+==
+['K14251']
+['K14252']
+['K14253']
+['K14254']
+['K14255']
+['K14256']
+['K21301']
+['K14257+K21297']
+++++++++++++++++++
+M00781
+K15941 K15942 K15943 K15944
+['K15941', 'K15942', 'K15943', 'K15944']
+==
+['K15941']
+['K15942']
+['K15943']
+['K15944']
+++++++++++++++++++
+M00782
+K15959 K15960 K15961 K15963 K15964 K15965 K15966 K15967
+['K15959', 'K15960', 'K15961', 'K15963', 'K15964', 'K15965', 'K15966', 'K15967']
+==
+['K15959']
+['K15960']
+['K15961']
+['K15963']
+['K15964']
+['K15965']
+['K15966']
+['K15967']
+++++++++++++++++++
+M00783
+K15968 K15969 K15886 -K15970 K15971 K15972
+['K15968', 'K15969', 'K15886', '-K15970', 'K15971', 'K15972']
+==
+['K15968']
+['K15969']
+['K15886']
+['-K15970']
+['K15971']
+['K15972']
+++++++++++++++++++
+M00784
+K19566 K19567 K19568 K19569 K19570
+['K19566', 'K19567', 'K19568', 'K19569', 'K19570']
+==
+['K19566']
+['K19567']
+['K19568']
+['K19569']
+['K19570']
+++++++++++++++++++
+M00793
+K00973 K01710 (K01790 K00067,K23987)
+['K00973', 'K01710', '(K01790_K00067,K23987)']
+==
+['K00973']
+['K01710']
+['K23987', 'K01790_K00067']
+++++++++++++++++++
+M00794
+K13312 K13313
+['K13312', 'K13313']
+==
+['K13312']
+['K13313']
+++++++++++++++++++
+M00795
+K19855 K12710 K17625
+['K19855', 'K12710', 'K17625']
+==
+['K19855']
+['K12710']
+['K17625']
+++++++++++++++++++
+M00796
+K19853 K19854 K13307
+['K19853', 'K19854', 'K13307']
+==
+['K19853']
+['K19854']
+['K13307']
+++++++++++++++++++
+M00797
+K13308 K13309 (K13310,K16436) (K13311,K13326)
+['K13308', 'K13309', '(K13310,K16436)', '(K13311,K13326)']
+==
+['K13308']
+['K13309']
+['K13310', 'K16436']
+['K13311', 'K13326']
+++++++++++++++++++
+M00798
+K16435 K13315 K13317 (K13316,K16438) K13318
+['K16435', 'K13315', 'K13317', '(K13316,K16438)', 'K13318']
+==
+['K16435']
+['K13315']
+['K13317']
+['K13316', 'K16438']
+['K13318']
+++++++++++++++++++
+M00799
+K16435 K13315 K16438 K19856 K19857
+['K16435', 'K13315', 'K16438', 'K19856', 'K19857']
+==
+['K16435']
+['K13315']
+['K16438']
+['K19856']
+['K19857']
+++++++++++++++++++
+M00800
+K16435 K16436 K13326 K16438 K13322
+['K16435', 'K16436', 'K13326', 'K16438', 'K13322']
+==
+['K16435']
+['K16436']
+['K13326']
+['K16438']
+['K13322']
+++++++++++++++++++
+M00801
+K16435 K13327 K19858 K13319
+['K16435', 'K13327', 'K19858', 'K13319']
+==
+['K16435']
+['K13327']
+['K19858']
+['K13319']
+++++++++++++++++++
+M00802
+K16435 K13327 K13328 K13329 K13330
+['K16435', 'K13327', 'K13328', 'K13329', 'K13330']
+==
+['K16435']
+['K13327']
+['K13328']
+['K13329']
+['K13330']
+++++++++++++++++++
+M00803
+K16435 K19859 K16436 K13332
+['K16435', 'K19859', 'K16436', 'K13332']
+==
+['K16435']
+['K19859']
+['K16436']
+['K13332']
+++++++++++++++++++
+M00672
+K12743 K04126 K10852
+['K12743', 'K04126', 'K10852']
+==
+['K12743']
+['K04126']
+['K10852']
+++++++++++++++++++
+M00673
+K12743 K04126 K04127 K12744 K12745 K04128 K18062 K18063
+['K12743', 'K04126', 'K04127', 'K12744', 'K12745', 'K04128', 'K18062', 'K18063']
+==
+['K12743']
+['K04126']
+['K04127']
+['K12744']
+['K12745']
+['K04128']
+['K18062']
+['K18063']
+++++++++++++++++++
+M00675
+K18317 K18316 K18315
+['K18317', 'K18316', 'K18315']
+==
+['K18317']
+['K18316']
+['K18315']
+++++++++++++++++++
+M00736
+K19102+K19103+K05375 K19104 K19105 K19106
+['K19102+K19103+K05375', 'K19104', 'K19105', 'K19106']
+==
+['K19102+K19103+K05375']
+['K19104']
+['K19105']
+['K19106']
+++++++++++++++++++
+M00674
+K12673 K12674 K12675 K12676
+['K12673', 'K12674', 'K12675', 'K12676']
+==
+['K12673']
+['K12674']
+['K12675']
+['K12676']
+++++++++++++++++++
+M00039
+(K10775,K13064) K00487 K01904 K13065 K09754 K00588 K09753 K09755 K13066 (K00083,K22395)
+['(K10775,K13064)', 'K00487', 'K01904', 'K13065', 'K09754', 'K00588', 'K09753', 'K09755', 'K13066', '(K00083,K22395)']
+==
+['K10775', 'K13064']
+['K00487']
+['K01904']
+['K13065']
+['K09754']
+['K00588']
+['K09753']
+['K09755']
+['K13066']
+['K00083', 'K22395']
+++++++++++++++++++
+M00137
+K10775 K00487 K01904 K00660 K01859
+['K10775', 'K00487', 'K01904', 'K00660', 'K01859']
+==
+['K10775']
+['K00487']
+['K01904']
+['K00660']
+['K01859']
+++++++++++++++++++
+M00138
+K00475 K13082 K05277
+['K00475', 'K13082', 'K05277']
+==
+['K00475']
+['K13082']
+['K05277']
+++++++++++++++++++
+M00661
+K18385 K18386 K18387
+['K18385', 'K18386', 'K18387']
+==
+['K18385']
+['K18386']
+['K18387']
+++++++++++++++++++
+M00370
+(K11812,K11813) K11818 K11819 K11820 K11821
+['(K11812,K11813)', 'K11818', 'K11819', 'K11820', 'K11821']
+==
+['K11812', 'K11813']
+['K11818']
+['K11819']
+['K11820']
+['K11821']
+++++++++++++++++++
+M00814
+K19969 K19979 K19974 K20424 K20425 K20426 K20427 K20430 -- --
+['K19969', 'K19979', 'K19974', 'K20424', 'K20425', 'K20426', 'K20427', 'K20430']
+==
+['K19969']
+['K19979']
+['K19974']
+['K20424']
+['K20425']
+['K20426']
+['K20427']
+['K20430']
+++++++++++++++++++
+M00815
+K19969 K20431 K20432 K20433 K20434 K20435 K20436 K20437 K20438
+['K19969', 'K20431', 'K20432', 'K20433', 'K20434', 'K20435', 'K20436', 'K20437', 'K20438']
+==
+['K19969']
+['K20431']
+['K20432']
+['K20433']
+['K20434']
+['K20435']
+['K20436']
+['K20437']
+['K20438']
+++++++++++++++++++
+M00786
+K18281 K14132 K17475 K18280 K17827 K17826 K14134 K17825 K18279
+['K18281', 'K14132', 'K17475', 'K18280', 'K17827', 'K17826', 'K14134', 'K17825', 'K18279']
+==
+['K18281']
+['K14132']
+['K17475']
+['K18280']
+['K17827']
+['K17826']
+['K14134']
+['K17825']
+['K18279']
+++++++++++++++++++
+M00789
+K14266 K19884 K19885 K19886+K19887 K19888 K19889
+['K14266', 'K19884', 'K19885', 'K19886+K19887', 'K19888', 'K19889']
+==
+['K14266']
+['K19884']
+['K19885']
+['K19886+K19887']
+['K19888']
+['K19889']
+++++++++++++++++++
+M00790
+K14266 K19981 K14257 K19982
+['K14266', 'K19981', 'K14257', 'K19982']
+==
+['K14266']
+['K19981']
+['K14257']
+['K19982']
+++++++++++++++++++
+M00805
+K20075 K20076 K20077+K20078 K20079 K20080 K20081 K20082
+['K20075', 'K20076', 'K20077+K20078', 'K20079', 'K20080', 'K20081', 'K20082']
+==
+['K20075']
+['K20076']
+['K20077+K20078']
+['K20079']
+['K20080']
+['K20081']
+['K20082']
+++++++++++++++++++
+M00808
+K20086 K20087+K20088 K20089 K20090
+['K20086', 'K20087+K20088', 'K20089', 'K20090']
+==
+['K20086']
+['K20087+K20088']
+['K20089']
+['K20090']
+++++++++++++++++++
+M00835
+K13063 K20261 K06998 K20260 K20262 K21103 K20940
+['K13063', 'K20261', 'K06998', 'K20260', 'K20262', 'K21103', 'K20940']
+==
+['K13063']
+['K20261']
+['K06998']
+['K20260']
+['K20262']
+['K21103']
+['K20940']
+++++++++++++++++++
+M00877
+K18652 K18653 K18654
+['K18652', 'K18653', 'K18654']
+==
+['K18652']
+['K18653']
+['K18654']
+++++++++++++++++++
+M00787
+K19546 K19547 K19550 K19549 K19548 K13037
+['K19546', 'K19547', 'K19550', 'K19549', 'K19548', 'K13037']
+==
+['K19546']
+['K19547']
+['K19550']
+['K19549']
+['K19548']
+['K13037']
+++++++++++++++++++
+M00848
+K09460 K02078+K14245+K14246+K22798 K22799 K22800 K21272 K21271 K22801 K22802
+['K09460', 'K02078+K14245+K14246+K22798', 'K22799', 'K22800', 'K21272', 'K21271', 'K22801', 'K22802']
+==
+['K09460']
+['K02078+K14245+K14246+K22798']
+['K22799']
+['K22800']
+['K21272']
+['K21271']
+['K22801']
+['K22802']
+++++++++++++++++++
+M00788
+K19835 K19834
+['K19835', 'K19834']
+==
+['K19835']
+['K19834']
+++++++++++++++++++
+M00819
+K12250 K15907 -- K18056 K17747 K18091 K18057 K17476
+['K12250', 'K15907', 'K18056', 'K17747', 'K18091', 'K18057', 'K17476']
+==
+['K12250']
+['K15907']
+['K18056']
+['K17747']
+['K18091']
+['K18057']
+['K17476']
+++++++++++++++++++
+M00876
+K21898 K23446 K23447
+['K21898', 'K23446', 'K23447']
+==
+['K21898']
+['K23446']
+['K23447']
+++++++++++++++++++
+M00875
+K23371 K21949 K21721 K23372 K23373 K23374 K23375
+['K23371', 'K21949', 'K21721', 'K23372', 'K23373', 'K23374', 'K23375']
+==
+['K23371']
+['K21949']
+['K21721']
+['K23372']
+['K23373']
+['K23374']
+['K23375']
+++++++++++++++++++
+M00538
+K15760+K15761-K15762+K15763+K15764-K15765 K00055 K00141
+['K15760+K15761-K15762+K15763+K15764-K15765', 'K00055', 'K00141']
+==
+['K15760+K15761-K15762+K15763+K15764-K15765']
+['K00055']
+['K00141']
+++++++++++++++++++
+M00537
+K15757+K15758 K00055 K00141
+['K15757+K15758', 'K00055', 'K00141']
+==
+['K15757+K15758']
+['K00055']
+['K00141']
+++++++++++++++++++
+M00419
+K10616+K18293 K10617 K10618
+['K10616+K18293', 'K10617', 'K10618']
+==
+['K10616+K18293']
+['K10617']
+['K10618']
+++++++++++++++++++
+M00547
+K03268+K16268+K18089+K18090 K16269
+['K03268+K16268+K18089+K18090', 'K16269']
+==
+['K03268+K16268+K18089+K18090']
+['K16269']
+++++++++++++++++++
+M00548
+K16249+K16243+K16244+K16242+K16245+K16246
+['K16249+K16243+K16244+K16242+K16245+K16246']
+==
+['K16249+K16243+K16244+K16242+K16245+K16246']
+++++++++++++++++++
+M00551
+K05549+K05550+K05784 K05783
+['K05549+K05550+K05784', 'K05783']
+==
+['K05549+K05550+K05784']
+['K05783']
+++++++++++++++++++
+M00637
+(K05599+K05600+K11311,K16319+K16320+K18248+K18249)
+['(K05599+K05600+K11311,K16319+K16320+K18248+K18249)']
+==
+['K05599+K05600+K11311', 'K16319+K16320+K18248+K18249']
+++++++++++++++++++
+M00568
+K03381 K01856 K03464 (K01055,K14727)
+['K03381', 'K01856', 'K03464', '(K01055,K14727)']
+==
+['K03381']
+['K01856']
+['K03464']
+['K01055', 'K14727']
+++++++++++++++++++
+M00569
+(K00446,K07104) ((K10217 K01821 K01617),K10216) (K18364,K02554) (K18365,K01666) (K18366,K04073)
+['(K00446,K07104)', '((K10217_K01821_K01617),K10216)', '(K18364,K02554)', '(K18365,K01666)', '(K18366,K04073)']
+==
+['K00446', 'K07104']
+['K10216', 'K10217_K01821_K01617']
+['K18364', 'K02554']
+['K18365', 'K01666']
+['K18366', 'K04073']
+++++++++++++++++++
+M00539
+K10619+K16303+K16304+K18227 K10620 K10621 K10622 K10623
+['K10619+K16303+K16304+K18227', 'K10620', 'K10621', 'K10622', 'K10623']
+==
+['K10619+K16303+K16304+K18227']
+['K10620']
+['K10621']
+['K10622']
+['K10623']
+++++++++++++++++++
+M00543
+K08689+K15750+K18087+K18088 K08690 K00462 K10222
+['K08689+K15750+K18087+K18088', 'K08690', 'K00462', 'K10222']
+==
+['K08689+K15750+K18087+K18088']
+['K08690']
+['K00462']
+['K10222']
+++++++++++++++++++
+M00544
+K15751-K15752-K15753 K15754+K15755 K15756
+['K15751-K15752-K15753', 'K15754+K15755', 'K15756']
+==
+['K15751-K15752-K15753']
+['K15754+K15755']
+['K15756']
+++++++++++++++++++
+M00418
+K07540 K07543+K07544 K07545 K07546 K07547+K07548 K07549+K07550
+['K07540', 'K07543+K07544', 'K07545', 'K07546', 'K07547+K07548', 'K07549+K07550']
+==
+['K07540']
+['K07543+K07544']
+['K07545']
+['K07546']
+['K07547+K07548']
+['K07549+K07550']
+++++++++++++++++++
+M00541
+(K04112+K04113+K04114+K04115,K19515+K19516) K07537 K07538 K07539
+['(K04112+K04113+K04114+K04115,K19515+K19516)', 'K07537', 'K07538', 'K07539']
+==
+['K19515+K19516', 'K04112+K04113+K04114+K04115']
+['K07537']
+['K07538']
+['K07539']
+++++++++++++++++++
+M00540
+K04116 K04117 K07534 K07535 K07536
+['K04116', 'K04117', 'K07534', 'K07535', 'K07536']
+==
+['K04116']
+['K04117']
+['K07534']
+['K07535']
+['K07536']
+++++++++++++++++++
+M00534
+K14579+K14580+K14578+K14581 K14582 K14583 K14584 K14585 K00152
+['K14579+K14580+K14578+K14581', 'K14582', 'K14583', 'K14584', 'K14585', 'K00152']
+==
+['K14579+K14580+K14578+K14581']
+['K14582']
+['K14583']
+['K14584']
+['K14585']
+['K00152']
+++++++++++++++++++
+M00638
+K18242+K18243+K14578+K14581
+['K18242+K18243+K14578+K14581']
+==
+['K18242+K18243+K14578+K14581']
+++++++++++++++++++
+M00624
+K18074+K18075+K18077 K18076
+['K18074+K18075+K18077', 'K18076']
+==
+['K18074+K18075+K18077']
+['K18076']
+++++++++++++++++++
+M00623
+K18068+K18069 K18067 K04102
+['K18068+K18069', 'K18067', 'K04102']
+==
+['K18068+K18069']
+['K18067']
+['K04102']
+++++++++++++++++++
+M00636
+K18251+K18252-K18253-K18254 K18255 K18256
+['K18251+K18252-K18253-K18254', 'K18255', 'K18256']
+==
+['K18251+K18252-K18253-K18254']
+['K18255']
+['K18256']
+++++++++++++++++++
+M00878
+K01912 K02609+K02610+K02611+K02612+K02613 K15866 K02618 K02615 K01692 K00074
+['K01912', 'K02609+K02610+K02611+K02612+K02613', 'K15866', 'K02618', 'K02615', 'K01692', 'K00074']
+==
+['K01912']
+['K02609+K02610+K02611+K02612+K02613']
+['K15866']
+['K02618']
+['K02615']
+['K01692']
+['K00074']
+++++++++++++++++++
+M00852
+K10961 K10920 K10919 K10930 K10931 K10962 K10932 K10963 K10933 K10964 K10965 K10934 K10935 K10966
+['K10961', 'K10920', 'K10919', 'K10930', 'K10931', 'K10962', 'K10932', 'K10963', 'K10933', 'K10964', 'K10965', 'K10934', 'K10935', 'K10966']
+==
+['K10961']
+['K10920']
+['K10919']
+['K10930']
+['K10931']
+['K10962']
+['K10932']
+['K10963']
+['K10933']
+['K10964']
+['K10965']
+['K10934']
+['K10935']
+['K10966']
+++++++++++++++++++
+M00850
+(K10928+K10929) K10954 K10952 K10953 K10948 K11018
+['(K10928+K10929)', 'K10954', 'K10952', 'K10953', 'K10948', 'K11018']
+==
+['K10928+K10929']
+['K10954']
+['K10952']
+['K10953']
+['K10948']
+['K11018']
+++++++++++++++++++
+M00542
+K03221+K03219+K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225 K12784 K12787 K12785 K12786 K12788 K16041 K16042
+['K03221+K03219+K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225', 'K12784', 'K12787', 'K12785', 'K12786', 'K12788', 'K16041', 'K16042']
+==
+['K03221+K03219+K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225']
+['K12784']
+['K12787']
+['K12785']
+['K12786']
+['K12788']
+['K16041']
+['K16042']
+++++++++++++++++++
+M00363
+K11006 K11007
+['K11006', 'K11007']
+==
+['K11006']
+['K11007']
+++++++++++++++++++
+M00853
+K22850 -K22851 K22852 K22853 K22854
+['K22850', '-K22851', 'K22852', 'K22853', 'K22854']
+==
+['K22850']
+['-K22851']
+['K22852']
+['K22853']
+['K22854']
+++++++++++++++++++
+M00576
+(K10928+K10929) (K16883,K16884)
+['(K10928+K10929)', '(K16883,K16884)']
+==
+['K10928+K10929']
+['K16883', 'K16884']
+++++++++++++++++++
+M00856
+K11014 K11023 K19298 K22918
+['K11014', 'K11023', 'K19298', 'K22918']
+==
+['K11014']
+['K11023']
+['K19298']
+['K22918']
+++++++++++++++++++
+M00857
+K22914 K22926 K22925 K22915 K22916 K22917 (K22924+K22921+K22923+K22922)
+['K22914', 'K22926', 'K22925', 'K22915', 'K22916', 'K22917', '(K22924+K22921+K22923+K22922)']
+==
+['K22914']
+['K22926']
+['K22925']
+['K22915']
+['K22916']
+['K22917']
+['K22924+K22921+K22923+K22922']
+++++++++++++++++++
+M00575
+K22944 K11004 K07389 K11003 K12340
+['K22944', 'K11004', 'K07389', 'K11003', 'K12340']
+==
+['K22944']
+['K11004']
+['K07389']
+['K11003']
+['K12340']
+++++++++++++++++++
+M00574
+K11023 -K11024 K11025 K11026 K11027
+['K11023', '-K11024', 'K11025', 'K11026', 'K11027']
+==
+['K11023']
+['-K11024']
+['K11025']
+['K11026']
+['K11027']
+++++++++++++++++++
+M00564
+K15842 K12086 -K12087 K12088 K12089 K12090 K03196 K12091 -K12092 K12093 K12094 K12095 K12096 K12097 K12098 -K12099 -K12100 K12101 K12102 K12103 K12104 K12105 K12106 K12107 K12108 K12109 K12110
+['K15842', 'K12086', '-K12087', 'K12088', 'K12089', 'K12090', 'K03196', 'K12091', '-K12092', 'K12093', 'K12094', 'K12095', 'K12096', 'K12097', 'K12098', '-K12099', '-K12100', 'K12101', 'K12102', 'K12103', 'K12104', 'K12105', 'K12106', 'K12107', 'K12108', 'K12109', 'K12110']
+==
+['K15842']
+['K12086']
+['-K12087']
+['K12088']
+['K12089']
+['K12090']
+['K03196']
+['K12091']
+['-K12092']
+['K12093']
+['K12094']
+['K12095']
+['K12096']
+['K12097']
+['K12098']
+['-K12099']
+['-K12100']
+['K12101']
+['K12102']
+['K12103']
+['K12104']
+['K12105']
+['K12106']
+['K12107']
+['K12108']
+['K12109']
+['K12110']
+++++++++++++++++++
+M00859
+K11030 K08645 K11029
+['K11030', 'K08645', 'K11029']
+==
+['K11030']
+['K08645']
+['K11029']
+++++++++++++++++++
+M00860
+K22976 K22977 K22980 K07282 K22116 K01932 K22981
+['K22976', 'K22977', 'K22980', 'K07282', 'K22116', 'K01932', 'K22981']
+==
+['K22976']
+['K22977']
+['K22980']
+['K07282']
+['K22116']
+['K01932']
+['K22981']
+++++++++++++++++++
+M00851
+(K18768,K18970,K19316,K22346,K18794,K19318,K18971,K18793,K19319,K19320,K19321,K19322,K18972,K19211,K18976,K21277,K18782,K18781,K18780,K19099,K19216)
+['(K18768,K18970,K19316,K22346,K18794,K19318,K18971,K18793,K19319,K19320,K19321,K19322,K18972,K19211,K18976,K21277,K18782,K18781,K18780,K19099,K19216)']
+==
+['K18768', 'K18970', 'K19316', 'K22346', 'K18794', 'K19318', 'K18971', 'K18793', 'K19319', 'K19320', 'K19321', 'K19322', 'K18972', 'K19211', 'K18976', 'K21277', 'K18782', 'K18781', 'K18780', 'K19099', 'K19216']
+++++++++++++++++++
+M00625
+K02547 K02546 K02545
+['K02547', 'K02546', 'K02545']
+==
+['K02547']
+['K02546']
+['K02545']
+++++++++++++++++++
+M00627
+K02172 K02171 (K18766,K17836)
+['K02172', 'K02171', '(K18766,K17836)']
+==
+['K02172']
+['K02171']
+['K18766', 'K17836']
+++++++++++++++++++
+M00745
+(K18072 K18073),(K07644 K07665),K18297 K18093
+['(K18072_K18073),(K07644_K07665),K18297', 'K18093']
+==
+['K18297', 'K18072_K18073', 'K07644_K07665']
+['K18093']
+++++++++++++++++++
+M00651
+(K18345 K18344 K07260 K18346),(K18351 K18352 K18354 K18353) (K18347 K15739 K08641)
+['(K18345_K18344_K07260_K18346),(K18351_K18352_K18354_K18353)', '(K18347_K15739_K08641)']
+==
+['K18345_K18344_K07260_K18346', 'K18351_K18352_K18354_K18353']
+['K18347_K15739_K08641']
+++++++++++++++++++
+M00652
+K18350 K18349 K18348 K18856 K18866
+['K18350', 'K18349', 'K18348', 'K18856', 'K18866']
+==
+['K18350']
+['K18349']
+['K18348']
+['K18856']
+['K18866']
+++++++++++++++++++
+M00704
+K18906 K08168
+['K18906', 'K08168']
+==
+['K18906']
+['K08168']
+++++++++++++++++++
+M00725
+K19077 K19078 K03367+K03739+K14188+K03740
+['K19077', 'K19078', 'K03367+K03739+K14188+K03740']
+==
+['K19077']
+['K19078']
+['K03367+K03739+K14188+K03740']
+++++++++++++++++++
+M00726
+K19077 K19078 K14205
+['K19077', 'K19078', 'K14205']
+==
+['K19077']
+['K19078']
+['K14205']
+++++++++++++++++++
+M00730
+K19077 K19078 K19079+K19080
+['K19077', 'K19078', 'K19079+K19080']
+==
+['K19077']
+['K19078']
+['K19079+K19080']
+++++++++++++++++++
+M00744
+K07637 K07660 K08477
+['K07637', 'K07660', 'K08477']
+==
+['K07637']
+['K07660']
+['K08477']
+++++++++++++++++++
+M00718
+K18131 K03585+K18138+K18139
+['K18131', 'K03585+K18138+K18139']
+==
+['K18131']
+['K03585+K18138+K18139']
+++++++++++++++++++
+M00639
+K18294 K18295+K18296-K08721
+['K18294', 'K18295+K18296-K08721']
+==
+['K18294']
+['K18295+K18296-K08721']
+++++++++++++++++++
+M00641
+K18297 K18298+K18299-K18300
+['K18297', 'K18298+K18299-K18300']
+==
+['K18297']
+['K18298+K18299-K18300']
+++++++++++++++++++
+M00642
+K18301 K18302+K18303-K18139
+['K18301', 'K18302+K18303-K18139']
+==
+['K18301']
+['K18302+K18303-K18139']
+++++++++++++++++++
+M00643
+K18129 K18094+K18095+K18139
+['K18129', 'K18094+K18095+K18139']
+==
+['K18129']
+['K18094+K18095+K18139']
+++++++++++++++++++
+M00769
+K18304 K19591 K19595+K19594+K19593
+['K18304', 'K19591', 'K19595+K19594+K19593']
+==
+['K18304']
+['K19591']
+['K19595+K19594+K19593']
+++++++++++++++++++
+M00649
+K18143 K18144 K18145+K18146-K18147
+['K18143', 'K18144', 'K18145+K18146-K18147']
+==
+['K18143']
+['K18144']
+['K18145+K18146-K18147']
+++++++++++++++++++
+M00696
+K18140 K18141+K18142+K12340
+['K18140', 'K18141+K18142+K12340']
+==
+['K18140']
+['K18141+K18142+K12340']
+++++++++++++++++++
+M00697
+K07690 K18898+K18899+K12340
+['K07690', 'K18898+K18899+K12340']
+==
+['K07690']
+['K18898+K18899+K12340']
+++++++++++++++++++
+M00698
+K18900 K18901+K18902+K18903
+['K18900', 'K18901+K18902+K18903']
+==
+['K18900']
+['K18901+K18902+K18903']
+++++++++++++++++++
+M00700
+(K18906,K18907) K18104
+['(K18906,K18907)', 'K18104']
+==
+['K18906', 'K18907']
+['K18104']
+++++++++++++++++++
+M00702
+(K18906,K18907) K08170
+['(K18906,K18907)', 'K08170']
+==
+['K18906', 'K18907']
+['K08170']
+++++++++++++++++++
+M00714
+K18938 K08167
+['K18938', 'K08167']
+==
+['K18938']
+['K08167']
+++++++++++++++++++
+M00705
+K18909 K18908
+['K18909', 'K18908']
+==
+['K18909']
+['K18908']
+++++++++++++++++++
+M00746
+K13632 K18513 K09476
+['K13632', 'K18513', 'K09476']
+==
+['K13632']
+['K18513']
+['K09476']
+++++++++++++++++++
+M00660
+K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225+K03223+K18374+K18376 K18373 K18375 K18377 K18378 K18379 K18380 K18381
+['K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225+K03223+K18374+K18376', 'K18373', 'K18375', 'K18377', 'K18378', 'K18379', 'K18380', 'K18381']
+==
+['K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225+K03223+K18374+K18376']
+['K18373']
+['K18375']
+['K18377']
+['K18378']
+['K18379']
+['K18380']
+['K18381']
+++++++++++++++++++
+M00664
+K14658 K14659 K14666 K14657
+['K14658', 'K14659', 'K14666', 'K14657']
+==
+['K14658']
+['K14659']
+['K14666']
+['K14657']
+++++++++++++++++++
diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/06.Module_Groups.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/06.Module_Groups.txt
new file mode 100644
index 0000000..9c7c0c4
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/06.Module_Groups.txt
@@ -0,0 +1,394 @@
+M00015 Arginine and proline metabolism #8a3222
+M00028 Arginine and proline metabolism #8a3222
+M00029 Arginine and proline metabolism #8a3222
+M00047 Arginine and proline metabolism #8a3222
+M00763 Arginine and proline metabolism #8a3222
+M00844 Arginine and proline metabolism #8a3222
+M00845 Arginine and proline metabolism #8a3222
+M00879 Arginine and proline metabolism #8a3222
+M00022 Aromatic amino acid metabolism #8641b6
+M00023 Aromatic amino acid metabolism #8641b6
+M00024 Aromatic amino acid metabolism #8641b6
+M00025 Aromatic amino acid metabolism #8641b6
+M00037 Aromatic amino acid metabolism #8641b6
+M00038 Aromatic amino acid metabolism #8641b6
+M00040 Aromatic amino acid metabolism #8641b6
+M00042 Aromatic amino acid metabolism #8641b6
+M00043 Aromatic amino acid metabolism #8641b6
+M00044 Aromatic amino acid metabolism #8641b6
+M00533 Aromatic amino acid metabolism #8641b6
+M00545 Aromatic amino acid metabolism #8641b6
+M00418 Aromatics degradation #76d25b
+M00419 Aromatics degradation #76d25b
+M00534 Aromatics degradation #76d25b
+M00537 Aromatics degradation #76d25b
+M00538 Aromatics degradation #76d25b
+M00539 Aromatics degradation #76d25b
+M00540 Aromatics degradation #76d25b
+M00541 Aromatics degradation #76d25b
+M00543 Aromatics degradation #76d25b
+M00544 Aromatics degradation #76d25b
+M00547 Aromatics degradation #76d25b
+M00548 Aromatics degradation #76d25b
+M00551 Aromatics degradation #76d25b
+M00568 Aromatics degradation #76d25b
+M00569 Aromatics degradation #76d25b
+M00623 Aromatics degradation #76d25b
+M00624 Aromatics degradation #76d25b
+M00636 Aromatics degradation #76d25b
+M00637 Aromatics degradation #76d25b
+M00638 Aromatics degradation #76d25b
+M00878 Aromatics degradation #76d25b
+M00142 ATP synthesis #cdd346
+M00143 ATP synthesis #cdd346
+M00144 ATP synthesis #cdd346
+M00145 ATP synthesis #cdd346
+M00146 ATP synthesis #cdd346
+M00147 ATP synthesis #cdd346
+M00148 ATP synthesis #cdd346
+M00149 ATP synthesis #cdd346
+M00150 ATP synthesis #cdd346
+M00151 ATP synthesis #cdd346
+M00152 ATP synthesis #cdd346
+M00153 ATP synthesis #cdd346
+M00154 ATP synthesis #cdd346
+M00155 ATP synthesis #cdd346
+M00156 ATP synthesis #cdd346
+M00157 ATP synthesis #cdd346
+M00158 ATP synthesis #cdd346
+M00159 ATP synthesis #cdd346
+M00160 ATP synthesis #cdd346
+M00162 ATP synthesis #cdd346
+M00416 ATP synthesis #cdd346
+M00417 ATP synthesis #cdd346
+M00672 Beta-Lactam biosynthesis #3b2882
+M00673 Beta-Lactam biosynthesis #3b2882
+M00674 Beta-Lactam biosynthesis #3b2882
+M00675 Beta-Lactam biosynthesis #3b2882
+M00736 Beta-Lactam biosynthesis #3b2882
+M00039 Biosynthesis of other secondary metabolites #cbde82
+M00137 Biosynthesis of other secondary metabolites #cbde82
+M00138 Biosynthesis of other secondary metabolites #cbde82
+M00370 Biosynthesis of other secondary metabolites #cbde82
+M00661 Biosynthesis of other secondary metabolites #cbde82
+M00785 Biosynthesis of other secondary metabolites #cbde82
+M00786 Biosynthesis of other secondary metabolites #cbde82
+M00787 Biosynthesis of other secondary metabolites #cbde82
+M00788 Biosynthesis of other secondary metabolites #cbde82
+M00789 Biosynthesis of other secondary metabolites #cbde82
+M00790 Biosynthesis of other secondary metabolites #cbde82
+M00805 Biosynthesis of other secondary metabolites #cbde82
+M00808 Biosynthesis of other secondary metabolites #cbde82
+M00814 Biosynthesis of other secondary metabolites #cbde82
+M00815 Biosynthesis of other secondary metabolites #cbde82
+M00819 Biosynthesis of other secondary metabolites #cbde82
+M00835 Biosynthesis of other secondary metabolites #cbde82
+M00837 Biosynthesis of other secondary metabolites #cbde82
+M00838 Biosynthesis of other secondary metabolites #cbde82
+M00848 Biosynthesis of other secondary metabolites #cbde82
+M00875 Biosynthesis of other secondary metabolites #cbde82
+M00876 Biosynthesis of other secondary metabolites #cbde82
+M00877 Biosynthesis of other secondary metabolites #cbde82
+M00019 Branched-chain amino acid metabolism #656cdb
+M00036 Branched-chain amino acid metabolism #656cdb
+M00432 Branched-chain amino acid metabolism #656cdb
+M00535 Branched-chain amino acid metabolism #656cdb
+M00570 Branched-chain amino acid metabolism #656cdb
+M00165 Carbon fixation #408937
+M00166 Carbon fixation #408937
+M00167 Carbon fixation #408937
+M00168 Carbon fixation #408937
+M00169 Carbon fixation #408937
+M00170 Carbon fixation #408937
+M00171 Carbon fixation #408937
+M00172 Carbon fixation #408937
+M00173 Carbon fixation #408937
+M00374 Carbon fixation #408937
+M00375 Carbon fixation #408937
+M00376 Carbon fixation #408937
+M00377 Carbon fixation #408937
+M00579 Carbon fixation #408937
+M00620 Carbon fixation #408937
+M00001 Central carbohydrate metabolism #c644a5
+M00002 Central carbohydrate metabolism #c644a5
+M00003 Central carbohydrate metabolism #c644a5
+M00004 Central carbohydrate metabolism #c644a5
+M00005 Central carbohydrate metabolism #c644a5
+M00006 Central carbohydrate metabolism #c644a5
+M00007 Central carbohydrate metabolism #c644a5
+M00008 Central carbohydrate metabolism #c644a5
+M00009 Central carbohydrate metabolism #c644a5
+M00010 Central carbohydrate metabolism #c644a5
+M00011 Central carbohydrate metabolism #c644a5
+M00307 Central carbohydrate metabolism #c644a5
+M00308 Central carbohydrate metabolism #c644a5
+M00309 Central carbohydrate metabolism #c644a5
+M00580 Central carbohydrate metabolism #c644a5
+M00633 Central carbohydrate metabolism #c644a5
+M00112 Cofactor and vitamin metabolism #5fda98
+M00115 Cofactor and vitamin metabolism #5fda98
+M00116 Cofactor and vitamin metabolism #5fda98
+M00117 Cofactor and vitamin metabolism #5fda98
+M00119 Cofactor and vitamin metabolism #5fda98
+M00120 Cofactor and vitamin metabolism #5fda98
+M00121 Cofactor and vitamin metabolism #5fda98
+M00122 Cofactor and vitamin metabolism #5fda98
+M00123 Cofactor and vitamin metabolism #5fda98
+M00124 Cofactor and vitamin metabolism #5fda98
+M00125 Cofactor and vitamin metabolism #5fda98
+M00126 Cofactor and vitamin metabolism #5fda98
+M00127 Cofactor and vitamin metabolism #5fda98
+M00128 Cofactor and vitamin metabolism #5fda98
+M00140 Cofactor and vitamin metabolism #5fda98
+M00141 Cofactor and vitamin metabolism #5fda98
+M00572 Cofactor and vitamin metabolism #5fda98
+M00573 Cofactor and vitamin metabolism #5fda98
+M00577 Cofactor and vitamin metabolism #5fda98
+M00622 Cofactor and vitamin metabolism #5fda98
+M00810 Cofactor and vitamin metabolism #5fda98
+M00811 Cofactor and vitamin metabolism #5fda98
+M00836 Cofactor and vitamin metabolism #5fda98
+M00840 Cofactor and vitamin metabolism #5fda98
+M00841 Cofactor and vitamin metabolism #5fda98
+M00842 Cofactor and vitamin metabolism #5fda98
+M00843 Cofactor and vitamin metabolism #5fda98
+M00846 Cofactor and vitamin metabolism #5fda98
+M00847 Cofactor and vitamin metabolism #5fda98
+M00868 Cofactor and vitamin metabolism #5fda98
+M00880 Cofactor and vitamin metabolism #5fda98
+M00017 Cysteine and methionine metabolism #782975
+M00021 Cysteine and methionine metabolism #782975
+M00034 Cysteine and methionine metabolism #782975
+M00035 Cysteine and methionine metabolism #782975
+M00338 Cysteine and methionine metabolism #782975
+M00368 Cysteine and methionine metabolism #782975
+M00609 Cysteine and methionine metabolism #782975
+M00625 Drug resistance #869534
+M00627 Drug resistance #869534
+M00639 Drug resistance #869534
+M00641 Drug resistance #869534
+M00642 Drug resistance #869534
+M00643 Drug resistance #869534
+M00649 Drug resistance #869534
+M00651 Drug resistance #869534
+M00652 Drug resistance #869534
+M00696 Drug resistance #869534
+M00697 Drug resistance #869534
+M00698 Drug resistance #869534
+M00700 Drug resistance #869534
+M00702 Drug resistance #869534
+M00704 Drug resistance #869534
+M00705 Drug resistance #869534
+M00714 Drug resistance #869534
+M00718 Drug resistance #869534
+M00725 Drug resistance #869534
+M00726 Drug resistance #869534
+M00730 Drug resistance #869534
+M00744 Drug resistance #869534
+M00745 Drug resistance #869534
+M00746 Drug resistance #869534
+M00769 Drug resistance #869534
+M00851 Drug resistance #869534
+M00824 Enediyne biosynthesis #d27bde
+M00825 Enediyne biosynthesis #d27bde
+M00826 Enediyne biosynthesis #d27bde
+M00827 Enediyne biosynthesis #d27bde
+M00828 Enediyne biosynthesis #d27bde
+M00829 Enediyne biosynthesis #d27bde
+M00830 Enediyne biosynthesis #d27bde
+M00831 Enediyne biosynthesis #d27bde
+M00832 Enediyne biosynthesis #d27bde
+M00833 Enediyne biosynthesis #d27bde
+M00834 Enediyne biosynthesis #d27bde
+M00082 Fatty acid metabolism #d9a344
+M00083 Fatty acid metabolism #d9a344
+M00085 Fatty acid metabolism #d9a344
+M00086 Fatty acid metabolism #d9a344
+M00087 Fatty acid metabolism #d9a344
+M00415 Fatty acid metabolism #d9a344
+M00861 Fatty acid metabolism #d9a344
+M00873 Fatty acid metabolism #d9a344
+M00874 Fatty acid metabolism #d9a344
+M00055 Glycan biosynthesis #588cd6
+M00056 Glycan biosynthesis #588cd6
+M00065 Glycan biosynthesis #588cd6
+M00068 Glycan biosynthesis #588cd6
+M00069 Glycan biosynthesis #588cd6
+M00070 Glycan biosynthesis #588cd6
+M00071 Glycan biosynthesis #588cd6
+M00072 Glycan biosynthesis #588cd6
+M00073 Glycan biosynthesis #588cd6
+M00074 Glycan biosynthesis #588cd6
+M00075 Glycan biosynthesis #588cd6
+M00872 Glycan biosynthesis #588cd6
+M00057 Glycosaminoglycan metabolism #d66432
+M00058 Glycosaminoglycan metabolism #d66432
+M00059 Glycosaminoglycan metabolism #d66432
+M00076 Glycosaminoglycan metabolism #d66432
+M00077 Glycosaminoglycan metabolism #d66432
+M00078 Glycosaminoglycan metabolism #d66432
+M00079 Glycosaminoglycan metabolism #d66432
+M00026 Histidine metabolism #66d7bf
+M00045 Histidine metabolism #66d7bf
+M00066 Lipid metabolism #d53e55
+M00067 Lipid metabolism #d53e55
+M00088 Lipid metabolism #d53e55
+M00089 Lipid metabolism #d53e55
+M00090 Lipid metabolism #d53e55
+M00091 Lipid metabolism #d53e55
+M00092 Lipid metabolism #d53e55
+M00093 Lipid metabolism #d53e55
+M00094 Lipid metabolism #d53e55
+M00098 Lipid metabolism #d53e55
+M00099 Lipid metabolism #d53e55
+M00100 Lipid metabolism #d53e55
+M00113 Lipid metabolism #d53e55
+M00060 Lipopolysaccharide metabolism #83d2de
+M00063 Lipopolysaccharide metabolism #83d2de
+M00064 Lipopolysaccharide metabolism #83d2de
+M00866 Lipopolysaccharide metabolism #83d2de
+M00867 Lipopolysaccharide metabolism #83d2de
+M00016 Lysine metabolism #d84e8b
+M00030 Lysine metabolism #d84e8b
+M00031 Lysine metabolism #d84e8b
+M00032 Lysine metabolism #d84e8b
+M00433 Lysine metabolism #d84e8b
+M00525 Lysine metabolism #d84e8b
+M00526 Lysine metabolism #d84e8b
+M00527 Lysine metabolism #d84e8b
+M00773 Macrolide biosynthesis #2e4b26
+M00774 Macrolide biosynthesis #2e4b26
+M00775 Macrolide biosynthesis #2e4b26
+M00776 Macrolide biosynthesis #2e4b26
+M00777 Macrolide biosynthesis #2e4b26
+M00611 Metabolic capacity #9378c3
+M00612 Metabolic capacity #9378c3
+M00613 Metabolic capacity #9378c3
+M00614 Metabolic capacity #9378c3
+M00615 Metabolic capacity #9378c3
+M00616 Metabolic capacity #9378c3
+M00617 Metabolic capacity #9378c3
+M00618 Metabolic capacity #9378c3
+M00174 Methane metabolism #9e7336
+M00344 Methane metabolism #9e7336
+M00345 Methane metabolism #9e7336
+M00346 Methane metabolism #9e7336
+M00356 Methane metabolism #9e7336
+M00357 Methane metabolism #9e7336
+M00358 Methane metabolism #9e7336
+M00378 Methane metabolism #9e7336
+M00422 Methane metabolism #9e7336
+M00563 Methane metabolism #9e7336
+M00567 Methane metabolism #9e7336
+M00608 Methane metabolism #9e7336
+M00175 Nitrogen metabolism #2c2351
+M00528 Nitrogen metabolism #2c2351
+M00529 Nitrogen metabolism #2c2351
+M00530 Nitrogen metabolism #2c2351
+M00531 Nitrogen metabolism #2c2351
+M00804 Nitrogen metabolism #2c2351
+M00027 Other amino acid metabolism #c5d7a9
+M00118 Other amino acid metabolism #c5d7a9
+M00369 Other amino acid metabolism #c5d7a9
+M00012 Other carbohydrate metabolism #872b4e
+M00013 Other carbohydrate metabolism #872b4e
+M00014 Other carbohydrate metabolism #872b4e
+M00061 Other carbohydrate metabolism #872b4e
+M00081 Other carbohydrate metabolism #872b4e
+M00114 Other carbohydrate metabolism #872b4e
+M00129 Other carbohydrate metabolism #872b4e
+M00130 Other carbohydrate metabolism #872b4e
+M00131 Other carbohydrate metabolism #872b4e
+M00132 Other carbohydrate metabolism #872b4e
+M00373 Other carbohydrate metabolism #872b4e
+M00532 Other carbohydrate metabolism #872b4e
+M00549 Other carbohydrate metabolism #872b4e
+M00550 Other carbohydrate metabolism #872b4e
+M00552 Other carbohydrate metabolism #872b4e
+M00554 Other carbohydrate metabolism #872b4e
+M00565 Other carbohydrate metabolism #872b4e
+M00630 Other carbohydrate metabolism #872b4e
+M00631 Other carbohydrate metabolism #872b4e
+M00632 Other carbohydrate metabolism #872b4e
+M00740 Other carbohydrate metabolism #872b4e
+M00741 Other carbohydrate metabolism #872b4e
+M00761 Other carbohydrate metabolism #872b4e
+M00854 Other carbohydrate metabolism #872b4e
+M00855 Other carbohydrate metabolism #872b4e
+M00097 Other terpenoid biosynthesis #6e9368
+M00371 Other terpenoid biosynthesis #6e9368
+M00372 Other terpenoid biosynthesis #6e9368
+M00363 Pathogenicity #66406d
+M00542 Pathogenicity #66406d
+M00564 Pathogenicity #66406d
+M00574 Pathogenicity #66406d
+M00575 Pathogenicity #66406d
+M00576 Pathogenicity #66406d
+M00850 Pathogenicity #66406d
+M00852 Pathogenicity #66406d
+M00853 Pathogenicity #66406d
+M00856 Pathogenicity #66406d
+M00857 Pathogenicity #66406d
+M00859 Pathogenicity #66406d
+M00860 Pathogenicity #66406d
+M00161 Photosynthesis #cfa68a
+M00163 Photosynthesis #cfa68a
+M00597 Photosynthesis #cfa68a
+M00598 Photosynthesis #cfa68a
+M00660 Plant pathogenicity #461d27
+M00133 Polyamine biosynthesis #a5b3da
+M00134 Polyamine biosynthesis #a5b3da
+M00135 Polyamine biosynthesis #a5b3da
+M00136 Polyamine biosynthesis #a5b3da
+M00793 Polyketide sugar unit biosynthesis #5c4f24
+M00794 Polyketide sugar unit biosynthesis #5c4f24
+M00795 Polyketide sugar unit biosynthesis #5c4f24
+M00796 Polyketide sugar unit biosynthesis #5c4f24
+M00797 Polyketide sugar unit biosynthesis #5c4f24
+M00798 Polyketide sugar unit biosynthesis #5c4f24
+M00799 Polyketide sugar unit biosynthesis #5c4f24
+M00800 Polyketide sugar unit biosynthesis #5c4f24
+M00801 Polyketide sugar unit biosynthesis #5c4f24
+M00802 Polyketide sugar unit biosynthesis #5c4f24
+M00803 Polyketide sugar unit biosynthesis #5c4f24
+M00048 Purine metabolism #e0a7d2
+M00049 Purine metabolism #e0a7d2
+M00050 Purine metabolism #e0a7d2
+M00546 Purine metabolism #e0a7d2
+M00046 Pyrimidine metabolism #25585e
+M00051 Pyrimidine metabolism #25585e
+M00052 Pyrimidine metabolism #25585e
+M00053 Pyrimidine metabolism #25585e
+M00018 Serine and threonine metabolism #de7d78
+M00020 Serine and threonine metabolism #de7d78
+M00033 Serine and threonine metabolism #de7d78
+M00555 Serine and threonine metabolism #de7d78
+M00101 Sterol biosynthesis #4e96a2
+M00102 Sterol biosynthesis #4e96a2
+M00103 Sterol biosynthesis #4e96a2
+M00104 Sterol biosynthesis #4e96a2
+M00106 Sterol biosynthesis #4e96a2
+M00107 Sterol biosynthesis #4e96a2
+M00108 Sterol biosynthesis #4e96a2
+M00109 Sterol biosynthesis #4e96a2
+M00110 Sterol biosynthesis #4e96a2
+M00862 Sterol biosynthesis #4e96a2
+M00176 Sulfur metabolism #4e96a2
+M00595 Sulfur metabolism #4e96a2
+M00596 Sulfur metabolism #4e96a2
+M00664 Symbiosis #88574e
+M00095 Terpenoid backbone biosynthesis #4e6089
+M00096 Terpenoid backbone biosynthesis #4e6089
+M00364 Terpenoid backbone biosynthesis #4e6089
+M00365 Terpenoid backbone biosynthesis #4e6089
+M00366 Terpenoid backbone biosynthesis #4e6089
+M00367 Terpenoid backbone biosynthesis #4e6089
+M00849 Terpenoid backbone biosynthesis #4e6089
+M00778 Type II polyketide biosynthesis #af7194
+M00779 Type II polyketide biosynthesis #af7194
+M00780 Type II polyketide biosynthesis #af7194
+M00781 Type II polyketide biosynthesis #af7194
+M00782 Type II polyketide biosynthesis #af7194
+M00783 Type II polyketide biosynthesis #af7194
+M00784 Type II polyketide biosynthesis #af7194
+M00823 Type II polyketide biosynthesis #af7194
diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Bifurcating_Module_Information.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Bifurcating_Module_Information.pkl
new file mode 100644
index 0000000..7535b86
Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Bifurcating_Module_Information.pkl differ
diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Module-KOs.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Module-KOs.pkl
new file mode 100644
index 0000000..cba82d5
Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Module-KOs.pkl differ
diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Module_Information.txt b/data/MicrobeAnnotator_KEGG/KEGG_Module_Information.txt
new file mode 100644
index 0000000..db9ec87
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/KEGG_Module_Information.txt
@@ -0,0 +1,394 @@
+M00015 Proline biosynthesis, glutamate => proline Arginine and proline metabolism #8a3222
+M00028 Ornithine biosynthesis, glutamate => ornithine Arginine and proline metabolism #8a3222
+M00029 Urea cycle Arginine and proline metabolism #8a3222
+M00047 Creatine pathway Arginine and proline metabolism #8a3222
+M00763 Ornithine biosynthesis, mediated by LysW, glutamate => ornithine Arginine and proline metabolism #8a3222
+M00844 Arginine biosynthesis, ornithine => arginine Arginine and proline metabolism #8a3222
+M00845 Arginine biosynthesis, glutamate => acetylcitrulline => arginine Arginine and proline metabolism #8a3222
+M00879 Arginine succinyltransferase pathway, arginine => glutamate Arginine and proline metabolism #8a3222
+M00022 Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate Aromatic amino acid metabolism #8641b6
+M00023 Tryptophan biosynthesis, chorismate => tryptophan Aromatic amino acid metabolism #8641b6
+M00024 Phenylalanine biosynthesis, chorismate => phenylalanine Aromatic amino acid metabolism #8641b6
+M00025 Tyrosine biosynthesis, chorismate => tyrosine Aromatic amino acid metabolism #8641b6
+M00037 Melatonin biosynthesis, tryptophan => serotonin => melatonin Aromatic amino acid metabolism #8641b6
+M00038 Tryptophan metabolism, tryptophan => kynurenine => 2-aminomuconate Aromatic amino acid metabolism #8641b6
+M00040 Tyrosine biosynthesis, prephanate => pretyrosine => tyrosine Aromatic amino acid metabolism #8641b6
+M00042 Catecholamine biosynthesis, tyrosine => dopamine => noradrenaline => adrenaline Aromatic amino acid metabolism #8641b6
+M00043 Thyroid hormone biosynthesis, tyrosine => triiodothyronine--thyroxine Aromatic amino acid metabolism #8641b6
+M00044 Tyrosine degradation, tyrosine => homogentisate Aromatic amino acid metabolism #8641b6
+M00533 Homoprotocatechuate degradation, homoprotocatechuate => 2-oxohept-3-enedioate Aromatic amino acid metabolism #8641b6
+M00545 Trans-cinnamate degradation, trans-cinnamate => acetyl-CoA Aromatic amino acid metabolism #8641b6
+M00418 Toluene degradation, anaerobic, toluene => benzoyl-CoA Aromatics degradation #76d25b
+M00419 Cymene degradation, p-cymene => p-cumate Aromatics degradation #76d25b
+M00534 Naphthalene degradation, naphthalene => salicylate Aromatics degradation #76d25b
+M00537 Xylene degradation, xylene => methylbenzoate Aromatics degradation #76d25b
+M00538 Toluene degradation, toluene => benzoate Aromatics degradation #76d25b
+M00539 Cumate degradation, p-cumate => 2-oxopent-4-enoate + 2-methylpropanoate Aromatics degradation #76d25b
+M00540 Benzoate degradation, cyclohexanecarboxylic acid =>pimeloyl-CoA Aromatics degradation #76d25b
+M00541 Benzoyl-CoA degradation, benzoyl-CoA => 3-hydroxypimeloyl-CoA Aromatics degradation #76d25b
+M00543 Biphenyl degradation, biphenyl => 2-oxopent-4-enoate + benzoate Aromatics degradation #76d25b
+M00544 Carbazole degradation, carbazole => 2-oxopent-4-enoate + anthranilate Aromatics degradation #76d25b
+M00547 Benzene--toluene degradation, benzene => catechol -- toluene => 3-methylcatechol Aromatics degradation #76d25b
+M00548 Benzene degradation, benzene => catechol Aromatics degradation #76d25b
+M00551 Benzoate degradation, benzoate => catechol -- methylbenzoate => methylcatechol Aromatics degradation #76d25b
+M00568 Catechol ortho-cleavage, catechol => 3-oxoadipate Aromatics degradation #76d25b
+M00569 Catechol meta-cleavage, catechol => acetyl-CoA -- 4-methylcatechol => propanoyl-CoA Aromatics degradation #76d25b
+M00623 Phthalate degradation 1, phthalate => protocatechuate Aromatics degradation #76d25b
+M00624 Terephthalate degradation, terephthalate => 3,4-dihydroxybenzoate Aromatics degradation #76d25b
+M00636 Phthalate degradation 2, phthalate => protocatechuate Aromatics degradation #76d25b
+M00637 Anthranilate degradation, anthranilate => catechol Aromatics degradation #76d25b
+M00638 Salicylate degradation, salicylate => gentisate Aromatics degradation #76d25b
+M00878 Phenylacetate degradation, phenylaxetate => acetyl-CoA--succinyl-CoA Aromatics degradation #76d25b
+M00142 NADH:ubiquinone oxidoreductase, mitochondria ATP synthesis #cdd346
+M00143 NADH dehydrogenase (ubiquinone) Fe-S protein--flavoprotein complex, mitochondria ATP synthesis #cdd346
+M00144 NADH:quinone oxidoreductase, prokaryotes ATP synthesis #cdd346
+M00145 NAD(P)H:quinone oxidoreductase, chloroplasts and cyanobacteria ATP synthesis #cdd346
+M00146 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex ATP synthesis #cdd346
+M00147 NADH dehydrogenase (ubiquinone) 1 beta subcomplex ATP synthesis #cdd346
+M00148 Succinate dehydrogenase (ubiquinone) ATP synthesis #cdd346
+M00149 Succinate dehydrogenase, prokaryotes ATP synthesis #cdd346
+M00150 Fumarate reductase, prokaryotes ATP synthesis #cdd346
+M00151 Cytochrome bc1 complex respiratory unit ATP synthesis #cdd346
+M00152 Cytochrome bc1 complex ATP synthesis #cdd346
+M00153 Cytochrome bd ubiquinol oxidase ATP synthesis #cdd346
+M00154 Cytochrome c oxidase ATP synthesis #cdd346
+M00155 Cytochrome c oxidase, prokaryotes ATP synthesis #cdd346
+M00156 Cytochrome c oxidase, cbb3-type ATP synthesis #cdd346
+M00157 F-type ATPase, prokaryotes and chloroplasts ATP synthesis #cdd346
+M00158 F-type ATPase, eukaryotes ATP synthesis #cdd346
+M00159 V-type ATPase, prokaryotes ATP synthesis #cdd346
+M00160 V-type ATPase, eukaryotes ATP synthesis #cdd346
+M00162 Cytochrome b6f complex ATP synthesis #cdd346
+M00416 Cytochrome aa3-600 menaquinol oxidase ATP synthesis #cdd346
+M00417 Cytochrome o ubiquinol oxidase ATP synthesis #cdd346
+M00672 Penicillin biosynthesis, aminoadipate + cycteine + valine => penicillin Beta-Lactam biosynthesis #3b2882
+M00673 Cephamycin C biosynthesis, aminoadipate + cycteine + valine => cephamycin C Beta-Lactam biosynthesis #3b2882
+M00674 Clavaminate biosynthesis, arginine + glyceraldehyde-3P => clavaminate Beta-Lactam biosynthesis #3b2882
+M00675 Carbapenem-3-carboxylate biosynthesis, pyrroline-5-carboxylate + malonyl-CoA => carbapenem-3-carboxylate Beta-Lactam biosynthesis #3b2882
+M00736 Nocardicin A biosynthesis, L-pHPG + arginine + serine => nocardicin A Beta-Lactam biosynthesis #3b2882
+M00039 Monolignol biosynthesis, phenylalanine--tyrosine => monolignol Biosynthesis of other secondary metabolites #cbde82
+M00137 Flavanone biosynthesis, phenylalanine => naringenin Biosynthesis of other secondary metabolites #cbde82
+M00138 Flavonoid biosynthesis, naringenin => pelargonidin Biosynthesis of other secondary metabolites #cbde82
+M00370 Glucosinolate biosynthesis, tryptophan => glucobrassicin Biosynthesis of other secondary metabolites #cbde82
+M00661 Paspaline biosynthesis, geranylgeranyl-PP + indoleglycerol phosphate => paspaline Biosynthesis of other secondary metabolites #cbde82
+M00785 Cycloserine biosynthesis, arginine--serine => cycloserine Biosynthesis of other secondary metabolites #cbde82
+M00786 Fumitremorgin alkaloid biosynthesis, tryptophan + proline => fumitremorgin C--A Biosynthesis of other secondary metabolites #cbde82
+M00787 Bacilysin biosynthesis, prephenate => bacilysin Biosynthesis of other secondary metabolites #cbde82
+M00788 Terpentecin biosynthesis, GGAP => terpentecin Biosynthesis of other secondary metabolites #cbde82
+M00789 Rebeccamycin biosynthesis, tryptophan => rebeccamycin Biosynthesis of other secondary metabolites #cbde82
+M00790 Pyrrolnitrin biosynthesis, tryptophan => pyrrolnitrin Biosynthesis of other secondary metabolites #cbde82
+M00805 Staurosporine biosynthesis, tryptophan => staurosporine Biosynthesis of other secondary metabolites #cbde82
+M00808 Violacein biosynthesis, tryptophan => violacein Biosynthesis of other secondary metabolites #cbde82
+M00814 Acarbose biosynthesis, sedoheptulopyranose-7P => acarbose Biosynthesis of other secondary metabolites #cbde82
+M00815 Validamycin A biosynthesis, sedoheptulopyranose-7P => validamycin A Biosynthesis of other secondary metabolites #cbde82
+M00819 Pentalenolactone biosynthesis, farnesyl-PP => pentalenolactone Biosynthesis of other secondary metabolites #cbde82
+M00835 Pyocyanine biosynthesis, chorismate => pyocyanine Biosynthesis of other secondary metabolites #cbde82
+M00837 Prodigiosin biosynthesis, L-proline => prodigiosin Biosynthesis of other secondary metabolites #cbde82
+M00838 Undecylprodigiosin biosynthesis, L-proline => undecylprodigiosin Biosynthesis of other secondary metabolites #cbde82
+M00848 Aurachin biosynthesis, anthranilate => aurachin A Biosynthesis of other secondary metabolites #cbde82
+M00875 Staphyloferrin B biosynthesis, L-serine => staphyloferrin B Biosynthesis of other secondary metabolites #cbde82
+M00876 Staphyloferrin A biosynthesis, L-ornithine => staphyloferrin A Biosynthesis of other secondary metabolites #cbde82
+M00877 Kanosamine biosynthesis glucose 6-phosphate => kanosamine Biosynthesis of other secondary metabolites #cbde82
+M00019 Valine--isoleucine biosynthesis, pyruvate => valine -- 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb
+M00036 Leucine degradation, leucine => acetoacetate + acetyl-CoA Branched-chain amino acid metabolism #656cdb
+M00432 Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate Branched-chain amino acid metabolism #656cdb
+M00535 Isoleucine biosynthesis, pyruvate => 2-oxobutanoate Branched-chain amino acid metabolism #656cdb
+M00570 Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb
+M00165 Reductive pentose phosphate cycle (Calvin cycle) Carbon fixation #408937
+M00166 Reductive pentose phosphate cycle, ribulose-5P => glyceraldehyde-3P Carbon fixation #408937
+M00167 Reductive pentose phosphate cycle, glyceraldehyde-3P => ribulose-5P Carbon fixation #408937
+M00168 CAM (Crassulacean acid metabolism), dark Carbon fixation #408937
+M00169 CAM (Crassulacean acid metabolism), light Carbon fixation #408937
+M00170 C4-dicarboxylic acid cycle, phosphoenolpyruvate carboxykinase type Carbon fixation #408937
+M00171 C4-dicarboxylic acid cycle, NAD - malic enzyme type Carbon fixation #408937
+M00172 C4-dicarboxylic acid cycle, NADP - malic enzyme type Carbon fixation #408937
+M00173 Reductive citrate cycle (Arnon-Buchanan cycle) Carbon fixation #408937
+M00374 Dicarboxylate-hydroxybutyrate cycle Carbon fixation #408937
+M00375 Hydroxypropionate-hydroxybutylate cycle Carbon fixation #408937
+M00376 3-Hydroxypropionate bi-cycle Carbon fixation #408937
+M00377 Reductive acetyl-CoA pathway (Wood-Ljungdahl pathway) Carbon fixation #408937
+M00579 Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate Carbon fixation #408937
+M00620 Incomplete reductive citrate cycle, acetyl-CoA => oxoglutarate Carbon fixation #408937
+M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate Central carbohydrate metabolism #c644a5
+M00002 Glycolysis, core module involving three-carbon compounds Central carbohydrate metabolism #c644a5
+M00003 Gluconeogenesis, oxaloacetate => fructose-6P Central carbohydrate metabolism #c644a5
+M00004 Pentose phosphate pathway (Pentose phosphate cycle) Central carbohydrate metabolism #c644a5
+M00005 PRPP biosynthesis, ribose 5P => PRPP Central carbohydrate metabolism #c644a5
+M00006 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P Central carbohydrate metabolism #c644a5
+M00007 Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5
+M00008 Entner-Doudoroff pathway, glucose-6P => glyceraldehyde-3P + pyruvate Central carbohydrate metabolism #c644a5
+M00009 Citrate cycle (TCA cycle, Krebs cycle) Central carbohydrate metabolism #c644a5
+M00010 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate Central carbohydrate metabolism #c644a5
+M00011 Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate Central carbohydrate metabolism #c644a5
+M00307 Pyruvate oxidation, pyruvate => acetyl-CoA Central carbohydrate metabolism #c644a5
+M00308 Semi-phosphorylative Entner-Doudoroff pathway, gluconate => glycerate-3P Central carbohydrate metabolism #c644a5
+M00309 Non-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate Central carbohydrate metabolism #c644a5
+M00580 Pentose phosphate pathway, archaea, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5
+M00633 Semi-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate-3P Central carbohydrate metabolism #c644a5
+M00112 Tocopherol--tocotorienol biosynthesis Cofactor and vitamin metabolism #5fda98
+M00115 NAD biosynthesis, aspartate => NAD Cofactor and vitamin metabolism #5fda98
+M00116 Menaquinone biosynthesis, chorismate => menaquinol Cofactor and vitamin metabolism #5fda98
+M00117 Ubiquinone biosynthesis, prokaryotes, chorismate => ubiquinone Cofactor and vitamin metabolism #5fda98
+M00119 Pantothenate biosynthesis, valine--L-aspartate => pantothenate Cofactor and vitamin metabolism #5fda98
+M00120 Coenzyme A biosynthesis, pantothenate => CoA Cofactor and vitamin metabolism #5fda98
+M00121 Heme biosynthesis, plants and bacteria, glutamate => heme Cofactor and vitamin metabolism #5fda98
+M00122 Cobalamin biosynthesis, cobinamide => cobalamin Cofactor and vitamin metabolism #5fda98
+M00123 Biotin biosynthesis, pimeloyl-ACP--CoA => biotin Cofactor and vitamin metabolism #5fda98
+M00124 Pyridoxal biosynthesis, erythrose-4P => pyridoxal-5P Cofactor and vitamin metabolism #5fda98
+M00125 Riboflavin biosynthesis, GTP => riboflavin--FMN--FAD Cofactor and vitamin metabolism #5fda98
+M00126 Tetrahydrofolate biosynthesis, GTP => THF Cofactor and vitamin metabolism #5fda98
+M00127 Thiamine biosynthesis, AIR => thiamine-P--thiamine-2P Cofactor and vitamin metabolism #5fda98
+M00128 Ubiquinone biosynthesis, eukaryotes, 4-hydroxybenzoate => ubiquinone Cofactor and vitamin metabolism #5fda98
+M00140 C1-unit interconversion, prokaryotes Cofactor and vitamin metabolism #5fda98
+M00141 C1-unit interconversion, eukaryotes Cofactor and vitamin metabolism #5fda98
+M00572 Pimeloyl-ACP biosynthesis, BioC-BioH pathway, malonyl-ACP => pimeloyl-ACP Cofactor and vitamin metabolism #5fda98
+M00573 Biotin biosynthesis, BioI pathway, long-chain-acyl-ACP => pimeloyl-ACP => biotin Cofactor and vitamin metabolism #5fda98
+M00577 Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin Cofactor and vitamin metabolism #5fda98
+M00622 Nicotinate degradation, nicotinate => fumarate Cofactor and vitamin metabolism #5fda98
+M00810 Nicotine degradation, pyridine pathway, nicotine => 2,6-dihydroxypyridine--succinate semialdehyde Cofactor and vitamin metabolism #5fda98
+M00811 Nicotine degradation, pyrrolidine pathway, nicotine => succinate semialdehyde Cofactor and vitamin metabolism #5fda98
+M00836 Coenzyme F430 biosynthesis, sirohydrochlorin => coenzyme F430 Cofactor and vitamin metabolism #5fda98
+M00840 Tetrahydrofolate biosynthesis, mediated by ribA and trpF, GTP => THF Cofactor and vitamin metabolism #5fda98
+M00841 Tetrahydrofolate biosynthesis, mediated by PTPS, GTP => THF Cofactor and vitamin metabolism #5fda98
+M00842 Tetrahydrobiopterin biosynthesis, GTP => BH4 Cofactor and vitamin metabolism #5fda98
+M00843 L-threo-Tetrahydrobiopterin biosynthesis, GTP => L-threo-BH4 Cofactor and vitamin metabolism #5fda98
+M00846 Siroheme biosynthesis, glutamate => siroheme Cofactor and vitamin metabolism #5fda98
+M00847 Heme biosynthesis, archaea, siroheme => heme Cofactor and vitamin metabolism #5fda98
+M00868 Heme biosynthesis, animals and fungi, glycine => heme Cofactor and vitamin metabolism #5fda98
+M00880 Molybdenum cofactor biosynthesis, GTP => molybdenum cofactor Cofactor and vitamin metabolism #5fda98
+M00017 Methionine biosynthesis, apartate => homoserine => methionine Cysteine and methionine metabolism #782975
+M00021 Cysteine biosynthesis, serine => cysteine Cysteine and methionine metabolism #782975
+M00034 Methionine salvage pathway Cysteine and methionine metabolism #782975
+M00035 Methionine degradation Cysteine and methionine metabolism #782975
+M00338 Cysteine biosynthesis, homocysteine + serine => cysteine Cysteine and methionine metabolism #782975
+M00368 Ethylene biosynthesis, methionine => ethylene Cysteine and methionine metabolism #782975
+M00609 Cysteine biosynthesis, methionine => cysteine Cysteine and methionine metabolism #782975
+M00625 Methicillin resistance Drug resistance #869534
+M00627 beta-Lactam resistance, Bla system Drug resistance #869534
+M00639 Multidrug resistance, efflux pump MexCD-OprJ Drug resistance #869534
+M00641 Multidrug resistance, efflux pump MexEF-OprN Drug resistance #869534
+M00642 Multidrug resistance, efflux pump MexJK-OprM Drug resistance #869534
+M00643 Multidrug resistance, efflux pump MexXY-OprM Drug resistance #869534
+M00649 Multidrug resistance, efflux pump AdeABC Drug resistance #869534
+M00651 Vancomycin resistance, D-Ala-D-Lac type Drug resistance #869534
+M00652 Vancomycin resistance, D-Ala-D-Ser type Drug resistance #869534
+M00696 Multidrug resistance, efflux pump AcrEF-TolC Drug resistance #869534
+M00697 Multidrug resistance, efflux pump MdtEF-TolC Drug resistance #869534
+M00698 Multidrug resistance, efflux pump BpeEF-OprC Drug resistance #869534
+M00700 Multidrug resistance, efflux pump AbcA Drug resistance #869534
+M00702 Multidrug resistance, efflux pump NorB Drug resistance #869534
+M00704 Tetracycline resistance, efflux pump Tet38 Drug resistance #869534
+M00705 Multidrug resistance, efflux pump MepA Drug resistance #869534
+M00714 Multidrug resistance, efflux pump QacA Drug resistance #869534
+M00718 Multidrug resistance, efflux pump MexAB-OprM Drug resistance #869534
+M00725 Cationic antimicrobial peptide (CAMP) resistance, dltABCD operon Drug resistance #869534
+M00726 Cationic antimicrobial peptide (CAMP) resistance, lysyl-phosphatidylglycerol (L-PG) synthase MprF Drug resistance #869534
+M00730 Cationic antimicrobial peptide (CAMP) resistance, VraFG transporter Drug resistance #869534
+M00744 Cationic antimicrobial peptide (CAMP) resistance, protease PgtE Drug resistance #869534
+M00745 Imipenem resistance, repression of porin OprD Drug resistance #869534
+M00746 Multidrug resistance, repression of porin OmpF Drug resistance #869534
+M00769 Multidrug resistance, efflux pump MexPQ-OpmE Drug resistance #869534
+M00851 Carbapenem resistance Drug resistance #869534
+M00824 9-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 9-membered enediyne core Enediyne biosynthesis #d27bde
+M00825 10-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 10-membered enediyne core Enediyne biosynthesis #d27bde
+M00826 C-1027 benzoxazolinate moiety biosynthesis, chorismate => benzoxazolinyl-CoA Enediyne biosynthesis #d27bde
+M00827 C-1027 beta-amino acid moiety biosynthesis, tyrosine => 3-chloro-4,5-dihydroxy-beta-phenylalanyl-PCP Enediyne biosynthesis #d27bde
+M00828 Maduropeptin beta-hydroxy acid moiety biosynthesis, tyrosine => 3-(4-hydroxyphenyl)-3-oxopropanoyl-PCP Enediyne biosynthesis #d27bde
+M00829 3,6-Dimethylsalicylyl-CoA biosynthesis, malonyl-CoA => 6-methylsalicylate => 3,6-dimethylsalicylyl-CoA Enediyne biosynthesis #d27bde
+M00830 Neocarzinostatin naphthoate moiety biosynthesis, malonyl-CoA => 2-hydroxy-5-methyl-1-naphthoate => 2-hydroxy-7-methoxy-5-methyl-1-naphthoyl-CoA Enediyne biosynthesis #d27bde
+M00831 Kedarcidin 2-hydroxynaphthoate moiety biosynthesis, malonyl-CoA => 3,6,8-trihydroxy-2-naphthoate => 3-hydroxy-7,8-dimethoxy-6-isopropoxy-2-naphthoyl-CoA Enediyne biosynthesis #d27bde
+M00832 Kedarcidin 2-aza-3-chloro-beta-tyrosine moiety biosynthesis, azatyrosine => 2-aza-3-chloro-beta-tyrosyl-PCP Enediyne biosynthesis #d27bde
+M00833 Calicheamicin biosynthesis, calicheamicinone => calicheamicin Enediyne biosynthesis #d27bde
+M00834 Calicheamicin orsellinate moiety biosynthesis, malonyl-CoA => orsellinate-ACP => 5-iodo-2,3-dimethoxyorsellinate-ACP Enediyne biosynthesis #d27bde
+M00082 Fatty acid biosynthesis, initiation Fatty acid metabolism #d9a344
+M00083 Fatty acid biosynthesis, elongation Fatty acid metabolism #d9a344
+M00085 Fatty acid elongation in mitochondria Fatty acid metabolism #d9a344
+M00086 beta-Oxidation, acyl-CoA synthesis Fatty acid metabolism #d9a344
+M00087 beta-Oxidation Fatty acid metabolism #d9a344
+M00415 Fatty acid elongation in endoplasmic reticulum Fatty acid metabolism #d9a344
+M00861 beta-Oxidation, peroxisome, VLCFA Fatty acid metabolism #d9a344
+M00873 Fatty acid biosynthesis in mitochondria, animals Fatty acid metabolism #d9a344
+M00874 Fatty acid biosynthesis in mitochondria, fungi Fatty acid metabolism #d9a344
+M00055 N-glycan precursor biosynthesis Glycan biosynthesis #588cd6
+M00056 O-glycan biosynthesis, mucin type core Glycan biosynthesis #588cd6
+M00065 GPI-anchor biosynthesis, core oligosaccharide Glycan biosynthesis #588cd6
+M00068 Glycosphingolipid biosynthesis, globo-series, LacCer => Gb4Cer Glycan biosynthesis #588cd6
+M00069 Glycosphingolipid biosynthesis, ganglio series, LacCer => GT3 Glycan biosynthesis #588cd6
+M00070 Glycosphingolipid biosynthesis, lacto-series, LacCer => Lc4Cer Glycan biosynthesis #588cd6
+M00071 Glycosphingolipid biosynthesis, neolacto-series, LacCer => nLc4Cer Glycan biosynthesis #588cd6
+M00072 N-glycosylation by oligosaccharyltransferase Glycan biosynthesis #588cd6
+M00073 N-glycan precursor trimming Glycan biosynthesis #588cd6
+M00074 N-glycan biosynthesis, high-mannose type Glycan biosynthesis #588cd6
+M00075 N-glycan biosynthesis, complex type Glycan biosynthesis #588cd6
+M00872 O-glycan biosynthesis, mannose type (core M3) Glycan biosynthesis #588cd6
+M00057 Glycosaminoglycan biosynthesis, linkage tetrasaccharide Glycosaminoglycan metabolism #d66432
+M00058 Glycosaminoglycan biosynthesis, chondroitin sulfate backbone Glycosaminoglycan metabolism #d66432
+M00059 Glycosaminoglycan biosynthesis, heparan sulfate backbone Glycosaminoglycan metabolism #d66432
+M00076 Dermatan sulfate degradation Glycosaminoglycan metabolism #d66432
+M00077 Chondroitin sulfate degradation Glycosaminoglycan metabolism #d66432
+M00078 Heparan sulfate degradation Glycosaminoglycan metabolism #d66432
+M00079 Keratan sulfate degradation Glycosaminoglycan metabolism #d66432
+M00026 Histidine biosynthesis, PRPP => histidine Histidine metabolism #66d7bf
+M00045 Histidine degradation, histidine => N-formiminoglutamate => glutamate Histidine metabolism #66d7bf
+M00066 Lactosylceramide biosynthesis Lipid metabolism #d53e55
+M00067 Sulfoglycolipids biosynthesis, ceramide--1-alkyl-2-acylglycerol => sulfatide--seminolipid Lipid metabolism #d53e55
+M00088 Ketone body biosynthesis, acetyl-CoA => acetoacetate--3-hydroxybutyrate--acetone Lipid metabolism #d53e55
+M00089 Triacylglycerol biosynthesis Lipid metabolism #d53e55
+M00090 Phosphatidylcholine (PC) biosynthesis, choline => PC Lipid metabolism #d53e55
+M00091 Phosphatidylcholine (PC) biosynthesis, PE => PC Lipid metabolism #d53e55
+M00092 Phosphatidylethanolamine (PE) biosynthesis, ethanolamine => PE Lipid metabolism #d53e55
+M00093 Phosphatidylethanolamine (PE) biosynthesis, PA => PS => PE Lipid metabolism #d53e55
+M00094 Ceramide biosynthesis Lipid metabolism #d53e55
+M00098 Acylglycerol degradation Lipid metabolism #d53e55
+M00099 Sphingosine biosynthesis Lipid metabolism #d53e55
+M00100 Sphingosine degradation Lipid metabolism #d53e55
+M00113 Jasmonic acid biosynthesis Lipid metabolism #d53e55
+M00060 KDO2-lipid A biosynthesis, Raetz pathway, LpxL-LpxM type Lipopolysaccharide metabolism #83d2de
+M00063 CMP-KDO biosynthesis Lipopolysaccharide metabolism #83d2de
+M00064 ADP-L-glycero-D-manno-heptose biosynthesis Lipopolysaccharide metabolism #83d2de
+M00866 KDO2-lipid A biosynthesis, Raetz pathway, non-LpxL-LpxM type Lipopolysaccharide metabolism #83d2de
+M00867 KDO2-lipid A modification pathway Lipopolysaccharide metabolism #83d2de
+M00016 Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00030 Lysine biosynthesis, AAA pathway, 2-oxoglutarate => 2-aminoadipate => lysine Lysine metabolism #d84e8b
+M00031 Lysine biosynthesis, mediated by LysW, 2-aminoadipate => lysine Lysine metabolism #d84e8b
+M00032 Lysine degradation, lysine => saccharopine => acetoacetyl-CoA Lysine metabolism #d84e8b
+M00433 Lysine biosynthesis, 2-oxoglutarate => 2-oxoadipate Lysine metabolism #d84e8b
+M00525 Lysine biosynthesis, acetyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00526 Lysine biosynthesis, DAP dehydrogenase pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00527 Lysine biosynthesis, DAP aminotransferase pathway, aspartate => lysine Lysine metabolism #d84e8b
+M00773 Tylosin biosynthesis, methylmalonyl-CoA + malonyl-CoA => tylactone => tylosin Macrolide biosynthesis #2e4b26
+M00774 Erythromycin biosynthesis, propanoyl-CoA + methylmalonyl-CoA => deoxyerythronolide B => erythromycin A--B Macrolide biosynthesis #2e4b26
+M00775 Oleandomycin biosynthesis, malonyl-CoA + methylmalonyl-CoA => 8,8a-deoxyoleandolide => oleandomycin Macrolide biosynthesis #2e4b26
+M00776 Pikromycin--methymycin biosynthesis, methylmalonyl-CoA + malonyl-CoA => narbonolide--10-deoxymethynolide => pikromycin--methymycin Macrolide biosynthesis #2e4b26
+M00777 Avermectin biosynthesis, 2-methylbutanoyl-CoA--isobutyryl-CoA => 6,8a-Seco-6,8a-deoxy-5-oxoavermectin 1a--1b aglycone => avermectin A1a--B1a--A1b--B1b Macrolide biosynthesis #2e4b26
+M00611 Oxygenic photosynthesis in plants and cyanobacteria Metabolic capacity #9378c3
+M00612 Anoxygenic photosynthesis in purple bacteria Metabolic capacity #9378c3
+M00613 Anoxygenic photosynthesis in green nonsulfur bacteria Metabolic capacity #9378c3
+M00614 Anoxygenic photosynthesis in green sulfur bacteria Metabolic capacity #9378c3
+M00615 Nitrate assimilation Metabolic capacity #9378c3
+M00616 Sulfate-sulfur assimilation Metabolic capacity #9378c3
+M00617 Methanogen Metabolic capacity #9378c3
+M00618 Acetogen Metabolic capacity #9378c3
+M00174 Methane oxidation, methanotroph, methane => formaldehyde Methane metabolism #9e7336
+M00344 Formaldehyde assimilation, xylulose monophosphate pathway Methane metabolism #9e7336
+M00345 Formaldehyde assimilation, ribulose monophosphate pathway Methane metabolism #9e7336
+M00346 Formaldehyde assimilation, serine pathway Methane metabolism #9e7336
+M00356 Methanogenesis, methanol => methane Methane metabolism #9e7336
+M00357 Methanogenesis, acetate => methane Methane metabolism #9e7336
+M00358 Coenzyme M biosynthesis Methane metabolism #9e7336
+M00378 F420 biosynthesis Methane metabolism #9e7336
+M00422 Acetyl-CoA pathway, CO2 => acetyl-CoA Methane metabolism #9e7336
+M00563 Methanogenesis, methylamine--dimethylamine--trimethylamine => methane Methane metabolism #9e7336
+M00567 Methanogenesis, CO2 => methane Methane metabolism #9e7336
+M00608 2-Oxocarboxylic acid chain extension, 2-oxoglutarate => 2-oxoadipate => 2-oxopimelate => 2-oxosuberate Methane metabolism #9e7336
+M00175 Nitrogen fixation, nitrogen => ammonia Nitrogen metabolism #2c2351
+M00528 Nitrification, ammonia => nitrite Nitrogen metabolism #2c2351
+M00529 Denitrification, nitrate => nitrogen Nitrogen metabolism #2c2351
+M00530 Dissimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351
+M00531 Assimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351
+M00804 Complete nitrification, comammox, ammonia => nitrite => nitrate Nitrogen metabolism #2c2351
+M00027 GABA (gamma-Aminobutyrate) shunt Other amino acid metabolism #c5d7a9
+M00118 Glutathione biosynthesis, glutamate => glutathione Other amino acid metabolism #c5d7a9
+M00369 Cyanogenic glycoside biosynthesis, tyrosine => dhurrin Other amino acid metabolism #c5d7a9
+M00012 Glyoxylate cycle Other carbohydrate metabolism #872b4e
+M00013 Malonate semialdehyde pathway, propanoyl-CoA => acetyl-CoA Other carbohydrate metabolism #872b4e
+M00014 Glucuronate pathway (uronate pathway) Other carbohydrate metabolism #872b4e
+M00061 D-Glucuronate degradation, D-glucuronate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e
+M00081 Pectin degradation Other carbohydrate metabolism #872b4e
+M00114 Ascorbate biosynthesis, plants, glucose-6P => ascorbate Other carbohydrate metabolism #872b4e
+M00129 Ascorbate biosynthesis, animals, glucose-1P => ascorbate Other carbohydrate metabolism #872b4e
+M00130 Inositol phosphate metabolism, PI=> PIP2 => Ins(1,4,5)P3 => Ins(1,3,4,5)P4 Other carbohydrate metabolism #872b4e
+M00131 Inositol phosphate metabolism, Ins(1,3,4,5)P4 => Ins(1,3,4)P3 => myo-inositol Other carbohydrate metabolism #872b4e
+M00132 Inositol phosphate metabolism, Ins(1,3,4)P3 => phytate Other carbohydrate metabolism #872b4e
+M00373 Ethylmalonyl pathway Other carbohydrate metabolism #872b4e
+M00532 Photorespiration Other carbohydrate metabolism #872b4e
+M00549 Nucleotide sugar biosynthesis, glucose => UDP-glucose Other carbohydrate metabolism #872b4e
+M00550 Ascorbate degradation, ascorbate => D-xylulose-5P Other carbohydrate metabolism #872b4e
+M00552 D-galactonate degradation, De Ley-Doudoroff pathway, D-galactonate => glycerate-3P Other carbohydrate metabolism #872b4e
+M00554 Nucleotide sugar biosynthesis, galactose => UDP-galactose Other carbohydrate metabolism #872b4e
+M00565 Trehalose biosynthesis, D-glucose 1P => trehalose Other carbohydrate metabolism #872b4e
+M00630 D-Galacturonate degradation (fungi), D-galacturonate => glycerol Other carbohydrate metabolism #872b4e
+M00631 D-Galacturonate degradation (bacteria), D-galacturonate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e
+M00632 Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P Other carbohydrate metabolism #872b4e
+M00740 Methylaspartate cycle Other carbohydrate metabolism #872b4e
+M00741 Propanoyl-CoA metabolism, propanoyl-CoA => succinyl-CoA Other carbohydrate metabolism #872b4e
+M00761 Undecaprenylphosphate alpha-L-Ara4N biosynthesis, UDP-GlcA => undecaprenyl phosphate alpha-L-Ara4N Other carbohydrate metabolism #872b4e
+M00854 Glycogen biosynthesis, glucose-1P => glycogen--starch Other carbohydrate metabolism #872b4e
+M00855 Glycogen degradation, glycogen => glucose-6P Other carbohydrate metabolism #872b4e
+M00097 beta-Carotene biosynthesis, GGAP => beta-carotene Other terpenoid biosynthesis #6e9368
+M00371 Castasterone biosynthesis, campesterol => castasterone Other terpenoid biosynthesis #6e9368
+M00372 Abscisic acid biosynthesis, beta-carotene => abscisic acid Other terpenoid biosynthesis #6e9368
+M00363 EHEC pathogenicity signature, Shiga toxin Pathogenicity #66406d
+M00542 EHEC--EPEC pathogenicity signature, T3SS and effectors Pathogenicity #66406d
+M00564 Helicobacter pylori pathogenicity signature, cagA pathogenicity island Pathogenicity #66406d
+M00574 Pertussis pathogenicity signature, pertussis toxin Pathogenicity #66406d
+M00575 Pertussis pathogenicity signature, T1SS Pathogenicity #66406d
+M00576 ETEC pathogenicity signature, heat-labile and heat-stable enterotoxins Pathogenicity #66406d
+M00850 Vibrio cholerae pathogenicity signature, cholera toxins Pathogenicity #66406d
+M00852 Vibrio cholerae pathogenicity signature, toxin coregulated pilus Pathogenicity #66406d
+M00853 ETEC pathogenicity signature, colonization factors Pathogenicity #66406d
+M00856 Salmonella enterica pathogenicity signature, typhoid toxin Pathogenicity #66406d
+M00857 Salmonella enterica pathogenicity signature, Vi antigen Pathogenicity #66406d
+M00859 Bacillus anthracis pathogenicity signature, anthrax toxin Pathogenicity #66406d
+M00860 Bacillus anthracis pathogenicity signature, polyglutamic acid capsule biosynthesis Pathogenicity #66406d
+M00161 Photosystem II Photosynthesis #cfa68a
+M00163 Photosystem I Photosynthesis #cfa68a
+M00597 Anoxygenic photosystem II [BR:ko00194] Photosynthesis #cfa68a
+M00598 Anoxygenic photosystem I [BR:ko00194] Photosynthesis #cfa68a
+M00660 Xanthomonas spp. pathogenicity signature, T3SS and effectors Plant pathogenicity #461d27
+M00133 Polyamine biosynthesis, arginine => agmatine => putrescine => spermidine Polyamine biosynthesis #a5b3da
+M00134 Polyamine biosynthesis, arginine => ornithine => putrescine Polyamine biosynthesis #a5b3da
+M00135 GABA biosynthesis, eukaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da
+M00136 GABA biosynthesis, prokaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da
+M00793 dTDP-L-rhamnose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00794 dTDP-6-deoxy-D-allose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00795 dTDP-beta-L-noviose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00796 dTDP-D-mycaminose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00797 dTDP-D-desosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00798 dTDP-L-mycarose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00799 dTDP-L-oleandrose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00800 dTDP-L-megosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00801 dTDP-L-olivose biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00802 dTDP-D-forosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00803 dTDP-D-angolosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24
+M00048 Inosine monophosphate biosynthesis, PRPP + glutamine => IMP Purine metabolism #e0a7d2
+M00049 Adenine ribonucleotide biosynthesis, IMP => ADP,ATP Purine metabolism #e0a7d2
+M00050 Guanine ribonucleotide biosynthesis IMP => GDP,GTP Purine metabolism #e0a7d2
+M00546 Purine degradation, xanthine => urea Purine metabolism #e0a7d2
+M00046 Pyrimidine degradation, uracil => beta-alanine, thymine => 3-aminoisobutanoate Pyrimidine metabolism #25585e
+M00051 Uridine monophosphate biosynthesis, glutamine (+ PRPP) => UMP Pyrimidine metabolism #25585e
+M00052 Pyrimidine ribonucleotide biosynthesis, UMP => UDP--UTP,CDP--CTP Pyrimidine metabolism #25585e
+M00053 Pyrimidine deoxyribonuleotide biosynthesis, CDP--CTP => dCDP--dCTP,dTDP--dTTP Pyrimidine metabolism #25585e
+M00018 Threonine biosynthesis, aspartate => homoserine => threonine Serine and threonine metabolism #de7d78
+M00020 Serine biosynthesis, glycerate-3P => serine Serine and threonine metabolism #de7d78
+M00033 Ectoine biosynthesis, aspartate => ectoine Serine and threonine metabolism #de7d78
+M00555 Betaine biosynthesis, choline => betaine Serine and threonine metabolism #de7d78
+M00101 Cholesterol biosynthesis, squalene 2,3-epoxide => cholesterol Sterol biosynthesis #4e96a2
+M00102 Ergocalciferol biosynthesis Sterol biosynthesis #4e96a2
+M00103 Cholecalciferol biosynthesis Sterol biosynthesis #4e96a2
+M00104 Bile acid biosynthesis, cholesterol => cholate--chenodeoxycholate Sterol biosynthesis #4e96a2
+M00106 Conjugated bile acid biosynthesis, cholate => taurocholate--glycocholate Sterol biosynthesis #4e96a2
+M00107 Steroid hormone biosynthesis, cholesterol => prognenolone => progesterone Sterol biosynthesis #4e96a2
+M00108 C21-Steroid hormone biosynthesis, progesterone => corticosterone--aldosterone Sterol biosynthesis #4e96a2
+M00109 C21-Steroid hormone biosynthesis, progesterone => cortisol--cortisone Sterol biosynthesis #4e96a2
+M00110 C19--C18-Steroid hormone biosynthesis, pregnenolone => androstenedione => estrone Sterol biosynthesis #4e96a2
+M00862 beta-Oxidation, peroxisome, tri--dihydroxycholestanoyl-CoA => choloyl--chenodeoxycholoyl-CoA Sterol biosynthesis #4e96a2
+M00176 Assimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2
+M00595 Thiosulfate oxidation by SOX complex, thiosulfate => sulfate Sulfur metabolism #4e96a2
+M00596 Dissimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2
+M00664 Nodulation Symbiosis #88574e
+M00095 C5 isoprenoid biosynthesis, mevalonate pathway Terpenoid backbone biosynthesis #4e6089
+M00096 C5 isoprenoid biosynthesis, non-mevalonate pathway Terpenoid backbone biosynthesis #4e6089
+M00364 C10-C20 isoprenoid biosynthesis, bacteria Terpenoid backbone biosynthesis #4e6089
+M00365 C10-C20 isoprenoid biosynthesis, archaea Terpenoid backbone biosynthesis #4e6089
+M00366 C10-C20 isoprenoid biosynthesis, plants Terpenoid backbone biosynthesis #4e6089
+M00367 C10-C20 isoprenoid biosynthesis, non-plant eukaryotes Terpenoid backbone biosynthesis #4e6089
+M00849 C5 isoprenoid biosynthesis, mevalonate pathway, archaea Terpenoid backbone biosynthesis #4e6089
+M00778 Type II polyketide backbone biosynthesis, acyl-CoA + malonyl-CoA => polyketide Type II polyketide biosynthesis #af7194
+M00779 Dihydrokalafungin biosynthesis, octaketide => dihydrokalafungin Type II polyketide biosynthesis #af7194
+M00780 Tetracycline--oxytetracycline biosynthesis, pretetramide => tetracycline--oxytetracycline Type II polyketide biosynthesis #af7194
+M00781 Nogalavinone--aklavinone biosynthesis, deoxynogalonate--deoxyaklanonate => nogalavinone--aklavinone Type II polyketide biosynthesis #af7194
+M00782 Mithramycin biosynthesis, 4-demethylpremithramycinone => mithramycin Type II polyketide biosynthesis #af7194
+M00783 Tetracenomycin C--8-demethyltetracenomycin C biosynthesis, tetracenomycin F2 => tetracenomycin C--8-demethyltetracenomycin C Type II polyketide biosynthesis #af7194
+M00784 Elloramycin biosynthesis, 8-demethyltetracenomycin C => elloramycin A Type II polyketide biosynthesis #af7194
+M00823 Chlortetracycline biosynthesis, pretetramide => chlortetracycline Type II polyketide biosynthesis #af7194
\ No newline at end of file
diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Regular_Module_Information.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Regular_Module_Information.pkl
new file mode 100644
index 0000000..c2ff119
Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Regular_Module_Information.pkl differ
diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Structural_Module_Information.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Structural_Module_Information.pkl
new file mode 100644
index 0000000..ba85377
Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Structural_Module_Information.pkl differ
diff --git a/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz
new file mode 100644
index 0000000..8c3f1d8
Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz differ
diff --git a/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz.md5 b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz.md5
new file mode 100644
index 0000000..12fdf2c
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz.md5
@@ -0,0 +1 @@
+7207b9efe0124c6e9781cf4cf4fa24de MicrobeAnnotator-KEGG.tar.gz
diff --git a/data/MicrobeAnnotator_KEGG/README.md b/data/MicrobeAnnotator_KEGG/README.md
new file mode 100644
index 0000000..3c1a62d
--- /dev/null
+++ b/data/MicrobeAnnotator_KEGG/README.md
@@ -0,0 +1,69 @@
+# MicrobeAnnotator-KEGG
+
+**If this is used in any way, please cite the source publication:**
+
+Ruiz-Perez, C.A., Conrad, R.E. & Konstantinidis, K.T. MicrobeAnnotator: a user-friendly, comprehensive functional annotation pipeline for microbial genomes. BMC Bioinformatics 22, 11 (2021). https://doi.org/10.1186/s12859-020-03940-5
+
+**This data has been incorporated from the following source:**
+
+https://github.com/cruizperez/MicrobeAnnotator/tree/master/microbeannotator/data
+
+**File Descriptions:**
+
+* `KEGG_Regular_Module_Information.pkl` - Python dictionary of regular modules from `MicrobeAnnotator` of `{id_module:structured_kegg_orthologs}`
+* `KEGG_Bifurcating_Module_Information.pkl` - Python dictionary of bifurcating modules from `MicrobeAnnotator` of `{id_module:structured_kegg_orthologs}`
+* `KEGG_Structural_Module_Information.pkl` - Python dictionary of structural modules from `MicrobeAnnotator` of `{id_module:structured_kegg_orthologs}`
+* `KEGG_Module_Information.txt` - - Table containing KEGG ortholog, higher level categories, and module color
+* `KEGG_Module-KOs.pkl` - Flattened dictionary which includes `{id_module:{KO_1, KO_2, ..., KO_M}`. Note: This is not structured and should be used cautiously as KEGG modules and completion calculations are complex. Generated with the Python code below:
+
+```python
+import pickle, glob, os
+
+kegg_directory = "{}/MicrobeAnnotator_KEGG/".format(os.environ["VEBA_DATABASE"]
+
+delimiters = [",","_","-","+"]
+
+# Load MicrobeAnnotator KEGG dictionaries
+module_to_kos__unprocessed = defaultdict(set)
+for fp in glob.glob(os.path.join(kegg_directory, "*.pkl")):
+ with open(fp, "rb") as f:
+ d = pickle.load(f)
+
+ for id_module, v1 in d.items():
+ if isinstance(v1, list):
+ try:
+ module_to_kos__unprocessed[id_module].update(v1)
+ except TypeError:
+ for v2 in v1:
+ module_to_kos__unprocessed[id_module].update(v2)
+ else:
+ for k2, v2 in v1.items():
+ if isinstance(v2, list):
+ try:
+ module_to_kos__unprocessed[id_module].update(v2)
+ except TypeError:
+ for v3 in v2:
+ module_to_kos__unprocessed[id_module].update(v3)
+
+# Flatten the KEGG orthologs
+module_to_kos__processed = dict()
+for id_module, kos_unprocessed in module_to_kos__unprocessed.items():
+ kos_processed = set()
+ for id_ko in kos:
+ composite=False
+ for sep in delimiters:
+ if sep in id_ko:
+ id_ko = id_ko.replace(sep,";")
+ composite = True
+ if composite:
+ kos_composite = set(map(str.strip, filter(bool, id_ko.split(";"))))
+ kos_processed.update(kos_composite)
+ else:
+ kos_processed.add(id_ko)
+ module_to_kos__processed[id_module] = kos_processed
+
+
+# Write
+with open(os.path.join(kegg_directory, "KEGG_Module-KOs.pkl"), "wb") as f:
+ pickle.dump(module_to_kos__processed, f)
+```
\ No newline at end of file
diff --git a/data/README.md b/data/README.md
index 10b3b4b..6525e90 100644
--- a/data/README.md
+++ b/data/README.md
@@ -9,4 +9,13 @@ The following fastq files are subsets of the original SRA sequences designed for
| S3 | SRR17458630 | FASTQ | DNA | 2389989 | 75 | 150.4 | 151 | 56.38 |
| S4 | SRR17458638 | FASTQ | DNA | 3142566 | 75 | 150.5 | 151 | 46.34 |
-[**Download**](https://zenodo.org/record/7946802#.ZGVSpuzMKDU)
\ No newline at end of file
+Also includes the following:
+
+* Metagenomic assemblies using metaSPAdes with sorted BAM files from Bowtie2
+* Genomes, gene models, etc.
+* Taxonomy classifications at the genome and genome cluster level
+* Annotations for genes and protein clusters
+* Biosynthetic gene clusters
+* Clusters for genomes and proteins
+
+[**Download**](https://zenodo.org/records/10094990)
\ No newline at end of file
diff --git a/install/README.md b/install/README.md
index d7d56b0..f92e986 100644
--- a/install/README.md
+++ b/install/README.md
@@ -3,16 +3,18 @@ ____________________________________________________________
#### Software installation
One issue with having large-scale pipeline suites with open-source software is the issue of dependencies. One solution for this is to have a modular software structure where each module has its own `conda` environment. This allows for minimizing dependency constraints as this software suite uses an array of diverse packages from different developers.
-The basis for these environments is creating a separate environment for each module with the `VEBA-` prefix and `_env` as the suffix. For example `VEBA-assembly_env` or `VEBA-binning-prokaryotic_env`. Because of this, `VEBA` is currently not available as a `conda` package but each module will be in the near future. In the meantime, please use the `veba/install/install_veba.sh` script which installs each environment from the yaml files in `veba/install/environments/`. After installing the environments, use the `veba/install/download_databases.sh` script to download and configure the databases while also adding the environment variables to the activate/deactivate scripts in each environment. To install anything manually, just read the scripts as they are well documented and refer to different URL and paths for specific installation options.
+The basis for these environments is creating a separate environment for each module with the `VEBA-` prefix and `_env` as the suffix. For example `VEBA-assembly_env` or `VEBA-binning-prokaryotic_env`. Because of this, `VEBA` is currently not available as a `conda` package but each module will be in the near future. In the meantime, please use the `veba/install/install.sh` script which installs each environment from the yaml files in `veba/install/environments/`. After installing the environments, use the `veba/install/download_databases.sh` script to download and configure the databases while also adding the environment variables to the activate/deactivate scripts in each environment. To install anything manually, just read the scripts as they are well documented and refer to different URL and paths for specific installation options.
-The majority of the time taken to build database is downloading/decompressing large archives, `Diamond` database creation of `UniRef`, and `MMSEQS2` database creation of microeukaryotic protein database.
+The majority of the time taken to build database is downloading/decompressing large archives (e.g., `UniRef` & `GTDB`), `Diamond` database creation of `UniRef`, and `MMSEQS2` database creation of `MicroEuk` database.
Total size is `243 GB` but if you have certain databases installed already then you can just symlink them so the `VEBA_DATABASE` path has the correct structure. Note, the exact size may vary as Pfam and UniRef are updated regularly.
Each major version will be packaged as a [release](https://github.com/jolespin/veba/releases) which will include a log of module and script versions.
-**Download Anaconda:**
-[https://www.anaconda.com/products/distribution](https://www.anaconda.com/products/distribution)
+**Download Miniconda (or Anaconda):**
+
+* [https://docs.conda.io/projects/miniconda/en/latest/](https://docs.conda.io/projects/miniconda/en/latest/) (Recommended)
+* [https://www.anaconda.com/products/distribution](https://www.anaconda.com/products/distribution)
____________________________________________________________
@@ -33,7 +35,7 @@ Currently, **Conda environments for VEBA are ONLY configured for Linux** and, du
* Download/configure databases
-**0. Clean up your conda installation [Optional, but recommended]**
+**0. Clean up your conda installation [Optional, but highly recommended]**
The `VEBA` installation is going to configure some `conda` environments for you and some of them have quite a bit of packages. To minimize the likelihood of [weird errors](https://forum.qiime2.org/t/valueerror-unsupported-format-character-t-0x54-at-index-3312-when-creating-environment-from-environment-file/25237), it's recommended to do the following:
@@ -83,7 +85,7 @@ The `VEBA` installation is going to configure some `conda` environments for you
```
# For stable version, download and decompress the tarball:
-VERSION="1.3.0"
+VERSION="1.4.0"
wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz
tar -xvf v${VERSION}.tar.gz && mv veba-${VERSION} veba
@@ -106,14 +108,16 @@ cd veba/install
The update from `CheckM1` -> `CheckM2` and installation of `antiSMASH` require more memory and may require grid access if head node is limited.
```
-bash install_veba.sh
+bash install.sh
```
**3. Activate the database conda environment, download, and configure databases**
**Recommended resource allocatation:** 48 GB memory (time is dependent on I/O of database repositories)
-⚠️ **This step should use ~48 GB memory** and should be run using a compute grid via SLURM or SunGridEngine. If this command is run on the head node it will likely fail or timeout if a connection is interrupted. The most computationally intensive steps are creating a `Diamond` database of `UniRef` and a `MMSEQS2` database of the microeukaryotic protein database. Note the duration will depend on several factors including your internet connection speed and the I/O of public repositories.
+⚠️ **This step should use ~48 GB memory** and should be run using a compute grid via `SLURM` or `SunGridEngine`. **If this command is run on the head node it will likely fail or timeout if a connection is interrupted.** The most computationally intensive steps are creating a `Diamond` database of `UniRef` and a `MMSEQS2` database of the `MicroEuk100/90/50`.
+
+Note the duration will depend on several factors including your internet connection speed and the I/O of public repositories.
**Future releases will split the downloading and configuration to better make use of resources.**
@@ -163,7 +167,7 @@ qsub -o logs/${N}.o -e logs/${N}.e -cwd -N ${N} -j y -pe threaded ${N_JOBS} "${C
PARTITION=[partition name]
ACCOUNT=[account name]
-sbatch -A ${ACCOUNT} -p ${PARTITION} -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 12:00:00 --mem=64G --wrap="${CMD}"
+sbatch -A ${ACCOUNT} -p ${PARTITION} -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 16:00:00 --mem=24G --wrap="${CMD}"
```
Now, you should have the following environments:
@@ -183,6 +187,7 @@ VEBA-phylogeny_env
VEBA-preprocess_env
VEBA-profile_env
```
+
All the environments should have the `VEBA_DATABASE` environment variable set. If not, then add it manually to ~/.bash_profile: `export VEBA_DATABASE=/path/to/veba_database`.
You can check to make sure the `conda` environments were created and all of the environment variables were created using the following command:
@@ -218,7 +223,7 @@ ____________________________________________________________
```
# Remove conda enivronments
-bash uninstall_veba.sh
+bash uninstall.sh
# Remove VEBA database
rm -rfv /path/to/veba_database
@@ -230,6 +235,6 @@ ____________________________________________________________
There are currently 2 ways to update veba:
1. Basic uninstall reinstall - You can uninstall and reinstall using the scripts in `veba/install/` directory. It's recomended to do a fresh reinstall when updating from `v1.0.x` → `v1.2.x`.
-2. Patching existing installation - Complete reinstalls of *VEBA* environments and databases is time consuming so [we've detailed how to do specific patches **for advanced users**](PATCHES.md). If you don't feel comfortable running these commands, then just do a fresh install if you would like to update.
+2. Patching existing installation - TBD Guide for updating specific modules in an installation.
diff --git a/install/PATCHES.md b/install/deprecated/PATCHES.md
similarity index 100%
rename from install/PATCHES.md
rename to install/deprecated/PATCHES.md
diff --git a/install/download_databases.sh b/install/download_databases.sh
index 12833fd..06c4d48 100644
--- a/install/download_databases.sh
+++ b/install/download_databases.sh
@@ -1,11 +1,12 @@
#!/bin/bash
-# __version__ = "2023.10.23"
-# VEBA_DATABASE_VERSION = "VDB_v5.2"
-# MICROEUKAYROTIC_DATABASE_VERSION = "VDB-Microeukaryotic_v2.1"
+# __version__ = "2023.12.11"
+# VEBA_DATABASE_VERSION = "VDB_v6"
+# MICROEUKAYROTIC_DATABASE_VERSION = "MicroEuk_v3"
# Create database
DATABASE_DIRECTORY=${1:-"."}
REALPATH_DATABASE_DIRECTORY=$(realpath $DATABASE_DIRECTORY)
+SCRIPT_DIRECTORY=$(dirname "$0")
# N_JOBS=$(2:-"1")
@@ -28,7 +29,7 @@ echo ". .. ... ..... ........ ............."
echo "i * Processing NCBITaxonomy"
echo ". .. ... ..... ........ ............."
mkdir -v -p ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy
-wget -v -P ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
+# wget -v -P ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
wget -v -P ${DATABASE_DIRECTORY} https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
# python -c 'import sys; from ete3 import NCBITaxa; NCBITaxa(taxdump_file="%s/taxdump.tar.gz"%(sys.argv[1]), dbfile="%s/Classify/NCBITaxonomy/taxa.sqlite"%(sys.argv[1]))' $DATABASE_DIRECTORY
tar xzfv ${DATABASE_DIRECTORY}/taxdump.tar.gz -C ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy/
@@ -86,18 +87,56 @@ echo ". .. ... ..... ........ ............."
echo "v * Processing Microeukaryotic MMSEQS2 database"
echo ". .. ... ..... ........ ............."
-# Download v2.1 from Zenodo
-wget -v -O ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz https://zenodo.org/record/7485114/files/VDB-Microeukaryotic_v2.tar.gz?download=1
-mkdir -p ${DATABASE_DIRECTORY}/Classify/Microeukaryotic && tar -xvzf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz -C ${DATABASE_DIRECTORY}/Classify/Microeukaryotic --strip-components=1
-mmseqs createdb ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic
-rm -rf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz
+## Download v2.1 from Zenodo
+# wget -v -O ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz https://zenodo.org/record/7485114/files/VDB-Microeukaryotic_v2.tar.gz?download=1
+# mkdir -p ${DATABASE_DIRECTORY}/Classify/Microeukaryotic && tar -xvzf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz -C ${DATABASE_DIRECTORY}/Classify/Microeukaryotic --strip-components=1
+# mmseqs createdb --compressed 1 ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic
+# rm -rf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz
-# eukaryota_odb10 subset of Microeukaryotic Protein Database
-wget -v -O ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list https://zenodo.org/record/7485114/files/reference.eukaryota_odb10.list?download=1
-seqkit grep -f ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz > ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa
-mmseqs createdb ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic.eukaryota_odb10
-rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa
-rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz # Comment this out if you want to keep the actual protein sequences
+# # eukaryota_odb10 subset of Microeukaryotic Protein Database
+# wget -v -O ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list https://zenodo.org/record/7485114/files/reference.eukaryota_odb10.list?download=1
+# seqkit grep -f ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz > ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa
+# mmseqs createdb --compressed 1 ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic.eukaryota_odb10
+# rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa
+# rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz # Comment this out if you want to keep the actual protein sequences
+
+# Download MicroEuk_v3 from Zenodo
+wget -v -O ${DATABASE_DIRECTORY}/MicroEuk_v3.tar.gz https://zenodo.org/records/10139451/files/MicroEuk_v3.tar.gz?download=1
+tar xvzf ${DATABASE_DIRECTORY}/MicroEuk_v3.tar.gz -C ${DATABASE_DIRECTORY}
+mkdir -p ${DATABASE_DIRECTORY}/Classify/MicroEuk
+
+# Source Taxonomy
+cp -rf ${DATABASE_DIRECTORY}/MicroEuk_v3/source_taxonomy.tsv.gz ${DATABASE_DIRECTORY}/Classify/MicroEuk
+
+# MicroEuk100
+gzip -d ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa.gz
+mmseqs createdb --compressed 1 ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk100
+
+# MicroEuk100.eukaryota_odb10
+gzip -d ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.eukaryota_odb10.list.gz
+seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.eukaryota_odb10.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk100
+
+# MicroEuk90
+gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list
+seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk90
+
+# MicroEuk90
+gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list
+seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk90
+
+# MicroEuk50
+gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk50_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk50.list
+seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk50.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk50
+
+# source_to_lineage.dict.pkl.gz
+build_source_to_lineage_dictionary.py -i ${DATABASE_DIRECTORY}/MicroEuk_v3/source_taxonomy.tsv.gz -o ${DATABASE_DIRECTORY}/Classify/MicroEuk/source_to_lineage.dict.pkl.gz
+
+# target_to_source.dict.pkl.gz
+build_target_to_source_dictionary.py -i ${DATABASE_DIRECTORY}/MicroEuk_v3/identifier_mapping.proteins.tsv.gz -o ${DATABASE_DIRECTORY}/Classify/MicroEuk/target_to_source.dict.pkl.gz
+
+# Remove intermediate files
+rm -rf ${DATABASE_DIRECTORY}/MicroEuk_v3/
+rm -rf ${DATABASE_DIRECTORY}/MicroEuk_v3.tar.gz
# MarkerSets
echo ". .. ... ..... ........ ............."
@@ -213,11 +252,17 @@ rm -rf ${DATABASE_DIRECTORY}/Contamination/AntiFam/*.seed
mkdir -v -p ${DATABASE_DIRECTORY}/Contamination/kmers
wget -v -O ${DATABASE_DIRECTORY}/Contamination/kmers/ribokmers.fa.gz https://figshare.com/ndownloader/files/36220587
-# Replacing GRCh38 with CHM13v2.0 in v2022.10.18
+# T2T-CHM13v2.0
+# Bowtie2 Index
wget -v -P ${DATABASE_DIRECTORY} https://genome-idx.s3.amazonaws.com/bt/chm13v2.0.zip
unzip -d ${DATABASE_DIRECTORY}/Contamination/ ${DATABASE_DIRECTORY}/chm13v2.0.zip
rm -rf ${DATABASE_DIRECTORY}/chm13v2.0.zip
+# # MiniMap2 Index (Uncomment if you plan on using long reads (7.1 GB))
+# wget -v -P ${DATABASE_DIRECTORY} https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
+# minimap2 -d ${DATABASE_DIRECTORY}/Contamination/chm13v2.0/chm13v2.0.mmi ${DATABASE_DIRECTORY}/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
+# rm -rf ${DATABASE_DIRECTORY}/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
+
echo ". .. ... ..... ........ ............."
echo "xii * Adding the following environment variable to VEBA environments: export VEBA_DATABASE=${REALPATH_DATABASE_DIRECTORY}"
# CONDA_BASE=$(which conda | python -c "import sys; print('/'.join(sys.stdin.read().split('/')[:-2]))")
diff --git a/install/environments/VEBA-assembly_env.yml b/install/environments/VEBA-assembly_env.yml
index 692c79e..6d5a013 100644
--- a/install/environments/VEBA-assembly_env.yml
+++ b/install/environments/VEBA-assembly_env.yml
@@ -1,4 +1,4 @@
-name: VEBA-assembly_env__2023.5.15
+name: VEBA-assembly_env__2023.11.30
channels:
- conda-forge
- bioconda
@@ -16,15 +16,16 @@ dependencies:
- bz2file=0.98=py_0
- bzip2=1.0.8=h7f98852_4
- c-ares=1.18.1=h7f98852_0
- - ca-certificates=2022.12.7=ha878542_0
+ - ca-certificates=2023.11.17=hbcca054_0
- cairo=1.16.0=ha61ee94_1014
- - certifi=2022.12.7=pyhd8ed1ab_0
+ - certifi=2023.11.17=pyhd8ed1ab_0
- cffi=1.15.1=py39he91dace_2
- charset-normalizer=2.1.1=pyhd8ed1ab_0
- colorama=0.4.6=pyhd8ed1ab_0
- coreutils=9.3=h0b41bf4_0
- - cryptography=38.0.4=py39hd97740a_0
+ - cryptography=41.0.7=py39hd4f0224_0
- expat=2.5.0=h27087fc_0
+ - flye=2.9.3=py39hd65a603_0
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
@@ -38,21 +39,22 @@ dependencies:
- giflib=5.2.1=h36c2ea0_2
- graphite2=1.3.13=h58526e2_1001
- harfbuzz=5.3.0=h418a68e_0
- - htslib=1.16=h6bc39ce_0
+ - htslib=1.18=h81da01d_0
- icu=70.1=h27087fc_0
- idna=3.4=pyhd8ed1ab_0
- jpeg=9e=h166bdaf_2
+ - k8=0.2.5=hdcf5f25_4
- kernel-headers_linux-64=3.10.0=h4a8ded7_13
- keyutils=1.6.1=h166bdaf_0
- - krb5=1.19.3=h3790be6_0
- - lcms2=2.14=h6ed2654_0
+ - krb5=1.21.2=h659d440_0
+ - lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.39=hcc3a1bd_1
- lerc=4.0.0=h27087fc_0
- libblas=3.9.0=16_linux64_openblas
- libcblas=3.9.0=16_linux64_openblas
- - libcups=2.3.3=h3e49a29_2
- - libcurl=7.86.0=h7bff187_1
- - libdeflate=1.13=h166bdaf_0
+ - libcups=2.3.3=h4637d8d_4
+ - libcurl=8.2.1=hca28451_0
+ - libdeflate=1.19=hd590300_0
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=h516909a_1
- libffi=3.4.2=h7f98852_5
@@ -64,14 +66,14 @@ dependencies:
- libhwloc=2.8.0=h32351e8_1
- libiconv=1.17=h166bdaf_0
- liblapack=3.9.0=16_linux64_openblas
- - libnghttp2=1.47.0=hdcd2b5c_1
+ - libnghttp2=1.52.0=h61bc06f_0
- libnsl=2.0.0=h7f98852_0
- libopenblas=0.3.21=pthreads_h78a6416_3
- libpng=1.6.39=h753d276_0
- libsqlite=3.40.0=h753d276_0
- - libssh2=1.10.0=haa6b8db_3
+ - libssh2=1.11.0=h0841786_0
- libstdcxx-ng=12.2.0=h46fd767_19
- - libtiff=4.4.0=h0e0dad5_3
+ - libtiff=4.2.0=hf544144_3
- libuuid=2.32.1=h7f98852_1000
- libwebp-base=1.2.4=h166bdaf_0
- libxcb=1.13=h7f98852_1004
@@ -79,11 +81,12 @@ dependencies:
- libzlib=1.2.13=h166bdaf_4
- llvm-openmp=8.0.1=hc9558a2_0
- megahit=1.2.9=h2e03b76_1
+ - minimap2=2.26=he4a0461_2
- ncurses=6.3=h27087fc_1
- numpy=1.23.5=py39h3d75532_0
- - openjdk=17.0.3=hafdced1_4
+ - openjdk=11.0.1=h516909a_1016
- openmp=8.0.1=0
- - openssl=1.1.1t=h0b41bf4_0
+ - openssl=3.2.0=hd590300_1
- pandas=1.5.2=py39h4661b88_0
- pathlib2=2.3.7.post1=py39hf3d152e_2
- pbzip2=1.1.13=0
@@ -93,9 +96,9 @@ dependencies:
- pixman=0.40.0=h36c2ea0_0
- pthread-stubs=0.4=h36c2ea0_1001
- pycparser=2.21=pyhd8ed1ab_0
- - pyopenssl=22.1.0=pyhd8ed1ab_0
+ - pyopenssl=23.3.0=pyhd8ed1ab_0
- pysocks=1.7.1=pyha2e5f31_6
- - python=3.9.15=h47a2c10_0_cpython
+ - python=3.9.16=h2782a2a_0_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-tzdata=2022.7=pyhd8ed1ab_0
- python_abi=3.9=3_cp39
diff --git a/install/environments/VEBA-cluster_env.yml b/install/environments/VEBA-cluster_env.yml
index 2f2d189..b9ff294 100644
--- a/install/environments/VEBA-cluster_env.yml
+++ b/install/environments/VEBA-cluster_env.yml
@@ -1,4 +1,4 @@
-name: VEBA-cluster_env__v2023.5.15
+name: VEBA-cluster_env__v2023.12.8
channels:
- conda-forge
- bioconda
@@ -9,27 +9,36 @@ dependencies:
- _openmp_mutex=4.5=2_gnu
- aria2=1.36.0=h1e4e653_3
- biopython=1.80=py311hd4cff14_0
+ - blast=2.14.1=pl5321h6f7f691_0
- brotlipy=0.7.0=py311hd4cff14_1005
- bz2file=0.98=py_0
- bzip2=1.0.8=h7f98852_4
- c-ares=1.18.1=h7f98852_0
- - ca-certificates=2022.12.7=ha878542_0
- - certifi=2022.12.7=pyhd8ed1ab_0
+ - ca-certificates=2023.11.17=hbcca054_0
+ - certifi=2023.11.17=pyhd8ed1ab_0
- cffi=1.15.1=py311h409f033_3
- charset-normalizer=2.1.1=pyhd8ed1ab_0
- colorama=0.4.6=pyhd8ed1ab_0
- coreutils=9.3=h0b41bf4_0
- cryptography=39.0.0=py311h9b4c7bb_0
- - fastani=1.33=h0fdf51a_1
+ - curl=8.1.2=h409715c_0
+ - diamond=2.1.8=h43eeafb_0
+ - entrez-direct=16.2=he881be0_1
+ - fastani=1.34=h4dfc31f_1
- gawk=5.1.0=h7f98852_0
- genopype=2023.5.15=py_0
- gettext=0.21.1=h27087fc_0
- gsl=2.7=he838d99_0
- icu=70.1=h27087fc_0
- idna=3.4=pyhd8ed1ab_0
+ - keyutils=1.6.1=h166bdaf_0
+ - krb5=1.20.1=h81ceb04_0
- ld_impl_linux-64=2.40=h41732ed_0
- libblas=3.9.0=16_linux64_openblas
- libcblas=3.9.0=16_linux64_openblas
+ - libcurl=8.1.2=h409715c_0
+ - libedit=3.1.20191231=he28a2e2_2
+ - libev=4.33=h516909a_1
- libffi=3.4.2=h7f98852_5
- libgcc-ng=12.2.0=h65d4601_19
- libgfortran-ng=12.2.0=h69a702a_19
@@ -38,6 +47,7 @@ dependencies:
- libiconv=1.17=h166bdaf_0
- libidn2=2.3.4=h166bdaf_0
- liblapack=3.9.0=16_linux64_openblas
+ - libnghttp2=1.52.0=h61bc06f_0
- libnsl=2.0.0=h7f98852_0
- libopenblas=0.3.21=pthreads_h78a6416_3
- libsqlite=3.40.0=h753d276_0
@@ -48,13 +58,34 @@ dependencies:
- libxml2=2.10.3=h7463322_0
- libzlib=1.2.13=h166bdaf_4
- mmseqs2=14.7e284=pl5321hf1761c0_0
+ - ncbi-vdb=3.0.0=pl5321h87f3376_0
- ncurses=6.3=h27087fc_1
- networkx=3.0=pyhd8ed1ab_0
- numpy=1.24.1=py311h8e6699e_0
- - openssl=3.0.8=h0b41bf4_0
+ - openssl=3.2.0=hd590300_1
- pandas=1.5.3=py311h2872171_0
- pathlib2=2.3.7.post1=py311h38be061_2
+ - pcre=8.45=h9c3ff4c_0
- perl=5.32.1=2_h7f98852_perl5
+ - perl-archive-tar=2.40=pl5321hdfd78af_0
+ - perl-carp=1.38=pl5321hdfd78af_4
+ - perl-common-sense=3.75=pl5321hdfd78af_0
+ - perl-compress-raw-bzip2=2.201=pl5321h87f3376_1
+ - perl-compress-raw-zlib=2.105=pl5321h87f3376_0
+ - perl-encode=3.19=pl5321hec16e2b_1
+ - perl-exporter=5.72=pl5321hdfd78af_2
+ - perl-exporter-tiny=1.002002=pl5321hdfd78af_0
+ - perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0
+ - perl-io-compress=2.201=pl5321hdbdd923_2
+ - perl-io-zlib=1.14=pl5321hdfd78af_0
+ - perl-json=4.10=pl5321hdfd78af_0
+ - perl-json-xs=2.34=pl5321h4ac6f70_6
+ - perl-list-moreutils=0.430=pl5321hdfd78af_0
+ - perl-list-moreutils-xs=0.430=pl5321h031d066_2
+ - perl-parent=0.236=pl5321hdfd78af_2
+ - perl-pathtools=3.75=pl5321hec16e2b_3
+ - perl-scalar-list-utils=1.62=pl5321hec16e2b_1
+ - perl-types-serialiser=1.01=pl5321hdfd78af_0
- pip=23.0=pyhd8ed1ab_0
- pycparser=2.21=pyhd8ed1ab_0
- pyopenssl=23.0.0=pyhd8ed1ab_0
@@ -71,6 +102,7 @@ dependencies:
- seqkit=2.3.1=h9ee0642_0
- setuptools=66.1.1=pyhd8ed1ab_0
- six=1.16.0=pyh6c4a22f_0
+ - skani=0.2.1=h4ac6f70_0
- soothsayer_utils=2022.6.24=py_0
- tk=8.6.12=h27826a3_0
- tqdm=4.64.1=pyhd8ed1ab_0
@@ -80,4 +112,5 @@ dependencies:
- wget=1.20.3=ha35d2d1_1
- wheel=0.38.4=pyhd8ed1ab_0
- xz=5.2.6=h166bdaf_0
- - zlib=1.2.13=h166bdaf_4
\ No newline at end of file
+ - zlib=1.2.13=h166bdaf_4
+ - zstd=1.5.5=hfc55251_0
\ No newline at end of file
diff --git a/install/environments/VEBA-database_env.yml b/install/environments/VEBA-database_env.yml
index 8e56e1c..f78e9c4 100644
--- a/install/environments/VEBA-database_env.yml
+++ b/install/environments/VEBA-database_env.yml
@@ -1,4 +1,4 @@
-name: VEBA-database_env__v2023.6.20
+name: VEBA-database_env__v2023.11.30
channels:
- conda-forge
- bioconda
@@ -14,8 +14,8 @@ dependencies:
- bz2file=0.98=py_0
- bzip2=1.0.8=h7f98852_4
- c-ares=1.19.1=hd590300_0
- - ca-certificates=2023.5.7=hbcca054_0
- - certifi=2023.5.7=pyhd8ed1ab_0
+ - ca-certificates=2023.11.17=hbcca054_0
+ - certifi=2023.11.17=pyhd8ed1ab_0
- charset-normalizer=3.1.0=pyhd8ed1ab_0
- colorama=0.4.6=pyhd8ed1ab_0
- coreutils=9.3=h0b41bf4_0
@@ -27,6 +27,7 @@ dependencies:
- gettext=0.21.1=h27087fc_0
- icu=72.1=hcb278e6_0
- idna=3.4=pyhd8ed1ab_0
+ - k8=0.2.5=hdcf5f25_4
- keyutils=1.6.1=h166bdaf_0
- krb5=1.20.1=h81ceb04_0
- ld_impl_linux-64=2.40=h41732ed_0
@@ -57,10 +58,11 @@ dependencies:
- libuuid=2.38.1=h0b41bf4_0
- libxml2=2.11.4=h0d562d8_0
- libzlib=1.2.13=hd590300_5
+ - minimap2=2.26=he4a0461_2
- mmseqs2=14.7e284=pl5321h6a68c12_2
- ncurses=6.4=hcb278e6_0
- numpy=1.25.0=py311h64a7726_0
- - openssl=3.1.1=hd590300_1
+ - openssl=3.2.0=hd590300_1
- pandas=2.0.2=py311h320fe9a_0
- pathlib2=2.3.7.post1=py311h38be061_2
- pcre=8.45=h9c3ff4c_0
diff --git a/install/environments/VEBA-mapping_env.yml b/install/environments/VEBA-mapping_env.yml
index 5af32f1..feb6918 100644
--- a/install/environments/VEBA-mapping_env.yml
+++ b/install/environments/VEBA-mapping_env.yml
@@ -1,106 +1,84 @@
-name: VEBA-mapping_env__v2023.7.25
+name: VEBA-mapping_env__v2023.11.17
channels:
- conda-forge
- bioconda
- jolespin
- defaults
+ - qiime2
dependencies:
- _libgcc_mutex=0.1=conda_forge
- - _openmp_mutex=4.5=1_gnu
- - anndata=0.9.0=pyhd8ed1ab_0
- - bbmap=38.95=h5c4e2a8_1
- - biom-format=2.1.14=py39h72bdee0_2
- - biopython=1.79=py39h3811e60_1
- - bowtie2=2.5.1=py39h6fed5c7_2
- - brotlipy=0.7.0=py39h3811e60_1003
+ - _openmp_mutex=4.5=2_gnu
+ - biopython=1.81=py310h2372a71_1
+ - bowtie2=2.5.2=py310ha0a81b8_0
+ - brotli-python=1.1.0=py310hc6cd4ac_1
- bz2file=0.98=py_0
- - bzip2=1.0.8=h7f98852_4
- - c-ares=1.18.1=h7f98852_0
+ - bzip2=1.0.8=hd590300_5
+ - c-ares=1.21.0=hd590300_0
- ca-certificates=2023.7.22=hbcca054_0
- - cached-property=1.5.2=hd8ed1ab_1
- - cached_property=1.5.2=pyha770c72_1
- certifi=2023.7.22=pyhd8ed1ab_0
- - cffi=1.15.0=py39h4bc2ebd_0
- - charset-normalizer=2.0.12=pyhd8ed1ab_0
- - click=8.1.3=unix_pyhd8ed1ab_2
- - colorama=0.4.4=pyh9f0ad1d_0
- - coreutils=9.3=h0b41bf4_0
- - cryptography=41.0.2=py39hd4f0224_0
+ - charset-normalizer=3.3.2=pyhd8ed1ab_0
+ - colorama=0.4.6=pyhd8ed1ab_0
+ - coreutils=9.4=hd590300_0
- genopype=2023.5.15=py_0
- - h5py=3.7.0=nompi_py39h63b1161_100
- - hdf5=1.12.1=nompi_h4df4325_104
- - htslib=1.17=h81da01d_2
- - icu=72.1=hcb278e6_0
- - idna=3.3=pyhd8ed1ab_0
- - importlib-metadata=6.3.0=pyha770c72_0
- - importlib_metadata=6.3.0=hd8ed1ab_0
+ - htslib=1.18=h81da01d_0
+ - icu=73.2=h59595ed_0
+ - idna=3.4=pyhd8ed1ab_0
- keyutils=1.6.1=h166bdaf_0
- - krb5=1.21.1=h659d440_0
- - ld_impl_linux-64=2.36.1=hea4e1c9_2
- - libblas=3.9.0=13_linux64_openblas
- - libcblas=3.9.0=13_linux64_openblas
- - libcurl=8.2.0=hca28451_0
- - libdeflate=1.18=h0b41bf4_0
+ - krb5=1.21.2=h659d440_0
+ - ld_impl_linux-64=2.40=h41732ed_0
+ - libblas=3.9.0=19_linux64_openblas
+ - libcblas=3.9.0=19_linux64_openblas
+ - libcurl=8.4.0=hca28451_0
+ - libdeflate=1.19=hd590300_0
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=h516909a_1
- libffi=3.4.2=h7f98852_5
- - libgcc-ng=12.2.0=h65d4601_19
- - libgfortran-ng=11.2.0=h69a702a_12
- - libgfortran5=11.2.0=h5c6108e_12
- - libgomp=12.2.0=h65d4601_19
- - libhwloc=2.9.1=nocuda_h7313eea_6
+ - libgcc-ng=13.2.0=h807b86a_3
+ - libgfortran-ng=13.2.0=h69a702a_3
+ - libgfortran5=13.2.0=ha4646dd_3
+ - libgomp=13.2.0=h807b86a_3
+ - libhwloc=2.9.3=default_h554bfaf_1009
- libiconv=1.17=h166bdaf_0
- - liblapack=3.9.0=13_linux64_openblas
- - libnghttp2=1.52.0=h61bc06f_0
- - libnsl=2.0.0=h7f98852_0
- - libopenblas=0.3.18=pthreads_h8fe5266_0
- - libsqlite=3.42.0=h2797004_0
+ - liblapack=3.9.0=19_linux64_openblas
+ - libnghttp2=1.58.0=h47da74e_0
+ - libnsl=2.0.1=hd590300_0
+ - libopenblas=0.3.24=pthreads_h413a1c8_0
+ - libsqlite=3.44.0=h2797004_0
- libssh2=1.11.0=h0841786_0
- - libstdcxx-ng=12.2.0=h46fd767_19
- - libuuid=2.32.1=h7f98852_1000
- - libxml2=2.11.4=h0d562d8_0
- - libzlib=1.2.13=h166bdaf_4
- - lz4-c=1.9.3=h9c3ff4c_1
- - natsort=8.3.1=pyhd8ed1ab_0
- - ncurses=6.3=h9c3ff4c_0
- - numpy=1.24.2=py39h7360e5f_0
- - openjdk=8.0.312=h7f98852_0
- - openssl=3.1.1=hd590300_1
- - packaging=23.0=pyhd8ed1ab_0
- - pandas=1.4.1=py39hde0f152_0
- - pathlib2=2.3.7.post1=py39hf3d152e_0
- - pbzip2=1.1.13=0
- - perl=5.32.1=2_h7f98852_perl5
- - pip=22.0.3=pyhd8ed1ab_0
- - pycparser=2.21=pyhd8ed1ab_0
- - pyopenssl=23.2.0=pyhd8ed1ab_1
- - pysocks=1.7.1=py39hf3d152e_4
- - python=3.9.16=h2782a2a_0_cpython
+ - libstdcxx-ng=13.2.0=h7e041cc_3
+ - libuuid=2.38.1=h0b41bf4_0
+ - libxml2=2.11.5=h232c23b_1
+ - libzlib=1.2.13=hd590300_5
+ - ncurses=6.4=h59595ed_2
+ - numpy=1.26.0=py310hb13e2d6_0
+ - openssl=3.1.4=hd590300_0
+ - pandas=2.1.3=py310hcc13569_0
+ - pathlib2=2.3.7.post1=py310hff52083_3
+ - perl=5.32.1=4_hd590300_perl5
+ - pip=23.3.1=pyhd8ed1ab_0
+ - pysocks=1.7.1=pyha2e5f31_6
+ - python=3.10.13=hd12c33a_0_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- - python-tzdata=2021.5=pyhd8ed1ab_0
- - python_abi=3.9=2_cp39
- - pytz=2021.3=pyhd8ed1ab_0
- - pytz-deprecation-shim=0.1.0.post0=py39hf3d152e_1
+ - python-tzdata=2023.3=pyhd8ed1ab_0
+ - python_abi=3.10=4_cp310
+ - pytz=2023.3.post1=pyhd8ed1ab_0
- readline=8.2=h8228510_1
- - requests=2.27.1=pyhd8ed1ab_0
- - samtools=1.17=hd87286a_1
- - scandir=1.10.0=py39h3811e60_4
- - scipy=1.9.3=py39hddc5342_2
- - setuptools=60.9.3=py39hf3d152e_0
+ - requests=2.31.0=pyhd8ed1ab_0
+ - salmon=0.8.1=0
+ - samtools=1.18=h50ea8bc_1
+ - scandir=1.10.0=py310h2372a71_7
+ - seqkit=2.6.0=h9ee0642_0
+ - setuptools=68.2.2=pyhd8ed1ab_0
- six=1.16.0=pyh6c4a22f_0
- soothsayer_utils=2022.6.24=py_0
- - sqlite=3.37.0=h9cd32fc_0
- - star=2.7.10a=h9ee0642_0
- - subread=2.0.3=h7132678_1
- - tbb=2021.9.0=hf52228f_0
- - tk=8.6.12=h27826a3_0
- - tqdm=4.62.3=pyhd8ed1ab_0
- - typing_extensions=4.5.0=pyha770c72_0
- - tzdata=2021e=he74cb21_0
- - tzlocal=4.1=py39hf3d152e_1
- - urllib3=1.26.8=pyhd8ed1ab_1
- - wheel=0.37.1=pyhd8ed1ab_0
+ - subread=2.0.6=he4a0461_0
+ - tbb=2021.10.0=h00ab1b0_2
+ - tk=8.6.13=noxft_h4845f30_101
+ - tqdm=4.66.1=pyhd8ed1ab_0
+ - tzdata=2023c=h71feb2d_0
+ - tzlocal=5.2=py310hff52083_0
+ - urllib3=2.1.0=pyhd8ed1ab_0
+ - wheel=0.41.3=pyhd8ed1ab_0
- xz=5.2.6=h166bdaf_0
- - zipp=3.15.0=pyhd8ed1ab_0
- - zlib=1.2.13=h166bdaf_4
- - zstd=1.5.2=ha95c52a_0
+ - zlib=1.2.13=hd590300_5
+ - zstd=1.5.5=hfc55251_0
\ No newline at end of file
diff --git a/install/environments/VEBA-preprocess_env.yml b/install/environments/VEBA-preprocess_env.yml
index d2f59b2..a7f174b 100644
--- a/install/environments/VEBA-preprocess_env.yml
+++ b/install/environments/VEBA-preprocess_env.yml
@@ -1,4 +1,4 @@
-name: VEBA-preprocess_env__v2023.8.21
+name: VEBA-preprocess_env__v2023.12.12
channels:
- conda-forge
- bioconda
@@ -7,46 +7,50 @@ channels:
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- - alsa-lib=1.2.8=h166bdaf_0
+ - alsa-lib=1.2.7.2=h166bdaf_0
- argparse-manpage-birdtools=1.7.0=pyhd8ed1ab_0
- - aria2=1.36.0=h8b6cd97_3
- - arrow-cpp=10.0.1=h3e2b116_1_cpu
- - aws-c-auth=0.6.21=h3cb7b9d_0
- - aws-c-cal=0.5.20=hd3b2fe5_3
- - aws-c-common=0.8.5=h166bdaf_0
- - aws-c-compression=0.2.16=hf5f93bc_0
- - aws-c-event-stream=0.2.15=h2c1f3d0_11
- - aws-c-http=0.6.27=hb11a807_3
- - aws-c-io=0.13.11=hf1b0a34_1
- - aws-c-mqtt=0.7.13=h93e60df_9
- - aws-c-s3=0.1.51=h1222a00_14
- - aws-c-sdkutils=0.1.7=hf5f93bc_0
- - aws-checksums=0.1.13=hf5f93bc_5
- - aws-crt-cpp=0.18.16=hb1454fd_1
- - aws-sdk-cpp=1.9.379=hdc6349a_5
+ - aria2=1.36.0=h1e4e653_3
+ - arrow-cpp=12.0.0=ha770c72_1_cpu
+ - aws-c-auth=0.6.26=h2c7c9e7_6
+ - aws-c-cal=0.5.26=h71eb795_0
+ - aws-c-common=0.8.17=hd590300_0
+ - aws-c-compression=0.2.16=h4f47f36_6
+ - aws-c-event-stream=0.2.20=h69ce273_6
+ - aws-c-http=0.7.7=h7b8353a_3
+ - aws-c-io=0.13.21=h2c99d58_4
+ - aws-c-mqtt=0.8.6=h3a1964a_15
+ - aws-c-s3=0.2.8=h0933b68_4
+ - aws-c-sdkutils=0.1.9=h4f47f36_1
+ - aws-checksums=0.1.14=h4f47f36_6
+ - aws-crt-cpp=0.19.9=h85076f6_5
+ - aws-sdk-cpp=1.10.57=hf40e4db_10
- awscli=1.27.23=py39hf3d152e_0
- bbmap=39.01=h5c4e2a8_0
+ - binutils_impl_linux-64=2.39=he00db2b_1
- bird_tool_utils_python=0.4.1=pyhdfd78af_0
- botocore=1.29.23=pyhd8ed1ab_0
- bowtie2=2.5.1=py39h3321a2d_0
- brotlipy=0.7.0=py39hb9d737c_1005
- bz2file=0.98=py_0
- bzip2=1.0.8=h7f98852_4
- - c-ares=1.18.1=h7f98852_0
- - ca-certificates=2023.7.22=hbcca054_0
+ - c-ares=1.22.1=hd590300_0
+ - ca-certificates=2023.11.17=hbcca054_0
- cairo=1.16.0=ha61ee94_1014
- - certifi=2023.7.22=pyhd8ed1ab_0
+ - certifi=2023.11.17=pyhd8ed1ab_0
- cffi=1.15.1=py39he91dace_2
- charset-normalizer=2.1.1=pyhd8ed1ab_0
+ - chopper=0.7.0=hdcf5f25_0
+ - clang=15.0.3=ha770c72_0
+ - clang-15=15.0.3=default_h2e3cab8_0
- colorama=0.4.4=pyh9f0ad1d_0
- coreutils=9.3=h0b41bf4_0
- - cryptography=38.0.4=py39hd97740a_0
- - curl=7.86.0=h7bff187_1
+ - cryptography=41.0.7=py39hd4f0224_0
+ - curl=8.4.0=hca28451_0
- docutils=0.16=py39hf3d152e_3
- expat=2.5.0=h27087fc_0
- extern=0.4.1=py_0
- fastp=0.23.4=h5f740d0_0
- - fastq_preprocessor=2023.7.24=py_0
+ - fastq_preprocessor=2023.12.12=py_0
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
@@ -55,6 +59,7 @@ dependencies:
- fonts-conda-ecosystem=1=0
- fonts-conda-forge=1=0
- freetype=2.12.1=hca18f0e_1
+ - gcc_impl_linux-64=12.2.0=hcc96c02_19
- genopype=2023.5.15=py_0
- gettext=0.21.1=h27087fc_0
- gflags=2.2.2=he1b5a44_1004
@@ -62,54 +67,62 @@ dependencies:
- glog=0.6.0=h6f12383_0
- graphite2=1.3.13=h58526e2_1001
- harfbuzz=5.3.0=h418a68e_0
- - hdf5=1.12.1=nompi_h2386368_104
- - htslib=1.16=h6bc39ce_0
+ - hdf5=1.14.2=nompi_h4f84152_100
+ - htslib=1.18=h81da01d_0
- icu=70.1=h27087fc_0
- idna=3.4=pyhd8ed1ab_0
- isa-l=2.30.0=ha770c72_4
- jmespath=1.0.1=pyhd8ed1ab_0
- jpeg=9e=h166bdaf_2
+ - k8=0.2.5=hdcf5f25_4
+ - kernel-headers_linux-64=2.6.32=he073ed8_16
- keyutils=1.6.1=h166bdaf_0
- kingfisher=0.1.0=pyh7cba7a3_1
- - krb5=1.19.3=h3790be6_0
- - lcms2=2.14=h6ed2654_0
+ - krb5=1.21.2=h659d440_0
+ - lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.39=hcc3a1bd_1
- lerc=4.0.0=h27087fc_0
- - libabseil=20220623.0=cxx17_h48a1fff_5
- - libarrow=10.0.1=hcf5dfb8_1_cpu
+ - libabseil=20230125.0=cxx17_hcb278e6_1
+ - libaec=1.1.2=h59595ed_1
+ - libarrow=12.0.0=h1cdf7b0_1_cpu
- libblas=3.9.0=16_linux64_openblas
- libbrotlicommon=1.0.9=h166bdaf_8
- libbrotlidec=1.0.9=h166bdaf_8
- libbrotlienc=1.0.9=h166bdaf_8
- libcblas=3.9.0=16_linux64_openblas
+ - libclang-cpp15=15.0.3=default_h2e3cab8_0
- libcrc32c=1.1.2=h9c3ff4c_0
- - libcups=2.3.3=h3e49a29_2
- - libcurl=7.86.0=h7bff187_1
- - libdeflate=1.13=h166bdaf_0
+ - libcups=2.3.3=h4637d8d_4
+ - libcurl=8.4.0=hca28451_0
+ - libdeflate=1.19=hd590300_0
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=h516909a_1
- - libevent=2.1.10=h9b69904_4
+ - libevent=2.1.12=hf998b51_1
- libffi=3.4.2=h7f98852_5
+ - libgcc-devel_linux-64=12.2.0=h3b97bd3_19
- libgcc-ng=12.2.0=h65d4601_19
- - libgfortran-ng=12.2.0=h69a702a_19
- - libgfortran5=12.2.0=h337968e_19
+ - libgfortran-ng=13.2.0=h69a702a_0
+ - libgfortran5=13.2.0=ha4646dd_0
- libglib=2.74.1=h606061b_1
- libgomp=12.2.0=h65d4601_19
- - libgoogle-cloud=2.5.0=hcb5eced_0
- - libgrpc=1.49.1=h05bd8bd_1
+ - libgoogle-cloud=2.10.0=hac9eb74_0
+ - libgrpc=1.54.2=hcf146ea_0
- libhwloc=2.8.0=h32351e8_1
- libiconv=1.17=h166bdaf_0
- liblapack=3.9.0=16_linux64_openblas
- - libnghttp2=1.47.0=hdcd2b5c_1
+ - libllvm15=15.0.3=h503ea73_0
+ - libnghttp2=1.58.0=h47da74e_0
- libnsl=2.0.0=h7f98852_0
+ - libnuma=2.0.16=h0b41bf4_1
- libopenblas=0.3.21=pthreads_h78a6416_3
- libpng=1.6.39=h753d276_0
- - libprotobuf=3.21.10=h6239696_0
+ - libprotobuf=3.21.12=hfc55251_2
+ - libsanitizer=12.2.0=h46fd767_19
- libsqlite=3.40.0=h753d276_0
- - libssh2=1.10.0=haa6b8db_3
+ - libssh2=1.11.0=h0841786_0
- libstdcxx-ng=12.2.0=h46fd767_19
- - libthrift=0.16.0=h491838f_2
- - libtiff=4.4.0=h0e0dad5_3
+ - libthrift=0.18.1=h8fd135c_2
+ - libtiff=4.2.0=hf544144_3
- libutf8proc=2.8.0=h166bdaf_0
- libuuid=2.32.1=h7f98852_1000
- libwebp-base=1.2.4=h166bdaf_0
@@ -117,12 +130,14 @@ dependencies:
- libxml2=2.9.14=h22db469_4
- libzlib=1.2.13=h166bdaf_4
- lz4-c=1.9.3=h9c3ff4c_1
+ - minimap2=2.26=he4a0461_2
- ncbi-ngs-sdk=2.9.0=0
+ - ncbi-vdb=3.0.9=hdbdd923_0
- ncurses=6.3=h27087fc_1
- numpy=1.23.5=py39h3d75532_0
- - openjdk=17.0.3=hafdced1_4
- - openssl=1.1.1u=hd590300_0
- - orc=1.8.0=h09e0d61_0
+ - openjdk=17.0.3=hea3dc9f_3
+ - openssl=3.2.0=hd590300_1
+ - orc=1.8.3=h2f23424_1
- ossuuid=1.6.2=hf484d3e_1000
- pandas=1.5.2=py39h4661b88_0
- parquet-cpp=1.5.1=2
@@ -164,38 +179,41 @@ dependencies:
- pip=22.3.1=pyhd8ed1ab_0
- pixman=0.40.0=h36c2ea0_0
- pthread-stubs=0.4=h36c2ea0_1001
- - pyarrow=10.0.1=py39h33d4778_1_cpu
+ - pyarrow=12.0.0=py39he4327e9_1_cpu
- pyasn1=0.4.8=py_0
- pycparser=2.21=pyhd8ed1ab_0
- - pyopenssl=22.1.0=pyhd8ed1ab_0
+ - pyopenssl=23.3.0=pyhd8ed1ab_0
- pysocks=1.7.1=pyha2e5f31_6
- - python=3.9.15=h47a2c10_0_cpython
+ - python=3.9.16=h2782a2a_0_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-tzdata=2022.7=pyhd8ed1ab_0
- python_abi=3.9=3_cp39
- pytz=2022.6=pyhd8ed1ab_0
- pytz-deprecation-shim=0.1.0.post0=py39hf3d152e_3
- pyyaml=5.4.1=py39hb9d737c_4
- - re2=2022.06.01=h27087fc_1
+ - rdma-core=28.9=h59595ed_1
+ - re2=2023.02.02=hcb278e6_0
- readline=8.1.2=h0f457ee_0
- requests=2.28.1=pyhd8ed1ab_1
- rsa=4.7.2=pyh44b312d_0
- - s2n=1.3.28=h8d01263_0
+ - s2n=1.3.44=h06160fa_0
- s3transfer=0.6.0=pyhd8ed1ab_0
- samtools=1.16.1=h6899075_1
- scandir=1.10.0=py39hb9d737c_6
- seqkit=2.3.1=h9ee0642_0
- setuptools=65.5.1=pyhd8ed1ab_0
- six=1.16.0=pyh6c4a22f_0
- - snappy=1.1.9=hbd366e4_2
+ - snappy=1.1.10=h9fff704_0
- soothsayer_utils=2022.6.24=py_0
- - sra-tools=3.0.0=pl5321hd0d85c6_1
+ - sra-tools=3.0.9=h9f5acd7_0
- sracat=0.2=h9f5acd7_1
+ - sysroot_linux-64=2.12=he073ed8_16
- tbb=2021.7.0=h924138e_1
- tk=8.6.12=h27826a3_0
- tqdm=4.64.1=pyhd8ed1ab_0
- tzdata=2022g=h191b570_0
- tzlocal=4.2=py39hf3d152e_2
+ - ucx=1.14.1=h64cca9d_5
- urllib3=1.26.13=pyhd8ed1ab_0
- wheel=0.38.4=pyhd8ed1ab_0
- xorg-fixesproto=5.0=h7f98852_1002
@@ -218,4 +236,4 @@ dependencies:
- xz=5.2.6=h166bdaf_0
- yaml=0.2.5=h7f98852_2
- zlib=1.2.13=h166bdaf_4
- - zstd=1.5.2=h6239696_4
\ No newline at end of file
+ - zstd=1.5.5=hfc55251_0
\ No newline at end of file
diff --git a/install/environments/VEBA-profile_env.yml b/install/environments/VEBA-profile_env.yml
index bccbda2..f6f3fab 100644
--- a/install/environments/VEBA-profile_env.yml
+++ b/install/environments/VEBA-profile_env.yml
@@ -1,4 +1,4 @@
-name: VEBA-profile_env__v2023.10.16
+name: VEBA-profile_env__v2023.12.14
channels:
- conda-forge
- bioconda
@@ -21,12 +21,12 @@ dependencies:
- bz2file=0.98=py_0
- bzip2=1.0.8=h7f98852_4
- c-ares=1.20.1=hd590300_0
- - ca-certificates=2023.7.22=hbcca054_0
+ - ca-certificates=2023.11.17=hbcca054_0
- cached-property=1.5.2=hd8ed1ab_1
- cached_property=1.5.2=pyha770c72_1
- cairo=1.16.0=hb05425b_5
- capnproto=0.9.1=ha19adfc_4
- - certifi=2023.7.22=pyhd8ed1ab_0
+ - certifi=2023.11.17=pyhd8ed1ab_0
- charset-normalizer=3.3.0=pyhd8ed1ab_0
- click=8.1.7=unix_pyh707e725_0
- cmseq=1.0.4=pyhb7b1952_0
@@ -119,7 +119,7 @@ dependencies:
- numpy=1.26.0=py310hb13e2d6_0
- openjdk=17.0.3=h4335b31_6
- openjpeg=2.5.0=h488ebb8_3
- - openssl=3.1.3=hd590300_0
+ - openssl=3.2.0=hd590300_1
- ossuuid=1.6.2=hf484d3e_1000
- packaging=23.2=pyhd8ed1ab_0
- pandas=2.1.1=py310hcc13569_1
@@ -198,6 +198,7 @@ dependencies:
- six=1.16.0=pyh6c4a22f_0
- soothsayer_utils=2022.6.24=py_0
- statsmodels=0.14.0=py310h1f7b6fc_2
+ - sylph=0.4.1=h4ac6f70_0
- tbb=2021.7.0=h924138e_1
- tk=8.6.13=h2797004_0
- tqdm=4.66.1=pyhd8ed1ab_0
diff --git a/install/install_veba.sh b/install/install.sh
similarity index 57%
rename from install/install_veba.sh
rename to install/install.sh
index 8c8fa6d..ac81638 100644
--- a/install/install_veba.sh
+++ b/install/install.sh
@@ -1,12 +1,14 @@
#!/bin/bash
-# __version__ = "2023.3.27"
+# __version__ = "2023.12.19"
SCRIPT_PATH=$(realpath $0)
PREFIX=$(echo $SCRIPT_PATH | python -c "import sys; print('/'.join(sys.stdin.read().split('/')[:-1]))")
-CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}")
+# CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}")
+CONDA_BASE=$(conda info --base)
# Update permissions
echo "Updating permissions for scripts in ${PREFIX}/../src"
+chmod 755 ${PREFIX}/../src/veba
chmod 755 ${PREFIX}/../src/*.py
chmod 755 ${PREFIX}/../src/scripts/*
@@ -15,12 +17,34 @@ conda install -c conda-forge mamba -y
# conda update mamba -y # Recommended
# Environemnts
+# Main environment
+echo "Creating ${VEBA} main environment"
+
+ENV_NAME="VEBA"
+mamba create -y -n $ENV_NAME -c conda-forge -c bioconda -c jolespin seqkit genopype networkx biopython biom-format anndata || (echo "Error when creating main VEBA environment" ; exit 1) &> ${PREFIX}/environments/VEBA.log
+
+# Copy main executable
+echo -e "\t*Copying main VEBA executable into ${ENV_NAME} environment path"
+cp -r ${PREFIX}/../src/veba ${CONDA_BASE}/envs/${ENV_NAME}/bin/
+# Copy over files to environment bin/
+echo -e "\t*Copying VEBA modules into ${ENV_NAME} environment path"
+cp -r ${PREFIX}/../src/*.py ${CONDA_BASE}/envs/${ENV_NAME}/bin/
+echo -e "\t*Copying VEBA utility scripts into ${ENV_NAME} environment path"
+cp -r ${PREFIX}/../src/scripts/ ${CONDA_BASE}/envs/${ENV_NAME}/bin/
+# Symlink the utility scripts to bin/
+echo -e "\t*Symlinking VEBA utility scripts into ${ENV_NAME} environment path"
+ln -sf ${CONDA_BASE}/envs/${ENV_NAME}/bin/scripts/* ${CONDA_BASE}/envs/${ENV_NAME}/bin/
+
+# Version
+cp -rf ${PREFIX}/../VERSION ${CONDA_BASE}/envs/${ENV_NAME}/bin/VEBA_VERSION
+
+# Module environments
for ENV_YAML in ${PREFIX}/environments/VEBA*.yml; do
# Get environment name
ENV_NAME=$(basename $ENV_YAML .yml)
# Create conda environment
- echo "Creating ${ENV_NAME} environment"
+ echo "Creating ${ENV_NAME} module environment"
mamba env create -n $ENV_NAME -f $ENV_YAML || (echo "Error when creating VEBA environment: ${ENV_YAML}" ; exit 1) &> ${ENV_YAML}.log
# Copy over files to environment bin/
@@ -32,6 +56,9 @@ for ENV_YAML in ${PREFIX}/environments/VEBA*.yml; do
echo -e "\t*Symlinking VEBA utility scripts into ${ENV_NAME} environment path"
ln -sf ${CONDA_BASE}/envs/${ENV_NAME}/bin/scripts/* ${CONDA_BASE}/envs/${ENV_NAME}/bin/
+ # Version
+ cp -rf ${PREFIX}/../VERSION ${CONDA_BASE}/envs/${ENV_NAME}/bin/VEBA_VERSION
+
done
echo -e " _ _ _______ ______ _______\n \ / |______ |_____] |_____|\n \/ |______ |_____] | |"
diff --git a/install/uninstall_veba.sh b/install/uninstall.sh
similarity index 100%
rename from install/uninstall_veba.sh
rename to install/uninstall.sh
diff --git a/install/update_environment_scripts.sh b/install/update_environment_scripts.sh
index 2c98bc7..59a98b0 100644
--- a/install/update_environment_scripts.sh
+++ b/install/update_environment_scripts.sh
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
-# __version__ = "2023.01.05"
+# __version__ = "2023.12.18"
# Usage: git clone https://github.com/jolespin/veba && update_environment_scripts.sh /path/to/veba_repository
echo "-----------------------------------------------------------------------------------------------------"
@@ -17,13 +17,14 @@ if [ $# -eq 0 ]; then
chmod 775 ${VEBA_REPOSITORY_DIRECTORY}/src/*
chmod 775 ${VEBA_REPOSITORY_DIRECTORY}/src/scripts/*
- else
+ else
VEBA_REPOSITORY_DIRECTORY=$1
fi
-CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}")
+# CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}")
+CONDA_BASE=$(conda info --base)
echo "-----------------------------------------------------------------------------------------------------"
echo " * Source VEBA: ${VEBA_REPOSITORY_DIRECTORY}"
@@ -31,9 +32,10 @@ echo " * Destination VEBA environments CONDA_BASE: ${CONDA_BASE}"
echo "-----------------------------------------------------------------------------------------------------"
# Environemnts
-for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA-*; do
+for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA ${CONDA_BASE}/envs/VEBA-*;
+do
echo $ENV_PREFIX
cp ${VEBA_REPOSITORY_DIRECTORY}/src/*.py ${ENV_PREFIX}/bin/
cp -r ${VEBA_REPOSITORY_DIRECTORY}/src/scripts/ ${ENV_PREFIX}/bin/
ln -sf ${ENV_PREFIX}/bin/scripts/* ${ENV_PREFIX}/bin/
- done
+done
diff --git a/src/MODULE_RESOURCES b/src/MODULE_RESOURCES
deleted file mode 100644
index a30553e..0000000
--- a/src/MODULE_RESOURCES
+++ /dev/null
@@ -1,18 +0,0 @@
-Status Environment Module Resources Recommended Threads Description
-Stable VEBA-preprocess_env preprocess.py 4GB-16GB 4 Fastq quality trimming, adapter removal, decontamination, and read statistics calculations
-Stable VEBA-assembly_env assembly.py 32GB-128GB+ 16 Assemble reads, align reads to assembly, and count mapped reads
-Stable VEBA-assembly_env coverage.py 24GB 16 Align reads to (concatenated) reference and counts mapped reads
-Stable VEBA-binning-prokaryotic_env binning-prokaryotic.py 16GB 4 Iterative consensus binning for recovering prokaryotic genomes with lineage-specific quality assessment
-Stable VEBA-binning-eukaryotic_env binning-eukaryotic.py 128GB 4 Binning for recovering eukaryotic genomes with exon-aware gene modeling and lineage-specific quality assessment
-Stable VEBA-binning-viral_env binning-viral.py 16GB 4 Detection of viral genomes and quality assessment
-Stable VEBA-classify_env classify-prokaryotic.py 64GB 32 Taxonomic classification of prokaryotic genomes
-Stable VEBA-classify_env classify-eukaryotic.py 32GB 1 Taxonomic classification of eukaryotic genomes
-Stable VEBA-classify_env classify-viral.py 16GB 4 Taxonomic classification of viral genomes
-Stable VEBA-cluster_env cluster.py 32GB+ 32 Species-level clustering of genomes and lineage-specific orthogroup detection
-Stable VEBA-annotate_env annotate.py 64GB 32 Annotates translated gene calls against NR, Pfam, and KOFAM
-Stable VEBA-phylogeny_env phylogeny.py 16GB+ 32 Constructs phylogenetic trees given a marker set
-Stable VEBA-mapping_env index.py 16GB 4 Builds local or global index for alignment to genomes
-Stable VEBA-mapping_env mapping.py 16GB 4 Aligns reads to local or global index of genomes
-Stable VEBA-biosynthetic_env biosynthetic.py 16GB 16 Identify biosynthetic gene clusters in prokaryotes and fungi
-Developmental VEBA-assembly_env assembly-sequential.py 32GB-128GB+ 16 Assemble metagenomes sequentially
-Developmental VEBA-amplicon_env amplicon.py 96GB 16 Automated read trim position detection, DADA2 ASV detection, taxonomic classification, and file conversion
\ No newline at end of file
diff --git a/src/README.md b/src/README.md
index 574149f..790091e 100755
--- a/src/README.md
+++ b/src/README.md
@@ -3,25 +3,29 @@
# Modules
[![Schematic](../images/Schematic.png)](../images/Schematic.pdf)
-| Status | Environment | Module | Resources | Recommended Threads | Description |
-|---------------|------------------------------|-------------------------|-------------|---------------------|-----------------------------------------------------------------------------------------------------------------|
-| Stable | [VEBA-preprocess_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-preprocess_env.yml) | [preprocess.py](https://github.com/jolespin/veba/tree/main/src#preprocesspy) | 4GB-16GB | 4 | Fastq quality trimming, adapter removal, decontamination, and read statistics calculations |
-| Stable | [VEBA-assembly_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-assembly_env.yml) | [assembly.py](https://github.com/jolespin/veba/tree/main/src#assemblypy) | 32GB-128GB+ | 4-16 | Assemble reads, align reads to assembly, and count mapped reads |
-| Stable | [VEBA-assembly_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-assembly_env.yml) | [coverage.py](https://github.com/jolespin/veba/tree/main/src#coveragepy) | 24GB | 16 | Align reads to (concatenated) reference and counts mapped reads |
-| Stable | [VEBA-binning-prokaryotic_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-binning-prokaryotic_env.yml) | [binning-prokaryotic.py](https://github.com/jolespin/veba/tree/main/src#binning-prokaryoticpy) | 16GB | 4 | Iterative consensus binning for recovering prokaryotic genomes with lineage-specific quality assessment |
-| Stable | [VEBA-binning-eukaryotic_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-binning-eukaryotic_env.yml) | [binning-eukaryotic.py](https://github.com/jolespin/veba/tree/main/src#binning-eukaryoticpy) | 128GB | 4 | Binning for recovering eukaryotic genomes with exon-aware gene modeling and lineage-specific quality assessment |
-| Stable | [VEBA-binning-viral_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-binning-viral_env.yml) | [binning-viral.py](https://github.com/jolespin/veba/tree/main/src#binning-viralpy) | 16GB | 4 | Detection of viral genomes and quality assessment |
-| Stable | [VEBA-classify_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-classify_env.yml) | [classify-prokaryotic.py](https://github.com/jolespin/veba/tree/main/src#classify-prokaryoticpy) | 72GB | 32 | Taxonomic classification of prokaryotic genomes |
-| Stable | [VEBA-classify_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-classify_env.yml) | [classify-eukaryotic.py](https://github.com/jolespin/veba/tree/main/src#classify-eukaryoticpy) | 32GB | 1 | Taxonomic classification of eukaryotic genomes |
-| Stable | [VEBA-classify_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-classify_env.yml) | [classify-viral.py](https://github.com/jolespin/veba/tree/main/src#classify-viralpy) | 16GB | 4 | Taxonomic classification of viral genomes |
-| Stable | [VEBA-cluster_env](https://github.com/jolespin/veba/blob/main/install/environments/[VEBA-cluster_env.yml) | [cluster.py](https://github.com/jolespin/veba/tree/main/src#clusterpy) | 32GB+ | 32 | Species-level clustering of genomes and lineage-specific orthogroup detection |
-| Stable | [VEBA-annotate_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-annotate_env.yml) | [annotate.py](https://github.com/jolespin/veba/tree/main/src#annotatepy) | 64GB | 32 | Annotates translated gene calls against UniRef, MiBIG, VFDB, Pfam, AntiFam, and KOFAM |
-| Stable | [VEBA-phylogeny_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-phylogeny_env.yml) | [phylogeny.py](https://github.com/jolespin/veba/tree/main/src#phylogenypy) | 16GB+ | 32 | Constructs phylogenetic trees given a marker set |
-| Stable | [VEBA-mapping_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-mapping_env.yml) | [index.py](https://github.com/jolespin/veba/tree/main/src#indexpy) | 16GB | 4 | Builds local or global index for alignment to genomes |
-| Stable | [VEBA-mapping_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-mapping_env.yml) | [mapping.py](https://github.com/jolespin/veba/tree/main/src#mappingpy) | 16GB | 4 | Aligns reads to local or global index of genomes |
-| Stable | [VEBA-biosynthetic_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-biosynthetic_env.yml) | [biosynthetic.py](https://github.com/jolespin/veba/tree/main/src#biosyntheticpy) | 16GB | 16 | Identify biosynthetic gene clusters in prokaryotes and fungi |
-| Developmental | [VEBA-assembly_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-assembly_env.yml) | [assembly-sequential.py](https://github.com/jolespin/veba/tree/main/src#assembly-sequentialpy) | 32GB-128GB+ | 16 | Assemble metagenomes sequentially |
-| Developmental | [VEBA-amplicon_env](https://github.com/jolespin/veba/blob/main/install/environments/devel/VEBA-amplicon_env.yml) | [amplicon.py](https://github.com/jolespin/veba/tree/main/src#ampliconpy) | 96GB | 16 | Automated read trim position detection, DADA2 ASV detection, taxonomic classification, and file conversion |
+| Status | Module | Environment | Executable | Resources | Recommended Threads | Description |
+|---------------|----------------------|------------------------------|-------------------------|-------------|---------------------|-------------------------------------------------------------------------------------------------------------------|
+| Stable | preprocess | VEBA-preprocess_env | preprocess.py | 4GB-16GB | 4 | Fastq quality trimming, adapter removal, decontamination, and read statistics calculations (Short Reads) |
+| Stable | preproces-long | VEBA-preprocess_env | preproces-long.py | 4GB-16GB | 4 | Fastq quality trimming, adapter removal, decontamination, and read statistics calculations (Long Reads) |
+| Stable | assembly | VEBA-assembly_env | assembly.py | 32GB-128GB+ | 16 | Assemble short reads, align reads to assembly, and count mapped reads |
+| Stable | assembly-long | VEBA-assembly_env | assembly-long.py | 32GB-128GB+ | 16 | Assemble long reads, align reads to assembly, and count mapped reads |
+| Stable | coverage | VEBA-assembly_env | coverage.py | 24GB | 16 | Align short reads to (concatenated) reference and counts mapped reads |
+| Stable | coverage-long | VEBA-assembly_env | coverage-long.py | 24GB | 16 | Align long reads to (concatenated) reference and counts mapped reads |
+| Stable | binning-prokaryotic | VEBA-binning-prokaryotic_env | binning-prokaryotic.py | 16GB | 4 | Iterative consensus binning for recovering prokaryotic genomes with lineage-specific quality assessment |
+| Stable | binning-eukaryotic | VEBA-binning-eukaryotic_env | binning-eukaryotic.py | 128GB | 4 | Binning for recovering eukaryotic genomes with exon-aware gene modeling and lineage-specific quality assessment |
+| Stable | binning-viral | VEBA-binning-viral_env | binning-viral.py | 16GB | 4 | Detection of viral genomes and quality assessment |
+| Stable | classify-prokaryotic | VEBA-classify_env | classify-prokaryotic.py | 64GB | 32 | Taxonomic classification of prokaryotic genomes |
+| Stable | classify-eukaryotic | VEBA-classify_env | classify-eukaryotic.py | 32GB | 1 | Taxonomic classification of eukaryotic genomes |
+| Stable | classify-viral | VEBA-classify_env | classify-viral.py | 16GB | 4 | Taxonomic classification of viral genomes |
+| Stable | cluster | VEBA-cluster_env | cluster.py | 32GB+ | 32 | Species-level clustering of genomes and lineage-specific orthogroup detection |
+| Stable | annotate | VEBA-annotate_env | annotate.py | 64GB | 32 | Annotates translated gene calls against NR, Pfam, and KOFAM |
+| Stable | phylogeny | VEBA-phylogeny_env | phylogeny.py | 16GB+ | 32 | Constructs phylogenetic trees given a marker set |
+| Stable | index | VEBA-mapping_env | index.py | 16GB | 4 | Builds local or global index for alignment to genomes |
+| Stable | mapping | VEBA-mapping_env | mapping.py | 16GB | 4 | Aligns reads to local or global index of genomes |
+| Stable | biosynthetic | VEBA-biosynthetic_env | biosynthetic.py | 16GB | 16 | Identify biosynthetic gene clusters in prokaryotes and fungi |
+| Stable | profile-pathway | VEBA-profile_env | profile-pathway.py | 16GB | 4 | Pathway profiling of de novo genomes |
+| Deprecated | assembly-sequential | VEBA-assembly_env | assembly-sequential.py | 32GB-128GB+ | 16 | Assemble metagenomes sequentially |
+| Developmental | amplicon | VEBA-amplicon_env | amplicon.py | 96GB | 16 | Automated read trim position detection, DADA2 ASV detection, taxonomic classification, and file conversion |
^__^
diff --git a/src/amplicon.py b/src/amplicon.py
index c673ee7..f66abf9 100755
--- a/src/amplicon.py
+++ b/src/amplicon.py
@@ -14,7 +14,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
# Reads archive
def get_reads_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -626,6 +626,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/annotate.py b/src/annotate.py
index c050e86..eda3cf5 100755
--- a/src/annotate.py
+++ b/src/annotate.py
@@ -15,7 +15,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.25"
+__version__ = "2023.11.30"
def get_preprocess_cmd( input_filepaths, output_filepaths, output_directory, directories, opts, program):
cmd = [
@@ -880,6 +880,7 @@ def main(args=None):
print("VEBA Database:", opts.veba_database, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
diff --git a/src/assembly-long.py b/src/assembly-long.py
new file mode 100755
index 0000000..0c35cc2
--- /dev/null
+++ b/src/assembly-long.py
@@ -0,0 +1,627 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, glob
+from collections import OrderedDict, defaultdict
+
+import pandas as pd
+
+# Soothsayer Ecosystem
+from genopype import *
+from genopype import __version__ as genopype_version
+from soothsayer_utils import *
+
+pd.options.display.max_colwidth = 100
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.12.14"
+
+# Assembly
+def get_assembly_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
+ # Command
+ cmd = [
+ os.environ["flye"],
+ "--{} {}".format(opts.reads_type, input_filepaths[0]),
+ "-g {}".format(opts.estimated_assembly_size) if opts.estimated_assembly_size else "",
+ "-o {}".format(output_directory),
+ "-t {}".format(opts.n_jobs),
+ "--deterministic" if not opts.no_deterministic else "",
+ "--meta" if opts.program == "metaflye" else "",
+ opts.assembler_options,
+
+ # Get failed length cutoff fasta
+ "&&",
+
+ "mv",
+ os.path.join(output_directory, "assembly.fasta"),
+ os.path.join(output_directory, "assembly_original.fasta"),
+
+ "&&",
+
+ "cat",
+ os.path.join(output_directory, "assembly_original.fasta"),
+ "|",
+ os.environ["seqkit"],
+ "seq",
+ "-M {}".format(max(opts.minimum_contig_length - 1, 1)),
+ "|",
+ "gzip",
+ ">",
+ os.path.join(output_directory, "assembly_failed_length_cutoff.fasta.gz"),
+
+ # Filter out small scaffolds and add prefix if applicable
+ "&&",
+
+ "cat",
+ os.path.join(output_directory, "assembly_original.fasta"),
+ "|",
+ os.environ["seqkit"],
+ "seq",
+ "-m {}".format(opts.minimum_contig_length),
+ "|",
+ os.environ["seqkit"],
+ "replace",
+ "-r {}".format(opts.scaffold_prefix),
+ "-p '^'",
+ ">",
+ os.path.join(output_directory, "assembly.fasta"),
+
+ "&&",
+
+ "rm -rf",
+ os.path.join(output_directory, "assembly_original.fasta"),
+
+ "&&",
+
+ os.environ["fasta_to_saf.py"],
+ "-i",
+ os.path.join(output_directory, "assembly.fasta"),
+ ">",
+ os.path.join(output_directory, "assembly.fasta.saf"),
+ ]
+
+
+
+ # files_to_remove = [
+ # ]
+
+ # for fn in files_to_remove:
+ # cmd += [
+ # "&&",
+ # "rm -rf {}".format(os.path.join(output_directory, fn)),
+ # ]
+ return cmd
+
+# Bowtie2
+def get_alignment_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+ cmd = [
+ # Clear temporary directory just in case
+ "rm -rf {}".format(os.path.join(directories["tmp"], "*")),
+ "&&",
+
+ # MiniMap2 Index
+ "(",
+ os.environ["minimap2"],
+ "-t {}".format(opts.n_jobs),
+ "-d {}".format(output_filepaths[0]), # Index
+ opts.minimap2_index_options,
+ input_filepaths[1], # Reference
+ ")",
+
+ "&&",
+
+ # MiniMap2
+ "(",
+ os.environ["minimap2"],
+ "-a",
+ "-t {}".format(opts.n_jobs),
+ "-x {}".format(opts.minimap2_preset),
+ opts.minimap2_options,
+ output_filepaths[0],
+ input_filepaths[0],
+
+
+
+ # Convert to sorted BAM
+ "|",
+
+ os.environ["samtools"],
+ "view",
+ "-b",
+ "-h",
+ "-F 4",
+
+ "|",
+
+ os.environ["samtools"],
+ "sort",
+ "--threads {}".format(opts.n_jobs),
+ "--reference {}".format(input_filepaths[1]),
+ "-T {}".format(os.path.join(directories["tmp"], "samtools_sort")),
+ ">",
+ output_filepaths[1],
+ ")",
+
+ "&&",
+
+ "(",
+ os.environ["samtools"],
+ "index",
+ "-@ {}".format(opts.n_jobs),
+ output_filepaths[1],
+ ")",
+ ]
+
+ return cmd
+
+
+# featureCounts
+def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+
+ # ORF-Level Counts
+ cmd = [
+ "mkdir -p {}".format(os.path.join(directories["tmp"], "featurecounts")),
+ "&&",
+ "(",
+ os.environ["featureCounts"],
+ # "-G {}".format(input_filepaths[0]),
+ "-a {}".format(input_filepaths[1]),
+ "-o {}".format(os.path.join(output_directory, "featurecounts.tsv")),
+ "-F SAF",
+ "-L",
+ "--tmpDir {}".format(os.path.join(directories["tmp"], "featurecounts")),
+ "-T {}".format(opts.n_jobs),
+ opts.featurecounts_options,
+ input_filepaths[2],
+ ")",
+ "&&",
+ "gzip -f {}".format(os.path.join(output_directory, "featurecounts.tsv")),
+ ]
+ return cmd
+
+# seqkit
+def get_seqkit_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+
+ # ORF-Level Counts
+ cmd = [
+
+ os.environ["seqkit"],
+ "stats",
+ "-a",
+ "-j {}".format(opts.n_jobs),
+ "-T",
+ "-b",
+ os.path.join(directories[("intermediate","1__assembly")], "*.fasta"),
+ "|",
+ "gzip",
+ ">",
+ output_filepaths[0],
+ ]
+ return cmd
+
+# Symlink
+def get_symlink_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+ # Command
+ cmd = [
+ "DST={}; (for SRC in {}; do SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST; done)".format(
+ output_directory,
+ " ".join(input_filepaths),
+ )
+ ]
+ return cmd
+
+# ============
+# Run Pipeline
+# ============
+# Set environment variables
+def add_executables_to_environment(opts):
+ """
+ Adapted from Soothsayer: https://github.com/jolespin/soothsayer
+ """
+ accessory_scripts = {
+ "fasta_to_saf.py",
+ }
+
+ required_executables={
+ "flye",
+ "minimap2",
+ "samtools",
+ "featureCounts",
+ "seqkit",
+ } | accessory_scripts
+
+ if opts.path_config == "CONDA_PREFIX":
+ executables = dict()
+ for name in required_executables:
+ executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name)
+ else:
+ if opts.path_config is None:
+ opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv")
+ opts.path_config = format_path(opts.path_config)
+ assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config)
+ assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables)
+ df_config = pd.read_csv(opts.path_config, sep="\t")
+ assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config)
+ df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str)
+ # Get executable paths
+ executables = OrderedDict(zip(df_config["name"], df_config["executable"]))
+ assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys())))
+
+ # Display
+ for name in sorted(accessory_scripts):
+ executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path
+
+ print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout)
+ for name, executable in executables.items():
+ if name in required_executables:
+ print(name, executable, sep = " --> ", file=sys.stdout)
+ os.environ[name] = executable.strip()
+ print("", file=sys.stdout)
+
+
+# Pipeline
+def create_pipeline(opts, directories, f_cmds):
+
+ # .................................................................
+ # Primordial
+ # .................................................................
+ # Commands file
+ pipeline = ExecutablePipeline(name=__program__, description=opts.name, f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"])
+
+ # ==========
+ # Assembly
+ # ==========
+
+ step = 1
+
+ # Info
+ program = "assembly"
+ program_label = "{}__{}".format(step, program)
+ description = "Assembling long reads via {}".format(opts.program.capitalize())
+
+ # Add to directories
+ output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
+
+
+ # i/o
+ input_filepaths = [opts.reads]
+ output_filenames = ["assembly.fasta", "assembly.fasta.saf"]
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_assembly_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+ # ==========
+ # Alignment
+ # ==========
+
+ step = 2
+
+ # Info
+ program = "alignment"
+ program_label = "{}__{}".format(step, program)
+ description = "Aligning reads to assembly"
+
+ # Add to directories
+ output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
+
+
+ # i/o
+ input_filepaths = [
+ opts.reads,
+ os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta"),
+ ]
+
+ output_filepaths = [
+ os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta.mmi"),
+ os.path.join(output_directory, "mapped.sorted.bam"),
+ ]
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_alignment_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+
+
+ # ==========
+ # featureCounts
+ # ==========
+ step = 3
+
+ # Info
+ program = "featurecounts"
+ program_label = "{}__{}".format(step, program)
+ description = "Counting reads"
+
+ # Add to directories
+ output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
+
+ # i/o
+
+ input_filepaths = [
+ os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta"),
+ os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta.saf"),
+ os.path.join(directories[("intermediate", "2__alignment")], "mapped.sorted.bam"),
+ ]
+
+ output_filenames = ["featurecounts.tsv.gz"]
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_featurecounts_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+ # ==========
+ # stats
+ # ==========
+
+ step = 4
+
+ # Info
+ program = "seqkit"
+ program_label = "{}__{}".format(step, program)
+ description = "Assembly statistics"
+
+ # Add to directories
+ output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
+
+
+ # i/o
+ input_filepaths = [
+ os.path.join(directories[("intermediate", "1__assembly")], "*.fasta"),
+
+ ]
+
+ output_filenames = ["seqkit_stats.tsv.gz"]
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_seqkit_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+
+ # =============
+ # Symlink
+ # =============
+ step = 5
+
+ # Info
+ program = "symlink"
+ program_label = "{}__{}".format(step, program)
+ description = "Symlinking relevant output files"
+
+ # Add to directories
+ output_directory = directories["output"]
+
+ # i/o
+
+ input_filepaths = [
+ os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta"),
+ os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta.mmi"),
+ os.path.join(directories[("intermediate", "2__alignment")], "mapped.sorted.bam"),
+ os.path.join(directories[("intermediate", "2__alignment")], "mapped.sorted.bam.bai"),
+ os.path.join(directories[("intermediate", "3__featurecounts")], "featurecounts.tsv.gz"),
+ os.path.join(directories[("intermediate", "4__seqkit")], "seqkit_stats.tsv.gz"),
+ ]
+
+ output_filenames = map(lambda fp: fp.split("/")[-1], input_filepaths)
+ output_filepaths = list(map(lambda fn:os.path.join(directories["output"], fn), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_symlink_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+ return pipeline
+
+# Configure parameters
+def configure_parameters(opts, directories):
+ # os.environ[]
+
+ # Scaffold prefix
+ if opts.scaffold_prefix == "NONE":
+ opts.scaffold_prefix = ""
+ else:
+ if "NAME" in opts.scaffold_prefix:
+ opts.scaffold_prefix = opts.scaffold_prefix.replace("NAME", opts.name)
+ print("Using the following prefix for all {} scaffolds: {}".format(opts.program, opts.scaffold_prefix), file=sys.stdout)
+
+ # Set environment variables
+ add_executables_to_environment(opts=opts)
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -n -g -o ".format(__program__)
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+ # Pipeline
+ parser_io = parser.add_argument_group('Required I/O arguments')
+ parser_io.add_argument("-i","--reads", type=str, required=True, help = "path/to/reads.fq[.gz]")
+ parser_io.add_argument("-n", "--name", type=str, required=True, help="Name of sample")
+ parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/assembly", help = "path/to/project_directory [Default: veba_output/assembly]")
+
+ # Utility
+ parser_utility = parser.add_argument_group('Utility arguments')
+ parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future
+ parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]")
+ parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]")
+ parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]")
+ parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
+ parser_utility.add_argument("--tmpdir", type=str, help="Set temporary directory") #site-packges in future
+
+ # Assembler
+ parser_assembler = parser.add_argument_group('Assembler arguments')
+ parser_assembler.add_argument("-P", "--program", type=str, default="flye", choices={"flye", "metaflye"}, help="Assembler | {flye, metaflye}} [Default: 'flye']")
+ parser_assembler.add_argument("-s", "--scaffold_prefix", type=str, default="NAME__", help="Assembler | Special options: Use NAME to use --name. Use NONE to not include a prefix. [Default: 'NAME__']")
+ parser_assembler.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="Minimum contig length. Should be lenient here because longer thresholds can be used for binning downstream. Recommended for metagenomes to use 1000 here. [Default: 1] ")
+ parser_assembler.add_argument("-t", "--reads_type", type=str, default="nano-hq", choices={"nano-hq", "nano-corr", "nano-raw", "pacbio-hifi", "pacbio-corr", "pacbio-raw"}, help="Reads type for (meta)flye. {nano-hq, nano-corr, nano-raw, pacbio-hifi, pacbio-corr, pacbio-raw} [Default: nano-hq] ")
+ parser_assembler.add_argument("-g", "--estimated_assembly_size", type=str, help="Estimated assembly size (e.g., 5m, 2.6g)")
+ parser_assembler.add_argument("--no_deterministic", action="store_true", help="Do not use deterministic mode. This will result in a faster assembly since it will be threaded but can get different assemblies upon rerunning")
+ parser_assembler.add_argument("--assembler_options", type=str, default="", help="Assembler options for Flye-based programs (e.g. --arg 1 ) [Default: '']")
+
+ # Aligner
+ parser_aligner = parser.add_argument_group('MiniMap2 arguments')
+ parser_aligner.add_argument("--minimap2_preset", type=str, default="map-ont", help="MiniMap2 | MiniMap2 preset {map-pb, map-ont, map-hifi} [Default: map-ont]")
+ # parser_aligner.add_argument("--no_create_index", action="store_true", help="Do not create a MiniMap2 index")
+ parser_aligner.add_argument("--minimap2_index_options", type=str, default="", help="MiniMap2 | More options (e.g. --arg 1 ) [Default: '']\nhttps://github.com/lh3/minimap2")
+ parser_aligner.add_argument("--minimap2_options", type=str, default="", help="MiniMap2 | More options (e.g. --arg 1 ) [Default: '']\nhttps://github.com/lh3/minimap2")
+
+ # featureCounts
+ parser_featurecounts = parser.add_argument_group('featureCounts arguments')
+ parser_featurecounts.add_argument("--featurecounts_options", type=str, default="", help="featureCounts | More options (e.g. --arg 1 ) [Default: ''] | http://bioinf.wehi.edu.au/featureCounts/")
+
+
+ # Options
+ opts = parser.parse_args()
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ # Threads
+ if opts.n_jobs == -1:
+ from multiprocessing import cpu_count
+ opts.n_jobs = cpu_count()
+ assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1."
+
+
+ # Directories
+ directories = dict()
+ directories["project"] = create_directory(opts.project_directory)
+ directories["sample"] = create_directory(os.path.join(directories["project"], opts.name))
+ directories["output"] = create_directory(os.path.join(directories["sample"], "output"))
+ directories["log"] = create_directory(os.path.join(directories["sample"], "log"))
+ if not opts.tmpdir:
+ opts.tmpdir = os.path.join(directories["sample"], "tmp")
+ directories["tmp"] = create_directory(opts.tmpdir)
+ directories["checkpoints"] = create_directory(os.path.join(directories["sample"], "checkpoints"))
+ directories["intermediate"] = create_directory(os.path.join(directories["sample"], "intermediate"))
+ # os.environ["TMPDIR"] = directories["tmp"]
+
+ # Info
+ print(format_header(__program__, "="), file=sys.stdout)
+ print(format_header("Configuration:", "-"), file=sys.stdout)
+ print(format_header("Name: {}".format(opts.name), "."), file=sys.stdout)
+ print("Python version:", sys.version.replace("\n"," "), file=sys.stdout)
+ print("Python path:", sys.executable, file=sys.stdout) #sys.path[2]
+ print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2]
+ print("Script version:", __version__, file=sys.stdout)
+ print("Moment:", get_timestamp(), file=sys.stdout)
+ print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
+ print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
+ configure_parameters(opts, directories)
+ sys.stdout.flush()
+
+ # Run pipeline
+ with open(os.path.join(directories["sample"], "commands.sh"), "w") as f_cmds:
+ pipeline = create_pipeline(
+ opts=opts,
+ directories=directories,
+ f_cmds=f_cmds,
+ )
+ pipeline.compile()
+ pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint)
+
+if __name__ == "__main__":
+ main()
diff --git a/src/assembly.py b/src/assembly.py
index 5156eff..32fc4fd 100755
--- a/src/assembly.py
+++ b/src/assembly.py
@@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
# Assembly
def get_assembly_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -683,8 +683,8 @@ def main(args=None):
parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
# Pipeline
parser_io = parser.add_argument_group('Required I/O arguments')
- parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/forward_reads.fq")
- parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reverse_reads.fq")
+ parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/forward_reads.fq[.gz]")
+ parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reverse_reads.fq[.gz]")
parser_io.add_argument("-n", "--name", type=str, help="Name of sample", required=True)
parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/assembly", help = "path/to/project_directory [Default: veba_output/assembly]")
@@ -758,6 +758,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/binning-eukaryotic.py b/src/binning-eukaryotic.py
index f8cfaf2..9fdc054 100755
--- a/src/binning-eukaryotic.py
+++ b/src/binning-eukaryotic.py
@@ -14,7 +14,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.12.2"
# DATABASE_METAEUK="/usr/local/scratch/CORE/jespinoz/db/veba/v1.0/Classify/Eukaryotic/eukaryotic"
@@ -310,11 +310,13 @@ def get_eukaryotic_gene_modeling_cmd(input_filepaths, output_filepaths, output_d
# Run Eukaryotic Gene Modeling
"&&",
+
os.environ["eukaryotic_gene_modeling_wrapper.py"],
"--fasta {}".format(os.path.join(directories["tmp"], "scaffolds.binned.eukaryotic.fasta")),
"--scaffolds_to_bins {}".format(input_filepaths[1]),
"--tiara_results {}".format(input_filepaths[2]),
"--metaeuk_database {}".format(opts.metaeuk_database),
+ "--metaeuk_split_memory_limit {}".format(opts.metaeuk_split_memory_limit),
"-o {}".format(output_directory),
"-p {}".format(opts.n_jobs),
@@ -1016,8 +1018,10 @@ def main(args=None):
# MetaEuk
parser_metaeuk = parser.add_argument_group('MetaEuk arguments')
+ parser_metaeuk.add_argument("-M", "--microeuk_database", type=str, choices={"MicroEuk100", "MicroEuk90", "MicroEuk50"}, default="MicroEuk50", help="MicroEuk database {MicroEuk100, MicroEuk90, MicroEuk50} [Default: MicroEuk50]")
parser_metaeuk.add_argument("--metaeuk_sensitivity", type=float, default=4.0, help="MetaEuk | Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [Default: 4.0]")
parser_metaeuk.add_argument("--metaeuk_evalue", type=float, default=0.01, help="MetaEuk | List matches below this E-value (range 0.0-inf) [Default: 0.01]")
+ parser_metaeuk.add_argument("--metaeuk_split_memory_limit", type=str, default="36G", help="MetaEuk | Set max memory per split. E.g. 800B, 5K, 10M, 1G. Use 0 to use all available system memory. (Default value is experimental) [Default: 36G]")
parser_metaeuk.add_argument("--metaeuk_options", type=str, default="", help="MetaEuk | More options (e.g. --arg 1 ) [Default: ''] https://github.com/soedinglab/metaeuk")
# --split-memory-limit 70G: https://github.com/soedinglab/metaeuk/issues/59
@@ -1071,7 +1075,7 @@ def main(args=None):
if opts.veba_database is None:
assert "VEBA_DATABASE" in os.environ, "Please set the following environment variable 'export VEBA_DATABASE=/path/to/veba_database' or provide path to --veba_database"
opts.veba_database = os.environ["VEBA_DATABASE"]
- opts.metaeuk_database = os.path.join(opts.veba_database, "Classify", "Microeukaryotic", "microeukaryotic")
+ opts.metaeuk_database = os.path.join(opts.veba_database, "Classify", "MicroEuk", opts.microeuk_database)
# Directories
@@ -1097,6 +1101,7 @@ def main(args=None):
print("VEBA Database:", opts.veba_database, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/binning-prokaryotic.py b/src/binning-prokaryotic.py
index 29f80c9..a52eb54 100755
--- a/src/binning-prokaryotic.py
+++ b/src/binning-prokaryotic.py
@@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
# Assembly
def get_coverage_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -1683,6 +1683,7 @@ def main(args=None):
print("VEBA Database:", opts.veba_database, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/binning-viral.py b/src/binning-viral.py
index f109b01..55f299e 100755
--- a/src/binning-viral.py
+++ b/src/binning-viral.py
@@ -14,7 +14,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
# geNomad
def get_genomad_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -953,6 +953,7 @@ def main(args=None):
print("VEBA Database:", opts.veba_database, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/biosynthetic.py b/src/biosynthetic.py
index 5c1cb77..9996c68 100755
--- a/src/biosynthetic.py
+++ b/src/biosynthetic.py
@@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.12.18"
# antiSMASH
def get_antismash_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -336,7 +336,7 @@ def get_mmseqs2_protein_cmd(input_filepaths, output_filepaths, output_directory,
"&&",
- os.environ["mmseqs2_wrapper.py"],
+ os.environ["clustering_wrapper.py"],
"--fasta {}".format(os.path.join(directories["tmp"], "components.concatenated.faa")),
"--output_directory {}".format(output_directory),
"--no_singletons" if bool(opts.no_singletons) else "",
@@ -415,7 +415,7 @@ def get_mmseqs2_nucleotide_cmd(input_filepaths, output_filepaths, output_directo
"&&",
- os.environ["mmseqs2_wrapper.py"],
+ os.environ["clustering_wrapper.py"],
"--fasta {}".format(os.path.join(directories["tmp"], "bgcs.concatenated.fasta")),
"--output_directory {}".format(output_directory),
"--no_singletons" if bool(opts.no_singletons) else "",
@@ -483,7 +483,7 @@ def add_executables_to_environment(opts):
"concatenate_dataframes.py",
"bgc_novelty_scorer.py",
"compile_krona.py",
- "mmseqs2_wrapper.py",
+ "clustering_wrapper.py",
"compile_protein_cluster_prevalence_table.py",
}
@@ -860,7 +860,7 @@ def main(args=None):
# antiSMASH
parser_antismash = parser.add_argument_group('antiSMASH arguments')
parser_antismash.add_argument("-t", "--taxon", type=str, default="bacteria", help="Taxonomic classification of input sequence {bacteria,fungi} [Default: bacteria]")
- parser_antismash.add_argument("--minimum_contig_length", type=int, default=1500, help="Minimum contig length. [Default: 1500] ")
+ parser_antismash.add_argument("--minimum_contig_length", type=int, default=1, help="Minimum contig length. [Default: 1] ")
parser_antismash.add_argument("-d", "--antismash_database", type=str, default=os.path.join(site.getsitepackages()[0], "antismash", "databases"), help="antiSMASH | Database directory path [Default: {}]".format(os.path.join(site.getsitepackages()[0], "antismash", "databases")))
parser_antismash.add_argument("-s", "--hmmdetection_strictness", type=str, default="relaxed", help="antiSMASH | Defines which level of strictness to use for HMM-based cluster detection {strict,relaxed,loose} [Default: relaxed] ")
parser_antismash.add_argument("--tta_threshold", type=float, default=0.65, help="antiSMASH | Lowest GC content to annotate TTA codons at [Default: 0.65]")
@@ -881,7 +881,7 @@ def main(args=None):
# MMSEQS2
parser_mmseqs2 = parser.add_argument_group('MMSEQS2 arguments')
- parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="easy-cluster", help="MMSEQS2 | {easy-cluster, easy-linclust} [Default: easy-cluster]")
+ parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="mmseqs-cluster", choices={"mmseqs-cluster", "mmseqs-linclust"}, help="MMSEQS2 | {mmseqs-cluster, mmseqs-linclust} [Default: mmseqs-cluster]")
parser_mmseqs2.add_argument("-f","--representative_output_format", type=str, default="fasta", help = "Format of output for representative sequences: {table, fasta} [Default: fasta]") # Should fasta be the new default?
@@ -943,6 +943,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/classify-eukaryotic.py b/src/classify-eukaryotic.py
index 216c26c..a9bb93d 100755
--- a/src/classify-eukaryotic.py
+++ b/src/classify-eukaryotic.py
@@ -14,7 +14,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
# Assembly
def get_concatenate_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -160,7 +160,7 @@ def get_compile_cmd( input_filepaths, output_filepaths, output_directory, direct
return cmd
-def get_consensus_genome_classification_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
+def get_consensus_genome_classification_ranked_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
# Command
cmd = [
@@ -172,7 +172,7 @@ def get_consensus_genome_classification_cmd( input_filepaths, output_filepaths,
"|",
"tail -n +2",
"|",
- os.environ["consensus_genome_classification.py"],
+ os.environ["consensus_genome_classification_ranked.py"],
"--leniency {}".format(opts.leniency),
"-o {}".format(output_filepaths[0]),
"-r c__,o__,f__,g__,s__",
@@ -224,7 +224,7 @@ def get_consensus_cluster_classification_cmd( input_filepaths, output_filepaths,
"-n id_genome_cluster",
"-i 0",
"|",
- os.environ["consensus_genome_classification.py"],
+ os.environ["consensus_genome_classification_ranked.py"],
"--leniency {}".format(opts.leniency),
"-o {}".format(output_filepaths[0]),
"-r c__,o__,f__,g__,s__",
@@ -252,7 +252,7 @@ def add_executables_to_environment(opts):
"filter_hmmsearch_results.py",
"subset_table.py",
"compile_eukaryotic_classifications.py",
- "consensus_genome_classification.py",
+ "consensus_genome_classification_ranked.py",
"insert_column_to_table.py",
"metaeuk_wrapper.py",
"scaffolds_to_bins.py",
@@ -481,7 +481,7 @@ def create_pipeline(opts, directories, f_cmds):
# ==========
step += 1
- program = "consensus_genome_classification"
+ program = "consensus_genome_classification_ranked"
program_label = "{}__{}".format(step, program)
# Add to directories
output_directory = directories["output"]# = create_directory(os.path.join(directories["intermediate"], program_label))
@@ -504,7 +504,7 @@ def create_pipeline(opts, directories, f_cmds):
"directories":directories,
}
- cmd = get_consensus_genome_classification_cmd(**params)
+ cmd = get_consensus_genome_classification_ranked_cmd(**params)
pipeline.add_step(
id=program,
@@ -698,6 +698,7 @@ def main(args=None):
print("VEBA Database:", opts.veba_database, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/classify-prokaryotic.py b/src/classify-prokaryotic.py
index b5abb15..e6f1d47 100755
--- a/src/classify-prokaryotic.py
+++ b/src/classify-prokaryotic.py
@@ -15,7 +15,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
# GTDB-Tk
def get_gtdbtk_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -138,7 +138,7 @@ def get_consensus_cluster_classification_cmd( input_filepaths, output_filepaths,
"-i {}".format(input_filepaths[0]),
"-c {}".format(input_filepaths[1]),
"|",
- os.environ["consensus_genome_classification.py"],
+ os.environ["consensus_genome_classification_ranked.py"],
"--leniency {}".format(opts.leniency),
"-o {}".format(output_filepaths[0]),
"-u 'Unclassified prokaryote'",
@@ -158,7 +158,7 @@ def add_executables_to_environment(opts):
"compile_prokaryotic_genome_cluster_classification_scores_table.py",
# "cut_table_by_column_labels.py",
"concatenate_dataframes.py",
- "consensus_genome_classification.py",
+ "consensus_genome_classification_ranked.py",
# "insert_column_to_table.py",
"compile_krona.py",
@@ -443,6 +443,7 @@ def main(args=None):
print("VEBA Database:", opts.veba_database, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/classify-viral.py b/src/classify-viral.py
index ed0da0f..50cca6b 100755
--- a/src/classify-viral.py
+++ b/src/classify-viral.py
@@ -14,7 +14,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
def get_concatenate_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -359,6 +359,7 @@ def main(args=None):
print("VEBA Database:", opts.veba_database, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/cluster.py b/src/cluster.py
index 320ef00..e13263f 100755
--- a/src/cluster.py
+++ b/src/cluster.py
@@ -1,6 +1,6 @@
#!/usr/bin/env python
from __future__ import print_function, division
-import sys, os, argparse, glob
+import sys, os, argparse, glob, warnings
from collections import OrderedDict, defaultdict
import pandas as pd
@@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.24"
+__version__ = "2023.12.11"
# Global clustering
def get_global_clustering_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -26,18 +26,35 @@ def get_global_clustering_cmd( input_filepaths, output_filepaths, output_directo
# "--no_singletons" if bool(opts.no_singletons) else "",
"-p {}".format(opts.n_jobs),
+ "--genome_clustering_algorithm {}".format(opts.genome_clustering_algorithm),
"--ani_threshold {}".format(opts.ani_threshold),
"--genome_cluster_prefix {}".format(opts.genome_cluster_prefix) if bool(opts.genome_cluster_prefix) else "",
"--genome_cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "",
"--genome_cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill) if bool(opts.genome_cluster_prefix_zfill) else "",
+ "--skani_target_ani {}".format(opts.skani_target_ani),
+ "--skani_minimum_af {}".format(opts.skani_minimum_af),
+ "--skani_no_confidence_interval" if opts.skani_no_confidence_interval else "",
+
+ "--skani_nonviral_preset {}".format(opts.skani_nonviral_preset),
+ "--skani_nonviral_compression_factor {}".format(opts.skani_nonviral_compression_factor),
+ "--skani_nonviral_marker_kmer_compression_factor {}".format(opts.skani_nonviral_marker_kmer_compression_factor),
+ "--skani_nonviral_options {}".format(opts.skani_nonviral_options) if bool(opts.skani_nonviral_options) else "",
+
+ "--skani_viral_preset {}".format(opts.skani_viral_preset),
+ "--skani_viral_compression_factor {}".format(opts.skani_viral_compression_factor),
+ "--skani_viral_marker_kmer_compression_factor {}".format(opts.skani_viral_marker_kmer_compression_factor),
+ "--skani_viral_options {}".format(opts.skani_viral_options) if bool(opts.skani_viral_options) else "",
+
"--fastani_options {}".format(opts.fastani_options) if bool(opts.fastani_options) else "",
- "--algorithm {}".format(opts.algorithm),
+
+ "--protein_clustering_algorithm {}".format(opts.protein_clustering_algorithm),
"--minimum_identity_threshold {}".format(opts.minimum_identity_threshold),
"--minimum_coverage_threshold {}".format(opts.minimum_coverage_threshold),
"--protein_cluster_prefix {}".format(opts.protein_cluster_prefix) if bool(opts.protein_cluster_prefix) else "",
"--protein_cluster_suffix {}".format(opts.protein_cluster_suffix) if bool(opts.protein_cluster_suffix) else "",
"--protein_cluster_prefix_zfill {}".format(opts.protein_cluster_prefix_zfill) if bool(opts.protein_cluster_prefix_zfill) else "",
"--mmseqs2_options {}".format(opts.mmseqs2_options) if bool(opts.mmseqs2_options) else "",
+ "--diamond_options {}".format(opts.diamond_options) if bool(opts.diamond_options) else "",
"--minimum_core_prevalence {}".format(opts.minimum_core_prevalence),
"&&",
@@ -60,18 +77,36 @@ def get_local_clustering_cmd( input_filepaths, output_filepaths, output_director
"-o {}".format(output_directory),
# "--no_singletons" if bool(opts.no_singletons) else "",
"-p {}".format(opts.n_jobs),
+
+ "--genome_clustering_algorithm {}".format(opts.genome_clustering_algorithm),
"--ani_threshold {}".format(opts.ani_threshold),
"--genome_cluster_prefix {}".format(opts.genome_cluster_prefix) if bool(opts.genome_cluster_prefix) else "",
"--genome_cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "",
"--genome_cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill) if bool(opts.genome_cluster_prefix_zfill) else "",
+ "--skani_target_ani {}".format(opts.skani_target_ani),
+ "--skani_minimum_af {}".format(opts.skani_minimum_af),
+ "--skani_no_confidence_interval" if opts.skani_no_confidence_interval else "",
+
+ "--skani_nonviral_preset {}".format(opts.skani_nonviral_preset),
+ "--skani_nonviral_compression_factor {}".format(opts.skani_nonviral_compression_factor),
+ "--skani_nonviral_marker_kmer_compression_factor {}".format(opts.skani_nonviral_marker_kmer_compression_factor),
+ "--skani_nonviral_options {}".format(opts.skani_nonviral_options) if bool(opts.skani_nonviral_options) else "",
+
+ "--skani_viral_preset {}".format(opts.skani_viral_preset),
+ "--skani_viral_compression_factor {}".format(opts.skani_viral_compression_factor),
+ "--skani_viral_marker_kmer_compression_factor {}".format(opts.skani_viral_marker_kmer_compression_factor),
+ "--skani_viral_options {}".format(opts.skani_viral_options) if bool(opts.skani_viral_options) else "",
+
"--fastani_options {}".format(opts.fastani_options) if bool(opts.fastani_options) else "",
- "--algorithm {}".format(opts.algorithm),
+
+ "--protein_clustering_algorithm {}".format(opts.protein_clustering_algorithm),
"--minimum_identity_threshold {}".format(opts.minimum_identity_threshold),
"--minimum_coverage_threshold {}".format(opts.minimum_coverage_threshold),
"--protein_cluster_prefix {}".format(opts.protein_cluster_prefix) if bool(opts.protein_cluster_prefix) else "",
"--protein_cluster_suffix {}".format(opts.protein_cluster_suffix) if bool(opts.protein_cluster_suffix) else "",
"--protein_cluster_prefix_zfill {}".format(opts.protein_cluster_prefix_zfill) if bool(opts.protein_cluster_prefix_zfill) else "",
"--mmseqs2_options {}".format(opts.mmseqs2_options) if bool(opts.mmseqs2_options) else "",
+ "--diamond_options {}".format(opts.diamond_options) if bool(opts.diamond_options) else "",
"--minimum_core_prevalence {}".format(opts.minimum_core_prevalence),
"&&",
@@ -107,8 +142,10 @@ def add_executables_to_environment(opts):
required_executables={
# 1
+ "skani",
"fastANI",
"mmseqs",
+ "diamond",
} | accessory_scripts
if opts.path_config == "CONDA_PREFIX":
@@ -142,6 +179,21 @@ def add_executables_to_environment(opts):
# Pipeline
def create_pipeline(opts, directories, f_cmds):
+
+ # Genome clustering algorithm
+ GENOME_CLUSTERING_ALGORITHM = opts.genome_clustering_algorithm.lower()
+ if GENOME_CLUSTERING_ALGORITHM == "fastani":
+ GENOME_CLUSTERING_ALGORITHM = "FastANI"
+ if GENOME_CLUSTERING_ALGORITHM == "skani":
+ GENOME_CLUSTERING_ALGORITHM = "skani"
+
+ # Protein clustering algorithm
+ PROTEIN_CLUSTERING_ALGORITHM = opts.protein_clustering_algorithm.split("-")[0].lower()
+ if PROTEIN_CLUSTERING_ALGORITHM == "mmseqs":
+ PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.upper()
+ if PROTEIN_CLUSTERING_ALGORITHM == "diamond":
+ PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.capitalize()
+
# .................................................................
# Primordial
# .................................................................
@@ -159,7 +211,7 @@ def create_pipeline(opts, directories, f_cmds):
output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
# Info
- description = "Global clustering of genomes (FastANI) and proteins (MMSEQS2)"
+ description = "Global clustering of genomes ({}) and proteins ({})".format(GENOME_CLUSTERING_ALGORITHM, PROTEIN_CLUSTERING_ALGORITHM)
# i/o
input_filepaths = [opts.genomes_table]
@@ -206,7 +258,7 @@ def create_pipeline(opts, directories, f_cmds):
output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
# Info
- description = "Local clustering of genomes (FastANI) and proteins (MMSEQS2)"
+ description = "Local clustering of genomes ({}) and proteins ({})".format(GENOME_CLUSTERING_ALGORITHM, PROTEIN_CLUSTERING_ALGORITHM)
# i/o
input_filepaths = [opts.genomes_table]
@@ -245,8 +297,20 @@ def create_pipeline(opts, directories, f_cmds):
# Configure parameters
def configure_parameters(opts, directories):
- assert_acceptable_arguments(opts.algorithm, {"easy-cluster", "easy-linclust"})
+
+ assert_acceptable_arguments(opts.protein_clustering_algorithm, {"easy-cluster", "easy-linclust", "mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"})
+ if opts.protein_clustering_algorithm in {"easy-cluster", "easy-linclust"}:
+ d = {"easy-cluster":"mmseqs-cluster", "easy-linclust":"mmseqs-linclust"}
+ warnings.warn("\n\nPlease use `{}` instead of `{}` for MMSEQS2 clustering.".format(d[opts.protein_clustering_algorithm], opts.protein_clustering_algorithm))
+ opts.protein_clustering_algorithm = d[opts.protein_clustering_algorithm]
+ if opts.skani_nonviral_preset.lower() == "none":
+ opts.skani_nonviral_preset = None
+
+ if opts.skani_viral_preset.lower() == "none":
+ opts.skani_viral_preset = None
+
+ assert 0 < opts.minimum_core_prevalence <= 1.0, "--minimum_core_prevalence must be a float between (0.0,1.0])"
# Set environment variables
add_executables_to_environment(opts=opts)
@@ -257,7 +321,7 @@ def main(args=None):
# Path info
description = """
Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
- usage = "{} -i -o -A 95 -a easy-cluster".format(__program__)
+ usage = "{} -i -o -A 95 -a mmseqs-cluster".format(__program__)
epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
# Parser
@@ -276,24 +340,45 @@ def main(args=None):
parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]")
parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
- # FastANI
+ # ANI
+ parser_genome_clustering = parser.add_argument_group('Genome clustering arguments')
+ parser_genome_clustering.add_argument("-G", "--genome_clustering_algorithm", type=str, choices={"fastani", "skani"}, default="skani", help="Program to use for ANI calculations. `skani` is faster and more memory efficient. For v1.0.0 - v1.3.x behavior, use `fastani`. [Default: skani]")
+ parser_genome_clustering.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]")
+ parser_genome_clustering.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-")
+ parser_genome_clustering.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '")
+ parser_genome_clustering.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7
+
+ parser_skani = parser.add_argument_group('Skani triangle arguments')
+ parser_skani.add_argument("--skani_target_ani", type=float, default=80, help="skani | If you set --skani_target_ani to --ani_threshold, you may screen out genomes ANI ≥ --ani_threshold [Default: 80]")
+ parser_skani.add_argument("--skani_minimum_af", type=float, default=15, help="skani | Minimum aligned fraction greater than this value [Default: 15]")
+ parser_skani.add_argument("--skani_no_confidence_interval", action="store_true", help="skani | Output [5,95] ANI confidence intervals using percentile bootstrap on the putative ANI distribution")
+ # parser_skani.add_argument("--skani_low_memory", action="store_true", help="Skani | More options (e.g. --arg 1 ) https://github.com/bluenote-1577/skani [Default: '']")
+
+ parser_skani = parser.add_argument_group('[Prokaryotic & Eukaryotic] Skani triangle arguments')
+ parser_skani.add_argument("--skani_nonviral_preset", type=str, default="medium", choices={"fast", "medium", "slow", "none"}, help="skani [Prokaryotic & Eukaryotic]| Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: medium]")
+ parser_skani.add_argument("--skani_nonviral_compression_factor", type=int, default=125, help="skani [Prokaryotic & Eukaryotic]| Compression factor (k-mer subsampling rate). [Default: 125]")
+ parser_skani.add_argument("--skani_nonviral_marker_kmer_compression_factor", type=int, default=1000, help="skani [Prokaryotic & Eukaryotic] | Marker k-mer compression factor. Markers are used for filtering. [Default: 1000]")
+ parser_skani.add_argument("--skani_nonviral_options", type=str, default="", help="skani [Prokaryotic & Eukaryotic] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']")
+
+ parser_skani = parser.add_argument_group('[Viral] Skani triangle arguments')
+ parser_skani.add_argument("--skani_viral_preset", type=str, default="slow", choices={"fast", "medium", "slow", "none"}, help="skani | Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: slow]")
+ parser_skani.add_argument("--skani_viral_compression_factor", type=int, default=30, help="skani [Viral] | Compression factor (k-mer subsampling rate). [Default: 30]")
+ parser_skani.add_argument("--skani_viral_marker_kmer_compression_factor", type=int, default=200, help="skani [Viral] | Marker k-mer compression factor. Markers are used for filtering. Consider decreasing to ~200-300 if working with small genomes (e.g. plasmids or viruses). [Default: 200]")
+ parser_skani.add_argument("--skani_viral_options", type=str, default="", help="skani [Viral] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']")
+
parser_fastani = parser.add_argument_group('FastANI arguments')
- parser_fastani.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="FastANI | Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]")
- parser_fastani.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-")
- parser_fastani.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '")
- parser_fastani.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7
parser_fastani.add_argument("--fastani_options", type=str, default="", help="FastANI | More options (e.g. --arg 1 ) [Default: '']")
-
- # MMSEQS2
- parser_mmseqs2 = parser.add_argument_group('MMSEQS2 arguments')
- parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="easy-cluster", help="MMSEQS2 | {easy-cluster, easy-linclust} [Default: easy-cluster]")
- parser_mmseqs2.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="MMSEQS2 | SLC-Specific Protein Cluster (SSPC, previously referred to as SSO) percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]")
- parser_mmseqs2.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="MMSEQS2 | SSPC coverage threshold (Range (0.0, 1.0]) [Default: 0.8]")
- parser_mmseqs2.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-")
- parser_mmseqs2.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '")
- parser_mmseqs2.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7
- parser_mmseqs2.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']")
+ # Clustering
+ parser_protein_clustering = parser.add_argument_group('Protein clustering arguments')
+ parser_protein_clustering.add_argument("-P", "--protein_clustering_algorithm", type=str, choices={"mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}, default="mmseqs-cluster", help="Clustering algorithm | Diamond can only be used for clustering proteins {mmseqs-cluster, mmseqs-linclust, diamond-cluster, mmseqs-linclust} [Default: mmseqs-cluster]")
+ parser_protein_clustering.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="Clustering | Percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]")
+ parser_protein_clustering.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="Clustering | Coverage threshold (Range (0.0, 1.0]) [Default: 0.8]")
+ parser_protein_clustering.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-")
+ parser_protein_clustering.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '")
+ parser_protein_clustering.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7
+ parser_protein_clustering.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']")
+ parser_protein_clustering.add_argument("--diamond_options", type=str, default="", help="Diamond | More options (e.g. --arg 1 ) [Default: '']")
# Pangenome
parser_pangenome = parser.add_argument_group('Pangenome arguments')
@@ -329,6 +414,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/coverage-long.py b/src/coverage-long.py
new file mode 100755
index 0000000..d282754
--- /dev/null
+++ b/src/coverage-long.py
@@ -0,0 +1,587 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, glob
+from collections import OrderedDict, defaultdict
+
+import pandas as pd
+
+# Soothsayer Ecosystem
+from genopype import *
+from genopype import __version__ as genopype_version
+from soothsayer_utils import *
+
+pd.options.display.max_colwidth = 100
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.12.18"
+
+# Assembly
+def get_index_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+ cmd = [
+ # Filtering out small contigs
+ "cat",
+ opts.fasta,
+ "|",
+ os.environ["seqkit"],
+ "seq",
+ "-m {}".format(opts.minimum_contig_length),
+ "-j {}".format(opts.n_jobs),
+ opts.seqkit_seq_options,
+ ">",
+ output_filepaths[0],
+
+ # Create SAF file
+ "&&",
+ os.environ["fasta_to_saf.py"],
+ "-i {}".format(output_filepaths[0]),
+ ">",
+ output_filepaths[1],
+
+ "&&",
+
+ # Minimap2 Index
+ os.environ["minimap2"],
+ "-t {}".format(opts.n_jobs),
+ # "--seed {}".format(opts.random_state),
+ opts.minimap2_index_options,
+ "-d {}".format(output_filepaths[3]), # Index
+ output_filepaths[0], # Reference
+
+ # Get stats for reference
+ "&&",
+ os.environ["seqkit"],
+ "stats",
+ "-a",
+ "-j {}".format(opts.n_jobs),
+ "-T",
+ "-b",
+ output_filepaths[0],
+ ">",
+ output_filepaths[2],
+ ]
+
+ return cmd
+
+
+# # Bowtie2
+# def get_alignment_gnuparallel_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+
+# # Command
+# cmd = [
+
+# # MAKE THIS A FOR LOOP WITH MAX THREADS FOR EACH ONE. THE REASON FOR THIS IS THAT IF THERE IS A SMALL SAMPLE IT WILL BE DONE QUICK BUT THE LARGER SAMPLES ARE GOING TO BE STUCK WITH ONE THREAD STILL
+# """
+# # Clear temporary directory just in case
+
+# rm -rf %s
+
+# # Minimap2
+# %s --jobs %d -a %s -C "\t" "mkdir -p %s && %s -x %s -1 {2} -2 {3} --threads 1 --seed %d --no-unal %s | %s sort --threads 1 --reference %s -T %s > %s && %s index -@ 1 %s"
+
+# """%(
+# os.path.join(directories["tmp"], "*"),
+
+# # Parallel
+# os.environ["parallel"],
+# opts.n_jobs,
+# input_filepaths[0],
+
+# # Make directory
+# os.path.join(output_directory, "{1}"),
+
+# # Bowtie2
+# os.environ["minimap2"],
+# input_filepaths[1],
+# opts.random_state,
+# opts.bowtie2_options,
+
+# # Samtools sort
+# os.environ["samtools"],
+# input_filepaths[0],
+# os.path.join(directories["tmp"], "samtools_sort_{1}"),
+# os.path.join(output_directory, "{1}", "mapped.sorted.bam"),
+
+# # Samtools index
+# os.environ["samtools"],
+# os.path.join(output_directory, "{1}", "mapped.sorted.bam"),
+
+# ),
+
+
+# ]
+
+# return cmd
+
+def get_alignment_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+ cmd = [
+
+"""
+ # Clear temporary directory just in case
+rm -rf %s
+
+# Read lines
+READ_TABLE=%s
+
+while IFS= read -r LINE
+do echo $LINE
+ # Split fields
+ ID_SAMPLE=$(echo $LINE | cut -f1 -d " ")
+ READS=$(echo $LINE | cut -f2 -d " ")
+
+ # Create subdirectory
+ mkdir -p %s
+
+ OUTPUT_BAM="%s"
+
+ # Minimap2
+ if [[ -e "$OUTPUT_BAM" && -s "$OUTPUT_BAM" ]]; then
+ echo "[Skipping (Exists)] [Minimap2] [$ID_SAMPLE]"
+ else
+ echo "[Running] [Minimap2] [$ID_SAMPLE]"
+ %s -a -x %s -t %d %s %s $READS | %s view -h -b -F 4 | %s sort -@ %d --reference %s -T %s > $OUTPUT_BAM && %s index -@ %d $OUTPUT_BAM
+ fi
+done < $READ_TABLE
+
+"""%(
+ # Clear temporary directory just in case
+ os.path.join(directories["tmp"], "*"),
+
+ # Read lines
+ input_filepaths[0],
+
+ # Make directory
+ os.path.join(output_directory, "${ID_SAMPLE}"),
+
+ # Output BAM
+ os.path.join(output_directory, "${ID_SAMPLE}", "mapped.sorted.bam"),
+
+
+ # Bowtie2
+ os.environ["minimap2"],
+ opts.minimap2_preset,
+ opts.n_jobs,
+ opts.minimap2_options,
+ input_filepaths[2],
+
+
+ # Samtools view
+ os.environ["samtools"],
+
+
+ # Samtools sort
+ os.environ["samtools"],
+ opts.n_jobs,
+ input_filepaths[1],
+ os.path.join(directories["tmp"], "samtools_sort_${ID_SAMPLE}"),
+ # os.path.join(output_directory, "${ID_SAMPLE}", "mapped.sorted.bam"),
+
+ # Samtools index
+ os.environ["samtools"],
+ opts.n_jobs,
+ # os.path.join(output_directory, "${ID_SAMPLE}", "mapped.sorted.bam"),
+ ),
+
+ ]
+
+ return cmd
+
+
+# featureCounts
+def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+
+ # ORF-Level Counts
+ cmd = [
+ "mkdir -p {}".format(os.path.join(directories["tmp"], "featurecounts")),
+ "&&",
+ "(",
+ os.environ["featureCounts"],
+ # "-G {}".format(input_filepaths[0]),
+ "-a {}".format(input_filepaths[0]),
+ "-o {}".format(os.path.join(output_directory, "featurecounts.tsv")),
+ "-F SAF",
+ "--tmpDir {}".format(os.path.join(directories["tmp"], "featurecounts")),
+ "-T {}".format(opts.n_jobs),
+ "-L",
+ opts.featurecounts_options,
+ *input_filepaths[1:],
+ ")",
+ "&&",
+ "gzip -f {}".format(os.path.join(output_directory, "featurecounts.tsv")),
+ ]
+ return cmd
+
+
+
+# Symlink
+def get_symlink_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+ # Command
+ cmd = [
+ "DST={}; (for SRC in {}; do SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST; done)".format(
+ output_directory,
+ " ".join(input_filepaths),
+ )
+ ]
+ return cmd
+
+# ============
+# Run Pipeline
+# ============
+# Set environment variables
+def add_executables_to_environment(opts):
+ """
+ Adapted from Soothsayer: https://github.com/jolespin/soothsayer
+ """
+ accessory_scripts = {
+ "fasta_to_saf.py"
+ }
+
+ required_executables={
+ "minimap2",
+ "samtools",
+ "featureCounts",
+ "seqkit",
+ # "parallel",
+ } | accessory_scripts
+
+ if opts.path_config == "CONDA_PREFIX":
+ executables = dict()
+ for name in required_executables:
+ executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name)
+ else:
+ if opts.path_config is None:
+ opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv")
+ opts.path_config = format_path(opts.path_config)
+ assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config)
+ assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables)
+ df_config = pd.read_csv(opts.path_config, sep="\t")
+ assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config)
+ df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str)
+ # Get executable paths
+ executables = OrderedDict(zip(df_config["name"], df_config["executable"]))
+ assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys())))
+
+ # Display
+ for name in sorted(accessory_scripts):
+ executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path
+ print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout)
+ for name, executable in executables.items():
+ if name in required_executables:
+ print(name, executable, sep = " --> ", file=sys.stdout)
+ os.environ[name] = executable.strip()
+ print("", file=sys.stdout)
+
+# Pipeline
+def create_pipeline(opts, directories, f_cmds):
+
+ # .................................................................
+ # Primordial
+ # .................................................................
+ # Commands file
+ pipeline = ExecutablePipeline(name=__program__, description="Coverage", f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"])
+
+ # ==========
+ # Assembly
+ # ==========
+
+ step = 1
+
+ # Info
+ program = "index"
+ program_label = "{}__{}".format(step, program)
+ description = "Preprocess fasta file and build Bowtie2 index"
+
+ # Add to directories
+ output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
+
+
+ # i/o
+ input_filepaths = [opts.fasta]
+ output_filenames = ["reference.fasta", "reference.fasta.saf", "seqkit_stats.tsv", "reference.mmi"]
+
+
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_index_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+ # ==========
+ # Alignment
+ # ==========
+
+ step = 2
+
+ # Info
+ program = "alignment"
+ program_label = "{}__{}".format(step, program)
+ description = "Aligning reads to reference"
+
+ # Add to directories
+ output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
+
+
+ # i/o
+ input_filepaths = [
+ opts.reads,
+ os.path.join(directories[("intermediate", "1__index")], "reference.fasta"),
+ os.path.join(directories[("intermediate", "1__index")], "reference.mmi"),
+ ]
+
+
+
+ output_filenames = ["*/mapped.sorted.bam"]
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ # if not opts.one_task_per_cpu:
+ cmd = get_alignment_cmd(**params)
+ # else:
+ # cmd = get_alignment_gnuparallel_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+ # ==========
+ # featureCounts
+ # ==========
+ step = 3
+
+ # Info
+ program = "featurecounts"
+ program_label = "{}__{}".format(step, program)
+ description = "Counting reads"
+
+ # Add to directories
+ output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label))
+
+ # i/o
+
+ input_filepaths = [
+ os.path.join(directories[("intermediate", "1__index")], "reference.fasta.saf"),
+ os.path.join(directories[("intermediate", "2__alignment")], "*", "mapped.sorted.bam"),
+ ]
+
+ output_filenames = ["featurecounts.tsv.gz"]
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_featurecounts_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+
+
+
+ # =============
+ # Symlink
+ # =============
+ step = 4
+
+ # Info
+ program = "symlink"
+ program_label = "{}__{}".format(step, program)
+ description = "Symlinking relevant output files"
+
+ # Add to directories
+ output_directory = directories["output"]
+
+ # i/o
+
+ input_filepaths = [
+ os.path.join(directories[("intermediate", "1__index")], "reference.fasta"),
+ os.path.join(directories[("intermediate", "1__index")], "reference.fasta.saf"),
+ os.path.join(directories[("intermediate", "1__index")], "seqkit_stats.tsv"),
+ os.path.join(directories[("intermediate", "2__alignment")], "*"),
+ os.path.join(directories[("intermediate", "3__featurecounts")], "featurecounts.tsv.gz"),
+ ]
+
+ output_filenames = map(lambda fp: fp.split("/")[-1], input_filepaths)
+ output_filepaths = list(map(lambda fn:os.path.join(directories["output"], fn), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_symlink_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+ return pipeline
+
+# Configure parameters
+def configure_parameters(opts, directories):
+ # os.environ[]
+
+ # assert not bool(opts.unpaired_reads), "Cannot have --unpaired_reads if --forward_reads. Note, this behavior may be changed in the future but it's an adaptation of interleaved reads."
+ df = pd.read_csv(opts.reads, sep="\t", header=None)
+ n, m = df.shape
+ assert m == 2, "--reads must be a 2 column table seperated by tabs and no header. Currently there are {} columns".format(m)
+ # Set environment variables
+ add_executables_to_environment(opts=opts)
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -f -r -o ".format(__program__)
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+ # Pipeline
+ parser_io = parser.add_argument_group('Required I/O arguments')
+ parser_io.add_argument("-f","--fasta", type=str, required=True, help = "path/to/reference.fasta. Recommended usage is for merging unbinned contigs. [Required]")
+ parser_io.add_argument("-r","--reads", type=str, required = True, help = "path/to/reads_table.tsv with the following format: [id_sample][path/to/reads.fastq.gz], No header")
+ parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/assembly/multisample", help = "path/to/project_directory [Default: veba_output/assembly/multisample]")
+
+ # Utility
+ parser_utility = parser.add_argument_group('Utility arguments')
+ parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future
+ parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]")
+ parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]")
+ parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]")
+ parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
+ parser_utility.add_argument("--tmpdir", type=str, help="Set temporary directory") #site-packges in future
+
+ # Aligner
+ parser_seqkit = parser.add_argument_group('SeqKit seq arguments')
+ parser_seqkit.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="seqkit seq | Minimum contig length [Default: 1]")
+ parser_seqkit.add_argument("--seqkit_seq_options", type=str, default="", help="seqkit seq | More options (e.g. --arg 1 ) [Default: '']")
+
+
+ # Aligner
+ parser_aligner = parser.add_argument_group('Minmap2 arguments')
+ parser_aligner.add_argument("--minimap2_preset", type=str, default="map-ont", help="MiniMap2 | MiniMap2 preset {map-pb, map-ont, map-hifi} [Default: map-ont]")
+ parser_aligner.add_argument("--minimap2_index_options", type=str, default="", help="Minimap2 | More options (e.g. --arg 1 ) [Default: '']")
+ # parser_aligner.add_argument("--one_task_per_cpu", action="store_true", help="Use GNU parallel to run GNU parallel with 1 task per CPU. Useful if all samples are roughly the same size but inefficient if depth varies.")
+ parser_aligner.add_argument("--minimap2_options", type=str, default="", help="Minimap2 | More options (e.g. --arg 1 ) [Default: '']")
+
+ # featureCounts
+ parser_featurecounts = parser.add_argument_group('featureCounts arguments')
+ parser_featurecounts.add_argument("--featurecounts_options", type=str, default="", help="featureCounts | More options (e.g. --arg 1 ) [Default: ''] | http://bioinf.wehi.edu.au/featureCounts/")
+
+
+ # Options
+ opts = parser.parse_args()
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ # Threads
+ if opts.n_jobs == -1:
+ from multiprocessing import cpu_count
+ opts.n_jobs = cpu_count()
+ assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1."
+
+ # Directories
+ directories = dict()
+ directories["project"] = create_directory(opts.output_directory)
+ directories["output"] = create_directory(os.path.join(directories["project"], "output"))
+ directories["log"] = create_directory(os.path.join(directories["project"], "log"))
+ if not opts.tmpdir:
+ opts.tmpdir = os.path.join(directories["project"], "tmp")
+ directories["tmp"] = create_directory(opts.tmpdir)
+ directories["checkpoints"] = create_directory(os.path.join(directories["project"], "checkpoints"))
+ directories["intermediate"] = create_directory(os.path.join(directories["project"], "intermediate"))
+ os.environ["TMPDIR"] = directories["tmp"]
+
+ # Info
+ print(format_header(__program__, "="), file=sys.stdout)
+ print(format_header("Configuration:", "-"), file=sys.stdout)
+ print("Python version:", sys.version.replace("\n"," "), file=sys.stdout)
+ print("Python path:", sys.executable, file=sys.stdout) #sys.path[2]
+ print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2]
+ print("Script version:", __version__, file=sys.stdout)
+ print("Moment:", get_timestamp(), file=sys.stdout)
+ print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
+ print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
+ configure_parameters(opts, directories)
+ sys.stdout.flush()
+
+ # Run pipeline
+ with open(os.path.join(directories["project"], "commands.sh"), "w") as f_cmds:
+ pipeline = create_pipeline(
+ opts=opts,
+ directories=directories,
+ f_cmds=f_cmds,
+ )
+ pipeline.compile()
+ pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint)
+
+if __name__ == "__main__":
+ main()
diff --git a/src/coverage.py b/src/coverage.py
index 77c0131..b7b331f 100755
--- a/src/coverage.py
+++ b/src/coverage.py
@@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
# .............................................................................
# Notes
@@ -525,7 +525,7 @@ def main(args=None):
# Aligner
parser_seqkit = parser.add_argument_group('SeqKit seq arguments')
- parser_seqkit.add_argument("-m", "--minimum_contig_length", type=int, default=1500, help="seqkit seq | Minimum contig length [Default: 1500]")
+ parser_seqkit.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="seqkit seq | Minimum contig length [Default: 1]")
parser_seqkit.add_argument("--seqkit_seq_options", type=str, default="", help="seqkit seq | More options (e.g. --arg 1 ) [Default: '']")
@@ -572,6 +572,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/deprecated/preprocess.py b/src/deprecated/preprocess.py
new file mode 100755
index 0000000..73adac7
--- /dev/null
+++ b/src/deprecated/preprocess.py
@@ -0,0 +1,151 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, glob
+from collections import OrderedDict
+
+import pandas as pd
+
+# Soothsayer Ecosystem
+from genopype import *
+from genopype import __version__ as genopype_version
+
+from soothsayer_utils import *
+import fastq_preprocessor
+
+
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.11.28"
+
+# ============
+# Run Pipeline
+# ============
+# Set environment variables
+def add_executables_to_environment(opts):
+ """
+ Adapted from Soothsayer: https://github.com/jolespin/soothsayer
+ """
+ accessory_scripts = set([])
+
+ required_executables={
+ "repair.sh",
+ "bbduk.sh",
+ "bowtie2",
+ "fastp",
+ "seqkit",
+ "fastq_preprocessor",
+ "minimap2",
+ "pigz",
+ "chopper",
+ } | accessory_scripts
+
+ if opts.path_config == "CONDA_PREFIX":
+ executables = dict()
+ for name in required_executables:
+ executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name)
+ else:
+ opts.path_config = format_path(opts.path_config)
+ assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config)
+ assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables)
+ df_config = pd.read_csv(opts.path_config, sep="\t")
+ assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config)
+ df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str)
+ # Get executable paths
+ executables = OrderedDict(zip(df_config["name"], df_config["executable"]))
+ assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys())))
+
+ # Display
+ for name in sorted(accessory_scripts):
+ executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path
+ print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout)
+ for name, executable in executables.items():
+ if name in required_executables:
+ print(name, executable, sep = " --> ", file=sys.stdout)
+ os.environ[name] = executable.strip()
+ print("", file=sys.stdout)
+
+
+# Configure parameters
+def configure_parameters(opts, directories):
+
+ assert opts.forward_reads != opts.reverse_reads, "You probably mislabeled the input files because `r1` should not be the same as `r2`: {}".format(opts.forward_reads)
+ assert_acceptable_arguments(opts.retain_trimmed_reads, {0,1})
+ assert_acceptable_arguments(opts.retain_decontaminated_reads, {0,1})
+
+ # Set environment variables
+ add_executables_to_environment(opts=opts)
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+ # Path info
+ description = """
+ Wrapper around github.com/jolespin/fastq_preprocessor
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -1 -2 -n -o |Optional| -x -k ".format(__program__)
+ epilog = "Copyright 2022 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+ # Pipeline
+ parser_io = parser.add_argument_group('Required I/O arguments')
+ parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/reads_1.fastq")
+ parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reads_2.fastq")
+ parser_io.add_argument("-n", "--name", type=str, help="Name of sample", required=True)
+ parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/preprocess", help = "path/to/project_directory [Default: veba_output/preprocess]")
+
+ # Utility
+ parser_utility = parser.add_argument_group('Utility arguments')
+ parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv. Must have at least 2 columns [name, executable] [Default: CONDA_PREFIX]") #site-packges in future
+ parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]")
+ parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]")
+ parser_utility.add_argument("--restart_from_checkpoint", type=int, help = "Restart from a particular checkpoint")
+ parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
+
+ # Fastp
+ parser_fastp = parser.add_argument_group('Fastp arguments')
+ parser_fastp.add_argument("-m", "--minimum_read_length", type=int, default=75, help="Fastp | Minimum read length [Default: 75]")
+ parser_fastp.add_argument("-a", "--adapters", type=str, default="detect", help="Fastp | path/to/adapters.fasta [Default: detect]")
+ parser_fastp.add_argument("--fastp_options", type=str, default="", help="Fastp | More options (e.g. --arg 1 ) [Default: '']")
+
+ # Bowtie
+ parser_bowtie2 = parser.add_argument_group('Bowtie2 arguments')
+ parser_bowtie2.add_argument("-x", "--contamination_index", type=str, help="Bowtie2 | path/to/contamination_index\n(e.g., Human T2T CHM13 v2 in $VEBA_DATABASE/Contamination/chm13v2.0/chm13v2.0)")
+ parser_bowtie2.add_argument("--retain_trimmed_reads", default=0, type=int, help = "Retain fastp trimmed fastq after decontamination. 0=No, 1=yes [Default: 0]")
+ parser_bowtie2.add_argument("--retain_contaminated_reads", default=0, type=int, help = "Retain contaminated fastq after decontamination. 0=No, 1=yes [Default: 0]")
+ parser_bowtie2.add_argument("--bowtie2_options", type=str, default="", help="Bowtie2 | More options (e.g. --arg 1 ) [Default: '']\nhttp://bowtie-bio.sourceforge.net/bowtie2/manual.shtml")
+
+ # BBDuk
+ parser_bbduk = parser.add_argument_group('BBDuk arguments')
+ parser_bbduk.add_argument("-k","--kmer_database", type=str, help="BBDuk | path/to/kmer_database\n(e.g., Ribokmers in $VEBA_DATABASE/Contamination/kmers/ribokmers.fa.gz)")
+ parser_bbduk.add_argument("--kmer_size", type=int, default=31, help="BBDuk | k-mer size [Default: 31]")
+ parser_bbduk.add_argument("--retain_kmer_hits", default=0, type=int, help = "Retain reads that map to k-mer database. 0=No, 1=yes [Default: 0]")
+ parser_bbduk.add_argument("--retain_non_kmer_hits", default=0, type=int, help = "Retain reads that do not map to k-mer database. 0=No, 1=yes [Default: 0]")
+ parser_bbduk.add_argument("--bbduk_options", type=str, default="", help="BBDuk | More options (e.g., --arg 1) [Default: '']")
+
+ # Options
+ opts = parser.parse_args()
+ # opts.script_directory = script_directory
+ # opts.script_filename = script_filename
+
+ # Threads
+ if opts.n_jobs == -1:
+ from multiprocessing import cpu_count
+ opts.n_jobs = cpu_count()
+ assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1."
+
+ #Get arguments
+ args = list()
+ for k,v in opts.__dict__.items():
+ if v is not None:
+ args += ["--{}".format(k), str(v)]
+ # args = flatten(map(lambda item: ("--{}".format(item[0]), item[1]), opts.__dict__.items()))
+ sys.argv = [sys.argv[0]] + args
+
+ # Wrapper
+ fastq_preprocessor.main(args)
+
+
+
+if __name__ == "__main__":
+ main()
diff --git a/src/index.py b/src/index.py
index 8f532d4..5c10154 100755
--- a/src/index.py
+++ b/src/index.py
@@ -7,7 +7,7 @@
from soothsayer_utils import *
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.12.12"
# ==============
# Agostic commands
@@ -22,11 +22,22 @@ def get_concatenate_fasta_cmd( input_filepaths, output_filepaths, output_directo
"-i {}".format(input_filepaths[0]),
"-o {}".format(output_directory),
"-m {}".format(opts.minimum_contig_length),
- "-x {}".format("fa.gz"),
+ "-x {}".format("fa.gz" if opts.reference_gzipped else "fa"),
"-b reference",
"-M {}".format(opts.mode),
-
+ "&&",
+
+ "cat",
+ os.path.join(output_directory, "reference.fa.gz" if opts.reference_gzipped else "reference.fa"),
+ "|",
+ os.environ["seqkit"],
+ "fx2tab",
+ "-i",
+ "-s",
+ "-n",
+ ">",
+ os.path.join(output_directory, "reference.id_to_hash.tsv"),
]
return cmd
@@ -51,22 +62,25 @@ def get_concatenate_gff_cmd( input_filepaths, output_filepaths, output_directory
def get_bowtie2_local_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
os.environ["TMPDIR"] = directories["tmp"]
# Command
+
cmd = [
"""
-
+OUTPUT_DIRECTORY=%s
+FASTA_FILENAME=%s
for ID_SAMPLE in $(cut -f1 %s);
- do %s --threads %d --seed %d %s/${ID_SAMPLE}/reference.fa.gz %s/${ID_SAMPLE}/reference.fa.gz
+ do %s --threads %d --seed %d ${OUTPUT_DIRECTORY}/${ID_SAMPLE}/${FASTA_FILENAME} ${OUTPUT_DIRECTORY}/${ID_SAMPLE}/${FASTA_FILENAME}
done
"""%(
+ output_directory,
+ "reference.fa.gz" if opts.reference_gzipped else "reference.fa",
opts.references,
os.environ["bowtie2-build"],
opts.n_jobs,
opts.random_state,
- output_directory,
- output_directory,
),
]
+
return cmd
# ==============
@@ -115,10 +129,10 @@ def create_local_pipeline(opts, directories, f_cmds):
]
output_filenames = [
- "*/reference.fa.gz",
+ "*/reference.fa.gz" if opts.reference_gzipped else "*/reference.fa",
"*/reference.saf",
-
- ]
+ "*/reference.id_to_hash.tsv",
+ ]
output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
params = {
@@ -207,8 +221,9 @@ def create_local_pipeline(opts, directories, f_cmds):
# Info
description = "Build mapping index"
# i/o
+
input_filepaths = list(
- map(lambda id_sample: os.path.join(directories["output"], id_sample, "reference.fa.gz"),
+ map(lambda id_sample: os.path.join(directories["output"], id_sample, "reference.fa.gz" if opts.reference_gzipped else "reference.fa"),
opts.samples,
),
)
@@ -273,8 +288,10 @@ def create_global_pipeline(opts, directories, f_cmds):
]
output_filenames = [
- "reference.fa.gz",
+ "reference.fa.gz" if opts.reference_gzipped else "reference.fa",
"reference.saf",
+ "reference.id_to_hash.tsv",
+
]
output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
@@ -365,13 +382,22 @@ def create_global_pipeline(opts, directories, f_cmds):
# Info
description = "Build mapping index"
# i/o
- input_filepaths = [
- os.path.join(directories["output"], "reference.fa.gz"),
- ]
+ if opts.reference_gzipped:
+ input_filepaths = [
+ os.path.join(directories["output"], "reference.fa.gz"),
+ ]
+
+ output_filenames = [
+ "reference.fa.gz.*.bt2",
+ ]
+ else:
+ input_filepaths = [
+ os.path.join(directories["output"], "reference.fa"),
+ ]
- output_filenames = [
- "reference.fa.gz.*.bt2",
- ]
+ output_filenames = [
+ "reference.fa.*.bt2",
+ ]
output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
params = {
@@ -417,7 +443,8 @@ def add_executables_to_environment(opts):
required_executables = set([
- "bowtie2-build",
+ "seqkit",
+ "bowtie2-build",
])| accessory_scripts
if opts.path_config == "CONDA_PREFIX":
@@ -509,8 +536,9 @@ def main(args=None):
parser_io.add_argument("-r","--references", type=str, required=True, help = "local mode: [id_sample][path/to/reference.fa] and global mode: [path/to/reference.fa]")
parser_io.add_argument("-g","--gene_models", type=str, required=True, help = "local mode: [id_sample][path/to/reference.gff] and global mode: [path/to/reference.gff]")
parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/index", help = "path/to/project_directory [Default: veba_output/index]")
- parser_io.add_argument("-m", "--minimum_contig_length", type=int, default=1500, help="Minimum contig length [Default: 1500]")
+ parser_io.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="Minimum contig length [Default: 1]")
parser_io.add_argument("-M", "--mode", type=str, default="infer", help="Concatenate all references with global and build index or build index for each reference {global, local, infer}")
+ parser_io.add_argument("-z", "--reference_gzipped",action="store_true", help="Gzip the reference to generate `reference.fa.gz` instead of `reference.fa`")
# parser_io.add_argument("-c", "--copy_files", action="store_true", help="Copy files instead of symlinking. Only applies to global.")
# Utility
@@ -559,6 +587,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/mapping.py b/src/mapping.py
index 8db61cc..b06175c 100755
--- a/src/mapping.py
+++ b/src/mapping.py
@@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.12.12"
# Bowtie2
@@ -451,9 +451,12 @@ def configure_parameters(opts, directories):
assert os.path.isdir(opts.reference_index), "If --reference_saf is not provided, then --reference_index must be provided as a directory containing a file 'reference.saf'"
opts.reference_saf = os.path.join(opts.reference_index, "reference.saf")
- # Check if --reference_index is a directory, if it is then set reference.fa.gz as the directory
+ # Check if --reference_index is a directory, if it is then set reference.fa as the directory
if os.path.isdir(opts.reference_index):
- opts.reference_index = os.path.join(opts.reference_index, "reference.fa.gz")
+ if opts.reference_gzipped:
+ opts.reference_index = os.path.join(opts.reference_index, "reference.fa.gz")
+ else:
+ opts.reference_index = os.path.join(opts.reference_index, "reference.fa")
# If --reference_fasta isn't provided then set it to the --reference_index
if opts.reference_fasta is None:
@@ -491,10 +494,11 @@ def main(args=None):
parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/mapping", help = "path/to/project_directory [Default: veba_output/mapping]")
parser_reference = parser.add_argument_group('Reference arguments')
- parser_reference.add_argument("-x", "--reference_index",type=str, required=True, help="path/to/bowtie2_index. Either a file or directory. If directory, then it assumes the index is named `reference.fa.gz`")
+ parser_reference.add_argument("-x", "--reference_index",type=str, required=True, help="path/to/bowtie2_index. Either a file or directory. If directory, then it assumes the index is named `reference.fa`")
parser_reference.add_argument("-r", "--reference_fasta", type=str, required=False, help = "path/to/reference.fasta. If not provided then it is set to the --reference_index" ) # ; or (2) a directory of fasta files [Must all have the same extension. Use `query_ext` argument]
parser_reference.add_argument("-a", "--reference_gff",type=str, required=False, help="path/to/reference.gff. If not provided then --reference_index must be a directory that contains the file: 'reference.gff'")
parser_reference.add_argument("-s", "--reference_saf",type=str, required=False, help="path/to/reference.saf. If not provided then --reference_index must be a directory that contains the file: 'reference.saf'")
+ parser_reference.add_argument("-z", "--reference_gzipped",action="store_true", help="If --reference_index directory, then it assumes the index is named `reference.fa.gz` instead of `reference.fa`")
# parser_io.add_argument("-S","--scaffold_identifier_mapping", type=str, required=False, help = "path/to/scaffold_identifiers.tsv, Format: [id_scaffold][id_mag][id_cluster], No header")
# parser_io.add_argument("-O","--orf_identifier_mapping", type=str, required=False, help = "path/to/scaffold_identifiers.tsv, Format: [id_scaffold][id_mag][id_cluster], No header")
@@ -558,6 +562,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/phylogeny.py b/src/phylogeny.py
index 002cce6..0730b4e 100755
--- a/src/phylogeny.py
+++ b/src/phylogeny.py
@@ -14,7 +14,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.27"
+__version__ = "2023.11.30"
# Assembly
def preprocess( input_filepaths, output_filepaths, output_directory, directories, opts):
@@ -650,6 +650,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/preprocess-long.py b/src/preprocess-long.py
new file mode 100755
index 0000000..fe1e58a
--- /dev/null
+++ b/src/preprocess-long.py
@@ -0,0 +1,21 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse
+from soothsayer_utils import format_header, read_script_as_module
+
+script_directory = os.path.dirname(os.path.abspath( __file__ ))
+
+try:
+ from fastq_preprocessor import fastq_preprocessor_long
+except ImportError:
+ fastq_preprocessor_long = read_script_as_module("fastq_preprocessor_long", os.path.join(script_directory, "fastq_preprocessor_long.py"))
+
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.11.29"
+
+if __name__ == "__main__":
+ print(format_header("VEBA Preprocessing Wrapper (fastq_preprocessor v{})".format(fastq_preprocessor_long.__version__)), file=sys.stderr)
+ label = "Mode: Long Nanopore and PacBio reads"
+ print(label, file=sys.stderr)
+ print(len(label)*"-", file=sys.stderr)
+ fastq_preprocessor_long.main(sys.argv[1:])
diff --git a/src/preprocess.py b/src/preprocess.py
index d28ccc2..146b03c 100755
--- a/src/preprocess.py
+++ b/src/preprocess.py
@@ -1,148 +1,21 @@
#!/usr/bin/env python
from __future__ import print_function, division
-import sys, os, argparse, glob
-from collections import OrderedDict
-
-import pandas as pd
-
-# Soothsayer Ecosystem
-from genopype import *
-from genopype import __version__ as genopype_version
-
-from soothsayer_utils import *
-import fastq_preprocessor
+import sys, os, argparse
+from soothsayer_utils import format_header, read_script_as_module
+script_directory = os.path.dirname(os.path.abspath( __file__ ))
+try:
+ from fastq_preprocessor import fastq_preprocessor_short
+except ImportError:
+ fastq_preprocessor_short = read_script_as_module("fastq_preprocessor_short", os.path.join(script_directory, "fastq_preprocessor_short.py"))
+
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
-
-# ============
-# Run Pipeline
-# ============
-# Set environment variables
-def add_executables_to_environment(opts):
- """
- Adapted from Soothsayer: https://github.com/jolespin/soothsayer
- """
- accessory_scripts = set([])
-
- required_executables={
- "repair.sh",
- "bbduk.sh",
- "bowtie2",
- "fastp",
- "seqkit",
- "fastq_preprocessor",
- } | accessory_scripts
-
- if opts.path_config == "CONDA_PREFIX":
- executables = dict()
- for name in required_executables:
- executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name)
- else:
- opts.path_config = format_path(opts.path_config)
- assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config)
- assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables)
- df_config = pd.read_csv(opts.path_config, sep="\t")
- assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config)
- df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str)
- # Get executable paths
- executables = OrderedDict(zip(df_config["name"], df_config["executable"]))
- assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys())))
-
- # Display
- for name in sorted(accessory_scripts):
- executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path
- print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout)
- for name, executable in executables.items():
- if name in required_executables:
- print(name, executable, sep = " --> ", file=sys.stdout)
- os.environ[name] = executable.strip()
- print("", file=sys.stdout)
-
-
-# Configure parameters
-def configure_parameters(opts, directories):
-
- assert opts.forward_reads != opts.reverse_reads, "You probably mislabeled the input files because `r1` should not be the same as `r2`: {}".format(opts.forward_reads)
- assert_acceptable_arguments(opts.retain_trimmed_reads, {0,1})
- assert_acceptable_arguments(opts.retain_decontaminated_reads, {0,1})
-
- # Set environment variables
- add_executables_to_environment(opts=opts)
-
-def main(args=None):
- # Path info
- script_directory = os.path.dirname(os.path.abspath( __file__ ))
- script_filename = __program__
- # Path info
- description = """
- Wrapper around github.com/jolespin/fastq_preprocessor
- Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
- usage = "{} -1 -2 -n -o |Optional| -x -k ".format(__program__)
- epilog = "Copyright 2022 Josh L. Espinoza (jespinoz@jcvi.org)"
-
- # Parser
- parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
- # Pipeline
- parser_io = parser.add_argument_group('Required I/O arguments')
- parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/reads_1.fastq")
- parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reads_2.fastq")
- parser_io.add_argument("-n", "--name", type=str, help="Name of sample", required=True)
- parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/preprocess", help = "path/to/project_directory [Default: veba_output/preprocess]")
-
- # Utility
- parser_utility = parser.add_argument_group('Utility arguments')
- parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv. Must have at least 2 columns [name, executable] [Default: CONDA_PREFIX]") #site-packges in future
- parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]")
- parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]")
- parser_utility.add_argument("--restart_from_checkpoint", type=int, help = "Restart from a particular checkpoint")
- parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
-
- # Fastp
- parser_fastp = parser.add_argument_group('Fastp arguments')
- parser_fastp.add_argument("-m", "--minimum_read_length", type=int, default=75, help="Fastp | Minimum read length [Default: 75]")
- parser_fastp.add_argument("-a", "--adapters", type=str, default="detect", help="Fastp | path/to/adapters.fasta [Default: detect]")
- parser_fastp.add_argument("--fastp_options", type=str, default="", help="Fastp | More options (e.g. --arg 1 ) [Default: '']")
-
- # Bowtie
- parser_bowtie2 = parser.add_argument_group('Bowtie2 arguments')
- parser_bowtie2.add_argument("-x", "--contamination_index", type=str, help="Bowtie2 | path/to/contamination_index\n(e.g., Human T2T CHM13 v2 in $VEBA_DATABASE/Contamination/chm13v2.0/chm13v2.0)")
- parser_bowtie2.add_argument("--retain_trimmed_reads", default=0, type=int, help = "Retain fastp trimmed fastq after decontamination. 0=No, 1=yes [Default: 0]")
- parser_bowtie2.add_argument("--retain_contaminated_reads", default=0, type=int, help = "Retain contaminated fastq after decontamination. 0=No, 1=yes [Default: 0]")
- parser_bowtie2.add_argument("--bowtie2_options", type=str, default="", help="Bowtie2 | More options (e.g. --arg 1 ) [Default: '']\nhttp://bowtie-bio.sourceforge.net/bowtie2/manual.shtml")
-
- # BBDuk
- parser_bbduk = parser.add_argument_group('BBDuk arguments')
- parser_bbduk.add_argument("-k","--kmer_database", type=str, help="BBDuk | path/to/kmer_database\n(e.g., Ribokmers in $VEBA_DATABASE/Contamination/kmers/ribokmers.fa.gz)")
- parser_bbduk.add_argument("--kmer_size", type=int, default=31, help="BBDuk | k-mer size [Default: 31]")
- parser_bbduk.add_argument("--retain_kmer_hits", default=0, type=int, help = "Retain reads that map to k-mer database. 0=No, 1=yes [Default: 0]")
- parser_bbduk.add_argument("--retain_non_kmer_hits", default=0, type=int, help = "Retain reads that do not map to k-mer database. 0=No, 1=yes [Default: 0]")
- parser_bbduk.add_argument("--bbduk_options", type=str, default="", help="BBDuk | More options (e.g., --arg 1) [Default: '']")
-
- # Options
- opts = parser.parse_args()
- # opts.script_directory = script_directory
- # opts.script_filename = script_filename
-
- # Threads
- if opts.n_jobs == -1:
- from multiprocessing import cpu_count
- opts.n_jobs = cpu_count()
- assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1."
-
- #Get arguments
- args = list()
- for k,v in opts.__dict__.items():
- if v is not None:
- args += ["--{}".format(k), str(v)]
- # args = flatten(map(lambda item: ("--{}".format(item[0]), item[1]), opts.__dict__.items()))
- sys.argv = [sys.argv[0]] + args
-
- # Wrapper
- fastq_preprocessor.main(args)
-
-
+__version__ = "2023.11.29"
if __name__ == "__main__":
- main()
+ print(format_header("VEBA Preprocessing Wrapper (fastq_preprocessor v{})".format(fastq_preprocessor_short.__version__)), file=sys.stderr)
+ label = "Mode: Paired Illumina Reads"
+ print(label, file=sys.stderr)
+ print(len(label)*"-", file=sys.stderr)
+ fastq_preprocessor_short.main(sys.argv[1:])
\ No newline at end of file
diff --git a/src/profile-pathway.py b/src/profile-pathway.py
index 3f674f4..d84738a 100755
--- a/src/profile-pathway.py
+++ b/src/profile-pathway.py
@@ -13,7 +13,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.16"
+__version__ = "2023.11.30"
DIAMOND_DATABASE_SUFFIX = "_v201901b.dmnd"
@@ -625,6 +625,7 @@ def main(args=None):
print("Script version:", __version__, file=sys.stdout)
print("Moment:", get_timestamp(), file=sys.stdout)
print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
configure_parameters(opts, directories)
sys.stdout.flush()
diff --git a/src/profile-taxonomy.py b/src/profile-taxonomy.py
new file mode 100755
index 0000000..2aa4db0
--- /dev/null
+++ b/src/profile-taxonomy.py
@@ -0,0 +1,357 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, glob, gzip
+from collections import OrderedDict, defaultdict
+
+import pandas as pd
+
+# Soothsayer Ecosystem
+from genopype import *
+from genopype import __version__ as genopype_version
+from soothsayer_utils import *
+
+pd.options.display.max_colwidth = 100
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.12.19"
+
+# Preprocess reads
+def get_sylph_sketch_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
+ cmd = [
+ os.environ["sylph"],
+ "sketch",
+ "-t {}".format(opts.n_jobs),
+ "-c {}".format(opts.sylph_sketch_subsampling_rate),
+ "-k {}".format(opts.sylph_sketch_k),
+ "--min-spacing {}".format(opts.sylph_sketch_minimum_spacing),
+ "-1 {}".format(opts.forward_reads),
+ "-2 {}".format(opts.reverse_reads),
+ "-d {}".format(output_directory),
+
+ "&&",
+
+ "mv",
+ "-v",
+ os.path.join(output_directory, "{}.paired.sylsp".format(os.path.split(opts.forward_reads)[1])),
+ os.path.join(output_directory, "reads.sylsp"),
+ ]
+
+ return cmd
+
+def get_sylph_profile_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
+ # Command
+ cmd = [
+ os.environ["sylph"],
+ "profile",
+ "-t {}".format(opts.n_jobs),
+ "--minimum-ani {}".format(opts.sylph_profile_minimum_ani),
+ "--min-number-kmers {}".format(opts.sylph_profile_minimum_number_kmers),
+ "--min-count-correct {}".format(opts.sylph_profile_minimum_count_correct),
+ opts.sylph_profile_options,
+ " ".join(input_filepaths),
+ "|",
+ "gzip",
+ ">",
+ os.path.join(output_directory, "sylph_profile.tsv.gz"),
+
+ "&&",
+
+ os.environ["reformat_sylph_profile_single_sample_output.py"],
+ "-i {}".format(os.path.join(output_directory, "sylph_profile.tsv.gz")),
+ "-o {}".format(output_directory),
+ "-c {}".format(opts.genome_clusters) if opts.genome_clusters else "",
+ "-f Taxonomic_abundance",
+ "-x {}".format(opts.extension),
+ "--header" if opts.header else "",
+ ]
+
+ return cmd
+
+
+
+# ============
+# Run Pipeline
+# ============
+# Set environment variables
+def add_executables_to_environment(opts):
+ """
+ Adapted from Soothsayer: https://github.com/jolespin/soothsayer
+ """
+ accessory_scripts = set([
+ "reformat_sylph_profile_single_sample_output.py",
+ ]
+ )
+
+ required_executables={
+ "sylph",
+ # "seqkit",
+
+ } | accessory_scripts
+
+ if opts.path_config == "CONDA_PREFIX":
+ executables = dict()
+ for name in required_executables:
+ executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name)
+ else:
+ if opts.path_config is None:
+ opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv")
+ opts.path_config = format_path(opts.path_config)
+ assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config)
+ assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables)
+ df_config = pd.read_csv(opts.path_config, sep="\t")
+ assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config)
+ df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str)
+ # Get executable paths
+ executables = OrderedDict(zip(df_config["name"], df_config["executable"]))
+ assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys())))
+
+ # Display
+ for name in sorted(accessory_scripts):
+ executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path
+
+ print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout)
+ for name, executable in executables.items():
+ if name in required_executables:
+ print(name, executable, sep = " --> ", file=sys.stdout)
+ os.environ[name] = executable.strip()
+ print("", file=sys.stdout)
+
+
+# Pipeline
+def create_pipeline(opts, directories, f_cmds):
+
+ # .................................................................
+ # Primordial
+ # .................................................................
+ # Commands file
+ pipeline = ExecutablePipeline(name=__program__, description=opts.name, f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"])
+
+ # ==========
+ # Preprocess reads
+ # ==========
+
+ if opts.input_reads_format == "paired":
+
+ step = 0
+
+ # Info
+ program = "sylph_sketch"
+ program_label = "{}__{}".format(step, program)
+ description = "Sketch input reads"
+
+ # Add to directories
+ output_directory = directories["output"]
+ # i/o
+ input_filepaths = [opts.forward_reads, opts.reverse_reads]
+ output_filepaths = [
+ os.path.join(output_directory, "reads.sylsp"),
+ ]
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_sylph_sketch_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+ )
+ else:
+ output_filepaths = [opts.reads_sketch]
+
+
+ # ==========
+ # Profile
+ # ==========
+
+ step = 1
+
+ # Info
+ program = "sylph_profile"
+ program_label = "{}__{}".format(step, program)
+ description = "Profile genome databases"
+
+ # Add to directories
+ output_directory = directories["output"]
+
+ # i/o
+ input_filepaths = output_filepaths + opts.sylph_databases
+
+
+ output_filepaths = [
+ os.path.join(output_directory, "sylph_profile.tsv.gz"),
+ os.path.join(output_directory, "taxonomic_abundance.tsv.gz"),
+ ]
+ if opts.genome_clusters:
+ input_filepaths += [
+ opts.genome_clusters,
+ ]
+ output_filepaths += [
+ os.path.join(output_directory, "taxonomic_abundance.clusters.tsv.gz"),
+ ]
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_sylph_profile_cmd(**params)
+ pipeline.add_step(
+ id=program_label,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ log_prefix=program_label,
+
+ )
+
+
+
+ return pipeline
+
+# Configure parameters
+def configure_parameters(opts, directories):
+
+ for db in opts.sylph_databases:
+ assert db.endswith(".syldb"), "{} must have .syldb file extension".format(db)
+
+ # --input_reads_format
+ assert_acceptable_arguments(opts.input_reads_format, {"paired", "sketch", "auto"})
+ if opts.input_reads_format == "auto":
+ if any([opts.forward_reads, opts.reverse_reads]):
+ assert opts.forward_reads != opts.reverse_reads, "You probably mislabeled the input files because `forward_reads` should not be the same as `reverse_reads`: {}".format(opts.forward_reads)
+ assert opts.forward_reads is not None, "If running in --input_reads_format paired mode, --forward_reads and --reverse_reads are needed."
+ assert opts.reverse_reads is not None, "If running in --input_reads_format paired mode, --forward_reads and --reverse_reads are needed."
+ opts.input_reads_format = "paired"
+ if opts.reads_sketch is not None:
+ assert opts.forward_reads is None, "If running in --input_reads_format sketch mode, you cannot provide --forward_reads, --reverse_reads"
+ assert opts.reverse_reads is None, "If running in --input_reads_format sketch mode, you cannot provide --forward_reads, --reverse_reads"
+ opts.input_reads_format = "sketch"
+
+ print("Auto detecting reads format: {}".format(opts.input_reads_format), file=sys.stdout)
+ assert_acceptable_arguments(opts.input_reads_format, {"paired", "sketch"})
+
+ # Set environment variables
+ add_executables_to_environment(opts=opts)
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -1 -2 |-s -n -o -d ".format(__program__)
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+
+ # Pipeline
+ parser_io = parser.add_argument_group('Required I/O arguments')
+ parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/forward_reads.fq[.gz]")
+ parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reverse_reads.fq[.gz]]")
+ parser_io.add_argument("-s","--reads_sketch", type=str, help = "path/to/reads_sketch.sylsp (e.g., sylph sketch output) (Cannot be used with --forward_reads and --reverse_reads)")
+ parser_io.add_argument("-n", "--name", type=str, required=True, help="Name of sample")
+ parser_io.add_argument("-d","--sylph_databases", type=str, nargs="+", required=True, help = "Sylph database(s) with all genomes. Can be multiple databases delimited by spaces. Use compile_custom_sylph_sketch_database_from_genomes.py to build database.")
+ parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/profiling/taxonomy", help = "path/to/project_directory [Default: veba_output/profiling/taxonomy]")
+ parser_io.add_argument("-c","--genome_clusters", type=str, help = "path/to/mags_to_slcs.tsv. [id_genome][id_genome-cluster], No header. Aggregates counts for genome clusters.")
+ parser_io.add_argument("-F", "--input_reads_format", choices={"paired", "sketch"}, type=str, default="auto", help = "Input reads format {paired, sketch} [Default: auto]")
+ parser_io.add_argument("-x","--extension", type=str, default="fa", help = "Fasta file extension for bins. Assumes all genomes have the same file extension. [Default: fa]")
+
+
+ # Utility
+ parser_utility = parser.add_argument_group('Utility arguments')
+ parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future
+ parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]")
+ parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]")
+ parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]")
+ parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
+ parser_utility.add_argument("--tmpdir", type=str, help="Set temporary directory") #site-packges in future
+
+ # Sylph
+ parser_sylph_sketch = parser.add_argument_group('Sylph sketch arguments (Fastq)')
+ parser_sylph_sketch.add_argument("--sylph_sketch_k", type=int, choices={21,31}, default=31, help="Sylph sketch [Fastq] | Value of k. Only k = 21, 31 are currently supported. [Default: 31]")
+ parser_sylph_sketch.add_argument("--sylph_sketch_minimum_spacing", type=int, default=30, help="Sylph sketch [Fastq] | Minimum spacing between selected k-mers on the genomes [Default: 30]")
+ parser_sylph_sketch.add_argument("--sylph_sketch_subsampling_rate", type=int, default=100, help="Sylph sketch [Fastq] | Subsampling rate. sylph runs without issues if the -c for all genomes is ≥ the -c for reads. [Default: 100]")
+ parser_sylph_sketch.add_argument("--sylph_sketch_options", type=str, default="", help="Sylph sketch [Fastq] | More options for `sylph sketch` (e.g. --arg 1 ) [Default: '']")
+
+ parser_sylph_profile = parser.add_argument_group('Sylph profile arguments')
+ parser_sylph_profile.add_argument("--sylph_profile_minimum_ani", type=float, default=95, help="Sylph profile | Minimum adjusted ANI to consider (0-100). [Default: 95]")
+ parser_sylph_profile.add_argument("--sylph_profile_minimum_number_kmers", type=int, default=20, help="Sylph profile | Exclude genomes with less than this number of sampled k-mers. Default is 50 in Sylph but lowering to 20 accounts for viruses and small CPR genomes. [Default: 20]")
+ parser_sylph_profile.add_argument("--sylph_profile_minimum_count_correct", type=int, default=3, help="Sylph profile | Minimum k-mer multiplicity needed for coverage correction. Higher values gives more precision but lower sensitivity [Default: 3]")
+ parser_sylph_profile.add_argument("--sylph_profile_options", type=str, default="", help="Sylph profile | More options for `sylph profile` (e.g. --arg 1 ) [Default: '']")
+ parser_sylph_profile.add_argument("--header", action="store_true", help = "Include header in taxonomic abundance tables")
+
+ # Options
+ opts = parser.parse_args()
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ # Threads
+ if opts.n_jobs == -1:
+ from multiprocessing import cpu_count
+ opts.n_jobs = cpu_count()
+ assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1."
+
+
+ # Directories
+ directories = dict()
+ directories["project"] = create_directory(opts.project_directory)
+ directories["sample"] = create_directory(os.path.join(directories["project"], opts.name))
+ directories["output"] = create_directory(os.path.join(directories["sample"], "output"))
+
+ directories["log"] = create_directory(os.path.join(directories["sample"], "log"))
+ if not opts.tmpdir:
+ opts.tmpdir = os.path.join(directories["sample"], "tmp")
+ directories["tmp"] = create_directory(opts.tmpdir)
+ directories["checkpoints"] = create_directory(os.path.join(directories["sample"], "checkpoints"))
+ directories["intermediate"] = create_directory(os.path.join(directories["sample"], "intermediate"))
+ os.environ["TMPDIR"] = directories["tmp"]
+
+ # Info
+ print(format_header(__program__, "="), file=sys.stdout)
+ print(format_header("Configuration:", "-"), file=sys.stdout)
+ print(format_header("Name: {}".format(opts.name), "."), file=sys.stdout)
+ print("Python version:", sys.version.replace("\n"," "), file=sys.stdout)
+ print("Python path:", sys.executable, file=sys.stdout) #sys.path[2]
+ print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2]
+ print("Script version:", __version__, file=sys.stdout)
+ print("Moment:", get_timestamp(), file=sys.stdout)
+ print("Directory:", os.getcwd(), file=sys.stdout)
+ if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout)
+ print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
+ configure_parameters(opts, directories)
+ sys.stdout.flush()
+
+ # Run pipeline
+ with open(os.path.join(directories["sample"], "commands.sh"), "w") as f_cmds:
+ pipeline = create_pipeline(
+ opts=opts,
+ directories=directories,
+ f_cmds=f_cmds,
+ )
+ pipeline.compile()
+ pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint)
+
+if __name__ == "__main__":
+ main()
diff --git a/src/scripts/binning_wrapper.py b/src/scripts/binning_wrapper.py
index cfd1c0e..a9904e6 100755
--- a/src/scripts/binning_wrapper.py
+++ b/src/scripts/binning_wrapper.py
@@ -12,7 +12,7 @@
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.5.8"
+__version__ = "2023.12.4"
def get_maxbin2_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
# Create dummy scaffolds_to_bins.tsv to overwrite later. This makes DAS_Tool easier to run
@@ -740,6 +740,9 @@ def add_executables_to_environment(opts):
"merge_cutup_clustering.py",
"extract_fasta_bins.py",
}
+ # if opts.algorithm == "vrhyme":
+ # required_executables |= {"vRhyme"}
+
# if opts.algorithm == "metacoag":
# required_executables |= {"MetaCoAG"}
@@ -845,7 +848,7 @@ def main(argv=None):
# Binning
parser_binning = parser.add_argument_group('Binning arguments')
- parser_binning.add_argument("-a", "--algorithm", type=str, default="metabat2", help="Binning algorithm: {concoct, metabat2, maxbin2} Future: {metacoag, vamb} [Default: metabat2] ")
+ parser_binning.add_argument("-a", "--algorithm", type=str, default="metabat2", help="Binning algorithm: {concoct, metabat2, maxbin2} Future: {vrhyme} [Default: metabat2] ")
parser_binning.add_argument("-m", "--minimum_contig_length", type=int, default=1500, help="Minimum contig length. [Default: 1500] ")
parser_binning.add_argument("-s", "--minimum_genome_length", type=int, default=150000, help="Minimum genome length. [Default: 150000] ")
parser_binning.add_argument("-P","--bin_prefix", type=str, default="DEFAULT", help = "Prefix for bin names. Special strings include: 1) --bin_prefix NONE which does not include a bin prefix; and 2) --bin_prefix DEFAULT then prefix is [ALGORITHM_UPPERCASE]__")
@@ -870,8 +873,8 @@ def main(argv=None):
# parser_metacoag = parser.add_argument_group('MetaCoAG arguments')
# parser_metacoag.add_argument("--metacoag_options", type=str, default="", help="MetaCoAG | More options (e.g. --arg 1 ) [Default: '']")
- # parser_vamb = parser.add_argument_group('VAMB arguments')
- # parser_vamb.add_argument("--vamb_options", type=str, default="", help="VAMB | More options (e.g. --arg 1 ) [Default: '']")
+ # parser_vrhyme = parser.add_argument_group('vRhyme arguments')
+ # parser_vrhyme.add_argument("--vrhyme_options", type=str, default="", help="vRhyme | More options (e.g. --arg 1 ) [Default: '']")
# Options
opts = parser.parse_args(argv)
diff --git a/src/scripts/build_source_to_lineage_dictionary.py b/src/scripts/build_source_to_lineage_dictionary.py
new file mode 100755
index 0000000..e593068
--- /dev/null
+++ b/src/scripts/build_source_to_lineage_dictionary.py
@@ -0,0 +1,69 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, gzip, pickle
+from tqdm import tqdm
+import pandas as pd
+
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.11.13"
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -o ".format(__program__)
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+ # Pipeline
+ parser.add_argument("-i","--input", default="stdin", type=str, help = "Path to table [id_source][class][order][family][genus][species], with header. Can include more columns but the first column must be `id_source`. [Default: stdin]")
+ parser.add_argument("-o","--output", required=True, type=str, help = "Path to dictionary pickle object. Can be gzipped. (Recommended name: source_to_lineage.dict.pkl.gz)")
+ parser.add_argument("--separator", default=";", type=str, help = "Separator field for taxonomy [Default: ; ]")
+
+ # Options
+ opts = parser.parse_args()
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ # Input
+ if opts.input == "stdin":
+ opts.input = sys.stdin
+
+
+ print(" * Reading identifier mappings from the following file: {}".format(opts.input), file=sys.stderr)
+ source_to_lineage = dict()
+ df_input = pd.read_csv(opts.input, sep="\t", index_col=0)
+ for id_source, row in tqdm(df_input.loc[:,["class", "order", "family", "genus", "species"]].iterrows(), total=df_input.shape[0]):
+ lineage = list()
+ for level, taxon in row.items():
+ v = level[0] + "__"
+ if pd.notnull(taxon):
+ v += taxon
+ lineage.append(v)
+ source_to_lineage[id_source] = opts.separator.join(lineage)
+
+
+
+ print(" * Writing Python dictionary: {}".format(opts.output), file=sys.stderr)
+ f_out = None
+ if opts.output.endswith((".gz", ".pgz")):
+ f_out = gzip.open(opts.output, "wb")
+ else:
+ f_out = open(opts.output, "wb")
+ assert f_out is not None, "Unrecognized file format: {}".format(opts.output)
+ pickle.dump(source_to_lineage, f_out)
+
+
+
+
+
+
+
+if __name__ == "__main__":
+ main()
diff --git a/src/scripts/build_target_to_source_dictionary.py b/src/scripts/build_target_to_source_dictionary.py
new file mode 100755
index 0000000..367bc94
--- /dev/null
+++ b/src/scripts/build_target_to_source_dictionary.py
@@ -0,0 +1,77 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, gzip, pickle
+
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.11.15"
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -o ".format(__program__)
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+ # Pipeline
+ parser.add_argument("-i","--input", default="stdin", type=str, help = "Path to identifier mapping table [id_database][id_source][id_protein][id_hash], No header. [Default: stdin]")
+ parser.add_argument("-o","--output", required=True, type=str, help = "Path to dictionary pickle object. Can be gzipped. (Recommended name: target_to_source.dict.pkl.gz)")
+ parser.add_argument("-n","--number_of_sequences", type=int, help = "Number of sequences. If used, the tqdm is required.")
+
+ # Options
+ opts = parser.parse_args()
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ # Input
+ f_in = None
+ if opts.input == "stdin":
+ f_in = sys.stdin
+ else:
+ if opts.input.endswith(".gz"):
+ f_in = gzip.open(opts.input, "rt")
+ else:
+ f_in = open(opts.input, "r")
+ assert f_in is not None, "Unrecognized file format: {}".format(opts.input)
+
+ if opts.number_of_sequences is not None:
+ from tqdm import tqdm
+ input_iterable = tqdm(f_in, total=opts.number_of_sequences, unit=" sequences")
+ else:
+ input_iterable = f_in
+
+ print(" * Reading identifier mappings from the following file: {}".format(f_in), file=sys.stderr)
+ target_to_source = dict()
+ for line in input_iterable:
+ line = line.strip()
+ if line:
+ fields = line.split("\t")
+ id_hash = fields[3]
+ id_source = fields[1]
+ target_to_source[id_hash] = id_source
+ if f_in != sys.stdin:
+ f_in.close()
+
+ print(" * Writing Python dictionary: {}".format(opts.output), file=sys.stderr)
+ f_out = None
+ if opts.output.endswith((".gz", ".pgz")):
+ f_out = gzip.open(opts.output, "wb")
+ else:
+ f_out = open(opts.output, "wb")
+ assert f_out is not None, "Unrecognized file format: {}".format(opts.output)
+ pickle.dump(target_to_source, f_out)
+
+
+
+
+
+
+
+if __name__ == "__main__":
+ main()
diff --git a/src/scripts/check_fasta_duplicates.py b/src/scripts/check_fasta_duplicates.py
index b4ca8dd..527b508 100755
--- a/src/scripts/check_fasta_duplicates.py
+++ b/src/scripts/check_fasta_duplicates.py
@@ -3,7 +3,7 @@
from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.4.17"
+__version__ = "2023.11.10"
def main(args=None):
# Path info
@@ -30,13 +30,15 @@ def main(args=None):
if not opts.input:
identifiers = set()
duplicates = set()
- for line in tqdm(sys.stdin, "stdin"):
+ for i, line in tqdm(enumerate(sys.stdin), "stdin"):
if line.startswith(">"):
id = line[1:].split(" ")[0].strip()
if id not in identifiers:
identifiers.add(id)
else:
duplicates.add(id)
+ else:
+ assert ">" not in line, "Line={} has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(i+1)
if duplicates:
print("# Duplicates:", *sorted(duplicates), file=sys.stdout, sep="\n", end=None)
sys.exit(1)
@@ -48,13 +50,16 @@ def main(args=None):
identifiers = set()
duplicates = set()
f = {True:gzip.open(fp, "rt"), False:open(fp, "r")}[fp.endswith(".gz")]
- for line in tqdm(f, fp):
+ for i,line in tqdm(enumerate(f), fp):
if line.startswith(">"):
id = line[1:].split(" ")[0]
if id not in identifiers:
identifiers.add(id)
else:
duplicates.add(id)
+ else:
+ assert ">" not in line, "Line={} has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(i+1)
+
if duplicates:
files_with_duplicates.add(fp)
print(f"[Fail] {fp}", file=sys.stdout)
diff --git a/src/scripts/clean_fasta.py b/src/scripts/clean_fasta.py
new file mode 100755
index 0000000..92192c6
--- /dev/null
+++ b/src/scripts/clean_fasta.py
@@ -0,0 +1,124 @@
+#!/usr/bin/env python
+import sys, os, argparse, gzip
+from Bio.SeqIO.FastaIO import SimpleFastaParser
+from tqdm import tqdm
+
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.11.10"
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -o )".format(__program__)
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+
+ # Pipeline
+ parser.add_argument("-i","--input", default="stdin", type=str, help = "Input fasta file")
+ parser.add_argument("-o","--output", default="stdout", type=str, help = "Output fasta file")
+ parser.add_argument("-r","--retain_description", action="store_true", help = "Retain description")
+ parser.add_argument("-s","--retain_stop_codon", action="store_true", help = "Retain stop codon character (if one exists)")
+ parser.add_argument("-m","--minimum_sequence_length", default=1, type=int, help = "Minimum sequence length accepted [Default: 1]")
+ parser.add_argument("--stop_codon_character", default="*", type=str, help = "Stop codon character [Default: *] ")
+ # parser.add_argument("-t","--molecule_type", help = "Comma-separated list of names for the --scaffolds_to_bins")
+
+
+ # Options
+ opts = parser.parse_args()
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ assert opts.minimum_sequence_length > 0
+
+ # Input
+ f_in = None
+ if opts.input == "stdin":
+ f_in = sys.stdin
+ else:
+ if opts.input.endswith(".gz"):
+ f_in = gzip.open(opts.input, "rt")
+ else:
+ f_in = open(opts.input, "r")
+ assert f_in is not None
+
+ # Output
+ f_out = None
+ if opts.output == "stdout":
+ f_out = sys.stdout
+ else:
+ if opts.output.endswith(".gz"):
+ f_out = gzip.open(opts.output, "wt")
+ else:
+ f_out = open(opts.output, "w")
+ assert f_out is not None
+
+ # retain_description=True
+ # retain_stop_codon=True
+ if all([
+ opts.retain_description,
+ opts.retain_stop_codon,
+ ]):
+ for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"):
+ header = header.strip()
+ if len(seq) >= opts.minimum_sequence_length:
+ assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header)
+ print(">{}\n{}".format(header,seq), file=f_out)
+
+ # retain_description=False
+ # retain_stop_codon=True
+ if all([
+ not opts.retain_description,
+ opts.retain_stop_codon,
+ ]):
+ for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"):
+ id = header.split(" ")[0].strip()
+ if len(seq) >= opts.minimum_sequence_length:
+ assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header)
+ print(">{}\n{}".format(id,seq), file=f_out)
+
+ # retain_description=True
+ # retain_stop_codon=False
+ if all([
+ opts.retain_description,
+ not opts.retain_stop_codon,
+ ]):
+ for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"):
+ header = header.strip()
+ if seq.endswith(opts.stop_codon_character):
+ seq = seq[:-1]
+ if len(seq) >= opts.minimum_sequence_length:
+ assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header)
+ print(">{}\n{}".format(header,seq), file=f_out)
+
+ # retain_description=False
+ # retain_stop_codon=False
+ if all([
+ not opts.retain_description,
+ not opts.retain_stop_codon,
+ ]):
+ for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"):
+ id = header.split(" ")[0].strip()
+ if seq.endswith(opts.stop_codon_character):
+ seq = seq[:-1]
+ if len(seq) >= opts.minimum_sequence_length:
+ assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header)
+ print(">{}\n{}".format(id,seq), file=f_out)
+
+ # Close
+ if f_in != sys.stdin:
+ f_in.close()
+ if f_out != sys.stdout:
+ f_out.close()
+
+if __name__ == "__main__":
+ main()
+
+
+
diff --git a/src/scripts/clustering_wrapper.py b/src/scripts/clustering_wrapper.py
new file mode 100755
index 0000000..b8eddbb
--- /dev/null
+++ b/src/scripts/clustering_wrapper.py
@@ -0,0 +1,439 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, glob, shutil, time, warnings
+from multiprocessing import cpu_count
+from collections import OrderedDict, defaultdict
+
+import pandas as pd
+
+# Soothsayer Ecosystem
+from genopype import *
+from soothsayer_utils import *
+
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.11.10"
+
+# Check
+def get_check_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+ # Command
+
+ # Command
+ cmd = [
+ os.environ["check_fasta_duplicates.py"],
+ opts.fasta,
+ ]
+
+ return cmd
+
+def get_mmseqs2_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+ cmd = [
+ os.environ["mmseqs"],
+ "easy-{}".format(opts.algorithm.split("-")[1]),
+ opts.fasta,
+ os.path.join(output_directory, "mmseqs2"),
+ directories["tmp"],
+ "--threads {}".format(opts.n_jobs),
+ "--min-seq-id {}".format(opts.minimum_identity_threshold/100),
+ "-c {}".format(opts.minimum_coverage_threshold),
+ "--cov-mode 1",
+ opts.mmseqs2_options,
+
+ "&&",
+
+ "mv",
+ os.path.join(output_directory, "mmseqs2_cluster.tsv"),
+ os.path.join(output_directory, "clusters.tsv"),
+
+ "&&",
+
+ "mv",
+ os.path.join(output_directory, "mmseqs2_rep_seq.fasta"),
+ os.path.join(output_directory, "representatives.fasta"),
+
+ "&&",
+
+ "gzip",
+ os.path.join(output_directory, "representatives.fasta"),
+
+ "&&",
+
+ "rm -rf",
+ os.path.join(output_directory, "mmseqs2_all_seqs.fasta"),
+ os.path.join(directories["tmp"], "*"),
+ ]
+
+ return cmd
+
+def get_diamond_cmd( input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+ cmd = [
+ os.environ["diamond"],
+ {"diamond-cluster":"cluster", "diamond-linclust":"linclust"}[opts.algorithm],
+ "--db",
+ opts.fasta,
+ "--out",
+ os.path.join(output_directory, "clusters.tsv"),
+ "--tmpdir",
+ directories["tmp"],
+ "--threads {}".format(opts.n_jobs),
+ "--approx-id {}".format(opts.minimum_identity_threshold),
+ "--member-cover {}".format(opts.minimum_coverage_threshold*100),
+ opts.diamond_options,
+
+ "&&",
+
+ "cut -f1",
+ os.path.join(output_directory, "clusters.tsv"),
+ "|",
+ "sort -u",
+ ">",
+ os.path.join(output_directory, "representatives.list"),
+
+ "&&",
+
+ os.environ["seqkit"],
+ "grep",
+ "-w 0",
+ "-f",
+ os.path.join(output_directory, "representatives.list"),
+ opts.fasta,
+ "|",
+ "gzip",
+ ">",
+ os.path.join(output_directory, "representatives.fasta.gz"),
+
+ "&&",
+
+ "rm -rf",
+ os.path.join(directories["tmp"], "*"),
+ ]
+
+ return cmd
+
+# Compile
+def get_compile_cmd(input_filepaths, output_filepaths, output_directory, directories, opts):
+
+ # Command
+ cmd = [
+
+ os.environ["edgelist_to_clusters.py"],
+ "-i {}".format(input_filepaths[0]),
+ "--no_singletons" if bool(opts.no_singletons) else "",
+ "--cluster_prefix {}".format(opts.cluster_prefix) if bool(opts.cluster_prefix) else "",
+ "--cluster_suffix {}".format(opts.cluster_suffix) if bool(opts.cluster_suffix) else "",
+ "--cluster_prefix_zfill {}".format(opts.cluster_prefix_zfill),
+ "-o {}".format(os.path.join(output_directory, "{}.tsv".format(opts.basename))),
+ # "-g {}".format(os.path.join(output_directory, "{}.networkx_graph.pkl".format(opts.basename))),
+ # "-d {}".format(os.path.join(output_directory, "{}.dict.pkl".format(opts.basename))),
+ "--identifiers {}".format(opts.identifiers) if bool(opts.identifiers) else "",
+
+ "&&",
+
+ os.environ["reformat_representative_sequences.py"],
+ "-c {}".format(os.path.join(output_directory, "{}.tsv".format(opts.basename))),
+ "-i {}".format(input_filepaths[1]),
+ "-f {}".format(opts.representative_output_format),
+ "-o {}".format(output_filepaths[1]),
+ ]
+
+ if opts.no_sequences_and_header:
+ cmd += [
+ "--no_sequences",
+ "--no_header",
+ ]
+
+ return cmd
+
+# ============
+# Run Pipeline
+# ============
+# Set environment variables
+def add_executables_to_environment(opts):
+ """
+ Adapted from Soothsayer: https://github.com/jolespin/soothsayer
+ """
+ accessory_scripts = set([
+ "check_fasta_duplicates.py",
+ "edgelist_to_clusters.py",
+ "reformat_representative_sequences.py",
+ ])
+
+ required_executables={
+ "mmseqs",
+ "diamond",
+ "seqkit",
+
+ } | accessory_scripts
+
+ if opts.path_config == "CONDA_PREFIX":
+ executables = dict()
+ for name in required_executables:
+ executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name)
+ else:
+ if opts.path_config is None:
+ opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv")
+ opts.path_config = format_path(opts.path_config)
+ assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config)
+ assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables)
+ df_config = pd.read_csv(opts.path_config, sep="\t")
+ assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config)
+ df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str)
+ # Get executable paths
+ executables = OrderedDict(zip(df_config["name"], df_config["executable"]))
+ assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys())))
+
+ # Display
+ for name in sorted(accessory_scripts):
+ executables[name] = "'{}'".format(os.path.join(opts.script_directory, name)) # Can handle spaces in path
+
+ print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout)
+ for name, executable in executables.items():
+ if name in required_executables:
+ print(name, executable, sep = " --> ", file=sys.stdout)
+ os.environ[name] = executable.strip()
+ print("", file=sys.stdout)
+
+# Pipeline
+def create_pipeline(opts, directories, f_cmds):
+
+ # .................................................................
+ # Primordial
+ # .................................................................
+ # Commands file
+ pipeline = ExecutablePipeline(name=__program__, f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"])
+
+
+ # ==========
+ # Preprocessing
+ # ==========
+
+ program = "check"
+ # Add to directories
+ output_directory = directories["tmp"]
+
+ # Info
+ step = 0
+ description = "Check sequences for duplicates"
+
+ # i/o
+ input_filepaths = [opts.fasta]
+ output_filepaths = [
+ ]
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_check_cmd(**params)
+
+ pipeline.add_step(
+ id=program,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=False,
+ )
+
+ # ==========
+ # Clustering
+ # ==========
+ step = 1
+
+ # i/o
+ output_directory = directories["intermediate"]
+
+ input_filepaths = [opts.fasta]
+ output_filenames = [
+ "clusters.tsv",
+ "representatives.fasta.gz",
+ ]
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ if opts.algorithm.split("-")[0] == "mmseqs":
+ program = "mmseqs2"
+ # Info
+ description = "Cluster sequences via MMSEQS2"
+ cmd = get_mmseqs2_cmd(**params)
+
+ if opts.algorithm.split("-")[0] == "diamond":
+ program = "diamond"
+ description = "Cluster sequences via Diamond"
+ cmd = get_diamond_cmd(**params)
+
+ pipeline.add_step(
+ id=program,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ )
+
+ # ==========
+ # Compile
+ # ==========
+
+ program = "compile"
+ # Add to directories
+ output_directory = directories["output"]
+
+ # Info
+ step = 2
+ description = "Compile clustering results"
+
+ # i/o
+ input_filepaths = output_filepaths
+ output_filenames = [
+ "{}.tsv".format(opts.basename),
+ ]
+ if opts.representative_output_format == "table":
+ output_filenames += ["representative_sequences.tsv.gz"]
+ if opts.representative_output_format == "fasta":
+ output_filenames += ["representative_sequences.fasta.gz"]
+ output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames))
+
+ params = {
+ "input_filepaths":input_filepaths,
+ "output_filepaths":output_filepaths,
+ "output_directory":output_directory,
+ "opts":opts,
+ "directories":directories,
+ }
+
+ cmd = get_compile_cmd(**params)
+
+ pipeline.add_step(
+ id=program,
+ description = description,
+ step=step,
+ cmd=cmd,
+ input_filepaths = input_filepaths,
+ output_filepaths = output_filepaths,
+ validate_inputs=True,
+ validate_outputs=True,
+ )
+
+ return pipeline
+
+# Configure parameters
+def configure_parameters(opts, directories):
+
+ assert_acceptable_arguments(opts.algorithm, {"easy-cluster", "easy-linclust", "mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"})
+ if opts.algorithm in {"easy-cluster", "easy-linclust"}:
+ d = {"easy-cluster":"mmseqs-cluster", "easy-linclust":"mmseqs-linclust"}
+ warnings.warn("\n\nPlease use `{}` instead of `{}` for MMSEQS2 clustering.".format(d[opts.algorithm], opts.algorithm))
+ opts.algorithm = d[opts.algorithm]
+ assert_acceptable_arguments(opts.representative_output_format, {"table", "fasta"})
+ # Set environment variables
+ add_executables_to_environment(opts=opts)
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -o ".format(__program__)
+
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+ # Pipeline
+ parser_io = parser.add_argument_group('Required I/O arguments')
+ parser_io.add_argument("-i","--fasta", type=str, help = "Fasta file")
+ parser_io.add_argument("-o","--output_directory", type=str, default="clustering_output", help = "path/to/project_directory [Default: clustering_output]")
+ parser_io.add_argument("-e", "--no_singletons", action="store_true", help="Exclude singletons")
+ parser_io.add_argument("-b", "--basename", type=str, default="clusters", help="Basename for clustering files [Default: clusters]")
+
+ # Utility
+ parser_utility = parser.add_argument_group('Utility arguments')
+ parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future
+ parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]")
+ parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]")
+ parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
+ # parser_utility.add_argument("--verbose", action='store_true')
+
+ # Clustering
+ parser_clustering = parser.add_argument_group('Clustering arguments')
+ parser_clustering.add_argument("-a", "--algorithm", type=str, default="mmseqs-cluster", help="Clustering algorithm | Diamond can only be used for clustering proteins {mmseqs-cluster, mmseqs-linclust, diamond-cluster, mmseqs-linclust} [Default: mmseqs-cluster]")
+ parser_clustering.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="Clustering | Percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]")
+ parser_clustering.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="Clustering | Coverage threshold (Range (0.0, 1.0]) [Default: 0.8]")
+ parser_clustering.add_argument("--cluster_prefix", type=str, default="SC-", help="Sequence cluster prefix [Default: 'SC-]")
+ parser_clustering.add_argument("--cluster_suffix", type=str, default="", help="Sequence cluster suffix [Default: '']")
+ parser_clustering.add_argument("--cluster_prefix_zfill", type=int, default=0, help="Sequence cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7
+ parser_clustering.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']")
+ parser_clustering.add_argument("--diamond_options", type=str, default="", help="Diamond | More options (e.g. --arg 1 ) [Default: '']")
+ parser_clustering.add_argument("--identifiers", type=str, help = "Identifiers to include for `edgelist_to_clusters.py`. If missing identifiers and singletons are allowed, then they will be included as singleton clusters with weight of np.inf")
+ parser_clustering.add_argument("--no_sequences_and_header", action="store_true", help = "Don't include sequences or header in table. Useful for concatenation and reduced redundancy of sequences")
+ parser_clustering.add_argument("-f","--representative_output_format", type=str, default="fasta", help = "Format of output for representative sequences: {table, fasta} [Default: fasta]") # Should fasta be the new default?
+
+ # Options
+ opts = parser.parse_args()
+
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ # Threads
+ if opts.n_jobs == -1:
+ opts.n_jobs = cpu_count()
+ assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1 (or -1 to use all available threads)"
+
+ # Directories
+ directories = dict()
+ directories["project"] = create_directory(opts.output_directory)
+ directories["output"] = create_directory(os.path.join(directories["project"], "output"))
+ directories["log"] = create_directory(os.path.join(directories["project"], "log"))
+ directories["tmp"] = create_directory(os.path.join(directories["project"], "tmp"))
+ directories["checkpoints"] = create_directory(os.path.join(directories["project"], "checkpoints"))
+ directories["intermediate"] = create_directory(os.path.join(directories["project"], "intermediate"))
+ os.environ["TMPDIR"] = directories["tmp"]
+
+ # Info
+ print(format_header(__program__, "="), file=sys.stdout)
+ print(format_header("Configuration:", "-"), file=sys.stdout)
+ print("Python version:", sys.version.replace("\n"," "), file=sys.stdout)
+ print("Python path:", sys.executable, file=sys.stdout) #sys.path[2]
+ print("Script version:", __version__, file=sys.stdout)
+ print("Moment:", get_timestamp(), file=sys.stdout)
+ print("Directory:", os.getcwd(), file=sys.stdout)
+ print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
+ configure_parameters(opts, directories)
+ sys.stdout.flush()
+
+ # Run pipeline
+ with open(os.path.join(directories["project"], "commands.sh"), "w") as f_cmds:
+ pipeline = create_pipeline(
+ opts=opts,
+ directories=directories,
+ f_cmds=f_cmds,
+ )
+ pipeline.compile()
+ pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint)
+
+if __name__ == "__main__":
+ main(sys.argv[1:])
+
+
diff --git a/src/scripts/compile_custom_humann_database_from_annotations.py b/src/scripts/compile_custom_humann_database_from_annotations.py
index a644bb5..6604413 100755
--- a/src/scripts/compile_custom_humann_database_from_annotations.py
+++ b/src/scripts/compile_custom_humann_database_from_annotations.py
@@ -11,7 +11,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.10.11"
+__version__ = "2023.12.15"
def main(args=None):
@@ -31,7 +31,7 @@ def main(args=None):
parser.add_argument("-a","--annotations", type=str, required=True, help = "path/to/annotations.tsv[.gz] Output from annotations.py. Multi-level header that contains (UniRef, sseqid)")
parser.add_argument("-t","--taxonomy", type=str, required=True, help = "path/to/taxonomy.tsv[.gz] [id_genome][classification] (No header). Use output from `merge_taxonomy_classifications.py` with --no_header and --no_domain")
parser.add_argument("-s","--sequences", type=str, required=True, help = "path/to/proteins.fasta[.gz]")
- parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/humann_uniref_annotations.tsv[.gz] [Default: stdout]")
+ parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/humann_uniref_annotations.tsv[.gz] (veba_output/profiling/databases/) is recommended [Default: stdout]")
parser.add_argument("--sep", default=";", help = "Separator for taxonomic levels [Default: ;]")
# parser.add_argument("--mandatory_taxonomy_prefixes", help = "Comma-separated values for mandatory prefix levels. (e.g., 'c__,f__,g__,s__')")
# parser.add_argument("--discarded_file", help = "Proteins that have been discarded due to incomplete lineage")
diff --git a/src/scripts/compile_custom_sylph_sketch_database_from_genomes.py b/src/scripts/compile_custom_sylph_sketch_database_from_genomes.py
new file mode 100755
index 0000000..9c25424
--- /dev/null
+++ b/src/scripts/compile_custom_sylph_sketch_database_from_genomes.py
@@ -0,0 +1,239 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse, glob, shutil, time, warnings
+from multiprocessing import cpu_count
+from collections import OrderedDict, defaultdict
+
+import pandas as pd
+
+# Soothsayer Ecosystem
+from genopype import *
+from genopype import __version__ as genopype_version
+from soothsayer_utils import *
+
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.12.15"
+
+# ============
+# Run Pipeline
+# ============
+# Set environment variables
+def add_executables_to_environment(opts):
+ """
+ Adapted from Soothsayer: https://github.com/jolespin/soothsayer
+ """
+ accessory_scripts = set([
+
+ ])
+
+ required_executables={
+ "sylph",
+
+ } | accessory_scripts
+
+ if opts.path_config == "CONDA_PREFIX":
+ executables = dict()
+ for name in required_executables:
+ executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name)
+ else:
+ if opts.path_config is None:
+ opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv")
+ opts.path_config = format_path(opts.path_config)
+ assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config)
+ assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables)
+ df_config = pd.read_csv(opts.path_config, sep="\t")
+ assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config)
+ df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str)
+ # Get executable paths
+ executables = OrderedDict(zip(df_config["name"], df_config["executable"]))
+ assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys())))
+
+ # Display
+ for name in sorted(accessory_scripts):
+ executables[name] = "'{}'".format(os.path.join(opts.script_directory, name)) # Can handle spaces in path
+
+ print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout)
+ for name, executable in executables.items():
+ if name in required_executables:
+ print(name, executable, sep = " --> ", file=sys.stdout)
+ os.environ[name] = executable.strip()
+ print("", file=sys.stdout)
+
+
+# Configure parameters
+def configure_parameters(opts, directories):
+
+
+ # Set environment variables
+ add_executables_to_environment(opts=opts)
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -o ".format(__program__)
+
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+ # Pipeline
+ parser_io = parser.add_argument_group('Required I/O arguments')
+ parser_io.add_argument("-i","--input", type=str, default="stdin", help = "path/to/input.tsv. Format: Must include the following columns (No header)[organism_type][path/to/genome.fa]. You can get this from `cut -f1,4 veba_output/misc/genomes_table.tsv` [Default: stdin]")
+ parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/profiling/databases", help = "path/to/output_directory for databases [Default: veba_output/profiling/databases]")
+ parser_io.add_argument("--viral_tag", type=str, default="viral", help = "[Not case sensitive] Tag/Label of viral organisms in first column of --input (e.g., viral, virus, viron) [Default: viral]")
+
+
+ # Utility
+ parser_utility = parser.add_argument_group('Utility arguments')
+ parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future
+ parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]")
+ parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__))
+ # parser_utility.add_argument("--verbose", action='store_true')
+
+ # Sylph
+ parser_sylph = parser.add_argument_group('Sylph sketch arguments')
+ parser_sylph.add_argument("-k", "--sylph_k", type=int, choices={21,31}, default=31, help="Sylph | Value of k. Only k = 21, 31 are currently supported. [Default: 31]")
+ parser_sylph.add_argument("-s", "--sylph_minimum_spacing", type=int, default=30, help="Sylph | Minimum spacing between selected k-mers on the genomes [Default: 30]")
+
+ parser_sylph_nonviral = parser.add_argument_group('[Prokaryotic & Eukaryotic] Sylph sketch arguments')
+ parser_sylph_nonviral.add_argument("--sylph_nonviral_subsampling_rate", type=int, default=200, help="Sylph [Prokaryotic & Eukaryotic]| Subsampling rate. [Default: 200]")
+ parser_sylph_nonviral.add_argument("--sylph_nonviral_options", type=str, default="", help="Sylph [Prokaryotic & Eukaryotic] | More options for `sylph sketch` (e.g. --arg 1 ) [Default: '']")
+
+ parser_sylph_viral = parser.add_argument_group('[Viral] Sylph sketch arguments')
+ parser_sylph_viral.add_argument("--sylph_viral_subsampling_rate", type=int, default=100, help="Sylph [Viral]| Subsampling rate. [Default: 100]")
+ parser_sylph_viral.add_argument("--sylph_viral_options", type=str, default="", help="Sylph [Viral] | More options for `sylph sketch` (e.g. --arg 1 ) [Default: '']")
+
+ # Options
+ opts = parser.parse_args()
+
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ # Threads
+ if opts.n_jobs == -1:
+ opts.n_jobs = cpu_count()
+ assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1 (or -1 to use all available threads)"
+
+ # Directories
+ directories = dict()
+ directories["output"] = create_directory(opts.output_directory)
+ directories["intermediate"] = create_directory(os.path.join(directories["output"], "intermediate"))
+ directories["log"] = create_directory(os.path.join(directories["intermediate"], "log"))
+ directories["checkpoints"] = create_directory(os.path.join(directories["intermediate"], "checkpoints"))
+
+ # Info
+ print(format_header(__program__, "="), file=sys.stdout)
+ print(format_header("Configuration:", "-"), file=sys.stdout)
+ print("Python version:", sys.version.replace("\n"," "), file=sys.stdout)
+ print("Python path:", sys.executable, file=sys.stdout) #sys.path[2]
+ print("Script version:", __version__, file=sys.stdout)
+ print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2]
+ print("Moment:", get_timestamp(), file=sys.stdout)
+ print("Directory:", os.getcwd(), file=sys.stdout)
+ print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout)
+ configure_parameters(opts, directories)
+ sys.stdout.flush()
+
+ # Make directories
+ t0 = time.time()
+ # print(format_header("* ({}) Creating directories:".format(format_duration(t0)), opts.output_directory), file=sys.stdout)
+ # os.makedirs(opts.output_directory, exist_ok=True)
+
+ # Load input
+ if opts.input == "stdin":
+ opts.input = sys.stdin
+ df_genomes = pd.read_csv(opts.input, sep="\t", header=None)
+ assert df_genomes.shape[1] == 2, "Must include the follow columns (No header) [organism_type][genome]). Suggested input is from `compile_genomes_table.py` script using `cut -f1,4` to get the necessary columns."
+ df_genomes.columns = ["organism_type", "genome"]
+
+ opts.viral_tag = opts.viral_tag.lower()
+
+ print(format_header("* ({}) Organizing genomes by organism_type".format(format_duration(t0))), file=sys.stdout)
+ organism_to_genomes = defaultdict(set)
+ for i, (organism_type, genome_filepath) in pv(df_genomes.iterrows(), unit="genomes ", total=df_genomes.shape[0]):
+ organism_type = organism_type.lower()
+ if organism_type == opts.viral_tag:
+ organism_to_genomes["viral"].add(genome_filepath)
+ else:
+ organism_to_genomes["nonviral"].add(genome_filepath)
+ # del df_genomes
+
+ # Commands
+ f_cmds = open(os.path.join(directories["intermediate"], "commands.sh"), "w")
+
+ for organism_type, filepaths in organism_to_genomes.items():
+ # Write genomes to file
+ print(format_header("* ({}) Creating genome database: (N={}) for organism_type='{}'".format(format_duration(t0),len(filepaths), organism_type)), file=sys.stdout)
+
+ genome_filepaths_list = os.path.join(directories["intermediate"], "{}_genomes.list".format(organism_type))
+ with open(genome_filepaths_list, "w") as f:
+ for fp in sorted(filepaths):
+ print(fp, file=f)
+
+ name = "sylph__{}".format(organism_type)
+ description = "[Program = sylph sketch] [Organism_Type = {}]".format(organism_type)
+
+ arguments = [
+ os.environ["sylph"],
+ "sketch",
+ "-t {}".format(opts.n_jobs),
+ "--gl {}".format(genome_filepaths_list),
+ "-o {}".format(os.path.join(opts.output_directory, "genome_database-{}".format(organism_type))),
+ "-k {}".format(opts.sylph_k),
+ "--min-spacing {}".format(opts.sylph_minimum_spacing),
+ ]
+
+ if organism_type == "nonviral":
+ arguments += [
+ "-c {}".format(opts.sylph_nonviral_subsampling_rate),
+ opts.sylph_nonviral_options,
+ ]
+
+ else:
+ arguments += [
+ "-c {}".format(opts.sylph_viral_subsampling_rate),
+ opts.sylph_viral_options,
+ ]
+ print(arguments, file=sys.stdout)
+ cmd = Command(
+ arguments,
+ name=name,
+ f_cmds=f_cmds,
+ )
+
+
+ # Run command
+ cmd.run(
+ checkpoint_message_notexists="[Running ({})] | {}".format(format_duration(t0), description),
+ checkpoint_message_exists="[Loading Checkpoint ({})] | {}".format(format_duration(t0), description),
+ write_stdout=os.path.join(directories["log"], "{}.o".format(name)),
+ write_stderr=os.path.join(directories["log"], "{}.e".format(name)),
+ write_returncode=os.path.join(directories["log"], "{}.returncode".format(name)),
+ checkpoint=os.path.join(directories["checkpoints"], name),
+ )
+
+ if hasattr(cmd, "returncode_"):
+ if cmd.returncode_ != 0:
+ print("[Error] | {}".format(description), file=sys.stdout)
+ print("Check the following files:\ncat {}".format(os.path.join(directories["log"], "{}.*".format(name))), file=sys.stdout)
+ sys.exit(cmd.returncode_)
+ else:
+ output_filepath = os.path.join(opts.output_directory, "genome_database-{}.syldb".format(organism_type))
+ size_bytes = os.path.getsize(output_filepath)
+ size_mb = size_bytes >> 20
+ if size_mb < 1:
+ print("Output Database:", output_filepath, "({} bytes)".format(size_bytes), file=sys.stdout)
+ else:
+ print("Output Database:", output_filepath, "({} MB)".format(size_mb), file=sys.stdout)
+
+ f_cmds.close()
+
+if __name__ == "__main__":
+ main(sys.argv[1:])
+
+
diff --git a/src/scripts/compile_eukaryotic_classifications.py b/src/scripts/compile_eukaryotic_classifications.py
index 609526e..4841d85 100755
--- a/src/scripts/compile_eukaryotic_classifications.py
+++ b/src/scripts/compile_eukaryotic_classifications.py
@@ -6,7 +6,7 @@
from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.3.20"
+__version__ = "2023.12.14"
def main(args=None):
@@ -16,20 +16,23 @@ def main(args=None):
# Path info
description = """
Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
- usage = "{} -i -s -c -o ".format(__program__)
+ usage = "{} -i -s -c -o ".format(__program__)
epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
# Parser
parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
# Pipeline
parser.add_argument("-i","--metaeuk_identifier_mapping", type=str, required=True, help = "path/to/identifier_mapping.metaeuk.tsv")
- parser.add_argument("-s","--scaffolds_to_bins", type=str, required=True, help = "path/to/scaffolds_to_bins.tsv")
- parser.add_argument("-c","--clusters", type=str, help = "path/to/clusters.tsv, Format: [id_mag][id_cluster], No header [Optional]")
- parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/output.tsv [Default: stdout]")
- parser.add_argument("--eukaryotic_database", type=str, default=None, required=True, help="path/to/eukaryotic_database (e.g. --arg 1 )")
+ parser.add_argument("-s","--scaffolds_to_bins", type=str, required=False, help = "path/to/scaffolds_to_bins.tsv")
+ # parser.add_argument("-g","--genes_to_contigs", type=str, required=False, help = "path/to/genes_to_contigs.tsv cannot be used with --scaffolds_to_bins")
+ parser.add_argument("-c","--clusters", type=str, help = "path/to/clusters.tsv, Format: [id_genome][id_cluster], No header [Optional]")
+ parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/gene-source_lineage.tsv [Default: stdout]")
+ parser.add_argument("-d", "--eukaryotic_database", type=str, default=None, required=True, help="path/to/eukaryotic_database directory (e.g. --arg 1 )")
# parser.add_argument("--veba_database", type=str, default=None, help=f"VEBA database location. [Default: $VEBA_DATABASE environment variable]")
parser.add_argument("--header", type=int, default=1, help="Include header in output {0=No, 1=Yes) [Default: 1]")
parser.add_argument("--debug", action="store_true")
+ parser.add_argument("--remove_genes_with_missing_values", action="store_true")
+ parser.add_argument("--use_original_metaeuk_gene_identifiers", action="store_true")
# Options
opts = parser.parse_args()
@@ -44,20 +47,15 @@ def main(args=None):
# opts.eukaryotic_database = os.path.join(opts.veba_database, "Classify", "Microeukaryotic")
# I/O
- # Scaffolds -> Bins
- fp = opts.scaffolds_to_bins
- print("* Reading scaffolds to bins table {}".format(fp), file=sys.stderr)
- scaffold_to_bin = pd.read_csv(fp, sep="\t", index_col=0, header=None).iloc[:,0]
- if opts.debug:
- print(fp, file=sys.stderr)
- scaffold_to_bin.head().to_csv(sys.stderr, sep="\t", header=None)
- print("\n", file=sys.stderr)
+
# SourceID -> Taxonomy
fp = os.path.join(opts.eukaryotic_database,"source_taxonomy.tsv.gz")
print("* Reading source taxonomy table {}".format(fp), file=sys.stderr)
df_source_taxonomy = pd.read_csv(fp, sep="\t", index_col=0)
df_source_taxonomy.index = df_source_taxonomy.index.map(str)
+ df_source_taxonomy = pd.DataFrame(df_source_taxonomy.to_dict()) # Hack for duplicate entries that will be resolved in MicroEuk_v3.1
+
if opts.debug:
print(fp, file=sys.stderr)
df_source_taxonomy.head().to_csv(sys.stderr, sep="\t")
@@ -65,7 +63,7 @@ def main(args=None):
# VEBA -> SourceID
fp = os.path.join(opts.eukaryotic_database,"target_to_source.dict.pkl.gz")
- print("* Reading target to source mapping {}".format(fp), file=sys.stderr)
+ print("* Reading target to source mapping {} (Note: This one takes a little longer to load...)".format(fp), file=sys.stderr)
with gzip.open(fp, "rb") as f:
target_to_source = pickle.load(f)
#target_to_source = pd.read_csv(fp, sep="\t", index_col=0, dtype=str, usecols=["id_veba", "id_source"], squeeze=True)#.iloc[:,0]
@@ -83,32 +81,44 @@ def main(args=None):
df_metaeuk.head().to_csv(sys.stderr, sep="\t")
print("\n", file=sys.stderr)
- orf_to_bitscore = df_metaeuk["bitscore"].map(float)
- orf_to_scaffold = df_metaeuk["C_acc"].map(str)
- orf_to_mag = orf_to_scaffold.map(lambda id_scaffold: scaffold_to_bin[id_scaffold])
-
- orf_to_target = df_metaeuk["T_acc"]
- orf_to_source = orf_to_target.map(lambda id_target: target_to_source.get(id_target,np.nan))
- if np.any(pd.isnull(orf_to_source)):
+ gene_to_bitscore = df_metaeuk["bitscore"].map(float)
+ gene_to_scaffold = df_metaeuk["C_acc"].map(str)
+ gene_to_genome = pd.Series([np.nan]*df_metaeuk.shape[0], index=df_metaeuk.index)
+ gene_to_target = df_metaeuk["T_acc"]
+ gene_to_source = gene_to_target.map(lambda id_target: target_to_source.get(id_target,np.nan))
+
+ if opts.scaffolds_to_bins:
+ # Scaffolds -> Bins
+ fp = opts.scaffolds_to_bins
+ print("* Reading scaffolds to bins table {}".format(fp), file=sys.stderr)
+ scaffold_to_bin = pd.read_csv(fp, sep="\t", index_col=0, header=None).iloc[:,0]
+ if opts.debug:
+ print(fp, file=sys.stderr)
+ scaffold_to_bin.head().to_csv(sys.stderr, sep="\t", header=None)
+ print("\n", file=sys.stderr)
+ gene_to_genome = gene_to_scaffold.map(lambda id_scaffold: scaffold_to_bin[id_scaffold])
+
+ if np.any(pd.isnull(gene_to_source)):
warnings.warn("The following gene - target identifiers are not in the database file: {}".format(
os.path.join(opts.eukaryotic_database,"target_to_source.dict.pkl.gz"),
),
)
- orf_to_target[orf_to_source[orf_to_source.isnull()].index].to_frame().to_csv(sys.stderr, sep="\t", header=None)
- orf_to_source = orf_to_source.dropna()
+ gene_to_target[gene_to_source[gene_to_source.isnull()].index].to_frame().to_csv(sys.stderr, sep="\t", header=None)
+ gene_to_source = gene_to_source.dropna()
# Lineage
- orf_to_lineage = OrderedDict()
+ gene_to_lineage = OrderedDict()
missing_lineage = list()
- for id_orf, id_source in tqdm(orf_to_source.items(), desc="Retrieving lineage", unit = " ORFs"):
+ for id_gene, id_source in tqdm(gene_to_source.items(), desc="Retrieving lineage", unit = " genes"):
if id_source in df_source_taxonomy.index:
lineage = df_source_taxonomy.loc[id_source, ["class", "order", "family", "genus", "species"]] # class order family genus species
+ lineage = lineage.fillna("")
lineage = ";".join(map(lambda items: "".join(items), zip(["c__", "o__", "f__", "g__", "s__"], lineage)))
- orf_to_lineage[id_orf] = lineage
+ gene_to_lineage[id_gene] = lineage
else:
missing_lineage.append(id_source)
- orf_to_lineage = pd.Series(orf_to_lineage)
+ gene_to_lineage = pd.Series(gene_to_lineage)
if len(missing_lineage):
warnings.warn("The following source identifiers are not in the database file: {}\n{}`".format(
@@ -118,31 +128,47 @@ def main(args=None):
)
# Output
- # ["id_orf", "id_mag", "bitscore", "lineage"]
- df_orf_classifications = pd.concat([
- orf_to_scaffold.to_frame("id_scaffold"),
- orf_to_mag.to_frame("id_mag"),
- orf_to_target.to_frame("id_target"),
- orf_to_source.to_frame("id_source"),
- orf_to_lineage.to_frame("lineage"),
- orf_to_bitscore.to_frame("bitscore"),
- ],
- axis=1)
- df_orf_classifications.index.name = "id_gene"
+ df_gene_classifications = pd.DataFrame({
+ "id_scaffold":gene_to_scaffold,
+ "id_genome":gene_to_genome,
+ "id_target":gene_to_target,
+ "id_source":gene_to_source,
+ "lineage":gene_to_lineage,
+ "bitscore":gene_to_bitscore,
+ })
+ df_gene_classifications.index.name = "id_gene"
+
+
+ # df_gene_classifications = pd.concat([
+ # gene_to_scaffold.to_frame("id_scaffold"),
+ # gene_to_genome.to_frame("id_genome"),
+ # gene_to_target.to_frame("id_target"),
+ # gene_to_source.to_frame("id_source"),
+ # gene_to_lineage.to_frame("lineage"),
+ # gene_to_bitscore.to_frame("bitscore"),
+ # ],
+ # axis=1)
+ # df_gene_classifications.index.name = "id_gene"
# Add clusters if provided
if opts.clusters:
if opts.clusters != "None": # Hack for when called internally
- mag_to_cluster = pd.read_csv(opts.clusters, sep="\t", index_col=0, header=None).iloc[:,0]
- orf_to_cluster = orf_to_mag.map(lambda id_orf: mag_to_cluster[id_orf])
- df_orf_classifications.insert(loc=2, column="id_cluster", value=orf_to_cluster)
+ genome_to_cluster = pd.read_csv(opts.clusters, sep="\t", index_col=0, header=None).iloc[:,0]
+ gene_to_cluster = gene_to_genome.map(lambda id_gene: genome_to_cluster[id_gene])
+ df_gene_classifications.insert(loc=2, column="id_cluster", value=gene_to_cluster)
# Output
if opts.output == "stdout":
opts.output = sys.stdout
- df_orf_classifications = df_orf_classifications.dropna(how="any", axis=0)
- df_orf_classifications.to_csv(opts.output, sep="\t", header=bool(opts.header))
+ if opts.remove_genes_with_missing_values:
+ df_gene_classifications = df_gene_classifications.dropna(how="any", axis=0)
+
+ if not opts.use_original_metaeuk_gene_identifiers:
+ metaeuk_to_gene = df_metaeuk["gene_id"].to_dict()
+ df_gene_classifications.index = df_gene_classifications.index.map(lambda x: metaeuk_to_gene[x])
+
+ df_gene_classifications.to_csv(opts.output, sep="\t", header=bool(opts.header))
diff --git a/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py b/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py
index 1e4a031..3ee1147 100755
--- a/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py
+++ b/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py
@@ -30,7 +30,6 @@ def main(argv=None):
parser_io.add_argument("--fill_missing_weight", type=float, help = "Fill missing weight between [0, 100.0]. [Default is to throw error if value is missing]")
parser_io.add_argument("--header", action="store_true", help = "Include header")
-
# Options
opts = parser.parse_args()
opts.script_directory = script_directory
diff --git a/src/scripts/compile_reads_table.py b/src/scripts/compile_reads_table.py
index 3b3edd8..8075113 100755
--- a/src/scripts/compile_reads_table.py
+++ b/src/scripts/compile_reads_table.py
@@ -7,7 +7,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2023.8.28"
+__version__ = "2023.12.18"
def parse_basename(query: str, naming_scheme: str):
"""
@@ -43,6 +43,7 @@ def main(args=None):
parser_preprocess_directory = parser.add_argument_group('[Mode 1] Preprocess Directory arguments')
parser_preprocess_directory.add_argument("-i","--preprocess_directory", type=str, help = "path/to/preprocess directory (e.g., veba_output/preprocess) [Cannot be used with --fastq_directory]")
parser_preprocess_directory.add_argument("-b","--basename", default="cleaned", type=str, help = "File basename to search VEBA preprocess directory [preprocess_directory]/[id_sample]/[output]/[basename]_1/2.fastq.gz [Default: cleaned]")
+ parser_preprocess_directory.add_argument("-L","--long", action="store_true", help = "Use if reads are ONT or PacBio")
parser_fastq_directory = parser.add_argument_group('[Mode 2] Fastq Directory arguments')
parser_fastq_directory.add_argument("-f","--fastq_directory", type=str, help = "path/to/fastq_directory [Cannot be used with --preprocess_directory]")
@@ -55,6 +56,7 @@ def main(args=None):
parser_output.add_argument("-0", "--sample_label", default="sample-id", type=str, help = "Sample ID column label [Reverse: sample-id]")
parser_output.add_argument("-1", "--forward_label", default="forward-absolute-filepath", type=str, help = "Forward filepath column label [Default: forward-absolute-filepath]")
parser_output.add_argument("-2", "--reverse_label", default="reverse-absolute-filepath", type=str, help = "Reverse filepath column label [Default: reverse-absolute-filepath]")
+ parser_output.add_argument("-3", "--long_label", default="reads-filepath", type=str, help = "Long reads filepath column label [Default: reads-filepath]")
parser_output.add_argument("--header", action="store_true", help = "Write header")
parser_output.add_argument("--volume_prefix", type=str, help = "Docker container prefix to volume path")
@@ -69,27 +71,41 @@ def main(args=None):
output = defaultdict(dict)
# Build table from preprocess directory
if opts.preprocess_directory:
- for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_1.fastq.gz".format(opts.basename))):
- id_sample = fp.split("/")[-3]
- output[id_sample][opts.forward_label] = fp
- for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_2.fastq.gz".format(opts.basename))):
- id_sample = fp.split("/")[-3]
- output[id_sample][opts.reverse_label] = fp
- # Build table from fastq directory
- if opts.fastq_directory:
- for fp in glob.glob(os.path.join(opts.fastq_directory, "*.{}".format(opts.extension))):
- basename = fp.split("/")[-1]
- id_sample, direction = parse_basename(basename, naming_scheme=opts.naming_scheme)
- # id_sample = "_R".join(basename.split("_R")[:-1])
- if direction == "1":
+ if not opts.long:
+ for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_1.fastq.gz".format(opts.basename))):
+ id_sample = fp.split("/")[-3]
output[id_sample][opts.forward_label] = fp
- if direction == "2":
+ for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_2.fastq.gz".format(opts.basename))):
+ id_sample = fp.split("/")[-3]
output[id_sample][opts.reverse_label] = fp
- df_output = pd.DataFrame(output).T.sort_index().loc[:,[opts.forward_label, opts.reverse_label]]
+ else:
+ for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}.fastq.gz".format(opts.basename))):
+ id_sample = fp.split("/")[-3]
+ output[id_sample][opts.long_label] = fp
+
+ # Build table from fastq directory
+ if opts.fastq_directory:
+ if not opts.long:
+ for fp in glob.glob(os.path.join(opts.fastq_directory, "*.{}".format(opts.extension))):
+ basename = fp.split("/")[-1]
+ id_sample, direction = parse_basename(basename, naming_scheme=opts.naming_scheme)
+ # id_sample = "_R".join(basename.split("_R")[:-1])
+ if direction == "1":
+ output[id_sample][opts.forward_label] = fp
+ if direction == "2":
+ output[id_sample][opts.reverse_label] = fp
+ else:
+ print("Long reads support with -L is currently only available with --preprocess_directory and not --fastq_directory", file=sys.stderr)
+ sys.exit(1)
+
+ if not opts.long:
+ df_output = pd.DataFrame(output).T.sort_index().loc[:,[opts.forward_label, opts.reverse_label]]
+ else:
+ df_output = pd.DataFrame(output).T.sort_index().loc[:,[opts.long_label]]
df_output.index.name = opts.sample_label
# Check missing values
- missing_values = df_output.notnull().sum(axis=1)[lambda x: x < 2].index
+ missing_values = df_output.notnull().sum(axis=1)[lambda x: x < df_output.shape[1]].index
assert missing_values.size == 0, "Missing fastq for the following samples: {}".format(missing_values.index)
# Absolute paths
@@ -97,10 +113,14 @@ def main(args=None):
df_output = df_output.applymap(lambda fp: os.path.abspath(fp))
else:
if opts.header:
- if "absolute" in opts.forward_label.lower():
- print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --forward_label: {}".format(opts.forward_label), file=sys.stderr)
- if "absolute" in opts.reverse_label.lower():
- print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --reverse_label: {}".format(opts.reverse_label), file=sys.stderr)
+ if not opts.long:
+ if "absolute" in opts.forward_label.lower():
+ print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --forward_label: {}".format(opts.forward_label), file=sys.stderr)
+ if "absolute" in opts.reverse_label.lower():
+ print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --reverse_label: {}".format(opts.reverse_label), file=sys.stderr)
+ else:
+ if "absolute" in opts.long_label.lower():
+ print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --long_label: {}".format(opts.long_label), file=sys.stderr)
# Docker volume prefix
if opts.volume_prefix:
diff --git a/src/scripts/concatenate_assembly.py b/src/scripts/concatenate_assembly.py
new file mode 100755
index 0000000..bcec4ff
--- /dev/null
+++ b/src/scripts/concatenate_assembly.py
@@ -0,0 +1,99 @@
+#!/usr/bin/env python
+import sys, os, argparse, gzip
+from Bio.SeqIO.FastaIO import SimpleFastaParser
+from tqdm import tqdm
+
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.12.18"
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -o )".format(__program__)
+ epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)"
+
+ # Parser
+ parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter)
+
+ # Pipeline
+ parser.add_argument("-i","--input", default="stdin", type=str, help = "Input fasta file")
+ parser.add_argument("-o","--output", default="stdout", type=str, help = "Output fasta file")
+ parser.add_argument("-n", "--name", type=str, required=True, help = "Name to use for pseudo-scaffold")
+ parser.add_argument("-N", "--pad", type=int, default=100, help = "Number of N to use for joining contigs")
+ parser.add_argument("-d", "--description", type=str, help = "Description to use [Default: Input filepath]")
+ parser.add_argument("-m","--minimum_sequence_length", default=1, type=int, help = "Minimum sequence length accepted [Default: 1]")
+ parser.add_argument("-w","--wrap", default=1000, type=int, help = "Wrap fasta. Use 0 for no wrapping [Default: 1000]")
+
+ # Options
+ opts = parser.parse_args()
+ opts.script_directory = script_directory
+ opts.script_filename = script_filename
+
+ assert opts.minimum_sequence_length > 0
+ assert opts.pad >= 0
+
+ # Input
+ f_in = None
+ if opts.input == "stdin":
+ f_in = sys.stdin
+ else:
+ if opts.input.endswith(".gz"):
+ f_in = gzip.open(opts.input, "rt")
+ else:
+ f_in = open(opts.input, "r")
+ assert f_in is not None
+
+ # Output
+ f_out = None
+ if opts.output == "stdout":
+ f_out = sys.stdout
+ else:
+ if opts.output.endswith(".gz"):
+ f_out = gzip.open(opts.output, "wt")
+ else:
+ f_out = open(opts.output, "w")
+ assert f_out is not None
+
+ # Concatenated assembly
+
+ if not opts.description:
+ opts.description = "assembly_filepath: {}".format(opts.input)
+ else:
+ if opts.description == "NONE":
+ opts.description = ""
+ pseudoscaffold_header = "{} {}".format(opts.name, opts.description)
+
+ print(">{}".format(pseudoscaffold_header), file=f_out)
+ sequences = list()
+ for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"):
+ if len(seq) >= opts.minimum_sequence_length:
+ sequences.append(seq)
+ number_of_sequences = len(sequences)
+ sequences = ("N"*opts.pad).join(sequences)
+
+ # Open output file
+ if opts.wrap > 0:
+ for i in range(0, len(sequences), opts.wrap):
+ wrapped_sequence = sequences[i:i+opts.wrap]
+ # Write header and wrapped sequence
+ print(wrapped_sequence, file=f_out)
+ else:
+ print(sequences, file=f_out)
+
+
+ # Close
+ if f_in != sys.stdin:
+ f_in.close()
+ if f_out != sys.stdout:
+ f_out.close()
+
+if __name__ == "__main__":
+ main()
+
+
+
diff --git a/src/scripts/concatenate_fasta.py b/src/scripts/concatenate_fasta.py
index 0977942..38d12ae 100755
--- a/src/scripts/concatenate_fasta.py
+++ b/src/scripts/concatenate_fasta.py
@@ -1,6 +1,6 @@
#!/usr/bin/env python
from __future__ import print_function, division
-import sys, os, argparse
+import sys, os, argparse, hashlib
import pandas as pd
from Bio.SeqIO.FastaIO import SimpleFastaParser
@@ -12,45 +12,7 @@
pd.options.display.max_colwidth = 100
# from tqdm import tqdm
__program__ = os.path.split(sys.argv[0])[-1]
-__version__ = "2022.02.17"
-
-
-
-def fasta_to_saf(path, compression="infer"):
- """
- # GeneID Chr Start End Strand
- # http://bioinf.wehi.edu.au/featureCounts/
-
- # Useful:
- import re
- record_id = "lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]"
- re.search("\[locus_tag=(\w+)\]", record_id).group(1)
- # 'MASE_RS00005'
-
- """
-
-
- saf_data = list()
-
- if path == "stdin":
- f = sys.stdin
- else:
- f = get_file_object(path, mode="read", compression=compression, verbose=False)
-
- for id_record, seq in pv(SimpleFastaParser(f), "Reading sequences [{}]".format(path)):
- id_record = id_record.split(" ")[0]
- fields = [
- id_record,
- id_record,
- 1,
- len(seq),
- "+",
- ]
- saf_data.append(fields)
- if f is not sys.stdin:
- f.close()
- return pd.DataFrame(saf_data, columns=["GeneID", "Chr", "Start", "End", "Strand"])
-
+__version__ = "2023.12.13"
def main(args=None):
# Path info
@@ -120,7 +82,15 @@ def main(args=None):
safe_mode=False,
verbose=False,
)
+
saf_filepath = os.path.join(opts.output_directory, "{}.saf".format(id_sample))
+
+ f_duplicates = get_file_object(
+ path=os.path.join(opts.output_directory, "{}.duplicates_removed.list".format(id_sample)),
+ mode="write",
+ safe_mode=False,
+ verbose=False,
+ )
else:
os.makedirs(os.path.join(opts.output_directory, id_sample), exist_ok=True)
@@ -130,29 +100,43 @@ def main(args=None):
safe_mode=False,
verbose=False,
)
+
saf_filepath = os.path.join(opts.output_directory, id_sample, "{}.saf".format(opts.basename))
+ f_duplicates = get_file_object(
+ path=os.path.join(opts.output_directory, id_sample, "{}.duplicates_removed.list".format(opts.basename)),
+ mode="write",
+ safe_mode=False,
+ verbose=False,
+ )
# Read input fasta, filter out short sequences, and write to concatenated file
+ sequence_hashes = set()
saf_data = list()
for fp in pv(filepaths, description=id_sample, unit= " files"):
f_query = get_file_object(fp, mode="read", verbose=False)
for id, seq in SimpleFastaParser(f_query):
if len(seq) >= opts.minimum_contig_length:
- print(">{}\n{}".format(id, seq), file=f_out)
+ id_hash = hashlib.md5(seq.upper().encode()).hexdigest()
id_record = id.split(" ")[0]
- fields = [
- id_record,
- id_record,
- 1,
- len(seq),
- "+",
- ]
- saf_data.append(fields)
+ if id_hash not in sequence_hashes:
+ print(">{}\n{}".format(id, seq), file=f_out)
+ fields = [
+ id_record,
+ id_record,
+ 1,
+ len(seq),
+ "+",
+ ]
+ saf_data.append(fields)
+ sequence_hashes.add(id_hash)
+ else:
+ print(id_record, file=f_duplicates)
f_query.close()
f_out.close()
+ f_duplicates.close()
df_saf = pd.DataFrame(saf_data, columns=["GeneID", "Chr", "Start", "End", "Strand"])
df_saf.to_csv(saf_filepath, sep="\t", index=None)
@@ -173,26 +157,39 @@ def main(args=None):
saf_filepath = os.path.join(opts.output_directory, "{}.saf".format(opts.basename))
+ f_duplicates = get_file_object(
+ path=os.path.join(opts.output_directory, "{}.duplicates_removed.list".format(opts.basename)),
+ mode="write",
+ safe_mode=False,
+ verbose=False,
+ )
+
# Read input fasta, filter out short sequences, and write to concatenated file
+ sequence_hashes = set()
saf_data = list()
for fp in pv(filepaths, unit= " files"):
f_query = get_file_object(fp, mode="read", verbose=False)
for id, seq in SimpleFastaParser(f_query):
if len(seq) >= opts.minimum_contig_length:
- print(">{}\n{}".format(id, seq), file=f_out)
+ id_hash = hashlib.md5(seq.upper().encode()).hexdigest()
id_record = id.split(" ")[0]
- fields = [
- id_record,
- id_record,
- 1,
- len(seq),
- "+",
- ]
- saf_data.append(fields)
-
+ if id_hash not in sequence_hashes:
+ print(">{}\n{}".format(id, seq), file=f_out)
+ fields = [
+ id_record,
+ id_record,
+ 1,
+ len(seq),
+ "+",
+ ]
+ saf_data.append(fields)
+ else:
+ print(id_record, file=f_duplicates)
f_query.close()
f_out.close()
+ f_duplicates.close()
+
df_saf = pd.DataFrame(saf_data, columns=["GeneID", "Chr", "Start", "End", "Strand"])
df_saf.to_csv(saf_filepath, sep="\t", index=None)
diff --git a/src/scripts/consensus_genome_classification_ranked.py b/src/scripts/consensus_genome_classification_ranked.py
new file mode 100755
index 0000000..2c190fa
--- /dev/null
+++ b/src/scripts/consensus_genome_classification_ranked.py
@@ -0,0 +1,222 @@
+#!/usr/bin/env python
+from __future__ import print_function, division
+import sys, os, argparse
+from collections import OrderedDict, defaultdict
+import pandas as pd
+import numpy as np
+
+
+pd.options.display.max_colwidth = 100
+# from tqdm import tqdm
+__program__ = os.path.split(sys.argv[0])[-1]
+__version__ = "2023.11.3"
+
+# RANK_TO_PREFIX="superkingdom:d__,phylum:p__,class:c__,order:o__,family:f__,genus:g__,species:s__"
+
+RANK_PREFIXES="d__,p__,c__,o__,f__,g__,s__"
+
+# Fill empty taxonomic levels for consensus classification
+def fill_lower_taxonomy_levels(
+ classifications:pd.Series,
+ rank_prefixes:list,
+ delimiter:str=";",
+ ):
+
+ rank_prefixes = list(rank_prefixes)
+ number_of_taxonomic_levels = len(rank_prefixes)
+ classifications_ = dict()
+ for id_genome, classification in pd.Series(classifications).items():
+ taxonomy = classification.split(delimiter)
+ classifications_[id_genome] = delimiter.join(taxonomy + rank_prefixes[len(taxonomy):])
+ return pd.Series(classifications_)[classifications.index]
+
+# Get consensus classification
+def get_consensus_classification(
+ classification:pd.Series,
+ classification_weights:pd.Series,
+ genome_to_genomecluster:pd.Series,
+ rank_prefixes:list,
+ number_of_taxonomic_levels="infer",
+ delimiter=";",
+ leniency:float=1.382,
+ ):
+ # Assertions
+ assert np.all(classification.notnull())
+ assert np.all(classification_weights.notnull())
+ assert np.all(genome_to_genomecluster.notnull())
+
+ # Set and index overlap
+ a = set(classification.index)
+ b = set(classification_weights.index)
+ c = set(genome_to_genomecluster.index)
+ assert a == b, "`classification` and `classification_weights` must have the same keys in the index"
+ assert a <= c, "`classification` and `classification_weights` must be a subset (or equal) to the keys in `genome_to_genomecluster` index"
+ index_genomes = pd.Index(sorted(a & b & c ))
+ classification = classification[index_genomes]
+ classification_weights = classification_weights[index_genomes]
+ genome_to_genomecluster = genome_to_genomecluster[index_genomes]
+
+ # Taxonomic levels
+ taxonomic_levels = classification.map(lambda x: x.count(delimiter)).unique()
+ if len(taxonomic_levels):
+ assert len(taxonomic_levels) == 1, "Taxonomic levels in `classification` should all have the same number of delimiters" #! Might need to change this to allow for missing taxonomic levels
+ else:
+ number_of_taxonomic_levels = 1
+
+ if number_of_taxonomic_levels == "infer":
+ number_of_taxonomic_levels = taxonomic_levels[0] + 1
+
+ # Scaling factors
+ scaling_factors = np.arange(1, number_of_taxonomic_levels + 1) # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Dermabacteraceae;g__Brachybacterium
+ scaling_factors = np.power(scaling_factors, leniency)
+
+ # Get container for scores [SLC -> Taxonomy -> Score]
+ #
+ # For example the following MAG:
+ # CLASSIFICATION=d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Corynebacterium;s__Corynebacterium aurimucosum_E
+ # MSA_PERCENT=80.0
+ #
+ # Would be stored and appended for it's corresponding SLC:
+ # d__Bacteria += 80.0
+ # d__Bacteria;p__Actinobacteriota += 80.0
+ # d__Bacteria;p__Actinobacteriota;c__Actinomycetia += 80.0
+ # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales += 80.0
+ # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae += 80.0
+ # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Corynebacterium += 80.0
+ # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Corynebacterium;s__Corynebacterium aurimucosum_E += 80.0
+ genomecluster_taxa_scores = defaultdict(lambda: defaultdict(float))
+
+ # Iterate through MAG, classification, and score
+ df = pd.concat([genome_to_genomecluster.to_frame("id"), classification.to_frame("classification"), classification_weights.to_frame("weight")], axis=1)
+ genomecluster_to_genomes = defaultdict(list)
+ for id_genome, (id_genome_cluster, classification, w) in df.iterrows():
+ genomecluster_to_genomes[id_genome_cluster].append(id_genome)
+ # Split the taxonomy classification by levels
+ levels = classification.split(delimiter)
+ # Remove the empty taxonomy levels (e.g., g__Corynebacterium;s__ --> g__Corynebacterium)
+ # levels = list(filter(lambda x:x not in rank_prefixes, levels))
+ number_of_query_levels = len(levels)
+ # Iterate through each level, scale score by the leniency weights, and add to running sum
+ for i in range(1, number_of_query_levels + 1):
+ taxon_at_level = levels[i-1]
+ taxon_level_is_missing = taxon_at_level in rank_prefixes
+ if taxon_level_is_missing:
+ weighted_score = 0.0
+ print("`{}` is missing taxonomic level `{}`".format(id_genome, taxon_at_level), file=sys.stderr)
+
+ else:
+ weighted_score = float(w) * scaling_factors[i-1]
+ genomecluster_taxa_scores[id_genome_cluster][tuple(levels[:i])] += weighted_score
+ genomecluster_to_genomes = pd.Series(genomecluster_to_genomes)
+
+ # Build datafarme
+ genomecluster_taxa_scores = pd.Series(genomecluster_taxa_scores)
+ df_consensus_classification = pd.DataFrame(genomecluster_taxa_scores.map(lambda taxa_scores: sorted(taxa_scores.items(), key=lambda x:(x[1], len(x[0])), reverse=True)[0]).to_dict(), index=["consensus_classification", "score"]).T
+ df_consensus_classification["consensus_classification"] = df_consensus_classification["consensus_classification"].map(";".join)
+ df_consensus_classification["number_of_unique_classifications"] = df["classification"].groupby(genome_to_genomecluster).apply(lambda x: len(set(x)))
+ df_consensus_classification["number_of_components"] = genomecluster_to_genomes.map(len) #df["classification"].groupby(genome_to_genomecluster).apply(len)
+ df_consensus_classification["components"] = genomecluster_to_genomes
+ df_consensus_classification["classifications"] = df["classification"].groupby(genome_to_genomecluster).apply(lambda x: list(x))
+ df_consensus_classification["weights"] = df["weight"].groupby(genome_to_genomecluster).apply(lambda x: list(x))
+ df_consensus_classification.index.name = "id"
+
+ # Homogeneity
+ slc_taxa_homogeneity = defaultdict(lambda: defaultdict(float))
+ for id_genome_cluster, (classifications, weights) in df_consensus_classification[["classifications", "weights"]].iterrows():
+ for (c, w) in zip(classifications, weights):
+ slc_taxa_homogeneity[id_genome_cluster][c] += w
+ df_consensus_classification["homogeneity"] = pd.DataFrame(slc_taxa_homogeneity).T.apply(lambda x: np.nanmax(x)/np.nansum(x), axis=1)
+
+ fields = [
+ "consensus_classification",
+ "homogeneity",
+ "number_of_unique_classifications",
+ "number_of_components",
+ "components",
+ "classifications",
+ "weights",
+ "score",
+ ]
+ return df_consensus_classification.loc[:,fields]
+
+
+
+def main(args=None):
+ # Path info
+ script_directory = os.path.dirname(os.path.abspath( __file__ ))
+ script_filename = __program__
+ # Path info
+ description = """
+ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable)
+ usage = "{} -i -o