diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 2acc3c3..0000000 Binary files a/.DS_Store and /dev/null differ diff --git a/CHANGELOG.md b/CHANGELOG.md index a00fb2e..7f1709c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,7 +6,73 @@ ________________________________________________________________ #### Current Releases: -**Release v1.3.0:** +**Release v1.4.0 Highlights:** + +* **`VEBA` Modules:** + + * Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database for taxonomic abundance. + * Added long read support for `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules. + * Redesign `binning-eukaryotic` module to handle custom `MetaEuk` databases + * Added new usage syntax `veba --module preprocess --params “${PARAMS}”` where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change. + * Added `skani` which is the new default for genome-level clustering based on ANI. + * Added `Diamond DeepClust` as an alternative to `MMSEQS2` for protein clustering. + +* **`VEBA` Database (`VDB_v6`)**: + + * Completely rebuilt `VEBA's Microeukaryotic Protein Database` to produce a clustered database `MicroEuk100/90/50` similar to `UniRef100/90/50`. Available on [doi:10.5281/zenodo.10139450](https://zenodo.org/records/10139451). + + * **Number of sequences:** + + * MicroEuk100 = 79,920,431 (19 GB) + + * MicroEuk90 = 51,767,730 (13 GB) + + * MicroEuk50 = 29,898,853 (6.5 GB) + + + + * **Number of source organisms per dataset:** + + * MycoCosm = 2503 + + * PhycoCosm = 174 + + * EnsemblProtists = 233 + + * MMETSP = 759 + + * TARA_SAGv1 = 8 + + * EukProt = 366 + + * EukZoo = 27 + + * TARA_SMAGv1 = 389 + + * NR_Protists-Fungi = 48217 + +
+ **Release v1.4.0 Details** +* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance. +* [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/fenderglass/Flye/issues/652). +* [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes. +* [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`. +* [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped. +* [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`. +* [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules. +* [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`. +* [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`. +* [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`. +* [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script. +* [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. +* [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`. +* [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py` +* [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels. +* [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists. + +
+ +**Release v1.3.0 Highlights:** * **`VEBA` Modules:** * Added `profile-pathway.py` module and associated scripts for building `HUMAnN` databases from *de novo* genomes and annotations. Essentially, a reads-based functional profiling method via `HUMAnN` using binned genomes as the database. @@ -139,6 +205,7 @@ ________________________________________________________________ **Release v1.1.0 Details** * **Modules**: + * `annotate.py` * Added `NCBIfam-AMRFinder` AMR domain annotations * Added `AntiFam` contimination annotations @@ -238,6 +305,7 @@ ________________________________________________________________ * `build_taxa_sqlite.py` * **Miscellaneous**: + * Updated environments and now add versions to environments. * Added `mamba` to installation to speed up. * Added `transdecoder_wrapper.py` which is a wrapper around `TransDecoder` with direct support for `Diamond` and `HMMSearch` homology searches. Also includes `append_geneid_to_transdecoder_gff.py` which is run in the backend to clean up the GFF file and make them compatible with what is output by `Prodigal` and `MetaEuk` runs of `VEBA`. @@ -317,6 +385,8 @@ ________________________________________________________________ **Critical:** +* `binning-prokaryotic.py` doesn't produce an `unbinned.fasta` file for long reads if there aren't any genomes. It also creates a symlink called `genomes` in the working directory. +* Add a way to show all versions * Genome checkpoints in `tRNAscan-SE` aren't working properly. * Dereplcate CDS sequences in GFF from `MetaEuk` for `antiSMASH` to work for eukaryotic genomes * Error with `amplicon.py` that works when run manually... @@ -329,39 +399,58 @@ There was a problem importing veba_output/misc/reads_table.tsv: **Definitely:** +* Use `pigz` instead of `gzip` +* Create a taxdump for `MicroEuk` +* Reimplement `compile_eukaryotic_classifications.py` * Add representative to `identifier_mapping.proteins.tsv.gz` -* Add coding density to GFF files * Split `download_databases.sh` into `download_databases.sh` (low memory, high threads) and `configure_databases.sh` (high memory, low-to-mid threads). Use `aria2` in parallel instead of `wget`. * `NextFlow` support -* Consistent usage of the following terms: 1) dataframe vs. table; 2) protein-cluster vs. orthogroup. -* Add support for `FAMSA` in `phylogeny.py` -* Create a `assembly-longreads.py` module that uses `MetaFlye` -* Expand Microeukaryotic Protein Database to include more microeukaryotes (`Mycocosm` and `PhycoCosm` from `JGI`) * Install each module via `bioconda` * Add support for `Salmon` in `mapping.py` and `index.py`. This can be used instead of `STAR` which will require adding the `exon` field to `Prodigal` GFF file (`MetaEuk` modified GFF files already have exon ids). -**Probably (Yes)?:** +**Eventually (Yes)?:** +* Don't load all genomes, proteins, and cds into memory for clustering. +* Add support for `FAMSA` in `phylogeny.py` +* Consistent usage of the following terms: 1) dataframe vs. table; 2) protein-cluster vs. orthogroup. +* Add coding density to GFF files +* Add `vRhyme` to `binning_wrapper.py` and support `vRhyme` in `binning-viral.py`. +* Phylogenetic tree of `MicroEuk100` * Convert HMMs to `MMSEQS2` (https://github.com/soedinglab/MMseqs2/wiki#how-to-create-a-target-profile-database-from-pfam)? * Run `cmsearch` before `tRNAscan-SE` * DN/DS from pangeome analysis * Add [iPHoP](https://bitbucket.org/srouxjgi/iphop/src/main/) to `binning-viral.py`. * Add a `metabolic.py` module * Swap [`TransDecoder`](https://github.com/TransDecoder/TransDecoder) for [`TransSuite`](https://github.com/anonconda/TranSuite) -* Build a clustered version of the Microeukaryotic Protein Database that is more efficient to run. Similar to UniRef100, UniRef90, UniRef50. +* For viral binning, contigs that are not identified as viral via `geNomad -> CheckV` use with `vRhyme`. **...Maybe (Not)?** * Modify behavior of `annotate.py` to allow for skipping Pfam and/or KOFAM since they take a long time. - ________________________________________________________________
**Daily Change Log:** +* [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance. +* [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/fenderglass/Flye/issues/652). +* [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes. +* [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`. +* [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped. +* [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`. +* [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, and all binning modules. +* [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`. +* [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`. +* [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`. +* [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script. +* [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. +* [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`. +* [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py` +* [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels. +* [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists. * [2023.10.27] - Update `annotate.py` and `merge_annotations.py` to handle `CAZy`. They also properly address clustered protein annotations now. * [2023.10.18] - Added `module_completion_ratio.py` script which is a fork of `MicrobeAnnotator` [`ko_mapper.py`](https://github.com/cruizperez/MicrobeAnnotator/blob/master/microbeannotator/pipeline/ko_mapper.py). Also included a database [Zenodo: 10020074](https://zenodo.org/records/10020074) which will be included in `VDB_v5.2` * [2023.10.16] - Added a checkpoint for `tRNAscan-SE` in `binning-prokaryotic.py` and `eukaryotic_gene_modeling_wrapper.py`. diff --git a/MODULE_RESOURCES.xlsx b/MODULE_RESOURCES.xlsx new file mode 100644 index 0000000..bf99e33 Binary files /dev/null and b/MODULE_RESOURCES.xlsx differ diff --git a/SOURCES.xlsx b/SOURCES.xlsx index 33478b5..b0f60d3 100644 Binary files a/SOURCES.xlsx and b/SOURCES.xlsx differ diff --git a/VERSION b/VERSION index 0a7e926..a0fef3f 100644 --- a/VERSION +++ b/VERSION @@ -1,2 +1,2 @@ -1.3.0 -VDB_v5.2 +1.4.0b +VDB_v6 diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.KEGG_Data_Scrapper.py b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.KEGG_Data_Scrapper.py new file mode 100644 index 0000000..e1f5f8f --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.KEGG_Data_Scrapper.py @@ -0,0 +1,165 @@ +from bs4 import BeautifulSoup +import pandas as pd +import re +import pickle +import ast +import requests + + +""" Script to download and parse KEGG information and store it in data """ + +def download_kegg_modules(module_name_file, chrome_driver): + module_ids =[] + module_names = {} + module_components_raw = {} + # Parse module names + with open(module_name_file) as module_input: + for line in module_input: + line = line.strip().split("\t") + module_ids.append(line[0]) + module_names[line[0]] = line[1] + # Access KEGG and download module information + for identifier in module_ids: + url = "https://www.kegg.jp/kegg-bin/show_module?" + identifier + site_request = requests.get(url) + soup = BeautifulSoup(site_request.text, "html.parser") + module_definition = "" + module_definition_bool = False + definition = soup.find(class_ = 'definition') + for line in (definition.text).splitlines(): + if line.strip() == "": + continue + elif module_definition_bool == True: + module_definition = line.strip() + module_definition_bool = False + elif line.strip() == 'Definition': + module_definition_bool = True + print(module_definition) + module_components_raw[identifier] = module_definition + return module_components_raw + + +def parse_regular_module_dictionary(bifurcating_list_file, structural_list_file, module_components_raw): + bifurcating_list = [] + structural_list = [] + # Populate bifurcating and structural lists + with open(bifurcating_list_file, 'r') as bif_list: + for line in bif_list: + bifurcating_list.append(line.strip()) + with open(structural_list_file, 'r') as bif_list: + for line in bif_list: + structural_list.append(line.strip()) + # Parse raw module information + module_steps_parsed = {} + for key, values in module_components_raw.items(): + values = values.replace(" --", "") + values = values.replace("-- ", "") + if key in bifurcating_list or key in structural_list: + continue + else: + module = [] + parenthesis_count = 0 + for character in values: + if character == "(": + parenthesis_count += 1 + module.append(character) + elif character == " ": + if parenthesis_count == 0: + module.append(character) + else: + module.append("_") + elif character == ")": + parenthesis_count -= 1 + module.append(character) + else: + module.append(character) + steps = ''.join(module).split() + module_steps_parsed[key] = steps + # Remove modules depending on other modules + temporal_dictionary = module_steps_parsed.copy() + for key, values in temporal_dictionary.items(): + for value in values: + if re.search(r'M[0-9]{5}', value) is not None: + del module_steps_parsed[key] + break + return module_steps_parsed + + +def create_final_regular_dictionary(module_steps_parsed, module_components_raw, outfile): + final_regular_dict = {} + # Parse module steps and export them into a text file + with open(outfile, 'w') as output: + for key, value in module_steps_parsed.items(): + output.write("{}\n".format(key)) + output.write("{}\n".format(module_components_raw[key])) + output.write("{}\n".format(value)) + output.write("{}\n".format("==")) + final_regular_dict[key] = {} + step_number = 0 + for step in value: + step_number += 1 + count = 0 + options = 0 + temp_string = "" + for char in step: + if char == "(": + count += 1 + options += 1 + if len(temp_string) > 1 and temp_string[-1] == "-": + temp_string += "%" + elif char == ")": + count -= 1 + if count >= 1: + temp_string += char + else: + continue + elif char == ",": + if count >= 2: + temp_string += char + print(step) + else: + temp_string += " " + else: + temp_string += char + if options >= 2: + temp_string = temp_string.replace(")_", "_") + if re.search('%.*\)', temp_string) is None: + temp_string = temp_string.replace(")", "") + temp_string = "".join(temp_string.rsplit("__", 1)) + temp_string = temp_string.split() + if isinstance(temp_string, str): + temp_string = temp_string.split() + temp_string = sorted(temp_string, key=len) + final_regular_dict[key][step_number] = temp_string + output.write("{}\n".format(temp_string)) + output.write("{}\n".format("++++++++++++++++++")) + return final_regular_dict + + +def export_module_dictionary(dictionary, location): + pickle_out = open(location,"wb") + pickle.dump(dictionary, pickle_out) + pickle_out.close() + + + +def transform_module_dictionaries(bifurcating_data, structural_data, output_bifur, output_struct): + bifurcating_dictionary = ast.literal_eval(open(bifurcating_data).read()) + export_module_dictionary(bifurcating_dictionary, output_bifur) + structural_dictionary = ast.literal_eval(open(structural_data).read()) + export_module_dictionary(structural_dictionary, output_struct) + + +# Execute parsing functions + +module_components_raw = download_kegg_modules("00.Module_Names.txt", 'chromedriver') +module_steps_parsed = parse_regular_module_dictionary("01.Bifurcating_List.txt", + "02.Structural_List.txt", module_components_raw) +final_regular_dict = create_final_regular_dictionary(module_steps_parsed, module_components_raw, "05.Modules_Parsed.txt") + + +export_module_dictionary(final_regular_dict, "../01.KEGG_Regular_Module_Information.pickle") +transform_module_dictionaries("03.Bifurcating_Modules.dict", + "04.Structural_Modules.dict", + "../02.KEGG_Bifurcating_Module_Information.pickle", + "../03.KEGG_Structural_Module_Information.pickle") \ No newline at end of file diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.Module_Names.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.Module_Names.txt new file mode 100644 index 0000000..db9ec87 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/00.Module_Names.txt @@ -0,0 +1,394 @@ +M00015 Proline biosynthesis, glutamate => proline Arginine and proline metabolism #8a3222 +M00028 Ornithine biosynthesis, glutamate => ornithine Arginine and proline metabolism #8a3222 +M00029 Urea cycle Arginine and proline metabolism #8a3222 +M00047 Creatine pathway Arginine and proline metabolism #8a3222 +M00763 Ornithine biosynthesis, mediated by LysW, glutamate => ornithine Arginine and proline metabolism #8a3222 +M00844 Arginine biosynthesis, ornithine => arginine Arginine and proline metabolism #8a3222 +M00845 Arginine biosynthesis, glutamate => acetylcitrulline => arginine Arginine and proline metabolism #8a3222 +M00879 Arginine succinyltransferase pathway, arginine => glutamate Arginine and proline metabolism #8a3222 +M00022 Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate Aromatic amino acid metabolism #8641b6 +M00023 Tryptophan biosynthesis, chorismate => tryptophan Aromatic amino acid metabolism #8641b6 +M00024 Phenylalanine biosynthesis, chorismate => phenylalanine Aromatic amino acid metabolism #8641b6 +M00025 Tyrosine biosynthesis, chorismate => tyrosine Aromatic amino acid metabolism #8641b6 +M00037 Melatonin biosynthesis, tryptophan => serotonin => melatonin Aromatic amino acid metabolism #8641b6 +M00038 Tryptophan metabolism, tryptophan => kynurenine => 2-aminomuconate Aromatic amino acid metabolism #8641b6 +M00040 Tyrosine biosynthesis, prephanate => pretyrosine => tyrosine Aromatic amino acid metabolism #8641b6 +M00042 Catecholamine biosynthesis, tyrosine => dopamine => noradrenaline => adrenaline Aromatic amino acid metabolism #8641b6 +M00043 Thyroid hormone biosynthesis, tyrosine => triiodothyronine--thyroxine Aromatic amino acid metabolism #8641b6 +M00044 Tyrosine degradation, tyrosine => homogentisate Aromatic amino acid metabolism #8641b6 +M00533 Homoprotocatechuate degradation, homoprotocatechuate => 2-oxohept-3-enedioate Aromatic amino acid metabolism #8641b6 +M00545 Trans-cinnamate degradation, trans-cinnamate => acetyl-CoA Aromatic amino acid metabolism #8641b6 +M00418 Toluene degradation, anaerobic, toluene => benzoyl-CoA Aromatics degradation #76d25b +M00419 Cymene degradation, p-cymene => p-cumate Aromatics degradation #76d25b +M00534 Naphthalene degradation, naphthalene => salicylate Aromatics degradation #76d25b +M00537 Xylene degradation, xylene => methylbenzoate Aromatics degradation #76d25b +M00538 Toluene degradation, toluene => benzoate Aromatics degradation #76d25b +M00539 Cumate degradation, p-cumate => 2-oxopent-4-enoate + 2-methylpropanoate Aromatics degradation #76d25b +M00540 Benzoate degradation, cyclohexanecarboxylic acid =>pimeloyl-CoA Aromatics degradation #76d25b +M00541 Benzoyl-CoA degradation, benzoyl-CoA => 3-hydroxypimeloyl-CoA Aromatics degradation #76d25b +M00543 Biphenyl degradation, biphenyl => 2-oxopent-4-enoate + benzoate Aromatics degradation #76d25b +M00544 Carbazole degradation, carbazole => 2-oxopent-4-enoate + anthranilate Aromatics degradation #76d25b +M00547 Benzene--toluene degradation, benzene => catechol -- toluene => 3-methylcatechol Aromatics degradation #76d25b +M00548 Benzene degradation, benzene => catechol Aromatics degradation #76d25b +M00551 Benzoate degradation, benzoate => catechol -- methylbenzoate => methylcatechol Aromatics degradation #76d25b +M00568 Catechol ortho-cleavage, catechol => 3-oxoadipate Aromatics degradation #76d25b +M00569 Catechol meta-cleavage, catechol => acetyl-CoA -- 4-methylcatechol => propanoyl-CoA Aromatics degradation #76d25b +M00623 Phthalate degradation 1, phthalate => protocatechuate Aromatics degradation #76d25b +M00624 Terephthalate degradation, terephthalate => 3,4-dihydroxybenzoate Aromatics degradation #76d25b +M00636 Phthalate degradation 2, phthalate => protocatechuate Aromatics degradation #76d25b +M00637 Anthranilate degradation, anthranilate => catechol Aromatics degradation #76d25b +M00638 Salicylate degradation, salicylate => gentisate Aromatics degradation #76d25b +M00878 Phenylacetate degradation, phenylaxetate => acetyl-CoA--succinyl-CoA Aromatics degradation #76d25b +M00142 NADH:ubiquinone oxidoreductase, mitochondria ATP synthesis #cdd346 +M00143 NADH dehydrogenase (ubiquinone) Fe-S protein--flavoprotein complex, mitochondria ATP synthesis #cdd346 +M00144 NADH:quinone oxidoreductase, prokaryotes ATP synthesis #cdd346 +M00145 NAD(P)H:quinone oxidoreductase, chloroplasts and cyanobacteria ATP synthesis #cdd346 +M00146 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex ATP synthesis #cdd346 +M00147 NADH dehydrogenase (ubiquinone) 1 beta subcomplex ATP synthesis #cdd346 +M00148 Succinate dehydrogenase (ubiquinone) ATP synthesis #cdd346 +M00149 Succinate dehydrogenase, prokaryotes ATP synthesis #cdd346 +M00150 Fumarate reductase, prokaryotes ATP synthesis #cdd346 +M00151 Cytochrome bc1 complex respiratory unit ATP synthesis #cdd346 +M00152 Cytochrome bc1 complex ATP synthesis #cdd346 +M00153 Cytochrome bd ubiquinol oxidase ATP synthesis #cdd346 +M00154 Cytochrome c oxidase ATP synthesis #cdd346 +M00155 Cytochrome c oxidase, prokaryotes ATP synthesis #cdd346 +M00156 Cytochrome c oxidase, cbb3-type ATP synthesis #cdd346 +M00157 F-type ATPase, prokaryotes and chloroplasts ATP synthesis #cdd346 +M00158 F-type ATPase, eukaryotes ATP synthesis #cdd346 +M00159 V-type ATPase, prokaryotes ATP synthesis #cdd346 +M00160 V-type ATPase, eukaryotes ATP synthesis #cdd346 +M00162 Cytochrome b6f complex ATP synthesis #cdd346 +M00416 Cytochrome aa3-600 menaquinol oxidase ATP synthesis #cdd346 +M00417 Cytochrome o ubiquinol oxidase ATP synthesis #cdd346 +M00672 Penicillin biosynthesis, aminoadipate + cycteine + valine => penicillin Beta-Lactam biosynthesis #3b2882 +M00673 Cephamycin C biosynthesis, aminoadipate + cycteine + valine => cephamycin C Beta-Lactam biosynthesis #3b2882 +M00674 Clavaminate biosynthesis, arginine + glyceraldehyde-3P => clavaminate Beta-Lactam biosynthesis #3b2882 +M00675 Carbapenem-3-carboxylate biosynthesis, pyrroline-5-carboxylate + malonyl-CoA => carbapenem-3-carboxylate Beta-Lactam biosynthesis #3b2882 +M00736 Nocardicin A biosynthesis, L-pHPG + arginine + serine => nocardicin A Beta-Lactam biosynthesis #3b2882 +M00039 Monolignol biosynthesis, phenylalanine--tyrosine => monolignol Biosynthesis of other secondary metabolites #cbde82 +M00137 Flavanone biosynthesis, phenylalanine => naringenin Biosynthesis of other secondary metabolites #cbde82 +M00138 Flavonoid biosynthesis, naringenin => pelargonidin Biosynthesis of other secondary metabolites #cbde82 +M00370 Glucosinolate biosynthesis, tryptophan => glucobrassicin Biosynthesis of other secondary metabolites #cbde82 +M00661 Paspaline biosynthesis, geranylgeranyl-PP + indoleglycerol phosphate => paspaline Biosynthesis of other secondary metabolites #cbde82 +M00785 Cycloserine biosynthesis, arginine--serine => cycloserine Biosynthesis of other secondary metabolites #cbde82 +M00786 Fumitremorgin alkaloid biosynthesis, tryptophan + proline => fumitremorgin C--A Biosynthesis of other secondary metabolites #cbde82 +M00787 Bacilysin biosynthesis, prephenate => bacilysin Biosynthesis of other secondary metabolites #cbde82 +M00788 Terpentecin biosynthesis, GGAP => terpentecin Biosynthesis of other secondary metabolites #cbde82 +M00789 Rebeccamycin biosynthesis, tryptophan => rebeccamycin Biosynthesis of other secondary metabolites #cbde82 +M00790 Pyrrolnitrin biosynthesis, tryptophan => pyrrolnitrin Biosynthesis of other secondary metabolites #cbde82 +M00805 Staurosporine biosynthesis, tryptophan => staurosporine Biosynthesis of other secondary metabolites #cbde82 +M00808 Violacein biosynthesis, tryptophan => violacein Biosynthesis of other secondary metabolites #cbde82 +M00814 Acarbose biosynthesis, sedoheptulopyranose-7P => acarbose Biosynthesis of other secondary metabolites #cbde82 +M00815 Validamycin A biosynthesis, sedoheptulopyranose-7P => validamycin A Biosynthesis of other secondary metabolites #cbde82 +M00819 Pentalenolactone biosynthesis, farnesyl-PP => pentalenolactone Biosynthesis of other secondary metabolites #cbde82 +M00835 Pyocyanine biosynthesis, chorismate => pyocyanine Biosynthesis of other secondary metabolites #cbde82 +M00837 Prodigiosin biosynthesis, L-proline => prodigiosin Biosynthesis of other secondary metabolites #cbde82 +M00838 Undecylprodigiosin biosynthesis, L-proline => undecylprodigiosin Biosynthesis of other secondary metabolites #cbde82 +M00848 Aurachin biosynthesis, anthranilate => aurachin A Biosynthesis of other secondary metabolites #cbde82 +M00875 Staphyloferrin B biosynthesis, L-serine => staphyloferrin B Biosynthesis of other secondary metabolites #cbde82 +M00876 Staphyloferrin A biosynthesis, L-ornithine => staphyloferrin A Biosynthesis of other secondary metabolites #cbde82 +M00877 Kanosamine biosynthesis glucose 6-phosphate => kanosamine Biosynthesis of other secondary metabolites #cbde82 +M00019 Valine--isoleucine biosynthesis, pyruvate => valine -- 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb +M00036 Leucine degradation, leucine => acetoacetate + acetyl-CoA Branched-chain amino acid metabolism #656cdb +M00432 Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate Branched-chain amino acid metabolism #656cdb +M00535 Isoleucine biosynthesis, pyruvate => 2-oxobutanoate Branched-chain amino acid metabolism #656cdb +M00570 Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb +M00165 Reductive pentose phosphate cycle (Calvin cycle) Carbon fixation #408937 +M00166 Reductive pentose phosphate cycle, ribulose-5P => glyceraldehyde-3P Carbon fixation #408937 +M00167 Reductive pentose phosphate cycle, glyceraldehyde-3P => ribulose-5P Carbon fixation #408937 +M00168 CAM (Crassulacean acid metabolism), dark Carbon fixation #408937 +M00169 CAM (Crassulacean acid metabolism), light Carbon fixation #408937 +M00170 C4-dicarboxylic acid cycle, phosphoenolpyruvate carboxykinase type Carbon fixation #408937 +M00171 C4-dicarboxylic acid cycle, NAD - malic enzyme type Carbon fixation #408937 +M00172 C4-dicarboxylic acid cycle, NADP - malic enzyme type Carbon fixation #408937 +M00173 Reductive citrate cycle (Arnon-Buchanan cycle) Carbon fixation #408937 +M00374 Dicarboxylate-hydroxybutyrate cycle Carbon fixation #408937 +M00375 Hydroxypropionate-hydroxybutylate cycle Carbon fixation #408937 +M00376 3-Hydroxypropionate bi-cycle Carbon fixation #408937 +M00377 Reductive acetyl-CoA pathway (Wood-Ljungdahl pathway) Carbon fixation #408937 +M00579 Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate Carbon fixation #408937 +M00620 Incomplete reductive citrate cycle, acetyl-CoA => oxoglutarate Carbon fixation #408937 +M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate Central carbohydrate metabolism #c644a5 +M00002 Glycolysis, core module involving three-carbon compounds Central carbohydrate metabolism #c644a5 +M00003 Gluconeogenesis, oxaloacetate => fructose-6P Central carbohydrate metabolism #c644a5 +M00004 Pentose phosphate pathway (Pentose phosphate cycle) Central carbohydrate metabolism #c644a5 +M00005 PRPP biosynthesis, ribose 5P => PRPP Central carbohydrate metabolism #c644a5 +M00006 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P Central carbohydrate metabolism #c644a5 +M00007 Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5 +M00008 Entner-Doudoroff pathway, glucose-6P => glyceraldehyde-3P + pyruvate Central carbohydrate metabolism #c644a5 +M00009 Citrate cycle (TCA cycle, Krebs cycle) Central carbohydrate metabolism #c644a5 +M00010 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate Central carbohydrate metabolism #c644a5 +M00011 Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate Central carbohydrate metabolism #c644a5 +M00307 Pyruvate oxidation, pyruvate => acetyl-CoA Central carbohydrate metabolism #c644a5 +M00308 Semi-phosphorylative Entner-Doudoroff pathway, gluconate => glycerate-3P Central carbohydrate metabolism #c644a5 +M00309 Non-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate Central carbohydrate metabolism #c644a5 +M00580 Pentose phosphate pathway, archaea, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5 +M00633 Semi-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate-3P Central carbohydrate metabolism #c644a5 +M00112 Tocopherol--tocotorienol biosynthesis Cofactor and vitamin metabolism #5fda98 +M00115 NAD biosynthesis, aspartate => NAD Cofactor and vitamin metabolism #5fda98 +M00116 Menaquinone biosynthesis, chorismate => menaquinol Cofactor and vitamin metabolism #5fda98 +M00117 Ubiquinone biosynthesis, prokaryotes, chorismate => ubiquinone Cofactor and vitamin metabolism #5fda98 +M00119 Pantothenate biosynthesis, valine--L-aspartate => pantothenate Cofactor and vitamin metabolism #5fda98 +M00120 Coenzyme A biosynthesis, pantothenate => CoA Cofactor and vitamin metabolism #5fda98 +M00121 Heme biosynthesis, plants and bacteria, glutamate => heme Cofactor and vitamin metabolism #5fda98 +M00122 Cobalamin biosynthesis, cobinamide => cobalamin Cofactor and vitamin metabolism #5fda98 +M00123 Biotin biosynthesis, pimeloyl-ACP--CoA => biotin Cofactor and vitamin metabolism #5fda98 +M00124 Pyridoxal biosynthesis, erythrose-4P => pyridoxal-5P Cofactor and vitamin metabolism #5fda98 +M00125 Riboflavin biosynthesis, GTP => riboflavin--FMN--FAD Cofactor and vitamin metabolism #5fda98 +M00126 Tetrahydrofolate biosynthesis, GTP => THF Cofactor and vitamin metabolism #5fda98 +M00127 Thiamine biosynthesis, AIR => thiamine-P--thiamine-2P Cofactor and vitamin metabolism #5fda98 +M00128 Ubiquinone biosynthesis, eukaryotes, 4-hydroxybenzoate => ubiquinone Cofactor and vitamin metabolism #5fda98 +M00140 C1-unit interconversion, prokaryotes Cofactor and vitamin metabolism #5fda98 +M00141 C1-unit interconversion, eukaryotes Cofactor and vitamin metabolism #5fda98 +M00572 Pimeloyl-ACP biosynthesis, BioC-BioH pathway, malonyl-ACP => pimeloyl-ACP Cofactor and vitamin metabolism #5fda98 +M00573 Biotin biosynthesis, BioI pathway, long-chain-acyl-ACP => pimeloyl-ACP => biotin Cofactor and vitamin metabolism #5fda98 +M00577 Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin Cofactor and vitamin metabolism #5fda98 +M00622 Nicotinate degradation, nicotinate => fumarate Cofactor and vitamin metabolism #5fda98 +M00810 Nicotine degradation, pyridine pathway, nicotine => 2,6-dihydroxypyridine--succinate semialdehyde Cofactor and vitamin metabolism #5fda98 +M00811 Nicotine degradation, pyrrolidine pathway, nicotine => succinate semialdehyde Cofactor and vitamin metabolism #5fda98 +M00836 Coenzyme F430 biosynthesis, sirohydrochlorin => coenzyme F430 Cofactor and vitamin metabolism #5fda98 +M00840 Tetrahydrofolate biosynthesis, mediated by ribA and trpF, GTP => THF Cofactor and vitamin metabolism #5fda98 +M00841 Tetrahydrofolate biosynthesis, mediated by PTPS, GTP => THF Cofactor and vitamin metabolism #5fda98 +M00842 Tetrahydrobiopterin biosynthesis, GTP => BH4 Cofactor and vitamin metabolism #5fda98 +M00843 L-threo-Tetrahydrobiopterin biosynthesis, GTP => L-threo-BH4 Cofactor and vitamin metabolism #5fda98 +M00846 Siroheme biosynthesis, glutamate => siroheme Cofactor and vitamin metabolism #5fda98 +M00847 Heme biosynthesis, archaea, siroheme => heme Cofactor and vitamin metabolism #5fda98 +M00868 Heme biosynthesis, animals and fungi, glycine => heme Cofactor and vitamin metabolism #5fda98 +M00880 Molybdenum cofactor biosynthesis, GTP => molybdenum cofactor Cofactor and vitamin metabolism #5fda98 +M00017 Methionine biosynthesis, apartate => homoserine => methionine Cysteine and methionine metabolism #782975 +M00021 Cysteine biosynthesis, serine => cysteine Cysteine and methionine metabolism #782975 +M00034 Methionine salvage pathway Cysteine and methionine metabolism #782975 +M00035 Methionine degradation Cysteine and methionine metabolism #782975 +M00338 Cysteine biosynthesis, homocysteine + serine => cysteine Cysteine and methionine metabolism #782975 +M00368 Ethylene biosynthesis, methionine => ethylene Cysteine and methionine metabolism #782975 +M00609 Cysteine biosynthesis, methionine => cysteine Cysteine and methionine metabolism #782975 +M00625 Methicillin resistance Drug resistance #869534 +M00627 beta-Lactam resistance, Bla system Drug resistance #869534 +M00639 Multidrug resistance, efflux pump MexCD-OprJ Drug resistance #869534 +M00641 Multidrug resistance, efflux pump MexEF-OprN Drug resistance #869534 +M00642 Multidrug resistance, efflux pump MexJK-OprM Drug resistance #869534 +M00643 Multidrug resistance, efflux pump MexXY-OprM Drug resistance #869534 +M00649 Multidrug resistance, efflux pump AdeABC Drug resistance #869534 +M00651 Vancomycin resistance, D-Ala-D-Lac type Drug resistance #869534 +M00652 Vancomycin resistance, D-Ala-D-Ser type Drug resistance #869534 +M00696 Multidrug resistance, efflux pump AcrEF-TolC Drug resistance #869534 +M00697 Multidrug resistance, efflux pump MdtEF-TolC Drug resistance #869534 +M00698 Multidrug resistance, efflux pump BpeEF-OprC Drug resistance #869534 +M00700 Multidrug resistance, efflux pump AbcA Drug resistance #869534 +M00702 Multidrug resistance, efflux pump NorB Drug resistance #869534 +M00704 Tetracycline resistance, efflux pump Tet38 Drug resistance #869534 +M00705 Multidrug resistance, efflux pump MepA Drug resistance #869534 +M00714 Multidrug resistance, efflux pump QacA Drug resistance #869534 +M00718 Multidrug resistance, efflux pump MexAB-OprM Drug resistance #869534 +M00725 Cationic antimicrobial peptide (CAMP) resistance, dltABCD operon Drug resistance #869534 +M00726 Cationic antimicrobial peptide (CAMP) resistance, lysyl-phosphatidylglycerol (L-PG) synthase MprF Drug resistance #869534 +M00730 Cationic antimicrobial peptide (CAMP) resistance, VraFG transporter Drug resistance #869534 +M00744 Cationic antimicrobial peptide (CAMP) resistance, protease PgtE Drug resistance #869534 +M00745 Imipenem resistance, repression of porin OprD Drug resistance #869534 +M00746 Multidrug resistance, repression of porin OmpF Drug resistance #869534 +M00769 Multidrug resistance, efflux pump MexPQ-OpmE Drug resistance #869534 +M00851 Carbapenem resistance Drug resistance #869534 +M00824 9-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 9-membered enediyne core Enediyne biosynthesis #d27bde +M00825 10-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 10-membered enediyne core Enediyne biosynthesis #d27bde +M00826 C-1027 benzoxazolinate moiety biosynthesis, chorismate => benzoxazolinyl-CoA Enediyne biosynthesis #d27bde +M00827 C-1027 beta-amino acid moiety biosynthesis, tyrosine => 3-chloro-4,5-dihydroxy-beta-phenylalanyl-PCP Enediyne biosynthesis #d27bde +M00828 Maduropeptin beta-hydroxy acid moiety biosynthesis, tyrosine => 3-(4-hydroxyphenyl)-3-oxopropanoyl-PCP Enediyne biosynthesis #d27bde +M00829 3,6-Dimethylsalicylyl-CoA biosynthesis, malonyl-CoA => 6-methylsalicylate => 3,6-dimethylsalicylyl-CoA Enediyne biosynthesis #d27bde +M00830 Neocarzinostatin naphthoate moiety biosynthesis, malonyl-CoA => 2-hydroxy-5-methyl-1-naphthoate => 2-hydroxy-7-methoxy-5-methyl-1-naphthoyl-CoA Enediyne biosynthesis #d27bde +M00831 Kedarcidin 2-hydroxynaphthoate moiety biosynthesis, malonyl-CoA => 3,6,8-trihydroxy-2-naphthoate => 3-hydroxy-7,8-dimethoxy-6-isopropoxy-2-naphthoyl-CoA Enediyne biosynthesis #d27bde +M00832 Kedarcidin 2-aza-3-chloro-beta-tyrosine moiety biosynthesis, azatyrosine => 2-aza-3-chloro-beta-tyrosyl-PCP Enediyne biosynthesis #d27bde +M00833 Calicheamicin biosynthesis, calicheamicinone => calicheamicin Enediyne biosynthesis #d27bde +M00834 Calicheamicin orsellinate moiety biosynthesis, malonyl-CoA => orsellinate-ACP => 5-iodo-2,3-dimethoxyorsellinate-ACP Enediyne biosynthesis #d27bde +M00082 Fatty acid biosynthesis, initiation Fatty acid metabolism #d9a344 +M00083 Fatty acid biosynthesis, elongation Fatty acid metabolism #d9a344 +M00085 Fatty acid elongation in mitochondria Fatty acid metabolism #d9a344 +M00086 beta-Oxidation, acyl-CoA synthesis Fatty acid metabolism #d9a344 +M00087 beta-Oxidation Fatty acid metabolism #d9a344 +M00415 Fatty acid elongation in endoplasmic reticulum Fatty acid metabolism #d9a344 +M00861 beta-Oxidation, peroxisome, VLCFA Fatty acid metabolism #d9a344 +M00873 Fatty acid biosynthesis in mitochondria, animals Fatty acid metabolism #d9a344 +M00874 Fatty acid biosynthesis in mitochondria, fungi Fatty acid metabolism #d9a344 +M00055 N-glycan precursor biosynthesis Glycan biosynthesis #588cd6 +M00056 O-glycan biosynthesis, mucin type core Glycan biosynthesis #588cd6 +M00065 GPI-anchor biosynthesis, core oligosaccharide Glycan biosynthesis #588cd6 +M00068 Glycosphingolipid biosynthesis, globo-series, LacCer => Gb4Cer Glycan biosynthesis #588cd6 +M00069 Glycosphingolipid biosynthesis, ganglio series, LacCer => GT3 Glycan biosynthesis #588cd6 +M00070 Glycosphingolipid biosynthesis, lacto-series, LacCer => Lc4Cer Glycan biosynthesis #588cd6 +M00071 Glycosphingolipid biosynthesis, neolacto-series, LacCer => nLc4Cer Glycan biosynthesis #588cd6 +M00072 N-glycosylation by oligosaccharyltransferase Glycan biosynthesis #588cd6 +M00073 N-glycan precursor trimming Glycan biosynthesis #588cd6 +M00074 N-glycan biosynthesis, high-mannose type Glycan biosynthesis #588cd6 +M00075 N-glycan biosynthesis, complex type Glycan biosynthesis #588cd6 +M00872 O-glycan biosynthesis, mannose type (core M3) Glycan biosynthesis #588cd6 +M00057 Glycosaminoglycan biosynthesis, linkage tetrasaccharide Glycosaminoglycan metabolism #d66432 +M00058 Glycosaminoglycan biosynthesis, chondroitin sulfate backbone Glycosaminoglycan metabolism #d66432 +M00059 Glycosaminoglycan biosynthesis, heparan sulfate backbone Glycosaminoglycan metabolism #d66432 +M00076 Dermatan sulfate degradation Glycosaminoglycan metabolism #d66432 +M00077 Chondroitin sulfate degradation Glycosaminoglycan metabolism #d66432 +M00078 Heparan sulfate degradation Glycosaminoglycan metabolism #d66432 +M00079 Keratan sulfate degradation Glycosaminoglycan metabolism #d66432 +M00026 Histidine biosynthesis, PRPP => histidine Histidine metabolism #66d7bf +M00045 Histidine degradation, histidine => N-formiminoglutamate => glutamate Histidine metabolism #66d7bf +M00066 Lactosylceramide biosynthesis Lipid metabolism #d53e55 +M00067 Sulfoglycolipids biosynthesis, ceramide--1-alkyl-2-acylglycerol => sulfatide--seminolipid Lipid metabolism #d53e55 +M00088 Ketone body biosynthesis, acetyl-CoA => acetoacetate--3-hydroxybutyrate--acetone Lipid metabolism #d53e55 +M00089 Triacylglycerol biosynthesis Lipid metabolism #d53e55 +M00090 Phosphatidylcholine (PC) biosynthesis, choline => PC Lipid metabolism #d53e55 +M00091 Phosphatidylcholine (PC) biosynthesis, PE => PC Lipid metabolism #d53e55 +M00092 Phosphatidylethanolamine (PE) biosynthesis, ethanolamine => PE Lipid metabolism #d53e55 +M00093 Phosphatidylethanolamine (PE) biosynthesis, PA => PS => PE Lipid metabolism #d53e55 +M00094 Ceramide biosynthesis Lipid metabolism #d53e55 +M00098 Acylglycerol degradation Lipid metabolism #d53e55 +M00099 Sphingosine biosynthesis Lipid metabolism #d53e55 +M00100 Sphingosine degradation Lipid metabolism #d53e55 +M00113 Jasmonic acid biosynthesis Lipid metabolism #d53e55 +M00060 KDO2-lipid A biosynthesis, Raetz pathway, LpxL-LpxM type Lipopolysaccharide metabolism #83d2de +M00063 CMP-KDO biosynthesis Lipopolysaccharide metabolism #83d2de +M00064 ADP-L-glycero-D-manno-heptose biosynthesis Lipopolysaccharide metabolism #83d2de +M00866 KDO2-lipid A biosynthesis, Raetz pathway, non-LpxL-LpxM type Lipopolysaccharide metabolism #83d2de +M00867 KDO2-lipid A modification pathway Lipopolysaccharide metabolism #83d2de +M00016 Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b +M00030 Lysine biosynthesis, AAA pathway, 2-oxoglutarate => 2-aminoadipate => lysine Lysine metabolism #d84e8b +M00031 Lysine biosynthesis, mediated by LysW, 2-aminoadipate => lysine Lysine metabolism #d84e8b +M00032 Lysine degradation, lysine => saccharopine => acetoacetyl-CoA Lysine metabolism #d84e8b +M00433 Lysine biosynthesis, 2-oxoglutarate => 2-oxoadipate Lysine metabolism #d84e8b +M00525 Lysine biosynthesis, acetyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b +M00526 Lysine biosynthesis, DAP dehydrogenase pathway, aspartate => lysine Lysine metabolism #d84e8b +M00527 Lysine biosynthesis, DAP aminotransferase pathway, aspartate => lysine Lysine metabolism #d84e8b +M00773 Tylosin biosynthesis, methylmalonyl-CoA + malonyl-CoA => tylactone => tylosin Macrolide biosynthesis #2e4b26 +M00774 Erythromycin biosynthesis, propanoyl-CoA + methylmalonyl-CoA => deoxyerythronolide B => erythromycin A--B Macrolide biosynthesis #2e4b26 +M00775 Oleandomycin biosynthesis, malonyl-CoA + methylmalonyl-CoA => 8,8a-deoxyoleandolide => oleandomycin Macrolide biosynthesis #2e4b26 +M00776 Pikromycin--methymycin biosynthesis, methylmalonyl-CoA + malonyl-CoA => narbonolide--10-deoxymethynolide => pikromycin--methymycin Macrolide biosynthesis #2e4b26 +M00777 Avermectin biosynthesis, 2-methylbutanoyl-CoA--isobutyryl-CoA => 6,8a-Seco-6,8a-deoxy-5-oxoavermectin 1a--1b aglycone => avermectin A1a--B1a--A1b--B1b Macrolide biosynthesis #2e4b26 +M00611 Oxygenic photosynthesis in plants and cyanobacteria Metabolic capacity #9378c3 +M00612 Anoxygenic photosynthesis in purple bacteria Metabolic capacity #9378c3 +M00613 Anoxygenic photosynthesis in green nonsulfur bacteria Metabolic capacity #9378c3 +M00614 Anoxygenic photosynthesis in green sulfur bacteria Metabolic capacity #9378c3 +M00615 Nitrate assimilation Metabolic capacity #9378c3 +M00616 Sulfate-sulfur assimilation Metabolic capacity #9378c3 +M00617 Methanogen Metabolic capacity #9378c3 +M00618 Acetogen Metabolic capacity #9378c3 +M00174 Methane oxidation, methanotroph, methane => formaldehyde Methane metabolism #9e7336 +M00344 Formaldehyde assimilation, xylulose monophosphate pathway Methane metabolism #9e7336 +M00345 Formaldehyde assimilation, ribulose monophosphate pathway Methane metabolism #9e7336 +M00346 Formaldehyde assimilation, serine pathway Methane metabolism #9e7336 +M00356 Methanogenesis, methanol => methane Methane metabolism #9e7336 +M00357 Methanogenesis, acetate => methane Methane metabolism #9e7336 +M00358 Coenzyme M biosynthesis Methane metabolism #9e7336 +M00378 F420 biosynthesis Methane metabolism #9e7336 +M00422 Acetyl-CoA pathway, CO2 => acetyl-CoA Methane metabolism #9e7336 +M00563 Methanogenesis, methylamine--dimethylamine--trimethylamine => methane Methane metabolism #9e7336 +M00567 Methanogenesis, CO2 => methane Methane metabolism #9e7336 +M00608 2-Oxocarboxylic acid chain extension, 2-oxoglutarate => 2-oxoadipate => 2-oxopimelate => 2-oxosuberate Methane metabolism #9e7336 +M00175 Nitrogen fixation, nitrogen => ammonia Nitrogen metabolism #2c2351 +M00528 Nitrification, ammonia => nitrite Nitrogen metabolism #2c2351 +M00529 Denitrification, nitrate => nitrogen Nitrogen metabolism #2c2351 +M00530 Dissimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351 +M00531 Assimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351 +M00804 Complete nitrification, comammox, ammonia => nitrite => nitrate Nitrogen metabolism #2c2351 +M00027 GABA (gamma-Aminobutyrate) shunt Other amino acid metabolism #c5d7a9 +M00118 Glutathione biosynthesis, glutamate => glutathione Other amino acid metabolism #c5d7a9 +M00369 Cyanogenic glycoside biosynthesis, tyrosine => dhurrin Other amino acid metabolism #c5d7a9 +M00012 Glyoxylate cycle Other carbohydrate metabolism #872b4e +M00013 Malonate semialdehyde pathway, propanoyl-CoA => acetyl-CoA Other carbohydrate metabolism #872b4e +M00014 Glucuronate pathway (uronate pathway) Other carbohydrate metabolism #872b4e +M00061 D-Glucuronate degradation, D-glucuronate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e +M00081 Pectin degradation Other carbohydrate metabolism #872b4e +M00114 Ascorbate biosynthesis, plants, glucose-6P => ascorbate Other carbohydrate metabolism #872b4e +M00129 Ascorbate biosynthesis, animals, glucose-1P => ascorbate Other carbohydrate metabolism #872b4e +M00130 Inositol phosphate metabolism, PI=> PIP2 => Ins(1,4,5)P3 => Ins(1,3,4,5)P4 Other carbohydrate metabolism #872b4e +M00131 Inositol phosphate metabolism, Ins(1,3,4,5)P4 => Ins(1,3,4)P3 => myo-inositol Other carbohydrate metabolism #872b4e +M00132 Inositol phosphate metabolism, Ins(1,3,4)P3 => phytate Other carbohydrate metabolism #872b4e +M00373 Ethylmalonyl pathway Other carbohydrate metabolism #872b4e +M00532 Photorespiration Other carbohydrate metabolism #872b4e +M00549 Nucleotide sugar biosynthesis, glucose => UDP-glucose Other carbohydrate metabolism #872b4e +M00550 Ascorbate degradation, ascorbate => D-xylulose-5P Other carbohydrate metabolism #872b4e +M00552 D-galactonate degradation, De Ley-Doudoroff pathway, D-galactonate => glycerate-3P Other carbohydrate metabolism #872b4e +M00554 Nucleotide sugar biosynthesis, galactose => UDP-galactose Other carbohydrate metabolism #872b4e +M00565 Trehalose biosynthesis, D-glucose 1P => trehalose Other carbohydrate metabolism #872b4e +M00630 D-Galacturonate degradation (fungi), D-galacturonate => glycerol Other carbohydrate metabolism #872b4e +M00631 D-Galacturonate degradation (bacteria), D-galacturonate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e +M00632 Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P Other carbohydrate metabolism #872b4e +M00740 Methylaspartate cycle Other carbohydrate metabolism #872b4e +M00741 Propanoyl-CoA metabolism, propanoyl-CoA => succinyl-CoA Other carbohydrate metabolism #872b4e +M00761 Undecaprenylphosphate alpha-L-Ara4N biosynthesis, UDP-GlcA => undecaprenyl phosphate alpha-L-Ara4N Other carbohydrate metabolism #872b4e +M00854 Glycogen biosynthesis, glucose-1P => glycogen--starch Other carbohydrate metabolism #872b4e +M00855 Glycogen degradation, glycogen => glucose-6P Other carbohydrate metabolism #872b4e +M00097 beta-Carotene biosynthesis, GGAP => beta-carotene Other terpenoid biosynthesis #6e9368 +M00371 Castasterone biosynthesis, campesterol => castasterone Other terpenoid biosynthesis #6e9368 +M00372 Abscisic acid biosynthesis, beta-carotene => abscisic acid Other terpenoid biosynthesis #6e9368 +M00363 EHEC pathogenicity signature, Shiga toxin Pathogenicity #66406d +M00542 EHEC--EPEC pathogenicity signature, T3SS and effectors Pathogenicity #66406d +M00564 Helicobacter pylori pathogenicity signature, cagA pathogenicity island Pathogenicity #66406d +M00574 Pertussis pathogenicity signature, pertussis toxin Pathogenicity #66406d +M00575 Pertussis pathogenicity signature, T1SS Pathogenicity #66406d +M00576 ETEC pathogenicity signature, heat-labile and heat-stable enterotoxins Pathogenicity #66406d +M00850 Vibrio cholerae pathogenicity signature, cholera toxins Pathogenicity #66406d +M00852 Vibrio cholerae pathogenicity signature, toxin coregulated pilus Pathogenicity #66406d +M00853 ETEC pathogenicity signature, colonization factors Pathogenicity #66406d +M00856 Salmonella enterica pathogenicity signature, typhoid toxin Pathogenicity #66406d +M00857 Salmonella enterica pathogenicity signature, Vi antigen Pathogenicity #66406d +M00859 Bacillus anthracis pathogenicity signature, anthrax toxin Pathogenicity #66406d +M00860 Bacillus anthracis pathogenicity signature, polyglutamic acid capsule biosynthesis Pathogenicity #66406d +M00161 Photosystem II Photosynthesis #cfa68a +M00163 Photosystem I Photosynthesis #cfa68a +M00597 Anoxygenic photosystem II [BR:ko00194] Photosynthesis #cfa68a +M00598 Anoxygenic photosystem I [BR:ko00194] Photosynthesis #cfa68a +M00660 Xanthomonas spp. pathogenicity signature, T3SS and effectors Plant pathogenicity #461d27 +M00133 Polyamine biosynthesis, arginine => agmatine => putrescine => spermidine Polyamine biosynthesis #a5b3da +M00134 Polyamine biosynthesis, arginine => ornithine => putrescine Polyamine biosynthesis #a5b3da +M00135 GABA biosynthesis, eukaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da +M00136 GABA biosynthesis, prokaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da +M00793 dTDP-L-rhamnose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00794 dTDP-6-deoxy-D-allose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00795 dTDP-beta-L-noviose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00796 dTDP-D-mycaminose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00797 dTDP-D-desosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00798 dTDP-L-mycarose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00799 dTDP-L-oleandrose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00800 dTDP-L-megosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00801 dTDP-L-olivose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00802 dTDP-D-forosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00803 dTDP-D-angolosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00048 Inosine monophosphate biosynthesis, PRPP + glutamine => IMP Purine metabolism #e0a7d2 +M00049 Adenine ribonucleotide biosynthesis, IMP => ADP,ATP Purine metabolism #e0a7d2 +M00050 Guanine ribonucleotide biosynthesis IMP => GDP,GTP Purine metabolism #e0a7d2 +M00546 Purine degradation, xanthine => urea Purine metabolism #e0a7d2 +M00046 Pyrimidine degradation, uracil => beta-alanine, thymine => 3-aminoisobutanoate Pyrimidine metabolism #25585e +M00051 Uridine monophosphate biosynthesis, glutamine (+ PRPP) => UMP Pyrimidine metabolism #25585e +M00052 Pyrimidine ribonucleotide biosynthesis, UMP => UDP--UTP,CDP--CTP Pyrimidine metabolism #25585e +M00053 Pyrimidine deoxyribonuleotide biosynthesis, CDP--CTP => dCDP--dCTP,dTDP--dTTP Pyrimidine metabolism #25585e +M00018 Threonine biosynthesis, aspartate => homoserine => threonine Serine and threonine metabolism #de7d78 +M00020 Serine biosynthesis, glycerate-3P => serine Serine and threonine metabolism #de7d78 +M00033 Ectoine biosynthesis, aspartate => ectoine Serine and threonine metabolism #de7d78 +M00555 Betaine biosynthesis, choline => betaine Serine and threonine metabolism #de7d78 +M00101 Cholesterol biosynthesis, squalene 2,3-epoxide => cholesterol Sterol biosynthesis #4e96a2 +M00102 Ergocalciferol biosynthesis Sterol biosynthesis #4e96a2 +M00103 Cholecalciferol biosynthesis Sterol biosynthesis #4e96a2 +M00104 Bile acid biosynthesis, cholesterol => cholate--chenodeoxycholate Sterol biosynthesis #4e96a2 +M00106 Conjugated bile acid biosynthesis, cholate => taurocholate--glycocholate Sterol biosynthesis #4e96a2 +M00107 Steroid hormone biosynthesis, cholesterol => prognenolone => progesterone Sterol biosynthesis #4e96a2 +M00108 C21-Steroid hormone biosynthesis, progesterone => corticosterone--aldosterone Sterol biosynthesis #4e96a2 +M00109 C21-Steroid hormone biosynthesis, progesterone => cortisol--cortisone Sterol biosynthesis #4e96a2 +M00110 C19--C18-Steroid hormone biosynthesis, pregnenolone => androstenedione => estrone Sterol biosynthesis #4e96a2 +M00862 beta-Oxidation, peroxisome, tri--dihydroxycholestanoyl-CoA => choloyl--chenodeoxycholoyl-CoA Sterol biosynthesis #4e96a2 +M00176 Assimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2 +M00595 Thiosulfate oxidation by SOX complex, thiosulfate => sulfate Sulfur metabolism #4e96a2 +M00596 Dissimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2 +M00664 Nodulation Symbiosis #88574e +M00095 C5 isoprenoid biosynthesis, mevalonate pathway Terpenoid backbone biosynthesis #4e6089 +M00096 C5 isoprenoid biosynthesis, non-mevalonate pathway Terpenoid backbone biosynthesis #4e6089 +M00364 C10-C20 isoprenoid biosynthesis, bacteria Terpenoid backbone biosynthesis #4e6089 +M00365 C10-C20 isoprenoid biosynthesis, archaea Terpenoid backbone biosynthesis #4e6089 +M00366 C10-C20 isoprenoid biosynthesis, plants Terpenoid backbone biosynthesis #4e6089 +M00367 C10-C20 isoprenoid biosynthesis, non-plant eukaryotes Terpenoid backbone biosynthesis #4e6089 +M00849 C5 isoprenoid biosynthesis, mevalonate pathway, archaea Terpenoid backbone biosynthesis #4e6089 +M00778 Type II polyketide backbone biosynthesis, acyl-CoA + malonyl-CoA => polyketide Type II polyketide biosynthesis #af7194 +M00779 Dihydrokalafungin biosynthesis, octaketide => dihydrokalafungin Type II polyketide biosynthesis #af7194 +M00780 Tetracycline--oxytetracycline biosynthesis, pretetramide => tetracycline--oxytetracycline Type II polyketide biosynthesis #af7194 +M00781 Nogalavinone--aklavinone biosynthesis, deoxynogalonate--deoxyaklanonate => nogalavinone--aklavinone Type II polyketide biosynthesis #af7194 +M00782 Mithramycin biosynthesis, 4-demethylpremithramycinone => mithramycin Type II polyketide biosynthesis #af7194 +M00783 Tetracenomycin C--8-demethyltetracenomycin C biosynthesis, tetracenomycin F2 => tetracenomycin C--8-demethyltetracenomycin C Type II polyketide biosynthesis #af7194 +M00784 Elloramycin biosynthesis, 8-demethyltetracenomycin C => elloramycin A Type II polyketide biosynthesis #af7194 +M00823 Chlortetracycline biosynthesis, pretetramide => chlortetracycline Type II polyketide biosynthesis #af7194 \ No newline at end of file diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/01.Bifurcating_List.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/01.Bifurcating_List.txt new file mode 100644 index 0000000..8d909f9 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/01.Bifurcating_List.txt @@ -0,0 +1,23 @@ +M00373 +M00532 +M00376 +M00378 +M00088 +M00031 +M00763 +M00133 +M00075 +M00872 +M00125 +M00119 +M00122 +M00827 +M00828 +M00832 +M00833 +M00837 +M00838 +M00785 +M00307 +M00048 +M00127 \ No newline at end of file diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/02.Structural_List.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/02.Structural_List.txt new file mode 100644 index 0000000..7fbba00 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/02.Structural_List.txt @@ -0,0 +1,10 @@ +M00144 +M00149 +M00151 +M00152 +M00154 +M00155 +M00153 +M00156 +M00158 +M00160 \ No newline at end of file diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/03.Bifurcating_Modules.dict b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/03.Bifurcating_Modules.dict new file mode 100644 index 0000000..a09f7b4 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/03.Bifurcating_Modules.dict @@ -0,0 +1 @@ +{'M00373':{'M00373_1':{1:'K00626',2:'K00023',3:'K17865',4:'K14446',5:'K14447',6:'K14448',7:'K14449',8:'K08691',9:'K14451'},'M00373_2':{1:'K00626',2:'K00023',3:'K17865',4:'K14446',5:'K14447',6:'K14448',7:'K14449',8:'K01965+K01966',9:'K05606',10:'K01847'}},'M00532':{'M00532_1':{1:'K01601-K01602',2:'K19269',3:'K11517',4:'K03781',5:'K14272',6:'K00600',7:'K00830',8:'K15893,K15919',9:'K15918'},'M00532_2':{1:'K01601-K01602',2:'K19269',3:'K11517',4:'K03781',5:'K14272',6:'K00600',7:'K00830',8:'K00281+K00605+K00382+K02437'}},'M00376':{'M00376_1':{1:'K02160+K01961+K01962+K01963',2:'K14468',3:'K14469',4:'K15052',5:'K05606',6:['K01847','K01848+K01849'],7:'K14471+K14472',8:'K00239+K00240+K00241',9:'K01679'},'M00376_2':{1:'K02160+K01961+K01962+K01963',2:'K14468',3:'K14469',4:'K08691',5:'K14449',6:'K14470',7:'K09709'}},'M00378':{'M00378_1':{1:['K11779','K11780+K11781'],2:'K11212',3:'K12234'},'M00378_2':{1:'K14941',2:'K11212',3:'K12234'}},'M00088':{'M00088_1':{1:'K00626',2:'K01641',3:'K01640',4:'K00019'},'M00088_2':{1:'K00626',2:'K01641',3:'K01640',4:'K01574'}},'M00031':{'M00031_1':{1:'K05826',2:'K05827',3:'K05828',4:'K05829',5:'K05830',6:'K05831'}},'M00763':{'M00763_1':{1:'K05826',2:'K19412',3:'K05828',4:'K05829',5:'K05830',6:'K05831'}},'M00133':{'M00133_1':{1:'K01583,K01584,K01585,K02626',2:'K01480',3:'K01611'},'M00133_2':{1:'K00797',2:'K01611'}},'M00075':{'M00075_1':{1:'K01231',2:'K00736',3:'K00737'},'M00075_2':{1:'K01231',2:'K00736',3:'K00738',4:'K00744,K09661',5:'K13748'},'M00075_3':{1:'K01231',2:'K00736',3:'K00717',4:'K07966,K07967,K07968',5:'K00778,K00779'}},'M00872':{'M00872_1':{1:'K00728',2:'K18207',3:'K09654',4:'K17547',5:'K19872',6:'K19873',7:'K21052',8:'K21032',9:'K09668'},'M00872_2':{1:'K21031',2:'K19872',3:'K19873',4:'K21052',5:'K21032',6:'K09668'}},'M00125':{'M00125_1':{1:'K01497,K14652',2:['K01498_K00082','K11752'],3:'K22912,K20860,K20861,K20862,K21063,K21064',4:'K00794',5:'K00793',6:['K00861,K20884_K00953,K22949','K11753']},'M00125_2':{1:'K02858,K14652',2:'K00794',3:'K00793',4:['K00861,K20884_K00953,K22949','K11753']}},'M00119':{'M00119_1':{1:'K00826',2:'K00606',3:'K00077',4:'K01918,K13799'},'M00119_2':{1:'K01579',2:'K01918,K13799'}},'M00122':{'M00122_1':{1:'K00798,K19221',2:'K02232',3:'K02225,K02227',4:'K02231',5:'K02233'},'M00122_2':{1:'K00768',2:'K02226,K22316',3:'K02233'}},'M00827':{'M00827_1':{1:'K21183',2:'K21181',3:'K21182',4:'K16431',5:'K21184',6:'K21185'}},'M00828':{'M00828_1':{1:'K21183',2:'K21181',3:'K21182',4:'K21188'}},'M00832':{'M00832_1':{1:'K21183',2:'K21227',3:'K21228',4:'K16431',5:'K21185'}},'M00833':{'M00833_1':{1:'K21254',2:'K21255',3:'K21256',4:'K21257',5:'K21258',6:'K21261',7:'K21262',8:'K21263'},'M00833_2':{1:'K21259',2:'K21260',3:'K21261',4:'K21262',5:'K21263'}},'M00837':{'M00837_1':{1:'K21780+K21781',2:'K21782',3:'K21783',4:'K21784',5:'K21785',6:'K21786',7:'K21787'},'M00837_2':{1:'K21428',2:'K21778',3:'K21779',4:'K21787'}},'M00838':{'M00838_1':{1:'K21780+K21781',2:'K21782',3:'K21783',4:'K21784',5:'K21785',6:'K21786',7:'K21787'},'M00837_2':{1:'K21791',2:'K21792',3:'K21793',4:'K21787'}},'M00785':{'M00785_1':{1:'K19741',2:'K19723',3:'K19725',4:'K19724',5:'K19727'},'M00785_2':{1:'K19726',2:'K19725',3:'K19724',4:'K19727'}},'M00307':{'M00307_1':{1:'K03737'},'M00307_2':{1:'K00169+K00170+K00171+K00172'},'M00307_3':{1:'K00161+K00162+K00627+K00382-K13997'},'M00307_4':{1:'K00163+K00627+K00382-K13997'}},'M00048':{'M00048_1':{1:'K00764',2:'K01945,K11787,K11788,K13713',3:'K00601,K11175,K08289,K11787,K01492',4:['K01952','K23269+K23264+K23265','K23270+K23265'], 5:'K01933,K11787',6:'K01923,K01587,K13713',7:'K01756',8:['K00602','K01492','K06863_K11176']}, 'M00048_2':{1:'K00764',2:'K01945,K11787,K11788,K13713',3:'K00601,K11175,K08289,K11787,K01492',4:['K01952','K23269+K23264+K23265','K23270+K23265'],5:'K11788',6:['K01587','K11808','K01589_K01588'],7:'K01923,K01587,K13713',8:'K01756',9:['K00602','K01492','K06863_K11176']}},'M00127':{'M00127_1':{1:'K03147',2:'K00877,K00941,K14153',3:'K00788,K14153,K14154',4:'K00946'},'M00127_2':{1:'K00878,K14154',2:'K00788,K14153,K14154',3:'K00946'}}} \ No newline at end of file diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/04.Structural_Modules.dict b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/04.Structural_Modules.dict new file mode 100644 index 0000000..b9aa2e8 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/04.Structural_Modules.dict @@ -0,0 +1 @@ +{'M00144':['K00330', 'K00331+K00332+K00333,K00331+K13378,K13380','K00334+K00335+K00336+K00337+K00338+K00339+K00340','K00341+K00342,K15863','K00343'],'M00149':['K00241','K00242,K18859,K18860','K00239+K00240'],'M00151':[['K03890+K03891+K03889','K03886+K03887+K03888','K00412+K00413,K00410_K00411']],'M00152':['K00412+K00413,K00410','K00411+K00414+K00415+K00416+K00417+K00418+K00419+K00420'],'M00154':['K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268','K02269,K02270-K02271','K02272-K02273+K02258+K02259+K02260'],'M00155':['K02275','K02274+K02276,K15408','K02277'],'M00153':['K00425+K00426','K00424,K22501'],'M00156':['K00404+K00405,K15862','K00407+K00406'],'M00158':['K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138','K02129,K01549','K02130,K02139','K02140','K02141,K02131','K02142-K02143+K02125'],'M00160':['K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154','K03661,K02155','K02146+K02153+K03662']} \ No newline at end of file diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/05.Modules_Parsed.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/05.Modules_Parsed.txt new file mode 100644 index 0000000..c8229a5 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/05.Modules_Parsed.txt @@ -0,0 +1,3343 @@ +M00001 +(K00844,K12407,K00845,K00886,K08074,K00918) (K01810,K06859,K13810,K15916) (K00850,K16370,K00918) (K01623,K01624,K11645,K16305,K16306) K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406) +['(K00844,K12407,K00845,K00886,K08074,K00918)', '(K01810,K06859,K13810,K15916)', '(K00850,K16370,K00918)', '(K01623,K01624,K11645,K16305,K16306)', 'K01803', '((K00134,K00150)_K00927,K11389)', '(K01834,K15633,K15634,K15635)', 'K01689', '(K00873,K12406)'] +== +['K00844', 'K12407', 'K00845', 'K00886', 'K08074', 'K00918'] +['K01810', 'K06859', 'K13810', 'K15916'] +['K00850', 'K16370', 'K00918'] +['K01623', 'K01624', 'K11645', 'K16305', 'K16306'] +['K01803'] +['K11389', 'K00134,K00150_K00927'] +['K01834', 'K15633', 'K15634', 'K15635'] +['K01689'] +['K00873', 'K12406'] +++++++++++++++++++ +M00002 +K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406) +['K01803', '((K00134,K00150)_K00927,K11389)', '(K01834,K15633,K15634,K15635)', 'K01689', '(K00873,K12406)'] +== +['K01803'] +['K11389', 'K00134,K00150_K00927'] +['K01834', 'K15633', 'K15634', 'K15635'] +['K01689'] +['K00873', 'K12406'] +++++++++++++++++++ +M00003 +(K01596,K01610) K01689 (K01834,K15633,K15634,K15635) K00927 (K00134,K00150) K01803 ((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622) +['(K01596,K01610)', 'K01689', '(K01834,K15633,K15634,K15635)', 'K00927', '(K00134,K00150)', 'K01803', '((K01623,K01624,K11645)_(K03841,K02446,K11532,K01086,K04041),K01622)'] +== +['K01596', 'K01610'] +['K01689'] +['K01834', 'K15633', 'K15634', 'K15635'] +['K00927'] +['K00134', 'K00150'] +['K01803'] +['K01622', 'K01623,K01624,K11645_K03841,K02446,K11532,K01086,K04041'] +++++++++++++++++++ +M00009 +(K01647,K05942) (K01681,K01682) (K00031,K00030) (K00164+K00658+K00382,K00174+K00175-K00177-K00176) (K01902+K01903,K01899+K01900,K18118) (K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247) (K01676,K01679,K01677+K01678) (K00026,K00025,K00024,K00116) +['(K01647,K05942)', '(K01681,K01682)', '(K00031,K00030)', '(K00164+K00658+K00382,K00174+K00175-K00177-K00176)', '(K01902+K01903,K01899+K01900,K18118)', '(K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247)', '(K01676,K01679,K01677+K01678)', '(K00026,K00025,K00024,K00116)'] +== +['K01647', 'K05942'] +['K01681', 'K01682'] +['K00031', 'K00030'] +['K00164+K00658+K00382', 'K00174+K00175-K00177-K00176'] +['K18118', 'K01902+K01903', 'K01899+K01900'] +['K00234+K00235+K00236+K00237', 'K00244+K00245+K00246-K00247', 'K00239+K00240+K00241-%K00242,K18859,K18860)'] +['K01676', 'K01679', 'K01677+K01678'] +['K00026', 'K00025', 'K00024', 'K00116'] +++++++++++++++++++ +M00010 +(K01647,K05942) (K01681,K01682) (K00031,K00030) +['(K01647,K05942)', '(K01681,K01682)', '(K00031,K00030)'] +== +['K01647', 'K05942'] +['K01681', 'K01682'] +['K00031', 'K00030'] +++++++++++++++++++ +M00011 +(K00164+K00658+K00382,K00174+K00175-K00177-K00176) (K01902+K01903,K01899+K01900,K18118) (K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247) (K01676,K01679,K01677+K01678) (K00026,K00025,K00024,K00116) +['(K00164+K00658+K00382,K00174+K00175-K00177-K00176)', '(K01902+K01903,K01899+K01900,K18118)', '(K00234+K00235+K00236+K00237,K00239+K00240+K00241-(K00242,K18859,K18860),K00244+K00245+K00246-K00247)', '(K01676,K01679,K01677+K01678)', '(K00026,K00025,K00024,K00116)'] +== +['K00164+K00658+K00382', 'K00174+K00175-K00177-K00176'] +['K18118', 'K01902+K01903', 'K01899+K01900'] +['K00234+K00235+K00236+K00237', 'K00244+K00245+K00246-K00247', 'K00239+K00240+K00241-%K00242,K18859,K18860)'] +['K01676', 'K01679', 'K01677+K01678'] +['K00026', 'K00025', 'K00024', 'K00116'] +++++++++++++++++++ +M00004 +(K13937,((K00036,K19243) (K01057,K07404))) K00033 K01783 (K01807,K01808) K00615 K00616 (K01810,K06859,K13810,K15916) +['(K13937,((K00036,K19243)_(K01057,K07404)))', 'K00033', 'K01783', '(K01807,K01808)', 'K00615', 'K00616', '(K01810,K06859,K13810,K15916)'] +== +['K13937', 'K00036,K19243_K01057,K07404'] +['K00033'] +['K01783'] +['K01807', 'K01808'] +['K00615'] +['K00616'] +['K01810', 'K06859', 'K13810', 'K15916'] +++++++++++++++++++ +M00006 +(K13937,((K00036,K19243) (K01057,K07404))) K00033 +['(K13937,((K00036,K19243)_(K01057,K07404)))', 'K00033'] +== +['K13937', 'K00036,K19243_K01057,K07404'] +['K00033'] +++++++++++++++++++ +M00007 +K00615 (K00616,K13810) K01783 (K01807,K01808) +['K00615', '(K00616,K13810)', 'K01783', '(K01807,K01808)'] +== +['K00615'] +['K00616', 'K13810'] +['K01783'] +['K01807', 'K01808'] +++++++++++++++++++ +M00580 +(K08094 (K08093,K13812),K13831) K01807 +['(K08094_(K08093,K13812),K13831)', 'K01807'] +== +['K13831', 'K08094_K08093,K13812'] +['K01807'] +++++++++++++++++++ +M00005 +K00948 +['K00948'] +== +['K00948'] +++++++++++++++++++ +M00008 +K00036 (K01057,K07404) K01690 K01625 +['K00036', '(K01057,K07404)', 'K01690', 'K01625'] +== +['K00036'] +['K01057', 'K07404'] +['K01690'] +['K01625'] +++++++++++++++++++ +M00308 +K05308 K00874 K01625 (K00134 K00927,K00131,K18978) +['K05308', 'K00874', 'K01625', '(K00134_K00927,K00131,K18978)'] +== +['K05308'] +['K00874'] +['K01625'] +['K00131', 'K18978', 'K00134_K00927'] +++++++++++++++++++ +M00633 +K05308 K18126 K11395 (K00131,K18978) +['K05308', 'K18126', 'K11395', '(K00131,K18978)'] +== +['K05308'] +['K18126'] +['K11395'] +['K00131', 'K18978'] +++++++++++++++++++ +M00309 +K05308 (K11395,K18127) (K18020+K18021+K18022,K18128,K03738) +['K05308', '(K11395,K18127)', '(K18020+K18021+K18022,K18128,K03738)'] +== +['K05308'] +['K11395', 'K18127'] +['K18128', 'K03738', 'K18020+K18021+K18022'] +++++++++++++++++++ +M00014 +K00012 ((K12447 K16190),(K00699 (K01195,K14756))) K00002 K13247 -- K03331 (K05351,K00008) K00854 +['K00012', '((K12447_K16190),(K00699_(K01195,K14756)))', 'K00002', 'K13247', 'K03331', '(K05351,K00008)', 'K00854'] +== +['K00012'] +['K12447_K16190', 'K00699_K01195,K14756'] +['K00002'] +['K13247'] +['K03331'] +['K05351', 'K00008'] +['K00854'] +++++++++++++++++++ +M00630 +(K18106,K19634) K18102 K18103 K18107 +['(K18106,K19634)', 'K18102', 'K18103', 'K18107'] +== +['K18106', 'K19634'] +['K18102'] +['K18103'] +['K18107'] +++++++++++++++++++ +M00631 +K01812 K00041 (K01685,K16849+K16850) K00874 (K01625,K17463) +['K01812', 'K00041', '(K01685,K16849+K16850)', 'K00874', '(K01625,K17463)'] +== +['K01812'] +['K00041'] +['K01685', 'K16849+K16850'] +['K00874'] +['K01625', 'K17463'] +++++++++++++++++++ +M00061 +K01812 K00040 (K01686,K08323) K00874 (K01625,K17463) +['K01812', 'K00040', '(K01686,K08323)', 'K00874', '(K01625,K17463)'] +== +['K01812'] +['K00040'] +['K01686', 'K08323'] +['K00874'] +['K01625', 'K17463'] +++++++++++++++++++ +M00081 +K01051 K01184 K01213 +['K01051', 'K01184', 'K01213'] +== +['K01051'] +['K01184'] +['K01213'] +++++++++++++++++++ +M00632 +K01785 K00849 K00965 K01784 +['K01785', 'K00849', 'K00965', 'K01784'] +== +['K01785'] +['K00849'] +['K00965'] +['K01784'] +++++++++++++++++++ +M00552 +K01684 K00883 K01631 K00134 K00927 +['K01684', 'K00883', 'K01631', 'K00134', 'K00927'] +== +['K01684'] +['K00883'] +['K01631'] +['K00134'] +['K00927'] +++++++++++++++++++ +M00129 +K00963 K00012 K00699 (K01195,K14756) K00002 K01053 K00103 +['K00963', 'K00012', 'K00699', '(K01195,K14756)', 'K00002', 'K01053', 'K00103'] +== +['K00963'] +['K00012'] +['K00699'] +['K01195', 'K14756'] +['K00002'] +['K01053'] +['K00103'] +++++++++++++++++++ +M00114 +((K01810,K06859,K13810) (K01809,K16011),K15916) (K16881,(K17497,K01840,K15778) (K00966,K00971,K16011)) K10046 K14190 (K10047,K18649) (K00064,K17744) K00225 +['((K01810,K06859,K13810)_(K01809,K16011),K15916)', '(K16881,(K17497,K01840,K15778)_(K00966,K00971,K16011))', 'K10046', 'K14190', '(K10047,K18649)', '(K00064,K17744)', 'K00225'] +== +['K15916', 'K01810,K06859,K13810_K01809,K16011'] +['K16881', 'K17497,K01840,K15778_K00966,K00971,K16011'] +['K10046'] +['K14190'] +['K10047', 'K18649'] +['K00064', 'K17744'] +['K00225'] +++++++++++++++++++ +M00550 +K02821+K02822+K03475 K03476 K03078 K03079 K03077 +['K02821+K02822+K03475', 'K03476', 'K03078', 'K03079', 'K03077'] +== +['K02821+K02822+K03475'] +['K03476'] +['K03078'] +['K03079'] +['K03077'] +++++++++++++++++++ +M00854 +(K00963 (K00693+K00750,K16150,K16153,K13679,K20812)),(K00975 (K00703,K13679,K20812)) (K00700,K16149) +['(K00963_(K00693+K00750,K16150,K16153,K13679,K20812)),(K00975_(K00703,K13679,K20812))', '(K00700,K16149)'] +== +['K00975_K00703,K13679,K20812', 'K00963_K00693+K00750,K16150,K16153,K13679,K20812'] +['K00700', 'K16149'] +++++++++++++++++++ +M00855 +(K00688,K16153) (K01196,((K00705,K22451) (K02438,K01200))) (K15779,K01835,K15778) +['(K00688,K16153)', '(K01196,((K00705,K22451)_(K02438,K01200)))', '(K15779,K01835,K15778)'] +== +['K00688', 'K16153'] +['K01196', 'K00705,K22451_K02438,K01200'] +['K15779', 'K01835', 'K15778'] +++++++++++++++++++ +M00565 +K00975 K00703 (K00700,K16149) K01214 K06044 K01236 +['K00975', 'K00703', '(K00700,K16149)', 'K01214', 'K06044', 'K01236'] +== +['K00975'] +['K00703'] +['K00700', 'K16149'] +['K01214'] +['K06044'] +['K01236'] +++++++++++++++++++ +M00549 +(K00844,K00845,K12407,K00886) K01835 K00963 +['(K00844,K00845,K12407,K00886)', 'K01835', 'K00963'] +== +['K00844', 'K00845', 'K12407', 'K00886'] +['K01835'] +['K00963'] +++++++++++++++++++ +M00554 +K00849 K00965 +['K00849', 'K00965'] +== +['K00849'] +['K00965'] +++++++++++++++++++ +M00761 +K10011 K07806 K10012 K13014 +['K10011', 'K07806', 'K10012', 'K13014'] +== +['K10011'] +['K07806'] +['K10012'] +['K13014'] +++++++++++++++++++ +M00012 +K01647 (K01681,K01682) K01637 (K01638,K19282) (K00026,K00025,K00024) +['K01647', '(K01681,K01682)', 'K01637', '(K01638,K19282)', '(K00026,K00025,K00024)'] +== +['K01647'] +['K01681', 'K01682'] +['K01637'] +['K01638', 'K19282'] +['K00026', 'K00025', 'K00024'] +++++++++++++++++++ +M00740 +K01647 K01681 K00031 K00261 K19268+K01846 K04835 K19280 K14449 K19281 K19282 K00024 +['K01647', 'K01681', 'K00031', 'K00261', 'K19268+K01846', 'K04835', 'K19280', 'K14449', 'K19281', 'K19282', 'K00024'] +== +['K01647'] +['K01681'] +['K00031'] +['K00261'] +['K19268+K01846'] +['K04835'] +['K19280'] +['K14449'] +['K19281'] +['K19282'] +['K00024'] +++++++++++++++++++ +M00013 +(K00248,K00232) (K07511,K07514,K07515,K14729) K05605 K23146 K00140 +['(K00248,K00232)', '(K07511,K07514,K07515,K14729)', 'K05605', 'K23146', 'K00140'] +== +['K00248', 'K00232'] +['K07511', 'K07514', 'K07515', 'K14729'] +['K05605'] +['K23146'] +['K00140'] +++++++++++++++++++ +M00741 +(K01965+K01966,K11263+(K18472,K19312+K22568),K01964+K15036+K15037) K05606 (K01847,K01848+K01849) +['(K01965+K01966,K11263+(K18472,K19312+K22568),K01964+K15036+K15037)', 'K05606', '(K01847,K01848+K01849)'] +== +['K01965+K01966', 'K01964+K15036+K15037', 'K11263+K18472,K19312+K22568'] +['K05606'] +['K01847', 'K01848+K01849'] +++++++++++++++++++ +M00130 +(K00888,K19801,K13711) (K00889,K13712) (K01116,K05857,K05858,K05859,K05860,K05861) K00911 +['(K00888,K19801,K13711)', '(K00889,K13712)', '(K01116,K05857,K05858,K05859,K05860,K05861)', 'K00911'] +== +['K00888', 'K19801', 'K13711'] +['K00889', 'K13712'] +['K01116', 'K05857', 'K05858', 'K05859', 'K05860', 'K05861'] +['K00911'] +++++++++++++++++++ +M00131 +K01106 (K01107,K15422) K01109 (K01092,K15759,K10047,K18649) +['K01106', '(K01107,K15422)', 'K01109', '(K01092,K15759,K10047,K18649)'] +== +['K01106'] +['K01107', 'K15422'] +['K01109'] +['K01092', 'K15759', 'K10047', 'K18649'] +++++++++++++++++++ +M00132 +(K00913,K01765) K00915 K10572 +['(K00913,K01765)', 'K00915', 'K10572'] +== +['K00913', 'K01765'] +['K00915'] +['K10572'] +++++++++++++++++++ +M00165 +K00855 (K01601-K01602) K00927 (K05298,K00150,K00134) (K01623,K01624) (K03841,K02446,K11532,K01086) K00615 (K01623,K01624) (K01100,K11532,K01086) K00615 (K01807,K01808) +['K00855', '(K01601-K01602)', 'K00927', '(K05298,K00150,K00134)', '(K01623,K01624)', '(K03841,K02446,K11532,K01086)', 'K00615', '(K01623,K01624)', '(K01100,K11532,K01086)', 'K00615', '(K01807,K01808)'] +== +['K00855'] +['K01601-K01602'] +['K00927'] +['K05298', 'K00150', 'K00134'] +['K01623', 'K01624'] +['K03841', 'K02446', 'K11532', 'K01086'] +['K00615'] +['K01623', 'K01624'] +['K01100', 'K11532', 'K01086'] +['K00615'] +['K01807', 'K01808'] +++++++++++++++++++ +M00166 +K00855 (K01601-K01602) K00927 (K05298,K00150,K00134) +['K00855', '(K01601-K01602)', 'K00927', '(K05298,K00150,K00134)'] +== +['K00855'] +['K01601-K01602'] +['K00927'] +['K05298', 'K00150', 'K00134'] +++++++++++++++++++ +M00167 +(K01623,K01624) (K03841,K02446,K11532,K01086) K00615 (K01623,K01624) (K01100,K11532,K01086) K00615 (K01807,K01808) +['(K01623,K01624)', '(K03841,K02446,K11532,K01086)', 'K00615', '(K01623,K01624)', '(K01100,K11532,K01086)', 'K00615', '(K01807,K01808)'] +== +['K01623', 'K01624'] +['K03841', 'K02446', 'K11532', 'K01086'] +['K00615'] +['K01623', 'K01624'] +['K01100', 'K11532', 'K01086'] +['K00615'] +['K01807', 'K01808'] +++++++++++++++++++ +M00168 +K01595 (K00025,K00026,K00024) +['K01595', '(K00025,K00026,K00024)'] +== +['K01595'] +['K00025', 'K00026', 'K00024'] +++++++++++++++++++ +M00169 +K00029 K01006 +['K00029', 'K01006'] +== +['K00029'] +['K01006'] +++++++++++++++++++ +M00172 +K01595 K00051 K00029 K01006 +['K01595', 'K00051', 'K00029', 'K01006'] +== +['K01595'] +['K00051'] +['K00029'] +['K01006'] +++++++++++++++++++ +M00171 +K01595 K14454 K14455 (K00025,K00026) K00028 (K00814,K14272) K01006 +['K01595', 'K14454', 'K14455', '(K00025,K00026)', 'K00028', '(K00814,K14272)', 'K01006'] +== +['K01595'] +['K14454'] +['K14455'] +['K00025', 'K00026'] +['K00028'] +['K00814', 'K14272'] +['K01006'] +++++++++++++++++++ +M00170 +K01595 K14454 K14455 K01610 +['K01595', 'K14454', 'K14455', 'K01610'] +== +['K01595'] +['K14454'] +['K14455'] +['K01610'] +++++++++++++++++++ +M00173 +(K00169+K00170+K00171+K00172,K03737) ((K01007,K01006) K01595,K01959+K01960,K01958) K00024 (K01676,K01679,K01677+K01678) (K00239+K00240-K00241-K00242,K00244+K00245-K00246-K00247,K18556+K18557+K18558+K18559+K18560) (K01902+K01903) (K00174+K00175-K00177-K00176) K00031 (K01681,K01682) (K15230+K15231,K15232+K15233 K15234) +['(K00169+K00170+K00171+K00172,K03737)', '((K01007,K01006)_K01595,K01959+K01960,K01958)', 'K00024', '(K01676,K01679,K01677+K01678)', '(K00239+K00240-K00241-K00242,K00244+K00245-K00246-K00247,K18556+K18557+K18558+K18559+K18560)', '(K01902+K01903)', '(K00174+K00175-K00177-K00176)', 'K00031', '(K01681,K01682)', '(K15230+K15231,K15232+K15233_K15234)'] +== +['K03737', 'K00169+K00170+K00171+K00172'] +['K01958', 'K01959+K01960', 'K01007,K01006_K01595'] +['K00024'] +['K01676', 'K01679', 'K01677+K01678'] +['K00239+K00240-K00241-K00242', 'K00244+K00245-K00246-K00247', 'K18556+K18557+K18558+K18559+K18560'] +['K01902+K01903'] +['K00174+K00175-K00177-K00176'] +['K00031'] +['K01681', 'K01682'] +['K15230+K15231', 'K15232+K15233_K15234'] +++++++++++++++++++ +M00375 +K01964+K15037+K15036 K15017 K15039 K15018 K15019 K15020 K05606 K01848+K01849 (K15038,K15017) K14465 (K14466,K18861) K14534 K15016 K00626 +['K01964+K15037+K15036', 'K15017', 'K15039', 'K15018', 'K15019', 'K15020', 'K05606', 'K01848+K01849', '(K15038,K15017)', 'K14465', '(K14466,K18861)', 'K14534', 'K15016', 'K00626'] +== +['K01964+K15037+K15036'] +['K15017'] +['K15039'] +['K15018'] +['K15019'] +['K15020'] +['K05606'] +['K01848+K01849'] +['K15038', 'K15017'] +['K14465'] +['K14466', 'K18861'] +['K14534'] +['K15016'] +['K00626'] +++++++++++++++++++ +M00374 +K00169+K00170+K00171+K00172 K01007 K01595 K00024 (K01676,K01677+K01678) (K00239+K00240-K00241-K18860) K01902+K01903 (K15038,K15017) K14465 (K14467,K18861) K14534 K15016 K00626 +['K00169+K00170+K00171+K00172', 'K01007', 'K01595', 'K00024', '(K01676,K01677+K01678)', '(K00239+K00240-K00241-K18860)', 'K01902+K01903', '(K15038,K15017)', 'K14465', '(K14467,K18861)', 'K14534', 'K15016', 'K00626'] +== +['K00169+K00170+K00171+K00172'] +['K01007'] +['K01595'] +['K00024'] +['K01676', 'K01677+K01678'] +['K00239+K00240-K00241-K18860'] +['K01902+K01903'] +['K15038', 'K15017'] +['K14465'] +['K14467', 'K18861'] +['K14534'] +['K15016'] +['K00626'] +++++++++++++++++++ +M00377 +K00198 K05299-K15022 K01938 K01491 K00297 K15023 K14138+K00197+K00194 +['K00198', 'K05299-K15022', 'K01938', 'K01491', 'K00297', 'K15023', 'K14138+K00197+K00194'] +== +['K00198'] +['K05299-K15022'] +['K01938'] +['K01491'] +['K00297'] +['K15023'] +['K14138+K00197+K00194'] +++++++++++++++++++ +M00579 +(K00625,K13788,K15024) K00925 +['(K00625,K13788,K15024)', 'K00925'] +== +['K00625', 'K13788', 'K15024'] +['K00925'] +++++++++++++++++++ +M00620 +K00169+K00170+K00171+K00172 K01959+K01960 K00024 K01677+K01678 K18209+K18210 K01902+K01903 K00174+K00175+K00176+K00177 +['K00169+K00170+K00171+K00172', 'K01959+K01960', 'K00024', 'K01677+K01678', 'K18209+K18210', 'K01902+K01903', 'K00174+K00175+K00176+K00177'] +== +['K00169+K00170+K00171+K00172'] +['K01959+K01960'] +['K00024'] +['K01677+K01678'] +['K18209+K18210'] +['K01902+K01903'] +['K00174+K00175+K00176+K00177'] +++++++++++++++++++ +M00567 +(K00200+K00201+K00202+K00203-K11261+(K00205,K11260,K00204)) K00672 K01499 (K00319,K13942) K00320 (K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584) (K00399+K00401+K00402) (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125)) +['(K00200+K00201+K00202+K00203-K11261+(K00205,K11260,K00204))', 'K00672', 'K01499', '(K00319,K13942)', 'K00320', '(K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584)', '(K00399+K00401+K00402)', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))'] +== +['K00200+K00201+K00202+K00203-K11261+K00205,K11260,K00204'] +['K00672'] +['K01499'] +['K00319', 'K13942'] +['K00320'] +['K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584'] +['K00399+K00401+K00402'] +['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125'] +++++++++++++++++++ +M00357 +(K00925 (K00625,K13788),K01895) (K00193+K00197+K00194) (K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584) (K00399+K00401+K00402) (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125)) +['(K00925_(K00625,K13788),K01895)', '(K00193+K00197+K00194)', '(K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584)', '(K00399+K00401+K00402)', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))'] +== +['K01895', 'K00925_K00625,K13788'] +['K00193+K00197+K00194'] +['K00577+K00578+K00579+K00580+K00581-K00582-K00583+K00584'] +['K00399+K00401+K00402'] +['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125'] +++++++++++++++++++ +M00356 +K14080+K04480+K14081 K00399+K00401+K00402 (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125)) +['K14080+K04480+K14081', 'K00399+K00401+K00402', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))'] +== +['K14080+K04480+K14081'] +['K00399+K00401+K00402'] +['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125'] +++++++++++++++++++ +M00563 +K14082 ((K16177-K16176),(K16179-K16178),(K14084-K14083)) K00399+K00401+K00402 (K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125)) +['K14082', '((K16177-K16176),(K16179-K16178),(K14084-K14083))', 'K00399+K00401+K00402', '(K22480+K22481+K22482,K03388+K03389+K03390,K08264+K08265,K03388+K03389+K03390+K14127+(K14126+K14128,K22516+K00125))'] +== +['K14082'] +['K16177-K16176', 'K16179-K16178', 'K14084-K14083'] +['K00399+K00401+K00402'] +['K08264+K08265', 'K22480+K22481+K22482', 'K03388+K03389+K03390', 'K03388+K03389+K03390+K14127+K14126+K14128,K22516+K00125'] +++++++++++++++++++ +M00358 +K08097 K05979 K05884 K13039+K06034 +['K08097', 'K05979', 'K05884', 'K13039+K06034'] +== +['K08097'] +['K05979'] +['K05884'] +['K13039+K06034'] +++++++++++++++++++ +M00608 +K10977 K16792+K16793 K10978 +['K10977', 'K16792+K16793', 'K10978'] +== +['K10977'] +['K16792+K16793'] +['K10978'] +++++++++++++++++++ +M00174 +((K10944+K10945+K10946),(K16157+K16158+K16159+K16160+K16161+K16162)) ((K14028-K14029),K23995) +['((K10944+K10945+K10946),(K16157+K16158+K16159+K16160+K16161+K16162))', '((K14028-K14029),K23995)'] +== +['K10944+K10945+K10946', 'K16157+K16158+K16159+K16160+K16161+K16162'] +['K23995', 'K14028-K14029'] +++++++++++++++++++ +M00346 +K00600 K00830 K00018 K11529 K01689 K01595 K00024 K08692+K14067 K08691 +['K00600', 'K00830', 'K00018', 'K11529', 'K01689', 'K01595', 'K00024', 'K08692+K14067', 'K08691'] +== +['K00600'] +['K00830'] +['K00018'] +['K11529'] +['K01689'] +['K01595'] +['K00024'] +['K08692+K14067'] +['K08691'] +++++++++++++++++++ +M00345 +(((K08093,K13812) K08094),K13831) (K00850,K16370) K01624 +['(((K08093,K13812)_K08094),K13831)', '(K00850,K16370)', 'K01624'] +== +['K13831', 'K08093,K13812_K08094'] +['K00850', 'K16370'] +['K01624'] +++++++++++++++++++ +M00344 +K17100 K00863 K01624 K03841 +['K17100', 'K00863', 'K01624', 'K03841'] +== +['K17100'] +['K00863'] +['K01624'] +['K03841'] +++++++++++++++++++ +M00422 +K00192+K00195 K00193+K00197+K00194 +['K00192+K00195', 'K00193+K00197+K00194'] +== +['K00192+K00195'] +['K00193+K00197+K00194'] +++++++++++++++++++ +M00175 +K02588+K02586+K02591-K00531,K22896+K22897+K22898+K22899 +['K02588+K02586+K02591-K00531,K22896+K22897+K22898+K22899'] +== +['K02588+K02586+K02591-K00531', 'K22896+K22897+K22898+K22899'] +++++++++++++++++++ +M00531 +(K00367,K10534,K00372-K00360) (K00366,K17877) +['(K00367,K10534,K00372-K00360)', '(K00366,K17877)'] +== +['K00367', 'K10534', 'K00372-K00360'] +['K00366', 'K17877'] +++++++++++++++++++ +M00530 +(K00370+K00371+K00374,K02567+K02568) (K00362+K00363,K03385+K15876) +['(K00370+K00371+K00374,K02567+K02568)', '(K00362+K00363,K03385+K15876)'] +== +['K02567+K02568', 'K00370+K00371+K00374'] +['K00362+K00363', 'K03385+K15876'] +++++++++++++++++++ +M00529 +(K00370+K00371+K00374,K02567+K02568) (K00368,K15864) (K04561+K02305) K00376 +['(K00370+K00371+K00374,K02567+K02568)', '(K00368,K15864)', '(K04561+K02305)', 'K00376'] +== +['K02567+K02568', 'K00370+K00371+K00374'] +['K00368', 'K15864'] +['K04561+K02305'] +['K00376'] +++++++++++++++++++ +M00528 +K10944+K10945+K10946 K10535 +['K10944+K10945+K10946', 'K10535'] +== +['K10944+K10945+K10946'] +['K10535'] +++++++++++++++++++ +M00804 +K10944+K10945+K10946 K10535 K00370+K00371 +['K10944+K10945+K10946', 'K10535', 'K00370+K00371'] +== +['K10944+K10945+K10946'] +['K10535'] +['K00370+K00371'] +++++++++++++++++++ +M00176 +(K13811,K00958+K00860,K00955+K00957,K00956+K00957+K00860) K00390 (K00380+K00381,K00392) +['(K13811,K00958+K00860,K00955+K00957,K00956+K00957+K00860)', 'K00390', '(K00380+K00381,K00392)'] +== +['K13811', 'K00958+K00860', 'K00955+K00957', 'K00956+K00957+K00860'] +['K00390'] +['K00392', 'K00380+K00381'] +++++++++++++++++++ +M00596 +K00958 (K00394+K00395) (K11180+K11181) +['K00958', '(K00394+K00395)', '(K11180+K11181)'] +== +['K00958'] +['K00394+K00395'] +['K11180+K11181'] +++++++++++++++++++ +M00595 +K17222+K17223+K17224-K17225-K22622+K17226+K17227 +['K17222+K17223+K17224-K17225-K22622+K17226+K17227'] +== +['K17222+K17223+K17224-K17225-K22622+K17226+K17227'] +++++++++++++++++++ +M00161 +K02703+K02706+K02705+K02704+K02707+K02708 +['K02703+K02706+K02705+K02704+K02707+K02708'] +== +['K02703+K02706+K02705+K02704+K02707+K02708'] +++++++++++++++++++ +M00163 +K02689+K02690+K02691+K02692+K02693+K02694 +['K02689+K02690+K02691+K02692+K02693+K02694'] +== +['K02689+K02690+K02691+K02692+K02693+K02694'] +++++++++++++++++++ +M00597 +K08928+K08929 +['K08928+K08929'] +== +['K08928+K08929'] +++++++++++++++++++ +M00598 +K08940+K08941+K08942+K08943 +['K08940+K08941+K08942+K08943'] +== +['K08940+K08941+K08942+K08943'] +++++++++++++++++++ +M00145 +K05574+K05582+K05581+K05579+K05572+K05580+K05578+K05576+K05577+K05575+K05573-K05583-K05584-K05585 +['K05574+K05582+K05581+K05579+K05572+K05580+K05578+K05576+K05577+K05575+K05573-K05583-K05584-K05585'] +== +['K05574+K05582+K05581+K05579+K05572+K05580+K05578+K05576+K05577+K05575+K05573-K05583-K05584-K05585'] +++++++++++++++++++ +M00142 +K03878+K03879+K03880+K03881+K03882+K03883+K03884 +['K03878+K03879+K03880+K03881+K03882+K03883+K03884'] +== +['K03878+K03879+K03880+K03881+K03882+K03883+K03884'] +++++++++++++++++++ +M00143 +K03934+K03935+K03936+K03937+K03938+K03939+K03940+K03941+K03942+K03943-K03944 +['K03934+K03935+K03936+K03937+K03938+K03939+K03940+K03941+K03942+K03943-K03944'] +== +['K03934+K03935+K03936+K03937+K03938+K03939+K03940+K03941+K03942+K03943-K03944'] +++++++++++++++++++ +M00146 +K03945+K03946+K03947+K03948+K03949+K03950+K03951+K03952+K03953+K03954+K03955+K03956+K11352+K11353 +['K03945+K03946+K03947+K03948+K03949+K03950+K03951+K03952+K03953+K03954+K03955+K03956+K11352+K11353'] +== +['K03945+K03946+K03947+K03948+K03949+K03950+K03951+K03952+K03953+K03954+K03955+K03956+K11352+K11353'] +++++++++++++++++++ +M00147 +K03957+K03958+K03959+K03960+K03961+K03962+K03963+K03964+K03965+K03966+K11351+K03967+K03968 +['K03957+K03958+K03959+K03960+K03961+K03962+K03963+K03964+K03965+K03966+K11351+K03967+K03968'] +== +['K03957+K03958+K03959+K03960+K03961+K03962+K03963+K03964+K03965+K03966+K11351+K03967+K03968'] +++++++++++++++++++ +M00150 +K00244+K00245+K00246+K00247 +['K00244+K00245+K00246+K00247'] +== +['K00244+K00245+K00246+K00247'] +++++++++++++++++++ +M00148 +K00236+K00237+K00234+K00235 +['K00236+K00237+K00234+K00235'] +== +['K00236+K00237+K00234+K00235'] +++++++++++++++++++ +M00162 +K02635+K02637+K02634+K02636+K02642+K02643+K03689+K02640 +['K02635+K02637+K02634+K02636+K02642+K02643+K03689+K02640'] +== +['K02635+K02637+K02634+K02636+K02642+K02643+K03689+K02640'] +++++++++++++++++++ +M00154 +(K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268+(K02269,K02270-K02271)+K02272-K02273+K02258+K02259+K02260) +['(K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268+(K02269,K02270-K02271)+K02272-K02273+K02258+K02259+K02260)'] +== +['K02257+K02262+K02256+K02261+K02263+K02264+K02265+K02266+K02267+K02268+K02269,K02270-K02271+K02272-K02273+K02258+K02259+K02260'] +++++++++++++++++++ +M00155 +K02275+(K02274+K02276,K15408)-K02277 +['K02275+(K02274+K02276,K15408)-K02277'] +== +['K15408-K02277', 'K02275+K02274+K02276'] +++++++++++++++++++ +M00153 +K00425+K00426+(K00424,K22501) +['K00425+K00426+(K00424,K22501)'] +== +['K22501', 'K00425+K00426+K00424'] +++++++++++++++++++ +M00417 +K02297+K02298+K02299+K02300 +['K02297+K02298+K02299+K02300'] +== +['K02297+K02298+K02299+K02300'] +++++++++++++++++++ +M00416 +K02827+K02826+K02828+K02829 +['K02827+K02826+K02828+K02829'] +== +['K02827+K02826+K02828+K02829'] +++++++++++++++++++ +M00156 +((K00404+K00405,K15862)+K00407+K00406) +['((K00404+K00405,K15862)+K00407+K00406)'] +== +['K00404+K00405,K15862+K00407+K00406'] +++++++++++++++++++ +M00157 +K02111+K02112+K02113+K02114+K02115+K02108+K02109+K02110 +['K02111+K02112+K02113+K02114+K02115+K02108+K02109+K02110'] +== +['K02111+K02112+K02113+K02114+K02115+K02108+K02109+K02110'] +++++++++++++++++++ +M00158 +K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138+(K02129,K01549)+(K02130,K02139)+K02140+(K02141,K02131)-K02142-K02143+K02125 +['K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138+(K02129,K01549)+(K02130,K02139)+K02140+(K02141,K02131)-K02142-K02143+K02125'] +== +['K01549+K02130', 'K02139+K02140+K02141', 'K02131-K02142-K02143+K02125', 'K02132+K02133+K02136+K02134+K02135+K02137+K02126+K02127+K02128+K02138+K02129'] +++++++++++++++++++ +M00159 +K02117+K02118+K02119+K02120+K02121+K02122+K02107+K02123+K02124 +['K02117+K02118+K02119+K02120+K02121+K02122+K02107+K02123+K02124'] +== +['K02117+K02118+K02119+K02120+K02121+K02122+K02107+K02123+K02124'] +++++++++++++++++++ +M00160 +K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154+(K03661,K02155)+K02146+K02153+K03662 +['K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154+(K03661,K02155)+K02146+K02153+K03662'] +== +['K02155+K02146+K02153+K03662', 'K02145+K02147+K02148+K02149+K02150+K02151+K02152+K02144+K02154+K03661'] +++++++++++++++++++ +M00082 +(K11262,(K02160+K01961,K11263)+(K01962+K01963,K18472)) (K00665,K00667+K00668,K11533,(K00645 (K00648,K18473))) +['(K11262,(K02160+K01961,K11263)+(K01962+K01963,K18472))', '(K00665,K00667+K00668,K11533,(K00645_(K00648,K18473)))'] +== +['K11262', 'K02160+K01961,K11263+K01962+K01963,K18472'] +['K00665', 'K11533', 'K00667+K00668', 'K00645_K00648,K18473'] +++++++++++++++++++ +M00083 +K00665,(K00667 K00668),K11533,((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209)) +['K00665,(K00667_K00668),K11533,((K00647,K09458)_K00059_(K02372,K01716,K16363)_(K00208,K02371,K10780,K00209))'] +== +['K00665', 'K11533', 'K00667_K00668', 'K00647,K09458_K00059_K02372,K01716,K16363_K00208,K02371,K10780,K00209'] +++++++++++++++++++ +M00873 +K18660 K03955+K00645 K09458 K11539+K13370 K22540 K07512 +['K18660', 'K03955+K00645', 'K09458', 'K11539+K13370', 'K22540', 'K07512'] +== +['K18660'] +['K03955+K00645'] +['K09458'] +['K11539+K13370'] +['K22540'] +['K07512'] +++++++++++++++++++ +M00874 +K11262 K03955+K00645 K09458 K00059 K22541 K07512 +['K11262', 'K03955+K00645', 'K09458', 'K00059', 'K22541', 'K07512'] +== +['K11262'] +['K03955+K00645'] +['K09458'] +['K00059'] +['K22541'] +['K07512'] +++++++++++++++++++ +M00085 +(K07508,K07509) (K00022 K07511,K07515) K07512 +['(K07508,K07509)', '(K00022_K07511,K07515)', 'K07512'] +== +['K07508', 'K07509'] +['K07515', 'K00022_K07511'] +['K07512'] +++++++++++++++++++ +M00415 +(K10247,K10205,K10248,K10249,K10244,K10203,K10250,K15397,K10245,K10246) K10251 K10703 K10258 +['(K10247,K10205,K10248,K10249,K10244,K10203,K10250,K15397,K10245,K10246)', 'K10251', 'K10703', 'K10258'] +== +['K10247', 'K10205', 'K10248', 'K10249', 'K10244', 'K10203', 'K10250', 'K15397', 'K10245', 'K10246'] +['K10251'] +['K10703'] +['K10258'] +++++++++++++++++++ +M00086 +K01897,K15013 +['K01897,K15013'] +== +['K01897', 'K15013'] +++++++++++++++++++ +M00087 +(K00232,K00249,K00255,K06445,K09479) (((K01692,K07511,K13767) (K00022,K07516)),K01825,K01782,K07514,K07515,K10527) (K00632,K07508,K07509,K07513) +['(K00232,K00249,K00255,K06445,K09479)', '(((K01692,K07511,K13767)_(K00022,K07516)),K01825,K01782,K07514,K07515,K10527)', '(K00632,K07508,K07509,K07513)'] +== +['K00232', 'K00249', 'K00255', 'K06445', 'K09479'] +['K01825', 'K01782', 'K07514', 'K07515', 'K10527', 'K01692,K07511,K13767_K00022,K07516'] +['K00632', 'K07508', 'K07509', 'K07513'] +++++++++++++++++++ +M00861 +K00232 K12405 (K07513,K08764) +['K00232', 'K12405', '(K07513,K08764)'] +== +['K00232'] +['K12405'] +['K07513', 'K08764'] +++++++++++++++++++ +M00101 +K01852 K05917 K00222 K07750 K07748 (K09827,K13373) K09828 K01824 K00227 K00213 +['K01852', 'K05917', 'K00222', 'K07750', 'K07748', '(K09827,K13373)', 'K09828', 'K01824', 'K00227', 'K00213'] +== +['K01852'] +['K05917'] +['K00222'] +['K07750'] +['K07748'] +['K09827', 'K13373'] +['K09828'] +['K01824'] +['K00227'] +['K00213'] +++++++++++++++++++ +M00102 +K00559 K09829 K00227 K09831 K00223 +['K00559', 'K09829', 'K00227', 'K09831', 'K00223'] +== +['K00559'] +['K09829'] +['K00227'] +['K09831'] +['K00223'] +++++++++++++++++++ +M00103 +K07419 K07438 +['K07419', 'K07438'] +== +['K07419'] +['K07438'] +++++++++++++++++++ +M00104 +K00489 K12408 K07431 K00251 K00037 K00488 K08748 K01796 K10214 K12405 K08764 K11992 +['K00489', 'K12408', 'K07431', 'K00251', 'K00037', 'K00488', 'K08748', 'K01796', 'K10214', 'K12405', 'K08764', 'K11992'] +== +['K00489'] +['K12408'] +['K07431'] +['K00251'] +['K00037'] +['K00488'] +['K08748'] +['K01796'] +['K10214'] +['K12405'] +['K08764'] +['K11992'] +++++++++++++++++++ +M00106 +K08748 K00659 +['K08748', 'K00659'] +== +['K08748'] +['K00659'] +++++++++++++++++++ +M00862 +K10214 K12405 K08764 +['K10214', 'K12405', 'K08764'] +== +['K10214'] +['K12405'] +['K08764'] +++++++++++++++++++ +M00107 +K00498 K00070 +['K00498', 'K00070'] +== +['K00498'] +['K00070'] +++++++++++++++++++ +M00108 +K00513 (K00497,K07433) K07433 +['K00513', '(K00497,K07433)', 'K07433'] +== +['K00513'] +['K00497', 'K07433'] +['K07433'] +++++++++++++++++++ +M00109 +K00512 K00513 K00497 (K15680,K00071) +['K00512', 'K00513', 'K00497', '(K15680,K00071)'] +== +['K00512'] +['K00513'] +['K00497'] +['K15680', 'K00071'] +++++++++++++++++++ +M00110 +K00512 K00070 K07434 +['K00512', 'K00070', 'K07434'] +== +['K00512'] +['K00070'] +['K07434'] +++++++++++++++++++ +M00089 +(K00629,K13506,K13507,K00630,K13508) (K00655,K13509,K13523,K19007,K13513,K13517,K13519,K14674,K22831) (K01080,K15728,K18693) (K11155,K11160,K14456,K22848,K22849) +['(K00629,K13506,K13507,K00630,K13508)', '(K00655,K13509,K13523,K19007,K13513,K13517,K13519,K14674,K22831)', '(K01080,K15728,K18693)', '(K11155,K11160,K14456,K22848,K22849)'] +== +['K00629', 'K13506', 'K13507', 'K00630', 'K13508'] +['K00655', 'K13509', 'K13523', 'K19007', 'K13513', 'K13517', 'K13519', 'K14674', 'K22831'] +['K01080', 'K15728', 'K18693'] +['K11155', 'K11160', 'K14456', 'K22848', 'K22849'] +++++++++++++++++++ +M00098 +(K01046,K12298,K16816,K13534,K14073,K14074,K14075,K14076,K22283,K14452,K22284,K14674,K14675,K17900) K01054 +['(K01046,K12298,K16816,K13534,K14073,K14074,K14075,K14076,K22283,K14452,K22284,K14674,K14675,K17900)', 'K01054'] +== +['K01046', 'K12298', 'K16816', 'K13534', 'K14073', 'K14074', 'K14075', 'K14076', 'K22283', 'K14452', 'K22284', 'K14674', 'K14675', 'K17900'] +['K01054'] +++++++++++++++++++ +M00090 +(K00866,K14156) K00968 (K00994,K13644) +['(K00866,K14156)', 'K00968', '(K00994,K13644)'] +== +['K00866', 'K14156'] +['K00968'] +['K00994', 'K13644'] +++++++++++++++++++ +M00091 +K00551,(K16369 K00550),K00570 +['K00551,(K16369_K00550),K00570'] +== +['K00551', 'K00570', 'K16369_K00550'] +++++++++++++++++++ +M00092 +(K00894,K14156) K00967 (K00993,K13644) +['(K00894,K14156)', 'K00967', '(K00993,K13644)'] +== +['K00894', 'K14156'] +['K00967'] +['K00993', 'K13644'] +++++++++++++++++++ +M00093 +K00981 (K00998,K17103) K01613 +['K00981', '(K00998,K17103)', 'K01613'] +== +['K00981'] +['K00998', 'K17103'] +['K01613'] +++++++++++++++++++ +M00094 +K00654 K04708 (K04709,K04710,K23727) K04712 +['K00654', 'K04708', '(K04709,K04710,K23727)', 'K04712'] +== +['K00654'] +['K04708'] +['K04709', 'K04710', 'K23727'] +['K04712'] +++++++++++++++++++ +M00066 +K00720 K07553 +['K00720', 'K07553'] +== +['K00720'] +['K07553'] +++++++++++++++++++ +M00067 +K04628 K01019 +['K04628', 'K01019'] +== +['K04628'] +['K01019'] +++++++++++++++++++ +M00099 +K00654 K04708 (K04709,K04710,K23727) K04712 (K01441,K12348,K12349) +['K00654', 'K04708', '(K04709,K04710,K23727)', 'K04712', '(K01441,K12348,K12349)'] +== +['K00654'] +['K04708'] +['K04709', 'K04710', 'K23727'] +['K04712'] +['K01441', 'K12348', 'K12349'] +++++++++++++++++++ +M00100 +K04718 K01634 +['K04718', 'K01634'] +== +['K04718'] +['K01634'] +++++++++++++++++++ +M00113 +K00454 K01723 K10525 K05894 K10526 K00232 K10527 K07513 -- +['K00454', 'K01723', 'K10525', 'K05894', 'K10526', 'K00232', 'K10527', 'K07513'] +== +['K00454'] +['K01723'] +['K10525'] +['K05894'] +['K10526'] +['K00232'] +['K10527'] +['K07513'] +++++++++++++++++++ +M00049 +K01939 K01756 (K00939,K18532,K18533,K00944) (K00940,K00873,K12406) +['K01939', 'K01756', '(K00939,K18532,K18533,K00944)', '(K00940,K00873,K12406)'] +== +['K01939'] +['K01756'] +['K00939', 'K18532', 'K18533', 'K00944'] +['K00940', 'K00873', 'K12406'] +++++++++++++++++++ +M00050 +K00088 K01951 K00942 (K00940,K18533,K00873,K12406) +['K00088', 'K01951', 'K00942', '(K00940,K18533,K00873,K12406)'] +== +['K00088'] +['K01951'] +['K00942'] +['K00940', 'K18533', 'K00873', 'K12406'] +++++++++++++++++++ +M00546 +(K00106,K00087+K13479+K13480,K13481+K13482,K11177+K11178+K13483) (K00365,K16838,K16839,K22879) (K13484,K07127 (K13485,K16838,K16840)) (K01466,K16842) K01477 +['(K00106,K00087+K13479+K13480,K13481+K13482,K11177+K11178+K13483)', '(K00365,K16838,K16839,K22879)', '(K13484,K07127_(K13485,K16838,K16840))', '(K01466,K16842)', 'K01477'] +== +['K00106', 'K13481+K13482', 'K00087+K13479+K13480', 'K11177+K11178+K13483'] +['K00365', 'K16838', 'K16839', 'K22879'] +['K13484', 'K07127_K13485,K16838,K16840'] +['K01466', 'K16842'] +['K01477'] +++++++++++++++++++ +M00051 +(K11540,(K11541 K01465),((K01954,K01955+K01956) (K00609+K00610,K00608) K01465)) (K00226,K00254,K17828) (K13421,K00762 K01591) +['(K11540,(K11541_K01465),((K01954,K01955+K01956)_(K00609+K00610,K00608)_K01465))', '(K00226,K00254,K17828)', '(K13421,K00762_K01591)'] +== +['K11540', 'K11541_K01465', 'K01954,K01955+K01956_K00609+K00610,K00608_K01465'] +['K00226', 'K00254', 'K17828'] +['K13421', 'K00762_K01591'] +++++++++++++++++++ +M00052 +(K13800,K13809,K09903) (K00940,K18533) K01937 +['(K13800,K13809,K09903)', '(K00940,K18533)', 'K01937'] +== +['K13800', 'K13809', 'K09903'] +['K00940', 'K18533'] +['K01937'] +++++++++++++++++++ +M00053 +(K00524,K00525+K00526,K10807+K10808) (K00940,K18533) (K00527,K21636) K01494 K01520 (K00560,K13998) K00943 K00940 +['(K00524,K00525+K00526,K10807+K10808)', '(K00940,K18533)', '(K00527,K21636)', 'K01494', 'K01520', '(K00560,K13998)', 'K00943', 'K00940'] +== +['K00524', 'K00525+K00526', 'K10807+K10808'] +['K00940', 'K18533'] +['K00527', 'K21636'] +['K01494'] +['K01520'] +['K00560', 'K13998'] +['K00943'] +['K00940'] +++++++++++++++++++ +M00046 +(K00207,K17722+K17723) K01464 (K01431,K06016) +['(K00207,K17722+K17723)', 'K01464', '(K01431,K06016)'] +== +['K00207', 'K17722+K17723'] +['K01464'] +['K01431', 'K06016'] +++++++++++++++++++ +M00020 +K00058 K00831 (K01079,K02203,K22305) +['K00058', 'K00831', '(K01079,K02203,K22305)'] +== +['K00058'] +['K00831'] +['K01079', 'K02203', 'K22305'] +++++++++++++++++++ +M00018 +(K00928,K12524,K12525,K12526) K00133 (K00003,K12524,K12525) (K00872,K02204,K02203) K01733 +['(K00928,K12524,K12525,K12526)', 'K00133', '(K00003,K12524,K12525)', '(K00872,K02204,K02203)', 'K01733'] +== +['K00928', 'K12524', 'K12525', 'K12526'] +['K00133'] +['K00003', 'K12524', 'K12525'] +['K00872', 'K02204', 'K02203'] +['K01733'] +++++++++++++++++++ +M00555 +(K17755,((K00108,K11440,K00499) (K00130,K14085))) +['(K17755,((K00108,K11440,K00499)_(K00130,K14085)))'] +== +['K17755', 'K00108,K11440,K00499_K00130,K14085'] +++++++++++++++++++ +M00033 +K00928 K00133 K00836 K06718 K06720 +['K00928', 'K00133', 'K00836', 'K06718', 'K06720'] +== +['K00928'] +['K00133'] +['K00836'] +['K06718'] +['K06720'] +++++++++++++++++++ +M00021 +(K00640,K23304) (K01738,K13034,K17069) +['(K00640,K23304)', '(K01738,K13034,K17069)'] +== +['K00640', 'K23304'] +['K01738', 'K13034', 'K17069'] +++++++++++++++++++ +M00338 +(K01697,K10150) K01758 +['(K01697,K10150)', 'K01758'] +== +['K01697', 'K10150'] +['K01758'] +++++++++++++++++++ +M00609 +K00789 K17462 K01243 K07173 K17216 K17217 +['K00789', 'K17462', 'K01243', 'K07173', 'K17216', 'K17217'] +== +['K00789'] +['K17462'] +['K01243'] +['K07173'] +['K17216'] +['K17217'] +++++++++++++++++++ +M00017 +(K00928,K12524,K12525) K00133 (K00003,K12524,K12525) (K00651,K00641) K01739 (K01760,K14155) (K00548,K24042,K00549) +['(K00928,K12524,K12525)', 'K00133', '(K00003,K12524,K12525)', '(K00651,K00641)', 'K01739', '(K01760,K14155)', '(K00548,K24042,K00549)'] +== +['K00928', 'K12524', 'K12525'] +['K00133'] +['K00003', 'K12524', 'K12525'] +['K00651', 'K00641'] +['K01739'] +['K01760', 'K14155'] +['K00548', 'K24042', 'K00549'] +++++++++++++++++++ +M00034 +K00789 K01611 K00797 ((K01243,K01244) K00899,K00772) K08963 (K16054,K08964 (K09880,K08965 K08966)) K08967 (K00815,K08969,K23977,K00832,K00838) +['K00789', 'K01611', 'K00797', '((K01243,K01244)_K00899,K00772)', 'K08963', '(K16054,K08964_(K09880,K08965_K08966))', 'K08967', '(K00815,K08969,K23977,K00832,K00838)'] +== +['K00789'] +['K01611'] +['K00797'] +['K00772', 'K01243,K01244_K00899'] +['K08963'] +['K16054', 'K08964_K09880,K08965_K08966'] +['K08967'] +['K00815', 'K08969', 'K23977', 'K00832', 'K00838'] +++++++++++++++++++ +M00035 +K00789 (K00558,K17398,K17399) K01251 (K01697,K10150) +['K00789', '(K00558,K17398,K17399)', 'K01251', '(K01697,K10150)'] +== +['K00789'] +['K00558', 'K17398', 'K17399'] +['K01251'] +['K01697', 'K10150'] +++++++++++++++++++ +M00368 +K00789 (K01762,K20772) K05933 +['K00789', '(K01762,K20772)', 'K05933'] +== +['K00789'] +['K01762', 'K20772'] +['K05933'] +++++++++++++++++++ +M00019 +K01652+(K01653,K11258) K00053 K01687 K00826 +['K01652+(K01653,K11258)', 'K00053', 'K01687', 'K00826'] +== +['K11258', 'K01652+K01653'] +['K00053'] +['K01687'] +['K00826'] +++++++++++++++++++ +M00535 +K09011 K01703+K01704 K00052 +['K09011', 'K01703+K01704', 'K00052'] +== +['K09011'] +['K01703+K01704'] +['K00052'] +++++++++++++++++++ +M00570 +(K17989,K01754) K01652+(K01653,K11258) K00053 K01687 K00826 +['(K17989,K01754)', 'K01652+(K01653,K11258)', 'K00053', 'K01687', 'K00826'] +== +['K17989', 'K01754'] +['K11258', 'K01652+K01653'] +['K00053'] +['K01687'] +['K00826'] +++++++++++++++++++ +M00432 +K01649 (K01702,K01703+K01704) K00052 +['K01649', '(K01702,K01703+K01704)', 'K00052'] +== +['K01649'] +['K01702', 'K01703+K01704'] +['K00052'] +++++++++++++++++++ +M00036 +K00826 ((K00166+K00167,K11381)+K09699+K00382) (K00253,K00249) (K01968+K01969) (K05607,K13766) K01640 +['K00826', '((K00166+K00167,K11381)+K09699+K00382)', '(K00253,K00249)', '(K01968+K01969)', '(K05607,K13766)', 'K01640'] +== +['K00826'] +['K00166+K00167,K11381+K09699+K00382'] +['K00253', 'K00249'] +['K01968+K01969'] +['K05607', 'K13766'] +['K01640'] +++++++++++++++++++ +M00016 +(K00928,K12524,K12525,K12526) K00133 K01714 K00215 K00674 (K00821,K14267) K01439 K01778 (K01586,K12526) +['(K00928,K12524,K12525,K12526)', 'K00133', 'K01714', 'K00215', 'K00674', '(K00821,K14267)', 'K01439', 'K01778', '(K01586,K12526)'] +== +['K00928', 'K12524', 'K12525', 'K12526'] +['K00133'] +['K01714'] +['K00215'] +['K00674'] +['K00821', 'K14267'] +['K01439'] +['K01778'] +['K01586', 'K12526'] +++++++++++++++++++ +M00525 +K00928 K00133 K01714 K00215 K05822 K00841 K05823 K01778 K01586 +['K00928', 'K00133', 'K01714', 'K00215', 'K05822', 'K00841', 'K05823', 'K01778', 'K01586'] +== +['K00928'] +['K00133'] +['K01714'] +['K00215'] +['K05822'] +['K00841'] +['K05823'] +['K01778'] +['K01586'] +++++++++++++++++++ +M00526 +(K00928,K12524,K12525,K12526) K00133 K01714 K00215 K03340 (K01586,K12526) +['(K00928,K12524,K12525,K12526)', 'K00133', 'K01714', 'K00215', 'K03340', '(K01586,K12526)'] +== +['K00928', 'K12524', 'K12525', 'K12526'] +['K00133'] +['K01714'] +['K00215'] +['K03340'] +['K01586', 'K12526'] +++++++++++++++++++ +M00527 +(K00928,K12524,K12525,K12526) K00133 K01714 K00215 K10206 K01778 (K01586,K12526) +['(K00928,K12524,K12525,K12526)', 'K00133', 'K01714', 'K00215', 'K10206', 'K01778', '(K01586,K12526)'] +== +['K00928', 'K12524', 'K12525', 'K12526'] +['K00133'] +['K01714'] +['K00215'] +['K10206'] +['K01778'] +['K01586', 'K12526'] +++++++++++++++++++ +M00030 +K01655 K17450 K01705 K05824 K00838 K00143 (K00293,K24034) K00290 +['K01655', 'K17450', 'K01705', 'K05824', 'K00838', 'K00143', '(K00293,K24034)', 'K00290'] +== +['K01655'] +['K17450'] +['K01705'] +['K05824'] +['K00838'] +['K00143'] +['K00293', 'K24034'] +['K00290'] +++++++++++++++++++ +M00433 +K01655 (K17450 K01705,K16792+K16793) K05824 +['K01655', '(K17450_K01705,K16792+K16793)', 'K05824'] +== +['K01655'] +['K17450_K01705', 'K16792+K16793'] +['K05824'] +++++++++++++++++++ +M00032 +K14157 K14085 K00825 (K15791+K00658+K00382) K00252 (K07514,(K07515,K07511) K00022) +['K14157', 'K14085', 'K00825', '(K15791+K00658+K00382)', 'K00252', '(K07514,(K07515,K07511)_K00022)'] +== +['K14157'] +['K14085'] +['K00825'] +['K15791+K00658+K00382'] +['K00252'] +['K07514', 'K07515,K07511_K00022'] +++++++++++++++++++ +M00028 +(K00618,K00619,K14681,K14682,K00620,K22477,K22478) ((K00930,K22478) K00145,K12659) (K00818,K00821) (K01438,K14677,K00620) +['(K00618,K00619,K14681,K14682,K00620,K22477,K22478)', '((K00930,K22478)_K00145,K12659)', '(K00818,K00821)', '(K01438,K14677,K00620)'] +== +['K00618', 'K00619', 'K14681', 'K14682', 'K00620', 'K22477', 'K22478'] +['K12659', 'K00930,K22478_K00145'] +['K00818', 'K00821'] +['K01438', 'K14677', 'K00620'] +++++++++++++++++++ +M00844 +K00611 K01940 (K01755,K14681) +['K00611', 'K01940', '(K01755,K14681)'] +== +['K00611'] +['K01940'] +['K01755', 'K14681'] +++++++++++++++++++ +M00845 +K22478 K00145 K00821 K09065 K01438 K01940 K01755 +['K22478', 'K00145', 'K00821', 'K09065', 'K01438', 'K01940', 'K01755'] +== +['K22478'] +['K00145'] +['K00821'] +['K09065'] +['K01438'] +['K01940'] +['K01755'] +++++++++++++++++++ +M00029 +K01948 K00611 K01940 (K01755,K14681) K01476 +['K01948', 'K00611', 'K01940', '(K01755,K14681)', 'K01476'] +== +['K01948'] +['K00611'] +['K01940'] +['K01755', 'K14681'] +['K01476'] +++++++++++++++++++ +M00015 +((K00931 K00147),K12657) K00286 +['((K00931_K00147),K12657)', 'K00286'] +== +['K12657', 'K00931_K00147'] +['K00286'] +++++++++++++++++++ +M00047 +K00613 K00542 K00933 +['K00613', 'K00542', 'K00933'] +== +['K00613'] +['K00542'] +['K00933'] +++++++++++++++++++ +M00879 +K00673 K01484 K00840 K06447 K05526 +['K00673', 'K01484', 'K00840', 'K06447', 'K05526'] +== +['K00673'] +['K01484'] +['K00840'] +['K06447'] +['K05526'] +++++++++++++++++++ +M00134 +K01476 K01581 +['K01476', 'K01581'] +== +['K01476'] +['K01581'] +++++++++++++++++++ +M00135 +K00657 K00274 (K00128,K14085,K00149) -- +['K00657', 'K00274', '(K00128,K14085,K00149)'] +== +['K00657'] +['K00274'] +['K00128', 'K14085', 'K00149'] +++++++++++++++++++ +M00136 +K09470 K09471 K09472 K09473 +['K09470', 'K09471', 'K09472', 'K09473'] +== +['K09470'] +['K09471'] +['K09472'] +['K09473'] +++++++++++++++++++ +M00026 +(K00765-K02502) (K01523 K01496,K11755,K14152) (K01814,K24017) (K02501+K02500,K01663) ((K01693 K00817 (K04486,K05602,K18649)),(K01089 K00817)) (K00013,K14152) +['(K00765-K02502)', '(K01523_K01496,K11755,K14152)', '(K01814,K24017)', '(K02501+K02500,K01663)', '((K01693_K00817_(K04486,K05602,K18649)),(K01089_K00817))', '(K00013,K14152)'] +== +['K00765-K02502'] +['K11755', 'K14152', 'K01523_K01496'] +['K01814', 'K24017'] +['K01663', 'K02501+K02500'] +['K01089_K00817', 'K01693_K00817_K04486,K05602,K18649'] +['K00013', 'K14152'] +++++++++++++++++++ +M00045 +K01745 K01712 K01468 (K01479,K00603,K13990,(K05603 K01458)) +['K01745', 'K01712', 'K01468', '(K01479,K00603,K13990,(K05603_K01458))'] +== +['K01745'] +['K01712'] +['K01468'] +['K01479', 'K00603', 'K13990', 'K05603_K01458'] +++++++++++++++++++ +M00022 +(K01626,K03856,K13853) (((K01735,K13829) ((K03785,K03786) K00014,K13832)),K13830) ((K00891,K13829) (K00800,K24018),K13830) K01736 +['(K01626,K03856,K13853)', '(((K01735,K13829)_((K03785,K03786)_K00014,K13832)),K13830)', '((K00891,K13829)_(K00800,K24018),K13830)', 'K01736'] +== +['K01626', 'K03856', 'K13853'] +['K13830', 'K01735,K13829_K03785,K03786_K00014,K13832'] +['K13830', 'K00891,K13829_K00800,K24018'] +['K01736'] +++++++++++++++++++ +M00023 +(((K01657+K01658,K13503,K13501,K01656) K00766),K13497) (((K01817,K24017) (K01656,K01609)),K13498,K13501) (K01695+(K01696,K06001),K01694) +['(((K01657+K01658,K13503,K13501,K01656)_K00766),K13497)', '(((K01817,K24017)_(K01656,K01609)),K13498,K13501)', '(K01695+(K01696,K06001),K01694)'] +== +['K13497', 'K01657+K01658,K13503,K13501,K01656_K00766'] +['K13498', 'K13501', 'K01817,K24017_K01656,K01609'] +['K01694', 'K01695+K01696,K06001'] +++++++++++++++++++ +M00024 +((K01850,K04092,K14187,K04093,K04516,K06208,K06209,K13853) (K01713,K04518,K05359),K14170) (K00832,K00838) +['((K01850,K04092,K14187,K04093,K04516,K06208,K06209,K13853)_(K01713,K04518,K05359),K14170)', '(K00832,K00838)'] +== +['K14170', 'K01850,K04092,K14187,K04093,K04516,K06208,K06209,K13853_K01713,K04518,K05359'] +['K00832', 'K00838'] +++++++++++++++++++ +M00025 +(((K01850,K04092,K14170,K04093,K04516,K06208,K06209,K13853) K04517),K14187) (K00815,K00832,K00838) +['(((K01850,K04092,K14170,K04093,K04516,K06208,K06209,K13853)_K04517),K14187)', '(K00815,K00832,K00838)'] +== +['K14187', 'K01850,K04092,K14170,K04093,K04516,K06208,K06209,K13853_K04517'] +['K00815', 'K00832', 'K00838'] +++++++++++++++++++ +M00040 +(K00832,K15849) (K00220,K24018,K15226,K15227) +['(K00832,K15849)', '(K00220,K24018,K15226,K15227)'] +== +['K00832', 'K15849'] +['K00220', 'K24018', 'K15226', 'K15227'] +++++++++++++++++++ +M00042 +(K00505,K00501) (K01592,K01593) K00503 K00553 +['(K00505,K00501)', '(K01592,K01593)', 'K00503', 'K00553'] +== +['K00505', 'K00501'] +['K01592', 'K01593'] +['K00503'] +['K00553'] +++++++++++++++++++ +M00043 +K00431 +['K00431'] +== +['K00431'] +++++++++++++++++++ +M00044 +(K00815,K00838,K03334) K00457 K00451 K01800 (K01555,K16171) +['(K00815,K00838,K03334)', 'K00457', 'K00451', 'K01800', '(K01555,K16171)'] +== +['K00815', 'K00838', 'K03334'] +['K00457'] +['K00451'] +['K01800'] +['K01555', 'K16171'] +++++++++++++++++++ +M00533 +K00455 K00151 K01826 K05921 +['K00455', 'K00151', 'K01826', 'K05921'] +== +['K00455'] +['K00151'] +['K01826'] +['K05921'] +++++++++++++++++++ +M00545 +(((K05708+K05709+K05710+K00529) K05711),K05712) K05713 K05714 K02554 K01666 K04073 +['(((K05708+K05709+K05710+K00529)_K05711),K05712)', 'K05713', 'K05714', 'K02554', 'K01666', 'K04073'] +== +['K05712', 'K05708+K05709+K05710+K00529_K05711'] +['K05713'] +['K05714'] +['K02554'] +['K01666'] +['K04073'] +++++++++++++++++++ +M00037 +K00502 K01593 K00669 K00543 +['K00502', 'K01593', 'K00669', 'K00543'] +== +['K00502'] +['K01593'] +['K00669'] +['K00543'] +++++++++++++++++++ +M00038 +(K00453,K00463) (K01432,K14263,K07130) K00486 K01556 K00452 K03392 (K10217,K23234) +['(K00453,K00463)', '(K01432,K14263,K07130)', 'K00486', 'K01556', 'K00452', 'K03392', '(K10217,K23234)'] +== +['K00453', 'K00463'] +['K01432', 'K14263', 'K07130'] +['K00486'] +['K01556'] +['K00452'] +['K03392'] +['K10217', 'K23234'] +++++++++++++++++++ +M00027 +K01580 (K13524,K07250,K00823,K16871) (K00135,K00139,K17761) +['K01580', '(K13524,K07250,K00823,K16871)', '(K00135,K00139,K17761)'] +== +['K01580'] +['K13524', 'K07250', 'K00823', 'K16871'] +['K00135', 'K00139', 'K17761'] +++++++++++++++++++ +M00369 +K13027 K13029 K13030 +['K13027', 'K13029', 'K13030'] +== +['K13027'] +['K13029'] +['K13030'] +++++++++++++++++++ +M00118 +(K11204+K11205,K01919) (K21456,K01920) +['(K11204+K11205,K01919)', '(K21456,K01920)'] +== +['K01919', 'K11204+K11205'] +['K21456', 'K01920'] +++++++++++++++++++ +M00055 +K01001 (K07432+K07441) K03842 K03843 K03844 K03845 K03846 K03847 K03846 K00729 K03848 K03849 K03850 +['K01001', '(K07432+K07441)', 'K03842', 'K03843', 'K03844', 'K03845', 'K03846', 'K03847', 'K03846', 'K00729', 'K03848', 'K03849', 'K03850'] +== +['K01001'] +['K07432+K07441'] +['K03842'] +['K03843'] +['K03844'] +['K03845'] +['K03846'] +['K03847'] +['K03846'] +['K00729'] +['K03848'] +['K03849'] +['K03850'] +++++++++++++++++++ +M00072 +K07151+K12666+K12667+K12668+K12669+K12670-K00730-K12691 +['K07151+K12666+K12667+K12668+K12669+K12670-K00730-K12691'] +== +['K07151+K12666+K12667+K12668+K12669+K12670-K00730-K12691'] +++++++++++++++++++ +M00073 +K01228 K05546 K23741 K01230 +['K01228', 'K05546', 'K23741', 'K01230'] +== +['K01228'] +['K05546'] +['K23741'] +['K01230'] +++++++++++++++++++ +M00074 +K05546 K23741 K01230 K05528 K05529+K05530 K05529+K05531+K05532+K05533+K05534 K05535 +['K05546', 'K23741', 'K01230', 'K05528', 'K05529+K05530', 'K05529+K05531+K05532+K05533+K05534', 'K05535'] +== +['K05546'] +['K23741'] +['K01230'] +['K05528'] +['K05529+K05530'] +['K05529+K05531+K05532+K05533+K05534'] +['K05535'] +++++++++++++++++++ +M00056 +K00710 (K00731,K09653) (K00727,K09662,K09663) K00739 +['K00710', '(K00731,K09653)', '(K00727,K09662,K09663)', 'K00739'] +== +['K00710'] +['K00731', 'K09653'] +['K00727', 'K09662', 'K09663'] +['K00739'] +++++++++++++++++++ +M00065 +(K03857+K03859+K03858+K03861+K03860+(K11001,K11002)-K09658) K03434 K05283 (K05284+K07541) K07542 K05285 K05286 (K05288+K05287) +['(K03857+K03859+K03858+K03861+K03860+(K11001,K11002)-K09658)', 'K03434', 'K05283', '(K05284+K07541)', 'K07542', 'K05285', 'K05286', '(K05288+K05287)'] +== +['K03857+K03859+K03858+K03861+K03860+K11001,K11002-K09658'] +['K03434'] +['K05283'] +['K05284+K07541'] +['K07542'] +['K05285'] +['K05286'] +['K05288+K05287'] +++++++++++++++++++ +M00070 +K03766 (K07819,K07820,K03877) +['K03766', '(K07819,K07820,K03877)'] +== +['K03766'] +['K07819', 'K07820', 'K03877'] +++++++++++++++++++ +M00071 +K03766 (K07966,K07967,K07968,K07969) +['K03766', '(K07966,K07967,K07968,K07969)'] +== +['K03766'] +['K07966', 'K07967', 'K07968', 'K07969'] +++++++++++++++++++ +M00068 +K01988 K00719 +['K01988', 'K00719'] +== +['K01988'] +['K00719'] +++++++++++++++++++ +M00069 +K03370 (K03371,K03369) +['K03370', '(K03371,K03369)'] +== +['K03370'] +['K03371', 'K03369'] +++++++++++++++++++ +M00057 +K00771 K00733 K00734 K10158 +['K00771', 'K00733', 'K00734', 'K10158'] +== +['K00771'] +['K00733'] +['K00734'] +['K10158'] +++++++++++++++++++ +M00058 +K00746 (K13499,K00747,K03419) +['K00746', '(K13499,K00747,K03419)'] +== +['K00746'] +['K13499', 'K00747', 'K03419'] +++++++++++++++++++ +M00059 +(K02369,K02370) (K02366,K02367) (K02368,K02370) (K02576,K02577,K02578,K02579) K01793 +['(K02369,K02370)', '(K02366,K02367)', '(K02368,K02370)', '(K02576,K02577,K02578,K02579)', 'K01793'] +== +['K02369', 'K02370'] +['K02366', 'K02367'] +['K02368', 'K02370'] +['K02576', 'K02577', 'K02578', 'K02579'] +['K01793'] +++++++++++++++++++ +M00076 +-- K01136 K01217 K01135 K01197 K01195 +['K01136', 'K01217', 'K01135', 'K01197', 'K01195'] +== +['K01136'] +['K01217'] +['K01135'] +['K01197'] +['K01195'] +++++++++++++++++++ +M00077 +K01135 K01197 K01195 K01132 +['K01135', 'K01197', 'K01195', 'K01132'] +== +['K01135'] +['K01197'] +['K01195'] +['K01132'] +++++++++++++++++++ +M00078 +(K07964,K07965) K01136 K01217 K01565 K10532 K01205 -- K01195 K01137 +['(K07964,K07965)', 'K01136', 'K01217', 'K01565', 'K10532', 'K01205', 'K01195', 'K01137'] +== +['K07964', 'K07965'] +['K01136'] +['K01217'] +['K01565'] +['K10532'] +['K01205'] +['K01195'] +['K01137'] +++++++++++++++++++ +M00079 +-- K01132 K12309 K01137 K12373 +['K01132', 'K12309', 'K01137', 'K12373'] +== +['K01132'] +['K12309'] +['K01137'] +['K12373'] +++++++++++++++++++ +M00060 +K00677 K02535 K02536 K03269 K00748 K00912 K02527 K02517 K02560 +['K00677', 'K02535', 'K02536', 'K03269', 'K00748', 'K00912', 'K02527', 'K02517', 'K02560'] +== +['K00677'] +['K02535'] +['K02536'] +['K03269'] +['K00748'] +['K00912'] +['K02527'] +['K02517'] +['K02560'] +++++++++++++++++++ +M00866 +K00677 (K02535,K16363) K02536 K03269 K00748 K00912 K02527 K02517 K09778 +['K00677', '(K02535,K16363)', 'K02536', 'K03269', 'K00748', 'K00912', 'K02527', 'K02517', 'K09778'] +== +['K00677'] +['K02535', 'K16363'] +['K02536'] +['K03269'] +['K00748'] +['K00912'] +['K02527'] +['K02517'] +['K09778'] +++++++++++++++++++ +M00867 +K12977 K03760 K23082+K23083 K23159 K09953 +['K12977', 'K03760', 'K23082+K23083', 'K23159', 'K09953'] +== +['K12977'] +['K03760'] +['K23082+K23083'] +['K23159'] +['K09953'] +++++++++++++++++++ +M00063 +K06041 K01627 K03270 K00979 +['K06041', 'K01627', 'K03270', 'K00979'] +== +['K06041'] +['K01627'] +['K03270'] +['K00979'] +++++++++++++++++++ +M00064 +K03271 (K03272,K21344) K03273 (K03272,K21345) K03274 +['K03271', '(K03272,K21344)', 'K03273', '(K03272,K21345)', 'K03274'] +== +['K03271'] +['K03272', 'K21344'] +['K03273'] +['K03272', 'K21345'] +['K03274'] +++++++++++++++++++ +M00127 +K03147 (K00877,K00941,K14153)(K00878,K14154)(K00788,K14153,K14154) K00946 +['K03147', '(K00877,K00941,K14153)(K00878,K14154)(K00788,K14153,K14154)', 'K00946'] +== +['K03147'] +['K00877', 'K00941', 'K14153', 'K14154', 'K14153K00878', 'K14154K00788'] +['K00946'] +++++++++++++++++++ +M00124 +K03472 K03473 K00831 K00097 K03474 (K00275,K23998) +['K03472', 'K03473', 'K00831', 'K00097', 'K03474', '(K00275,K23998)'] +== +['K03472'] +['K03473'] +['K00831'] +['K00097'] +['K03474'] +['K00275', 'K23998'] +++++++++++++++++++ +M00115 +K00278 K03517 K00767 (K00969,K06210) (K01916,K01950) +['K00278', 'K03517', 'K00767', '(K00969,K06210)', '(K01916,K01950)'] +== +['K00278'] +['K03517'] +['K00767'] +['K00969', 'K06210'] +['K01916', 'K01950'] +++++++++++++++++++ +M00810 +K19818+K19819+K19820 (K19826,K19890) K19185+K19186+K19187 K19188 -K20155 +['K19818+K19819+K19820', '(K19826,K19890)', 'K19185+K19186+K19187', 'K19188', '-K20155'] +== +['K19818+K19819+K19820'] +['K19826', 'K19890'] +['K19185+K19186+K19187'] +['K19188'] +['-K20155'] +++++++++++++++++++ +M00811 +K20170,K20169 (K20170,(K20158 K19700)) K20171-K20172 K15359,K18276 +['K20170,K20169', '(K20170,(K20158_K19700))', 'K20171-K20172', 'K15359,K18276'] +== +['K20170', 'K20169'] +['K20170', 'K20158_K19700'] +['K20171-K20172'] +['K15359', 'K18276'] +++++++++++++++++++ +M00622 +K18029+K18030 K14974 K18028 K15357 K13995 K01799 +['K18029+K18030', 'K14974', 'K18028', 'K15357', 'K13995', 'K01799'] +== +['K18029+K18030'] +['K14974'] +['K18028'] +['K15357'] +['K13995'] +['K01799'] +++++++++++++++++++ +M00120 +(K00867,K03525,K09680,K01947) ((K01922,K21977) K01598,K13038) (K02318,(K00954,K02201) K00859) +['(K00867,K03525,K09680,K01947)', '((K01922,K21977)_K01598,K13038)', '(K02318,(K00954,K02201)_K00859)'] +== +['K00867', 'K03525', 'K09680', 'K01947'] +['K13038', 'K01922,K21977_K01598'] +['K02318', 'K00954,K02201_K00859'] +++++++++++++++++++ +M00572 +K02169 (K00647,K09458) K00059 K02372 K00208 (K02170,K09789,K19560,K19561) +['K02169', '(K00647,K09458)', 'K00059', 'K02372', 'K00208', '(K02170,K09789,K19560,K19561)'] +== +['K02169'] +['K00647', 'K09458'] +['K00059'] +['K02372'] +['K00208'] +['K02170', 'K09789', 'K19560', 'K19561'] +++++++++++++++++++ +M00123 +K00652 ((K00833,K19563) K01935,K19562) K01012 +['K00652', '((K00833,K19563)_K01935,K19562)', 'K01012'] +== +['K00652'] +['K19562', 'K00833,K19563_K01935'] +['K01012'] +++++++++++++++++++ +M00573 +K16593 K00652 K19563 K01935 K01012 +['K16593', 'K00652', 'K19563', 'K01935', 'K01012'] +== +['K16593'] +['K00652'] +['K19563'] +['K01935'] +['K01012'] +++++++++++++++++++ +M00577 +K01906 K00652 (K00833,K19563) K01935 K01012 +['K01906', 'K00652', '(K00833,K19563)', 'K01935', 'K01012'] +== +['K01906'] +['K00652'] +['K00833', 'K19563'] +['K01935'] +['K01012'] +++++++++++++++++++ +M00126 +(K01495,K09007,K22391) (K01077,K01113,(K08310,K19965)) ((K13939,(K13940,K01633 K00950) K00796),(K01633 K13941)) (K11754,K20457) (K00287,K13998) +['(K01495,K09007,K22391)', '(K01077,K01113,(K08310,K19965))', '((K13939,(K13940,K01633_K00950)_K00796),(K01633_K13941))', '(K11754,K20457)', '(K00287,K13998)'] +== +['K01495', 'K09007', 'K22391'] +['K01077', 'K01113', 'K08310,K19965'] +['K01633_K13941', 'K13939,K13940,K01633_K00950_K00796'] +['K11754', 'K20457'] +['K00287', 'K13998'] +++++++++++++++++++ +M00840 +K14652 K22100 -- K01633 K13941 K22099 K00287 +['K14652', 'K22100', 'K01633', 'K13941', 'K22099', 'K00287'] +== +['K14652'] +['K22100'] +['K01633'] +['K13941'] +['K22099'] +['K00287'] +++++++++++++++++++ +M00841 +K01495 K22101 K00950 K00796 K11754 K13998 +['K01495', 'K22101', 'K00950', 'K00796', 'K11754', 'K13998'] +== +['K01495'] +['K22101'] +['K00950'] +['K00796'] +['K11754'] +['K13998'] +++++++++++++++++++ +M00842 +K01495 K01737 K00072 +['K01495', 'K01737', 'K00072'] +== +['K01495'] +['K01737'] +['K00072'] +++++++++++++++++++ +M00843 +K01495 K01737 K17745 +['K01495', 'K01737', 'K17745'] +== +['K01495'] +['K01737'] +['K17745'] +++++++++++++++++++ +M00880 +((K03639 K03637),K20967) (K03635,K21142) (((K03831,K03638) K03750),K15376) +['((K03639_K03637),K20967)', '(K03635,K21142)', '(((K03831,K03638)_K03750),K15376)'] +== +['K20967', 'K03639_K03637'] +['K03635', 'K21142'] +['K15376', 'K03831,K03638_K03750'] +++++++++++++++++++ +M00140 +K00600 (K01491,(K00300 K01500)) K01938 +['K00600', '(K01491,(K00300_K01500))', 'K01938'] +== +['K00600'] +['K01491', 'K00300_K01500'] +['K01938'] +++++++++++++++++++ +M00141 +K00600 (K00288,(K13403 K13402)) +['K00600', '(K00288,(K13403_K13402))'] +== +['K00600'] +['K00288', 'K13403_K13402'] +++++++++++++++++++ +M00868 +K00643 K01698 K01749 K01719 K01599 K00228 K00231 K01772 +['K00643', 'K01698', 'K01749', 'K01719', 'K01599', 'K00228', 'K00231', 'K01772'] +== +['K00643'] +['K01698'] +['K01749'] +['K01719'] +['K01599'] +['K00228'] +['K00231'] +['K01772'] +++++++++++++++++++ +M00121 +(K01885,K14163) K02492 K01845 K01698 K01749 (K01719,K13542,K13543) K01599 (K00228,K02495) (K00230,K00231) K01772 +['(K01885,K14163)', 'K02492', 'K01845', 'K01698', 'K01749', '(K01719,K13542,K13543)', 'K01599', '(K00228,K02495)', '(K00230,K00231)', 'K01772'] +== +['K01885', 'K14163'] +['K02492'] +['K01845'] +['K01698'] +['K01749'] +['K01719', 'K13542', 'K13543'] +['K01599'] +['K00228', 'K02495'] +['K00230', 'K00231'] +['K01772'] +++++++++++++++++++ +M00846 +(K01885,K14163) K02492 K01845 K01698 K01749 (K01719,K13542,K13543) (K02302,(K00589,K02303,K02496,K13542,K13543)+K02304-K03794) +['(K01885,K14163)', 'K02492', 'K01845', 'K01698', 'K01749', '(K01719,K13542,K13543)', '(K02302,(K00589,K02303,K02496,K13542,K13543)+K02304-K03794)'] +== +['K01885', 'K14163'] +['K02492'] +['K01845'] +['K01698'] +['K01749'] +['K01719', 'K13542', 'K13543'] +['K02302', 'K00589,K02303,K02496,K13542,K13543+K02304-K03794'] +++++++++++++++++++ +M00847 +K22225 K22226 K22227 +['K22225', 'K22226', 'K22227'] +== +['K22225'] +['K22226'] +['K22227'] +++++++++++++++++++ +M00836 +K22011 K22012 (K21610+K21611) K21612 +['K22011', 'K22012', '(K21610+K21611)', 'K21612'] +== +['K22011'] +['K22012'] +['K21610+K21611'] +['K21612'] +++++++++++++++++++ +M00117 +(K03181,K18240) K03179 (K03182+K03186) K18800 K00568 K03185 K03183 K03184 K00568 +['(K03181,K18240)', 'K03179', '(K03182+K03186)', 'K18800', 'K00568', 'K03185', 'K03183', 'K03184', 'K00568'] +== +['K03181', 'K18240'] +['K03179'] +['K03182+K03186'] +['K18800'] +['K00568'] +['K03185'] +['K03183'] +['K03184'] +['K00568'] +++++++++++++++++++ +M00128 +K06125 K00591 K06126 K06127 K06134 K00591 +['K06125', 'K00591', 'K06126', 'K06127', 'K06134', 'K00591'] +== +['K06125'] +['K00591'] +['K06126'] +['K06127'] +['K06134'] +['K00591'] +++++++++++++++++++ +M00116 +K02552 K02551 K08680 K02549 K01911 K01661 K19222 K02548 K03183 +['K02552', 'K02551', 'K08680', 'K02549', 'K01911', 'K01661', 'K19222', 'K02548', 'K03183'] +== +['K02552'] +['K02551'] +['K08680'] +['K02549'] +['K01911'] +['K01661'] +['K19222'] +['K02548'] +['K03183'] +++++++++++++++++++ +M00112 +K09833 (K12502,K18534) K09834 K05928 +['K09833', '(K12502,K18534)', 'K09834', 'K05928'] +== +['K09833'] +['K12502', 'K18534'] +['K09834'] +['K05928'] +++++++++++++++++++ +M00095 +K00626 K01641 K00021 K00869 (K00938,K13273) K01597 K01823 +['K00626', 'K01641', 'K00021', 'K00869', '(K00938,K13273)', 'K01597', 'K01823'] +== +['K00626'] +['K01641'] +['K00021'] +['K00869'] +['K00938', 'K13273'] +['K01597'] +['K01823'] +++++++++++++++++++ +M00849 +K00626 K01641 (K00021,K00054) ((K00869 K17942),(K18689 K18690 K22813)) K06981 K01823 +['K00626', 'K01641', '(K00021,K00054)', '((K00869_K17942),(K18689_K18690_K22813))', 'K06981', 'K01823'] +== +['K00626'] +['K01641'] +['K00021', 'K00054'] +['K00869_K17942', 'K18689_K18690_K22813'] +['K06981'] +['K01823'] +++++++++++++++++++ +M00096 +K01662 K00099 (K00991,K12506) K00919 (K01770,K12506) K03526 K03527 K01823 +['K01662', 'K00099', '(K00991,K12506)', 'K00919', '(K01770,K12506)', 'K03526', 'K03527', 'K01823'] +== +['K01662'] +['K00099'] +['K00991', 'K12506'] +['K00919'] +['K01770', 'K12506'] +['K03526'] +['K03527'] +['K01823'] +++++++++++++++++++ +M00364 +K01823 (K00795,K13789,K13787) +['K01823', '(K00795,K13789,K13787)'] +== +['K01823'] +['K00795', 'K13789', 'K13787'] +++++++++++++++++++ +M00365 +K01823 K13787 +['K01823', 'K13787'] +== +['K01823'] +['K13787'] +++++++++++++++++++ +M00366 +K01823 K14066 K00787 K13789 +['K01823', 'K14066', 'K00787', 'K13789'] +== +['K01823'] +['K14066'] +['K00787'] +['K13789'] +++++++++++++++++++ +M00367 +K01823 K00787 K00804 +['K01823', 'K00787', 'K00804'] +== +['K01823'] +['K00787'] +['K00804'] +++++++++++++++++++ +M00097 +K02291 K02293 K15744 K00514 K09835 K06443 +['K02291', 'K02293', 'K15744', 'K00514', 'K09835', 'K06443'] +== +['K02291'] +['K02293'] +['K15744'] +['K00514'] +['K09835'] +['K06443'] +++++++++++++++++++ +M00372 +(K15746,K15747) K09838 -K14594 K09840 K09841 K09842 +['(K15746,K15747)', 'K09838', '-K14594', 'K09840', 'K09841', 'K09842'] +== +['K15746', 'K15747'] +['K09838'] +['-K14594'] +['K09840'] +['K09841'] +['K09842'] +++++++++++++++++++ +M00371 +(K09587,K12639) K09588 K09591 (K12637,K12638) K20623 (K09590,K12640) +['(K09587,K12639)', 'K09588', 'K09591', '(K12637,K12638)', 'K20623', '(K09590,K12640)'] +== +['K09587', 'K12639'] +['K09588'] +['K09591'] +['K12637', 'K12638'] +['K20623'] +['K09590', 'K12640'] +++++++++++++++++++ +M00773 +K15988 K15989+K15990 K15992 K15991 K15993 K15994 K15995 K15996 +['K15988', 'K15989+K15990', 'K15992', 'K15991', 'K15993', 'K15994', 'K15995', 'K15996'] +== +['K15988'] +['K15989+K15990'] +['K15992'] +['K15991'] +['K15993'] +['K15994'] +['K15995'] +['K15996'] +++++++++++++++++++ +M00774 +K10817 K14366 K14367 K14368+K15997 K14370 K14369 +['K10817', 'K14366', 'K14367', 'K14368+K15997', 'K14370', 'K14369'] +== +['K10817'] +['K14366'] +['K14367'] +['K14368+K15997'] +['K14370'] +['K14369'] +++++++++++++++++++ +M00775 +K16007 K16008 K16009 K13320 K16010 +['K16007', 'K16008', 'K16009', 'K13320', 'K16010'] +== +['K16007'] +['K16008'] +['K16009'] +['K13320'] +['K16010'] +++++++++++++++++++ +M00776 +K16000+K16001+K16002-K16003 K16004 K16005 K16006 +['K16000+K16001+K16002-K16003', 'K16004', 'K16005', 'K16006'] +== +['K16000+K16001+K16002-K16003'] +['K16004'] +['K16005'] +['K16006'] +++++++++++++++++++ +M00777 +K14371 K14372 K14373 K14374 K14375 +['K14371', 'K14372', 'K14373', 'K14374', 'K14375'] +== +['K14371'] +['K14372'] +['K14373'] +['K14374'] +['K14375'] +++++++++++++++++++ +M00824 +K15314 K21160+K21161+K21162+K21163+K21164+K21165+K21166+K21167 +['K15314', 'K21160+K21161+K21162+K21163+K21164+K21165+K21166+K21167'] +== +['K15314'] +['K21160+K21161+K21162+K21163+K21164+K21165+K21166+K21167'] +++++++++++++++++++ +M00825 +K15314 K21168+K21169+K21170+K21171+K21172+K21173+K21174 +['K15314', 'K21168+K21169+K21170+K21171+K21172+K21173+K21174'] +== +['K15314'] +['K21168+K21169+K21170+K21171+K21172+K21173+K21174'] +++++++++++++++++++ +M00826 +K20159+K21175 K20156 K21176 K21177 K21178 K21179 +['K20159+K21175', 'K20156', 'K21176', 'K21177', 'K21178', 'K21179'] +== +['K20159+K21175'] +['K20156'] +['K21176'] +['K21177'] +['K21178'] +['K21179'] +++++++++++++++++++ +M00829 +K15320 K21191 K21192 +['K15320', 'K21191', 'K21192'] +== +['K15320'] +['K21191'] +['K21192'] +++++++++++++++++++ +M00830 +K20422 K20420 K20421 K20423 +['K20422', 'K20420', 'K20421', 'K20423'] +== +['K20422'] +['K20420'] +['K20421'] +['K20423'] +++++++++++++++++++ +M00831 +K21221 K21222 K21223 K21224 K21225 +['K21221', 'K21222', 'K21223', 'K21224', 'K21225'] +== +['K21221'] +['K21222'] +['K21223'] +['K21224'] +['K21225'] +++++++++++++++++++ +M00834 +K21254 K21255 K21256 K21257 K21258 +['K21254', 'K21255', 'K21256', 'K21257', 'K21258'] +== +['K21254'] +['K21255'] +['K21256'] +['K21257'] +['K21258'] +++++++++++++++++++ +M00778 +K05551+K05552+K05553 -K12420 ((K05554,K14249,K15884,K15885) (K05555,K14250),K15886) +['K05551+K05552+K05553', '-K12420', '((K05554,K14249,K15884,K15885)_(K05555,K14250),K15886)'] +== +['K05551+K05552+K05553'] +['-K12420'] +['K15886', 'K05554,K14249,K15884,K15885_K05555,K14250'] +++++++++++++++++++ +M00779 +K05556 (K14626,K14627) (K14628,K14629) (K14630+K14631,K14632) +['K05556', '(K14626,K14627)', '(K14628,K14629)', '(K14630+K14631,K14632)'] +== +['K05556'] +['K14626', 'K14627'] +['K14628', 'K14629'] +['K14632', 'K14630+K14631'] +++++++++++++++++++ +M00780 +K14251 K14252 K14253 K14254 K14255 K14256 K21301 +['K14251', 'K14252', 'K14253', 'K14254', 'K14255', 'K14256', 'K21301'] +== +['K14251'] +['K14252'] +['K14253'] +['K14254'] +['K14255'] +['K14256'] +['K21301'] +++++++++++++++++++ +M00823 +K14251 K14252 K14253 K14254 K14255 K14256 K21301 K14257+K21297 +['K14251', 'K14252', 'K14253', 'K14254', 'K14255', 'K14256', 'K21301', 'K14257+K21297'] +== +['K14251'] +['K14252'] +['K14253'] +['K14254'] +['K14255'] +['K14256'] +['K21301'] +['K14257+K21297'] +++++++++++++++++++ +M00781 +K15941 K15942 K15943 K15944 +['K15941', 'K15942', 'K15943', 'K15944'] +== +['K15941'] +['K15942'] +['K15943'] +['K15944'] +++++++++++++++++++ +M00782 +K15959 K15960 K15961 K15963 K15964 K15965 K15966 K15967 +['K15959', 'K15960', 'K15961', 'K15963', 'K15964', 'K15965', 'K15966', 'K15967'] +== +['K15959'] +['K15960'] +['K15961'] +['K15963'] +['K15964'] +['K15965'] +['K15966'] +['K15967'] +++++++++++++++++++ +M00783 +K15968 K15969 K15886 -K15970 K15971 K15972 +['K15968', 'K15969', 'K15886', '-K15970', 'K15971', 'K15972'] +== +['K15968'] +['K15969'] +['K15886'] +['-K15970'] +['K15971'] +['K15972'] +++++++++++++++++++ +M00784 +K19566 K19567 K19568 K19569 K19570 +['K19566', 'K19567', 'K19568', 'K19569', 'K19570'] +== +['K19566'] +['K19567'] +['K19568'] +['K19569'] +['K19570'] +++++++++++++++++++ +M00793 +K00973 K01710 (K01790 K00067,K23987) +['K00973', 'K01710', '(K01790_K00067,K23987)'] +== +['K00973'] +['K01710'] +['K23987', 'K01790_K00067'] +++++++++++++++++++ +M00794 +K13312 K13313 +['K13312', 'K13313'] +== +['K13312'] +['K13313'] +++++++++++++++++++ +M00795 +K19855 K12710 K17625 +['K19855', 'K12710', 'K17625'] +== +['K19855'] +['K12710'] +['K17625'] +++++++++++++++++++ +M00796 +K19853 K19854 K13307 +['K19853', 'K19854', 'K13307'] +== +['K19853'] +['K19854'] +['K13307'] +++++++++++++++++++ +M00797 +K13308 K13309 (K13310,K16436) (K13311,K13326) +['K13308', 'K13309', '(K13310,K16436)', '(K13311,K13326)'] +== +['K13308'] +['K13309'] +['K13310', 'K16436'] +['K13311', 'K13326'] +++++++++++++++++++ +M00798 +K16435 K13315 K13317 (K13316,K16438) K13318 +['K16435', 'K13315', 'K13317', '(K13316,K16438)', 'K13318'] +== +['K16435'] +['K13315'] +['K13317'] +['K13316', 'K16438'] +['K13318'] +++++++++++++++++++ +M00799 +K16435 K13315 K16438 K19856 K19857 +['K16435', 'K13315', 'K16438', 'K19856', 'K19857'] +== +['K16435'] +['K13315'] +['K16438'] +['K19856'] +['K19857'] +++++++++++++++++++ +M00800 +K16435 K16436 K13326 K16438 K13322 +['K16435', 'K16436', 'K13326', 'K16438', 'K13322'] +== +['K16435'] +['K16436'] +['K13326'] +['K16438'] +['K13322'] +++++++++++++++++++ +M00801 +K16435 K13327 K19858 K13319 +['K16435', 'K13327', 'K19858', 'K13319'] +== +['K16435'] +['K13327'] +['K19858'] +['K13319'] +++++++++++++++++++ +M00802 +K16435 K13327 K13328 K13329 K13330 +['K16435', 'K13327', 'K13328', 'K13329', 'K13330'] +== +['K16435'] +['K13327'] +['K13328'] +['K13329'] +['K13330'] +++++++++++++++++++ +M00803 +K16435 K19859 K16436 K13332 +['K16435', 'K19859', 'K16436', 'K13332'] +== +['K16435'] +['K19859'] +['K16436'] +['K13332'] +++++++++++++++++++ +M00672 +K12743 K04126 K10852 +['K12743', 'K04126', 'K10852'] +== +['K12743'] +['K04126'] +['K10852'] +++++++++++++++++++ +M00673 +K12743 K04126 K04127 K12744 K12745 K04128 K18062 K18063 +['K12743', 'K04126', 'K04127', 'K12744', 'K12745', 'K04128', 'K18062', 'K18063'] +== +['K12743'] +['K04126'] +['K04127'] +['K12744'] +['K12745'] +['K04128'] +['K18062'] +['K18063'] +++++++++++++++++++ +M00675 +K18317 K18316 K18315 +['K18317', 'K18316', 'K18315'] +== +['K18317'] +['K18316'] +['K18315'] +++++++++++++++++++ +M00736 +K19102+K19103+K05375 K19104 K19105 K19106 +['K19102+K19103+K05375', 'K19104', 'K19105', 'K19106'] +== +['K19102+K19103+K05375'] +['K19104'] +['K19105'] +['K19106'] +++++++++++++++++++ +M00674 +K12673 K12674 K12675 K12676 +['K12673', 'K12674', 'K12675', 'K12676'] +== +['K12673'] +['K12674'] +['K12675'] +['K12676'] +++++++++++++++++++ +M00039 +(K10775,K13064) K00487 K01904 K13065 K09754 K00588 K09753 K09755 K13066 (K00083,K22395) +['(K10775,K13064)', 'K00487', 'K01904', 'K13065', 'K09754', 'K00588', 'K09753', 'K09755', 'K13066', '(K00083,K22395)'] +== +['K10775', 'K13064'] +['K00487'] +['K01904'] +['K13065'] +['K09754'] +['K00588'] +['K09753'] +['K09755'] +['K13066'] +['K00083', 'K22395'] +++++++++++++++++++ +M00137 +K10775 K00487 K01904 K00660 K01859 +['K10775', 'K00487', 'K01904', 'K00660', 'K01859'] +== +['K10775'] +['K00487'] +['K01904'] +['K00660'] +['K01859'] +++++++++++++++++++ +M00138 +K00475 K13082 K05277 +['K00475', 'K13082', 'K05277'] +== +['K00475'] +['K13082'] +['K05277'] +++++++++++++++++++ +M00661 +K18385 K18386 K18387 +['K18385', 'K18386', 'K18387'] +== +['K18385'] +['K18386'] +['K18387'] +++++++++++++++++++ +M00370 +(K11812,K11813) K11818 K11819 K11820 K11821 +['(K11812,K11813)', 'K11818', 'K11819', 'K11820', 'K11821'] +== +['K11812', 'K11813'] +['K11818'] +['K11819'] +['K11820'] +['K11821'] +++++++++++++++++++ +M00814 +K19969 K19979 K19974 K20424 K20425 K20426 K20427 K20430 -- -- +['K19969', 'K19979', 'K19974', 'K20424', 'K20425', 'K20426', 'K20427', 'K20430'] +== +['K19969'] +['K19979'] +['K19974'] +['K20424'] +['K20425'] +['K20426'] +['K20427'] +['K20430'] +++++++++++++++++++ +M00815 +K19969 K20431 K20432 K20433 K20434 K20435 K20436 K20437 K20438 +['K19969', 'K20431', 'K20432', 'K20433', 'K20434', 'K20435', 'K20436', 'K20437', 'K20438'] +== +['K19969'] +['K20431'] +['K20432'] +['K20433'] +['K20434'] +['K20435'] +['K20436'] +['K20437'] +['K20438'] +++++++++++++++++++ +M00786 +K18281 K14132 K17475 K18280 K17827 K17826 K14134 K17825 K18279 +['K18281', 'K14132', 'K17475', 'K18280', 'K17827', 'K17826', 'K14134', 'K17825', 'K18279'] +== +['K18281'] +['K14132'] +['K17475'] +['K18280'] +['K17827'] +['K17826'] +['K14134'] +['K17825'] +['K18279'] +++++++++++++++++++ +M00789 +K14266 K19884 K19885 K19886+K19887 K19888 K19889 +['K14266', 'K19884', 'K19885', 'K19886+K19887', 'K19888', 'K19889'] +== +['K14266'] +['K19884'] +['K19885'] +['K19886+K19887'] +['K19888'] +['K19889'] +++++++++++++++++++ +M00790 +K14266 K19981 K14257 K19982 +['K14266', 'K19981', 'K14257', 'K19982'] +== +['K14266'] +['K19981'] +['K14257'] +['K19982'] +++++++++++++++++++ +M00805 +K20075 K20076 K20077+K20078 K20079 K20080 K20081 K20082 +['K20075', 'K20076', 'K20077+K20078', 'K20079', 'K20080', 'K20081', 'K20082'] +== +['K20075'] +['K20076'] +['K20077+K20078'] +['K20079'] +['K20080'] +['K20081'] +['K20082'] +++++++++++++++++++ +M00808 +K20086 K20087+K20088 K20089 K20090 +['K20086', 'K20087+K20088', 'K20089', 'K20090'] +== +['K20086'] +['K20087+K20088'] +['K20089'] +['K20090'] +++++++++++++++++++ +M00835 +K13063 K20261 K06998 K20260 K20262 K21103 K20940 +['K13063', 'K20261', 'K06998', 'K20260', 'K20262', 'K21103', 'K20940'] +== +['K13063'] +['K20261'] +['K06998'] +['K20260'] +['K20262'] +['K21103'] +['K20940'] +++++++++++++++++++ +M00877 +K18652 K18653 K18654 +['K18652', 'K18653', 'K18654'] +== +['K18652'] +['K18653'] +['K18654'] +++++++++++++++++++ +M00787 +K19546 K19547 K19550 K19549 K19548 K13037 +['K19546', 'K19547', 'K19550', 'K19549', 'K19548', 'K13037'] +== +['K19546'] +['K19547'] +['K19550'] +['K19549'] +['K19548'] +['K13037'] +++++++++++++++++++ +M00848 +K09460 K02078+K14245+K14246+K22798 K22799 K22800 K21272 K21271 K22801 K22802 +['K09460', 'K02078+K14245+K14246+K22798', 'K22799', 'K22800', 'K21272', 'K21271', 'K22801', 'K22802'] +== +['K09460'] +['K02078+K14245+K14246+K22798'] +['K22799'] +['K22800'] +['K21272'] +['K21271'] +['K22801'] +['K22802'] +++++++++++++++++++ +M00788 +K19835 K19834 +['K19835', 'K19834'] +== +['K19835'] +['K19834'] +++++++++++++++++++ +M00819 +K12250 K15907 -- K18056 K17747 K18091 K18057 K17476 +['K12250', 'K15907', 'K18056', 'K17747', 'K18091', 'K18057', 'K17476'] +== +['K12250'] +['K15907'] +['K18056'] +['K17747'] +['K18091'] +['K18057'] +['K17476'] +++++++++++++++++++ +M00876 +K21898 K23446 K23447 +['K21898', 'K23446', 'K23447'] +== +['K21898'] +['K23446'] +['K23447'] +++++++++++++++++++ +M00875 +K23371 K21949 K21721 K23372 K23373 K23374 K23375 +['K23371', 'K21949', 'K21721', 'K23372', 'K23373', 'K23374', 'K23375'] +== +['K23371'] +['K21949'] +['K21721'] +['K23372'] +['K23373'] +['K23374'] +['K23375'] +++++++++++++++++++ +M00538 +K15760+K15761-K15762+K15763+K15764-K15765 K00055 K00141 +['K15760+K15761-K15762+K15763+K15764-K15765', 'K00055', 'K00141'] +== +['K15760+K15761-K15762+K15763+K15764-K15765'] +['K00055'] +['K00141'] +++++++++++++++++++ +M00537 +K15757+K15758 K00055 K00141 +['K15757+K15758', 'K00055', 'K00141'] +== +['K15757+K15758'] +['K00055'] +['K00141'] +++++++++++++++++++ +M00419 +K10616+K18293 K10617 K10618 +['K10616+K18293', 'K10617', 'K10618'] +== +['K10616+K18293'] +['K10617'] +['K10618'] +++++++++++++++++++ +M00547 +K03268+K16268+K18089+K18090 K16269 +['K03268+K16268+K18089+K18090', 'K16269'] +== +['K03268+K16268+K18089+K18090'] +['K16269'] +++++++++++++++++++ +M00548 +K16249+K16243+K16244+K16242+K16245+K16246 +['K16249+K16243+K16244+K16242+K16245+K16246'] +== +['K16249+K16243+K16244+K16242+K16245+K16246'] +++++++++++++++++++ +M00551 +K05549+K05550+K05784 K05783 +['K05549+K05550+K05784', 'K05783'] +== +['K05549+K05550+K05784'] +['K05783'] +++++++++++++++++++ +M00637 +(K05599+K05600+K11311,K16319+K16320+K18248+K18249) +['(K05599+K05600+K11311,K16319+K16320+K18248+K18249)'] +== +['K05599+K05600+K11311', 'K16319+K16320+K18248+K18249'] +++++++++++++++++++ +M00568 +K03381 K01856 K03464 (K01055,K14727) +['K03381', 'K01856', 'K03464', '(K01055,K14727)'] +== +['K03381'] +['K01856'] +['K03464'] +['K01055', 'K14727'] +++++++++++++++++++ +M00569 +(K00446,K07104) ((K10217 K01821 K01617),K10216) (K18364,K02554) (K18365,K01666) (K18366,K04073) +['(K00446,K07104)', '((K10217_K01821_K01617),K10216)', '(K18364,K02554)', '(K18365,K01666)', '(K18366,K04073)'] +== +['K00446', 'K07104'] +['K10216', 'K10217_K01821_K01617'] +['K18364', 'K02554'] +['K18365', 'K01666'] +['K18366', 'K04073'] +++++++++++++++++++ +M00539 +K10619+K16303+K16304+K18227 K10620 K10621 K10622 K10623 +['K10619+K16303+K16304+K18227', 'K10620', 'K10621', 'K10622', 'K10623'] +== +['K10619+K16303+K16304+K18227'] +['K10620'] +['K10621'] +['K10622'] +['K10623'] +++++++++++++++++++ +M00543 +K08689+K15750+K18087+K18088 K08690 K00462 K10222 +['K08689+K15750+K18087+K18088', 'K08690', 'K00462', 'K10222'] +== +['K08689+K15750+K18087+K18088'] +['K08690'] +['K00462'] +['K10222'] +++++++++++++++++++ +M00544 +K15751-K15752-K15753 K15754+K15755 K15756 +['K15751-K15752-K15753', 'K15754+K15755', 'K15756'] +== +['K15751-K15752-K15753'] +['K15754+K15755'] +['K15756'] +++++++++++++++++++ +M00418 +K07540 K07543+K07544 K07545 K07546 K07547+K07548 K07549+K07550 +['K07540', 'K07543+K07544', 'K07545', 'K07546', 'K07547+K07548', 'K07549+K07550'] +== +['K07540'] +['K07543+K07544'] +['K07545'] +['K07546'] +['K07547+K07548'] +['K07549+K07550'] +++++++++++++++++++ +M00541 +(K04112+K04113+K04114+K04115,K19515+K19516) K07537 K07538 K07539 +['(K04112+K04113+K04114+K04115,K19515+K19516)', 'K07537', 'K07538', 'K07539'] +== +['K19515+K19516', 'K04112+K04113+K04114+K04115'] +['K07537'] +['K07538'] +['K07539'] +++++++++++++++++++ +M00540 +K04116 K04117 K07534 K07535 K07536 +['K04116', 'K04117', 'K07534', 'K07535', 'K07536'] +== +['K04116'] +['K04117'] +['K07534'] +['K07535'] +['K07536'] +++++++++++++++++++ +M00534 +K14579+K14580+K14578+K14581 K14582 K14583 K14584 K14585 K00152 +['K14579+K14580+K14578+K14581', 'K14582', 'K14583', 'K14584', 'K14585', 'K00152'] +== +['K14579+K14580+K14578+K14581'] +['K14582'] +['K14583'] +['K14584'] +['K14585'] +['K00152'] +++++++++++++++++++ +M00638 +K18242+K18243+K14578+K14581 +['K18242+K18243+K14578+K14581'] +== +['K18242+K18243+K14578+K14581'] +++++++++++++++++++ +M00624 +K18074+K18075+K18077 K18076 +['K18074+K18075+K18077', 'K18076'] +== +['K18074+K18075+K18077'] +['K18076'] +++++++++++++++++++ +M00623 +K18068+K18069 K18067 K04102 +['K18068+K18069', 'K18067', 'K04102'] +== +['K18068+K18069'] +['K18067'] +['K04102'] +++++++++++++++++++ +M00636 +K18251+K18252-K18253-K18254 K18255 K18256 +['K18251+K18252-K18253-K18254', 'K18255', 'K18256'] +== +['K18251+K18252-K18253-K18254'] +['K18255'] +['K18256'] +++++++++++++++++++ +M00878 +K01912 K02609+K02610+K02611+K02612+K02613 K15866 K02618 K02615 K01692 K00074 +['K01912', 'K02609+K02610+K02611+K02612+K02613', 'K15866', 'K02618', 'K02615', 'K01692', 'K00074'] +== +['K01912'] +['K02609+K02610+K02611+K02612+K02613'] +['K15866'] +['K02618'] +['K02615'] +['K01692'] +['K00074'] +++++++++++++++++++ +M00852 +K10961 K10920 K10919 K10930 K10931 K10962 K10932 K10963 K10933 K10964 K10965 K10934 K10935 K10966 +['K10961', 'K10920', 'K10919', 'K10930', 'K10931', 'K10962', 'K10932', 'K10963', 'K10933', 'K10964', 'K10965', 'K10934', 'K10935', 'K10966'] +== +['K10961'] +['K10920'] +['K10919'] +['K10930'] +['K10931'] +['K10962'] +['K10932'] +['K10963'] +['K10933'] +['K10964'] +['K10965'] +['K10934'] +['K10935'] +['K10966'] +++++++++++++++++++ +M00850 +(K10928+K10929) K10954 K10952 K10953 K10948 K11018 +['(K10928+K10929)', 'K10954', 'K10952', 'K10953', 'K10948', 'K11018'] +== +['K10928+K10929'] +['K10954'] +['K10952'] +['K10953'] +['K10948'] +['K11018'] +++++++++++++++++++ +M00542 +K03221+K03219+K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225 K12784 K12787 K12785 K12786 K12788 K16041 K16042 +['K03221+K03219+K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225', 'K12784', 'K12787', 'K12785', 'K12786', 'K12788', 'K16041', 'K16042'] +== +['K03221+K03219+K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225'] +['K12784'] +['K12787'] +['K12785'] +['K12786'] +['K12788'] +['K16041'] +['K16042'] +++++++++++++++++++ +M00363 +K11006 K11007 +['K11006', 'K11007'] +== +['K11006'] +['K11007'] +++++++++++++++++++ +M00853 +K22850 -K22851 K22852 K22853 K22854 +['K22850', '-K22851', 'K22852', 'K22853', 'K22854'] +== +['K22850'] +['-K22851'] +['K22852'] +['K22853'] +['K22854'] +++++++++++++++++++ +M00576 +(K10928+K10929) (K16883,K16884) +['(K10928+K10929)', '(K16883,K16884)'] +== +['K10928+K10929'] +['K16883', 'K16884'] +++++++++++++++++++ +M00856 +K11014 K11023 K19298 K22918 +['K11014', 'K11023', 'K19298', 'K22918'] +== +['K11014'] +['K11023'] +['K19298'] +['K22918'] +++++++++++++++++++ +M00857 +K22914 K22926 K22925 K22915 K22916 K22917 (K22924+K22921+K22923+K22922) +['K22914', 'K22926', 'K22925', 'K22915', 'K22916', 'K22917', '(K22924+K22921+K22923+K22922)'] +== +['K22914'] +['K22926'] +['K22925'] +['K22915'] +['K22916'] +['K22917'] +['K22924+K22921+K22923+K22922'] +++++++++++++++++++ +M00575 +K22944 K11004 K07389 K11003 K12340 +['K22944', 'K11004', 'K07389', 'K11003', 'K12340'] +== +['K22944'] +['K11004'] +['K07389'] +['K11003'] +['K12340'] +++++++++++++++++++ +M00574 +K11023 -K11024 K11025 K11026 K11027 +['K11023', '-K11024', 'K11025', 'K11026', 'K11027'] +== +['K11023'] +['-K11024'] +['K11025'] +['K11026'] +['K11027'] +++++++++++++++++++ +M00564 +K15842 K12086 -K12087 K12088 K12089 K12090 K03196 K12091 -K12092 K12093 K12094 K12095 K12096 K12097 K12098 -K12099 -K12100 K12101 K12102 K12103 K12104 K12105 K12106 K12107 K12108 K12109 K12110 +['K15842', 'K12086', '-K12087', 'K12088', 'K12089', 'K12090', 'K03196', 'K12091', '-K12092', 'K12093', 'K12094', 'K12095', 'K12096', 'K12097', 'K12098', '-K12099', '-K12100', 'K12101', 'K12102', 'K12103', 'K12104', 'K12105', 'K12106', 'K12107', 'K12108', 'K12109', 'K12110'] +== +['K15842'] +['K12086'] +['-K12087'] +['K12088'] +['K12089'] +['K12090'] +['K03196'] +['K12091'] +['-K12092'] +['K12093'] +['K12094'] +['K12095'] +['K12096'] +['K12097'] +['K12098'] +['-K12099'] +['-K12100'] +['K12101'] +['K12102'] +['K12103'] +['K12104'] +['K12105'] +['K12106'] +['K12107'] +['K12108'] +['K12109'] +['K12110'] +++++++++++++++++++ +M00859 +K11030 K08645 K11029 +['K11030', 'K08645', 'K11029'] +== +['K11030'] +['K08645'] +['K11029'] +++++++++++++++++++ +M00860 +K22976 K22977 K22980 K07282 K22116 K01932 K22981 +['K22976', 'K22977', 'K22980', 'K07282', 'K22116', 'K01932', 'K22981'] +== +['K22976'] +['K22977'] +['K22980'] +['K07282'] +['K22116'] +['K01932'] +['K22981'] +++++++++++++++++++ +M00851 +(K18768,K18970,K19316,K22346,K18794,K19318,K18971,K18793,K19319,K19320,K19321,K19322,K18972,K19211,K18976,K21277,K18782,K18781,K18780,K19099,K19216) +['(K18768,K18970,K19316,K22346,K18794,K19318,K18971,K18793,K19319,K19320,K19321,K19322,K18972,K19211,K18976,K21277,K18782,K18781,K18780,K19099,K19216)'] +== +['K18768', 'K18970', 'K19316', 'K22346', 'K18794', 'K19318', 'K18971', 'K18793', 'K19319', 'K19320', 'K19321', 'K19322', 'K18972', 'K19211', 'K18976', 'K21277', 'K18782', 'K18781', 'K18780', 'K19099', 'K19216'] +++++++++++++++++++ +M00625 +K02547 K02546 K02545 +['K02547', 'K02546', 'K02545'] +== +['K02547'] +['K02546'] +['K02545'] +++++++++++++++++++ +M00627 +K02172 K02171 (K18766,K17836) +['K02172', 'K02171', '(K18766,K17836)'] +== +['K02172'] +['K02171'] +['K18766', 'K17836'] +++++++++++++++++++ +M00745 +(K18072 K18073),(K07644 K07665),K18297 K18093 +['(K18072_K18073),(K07644_K07665),K18297', 'K18093'] +== +['K18297', 'K18072_K18073', 'K07644_K07665'] +['K18093'] +++++++++++++++++++ +M00651 +(K18345 K18344 K07260 K18346),(K18351 K18352 K18354 K18353) (K18347 K15739 K08641) +['(K18345_K18344_K07260_K18346),(K18351_K18352_K18354_K18353)', '(K18347_K15739_K08641)'] +== +['K18345_K18344_K07260_K18346', 'K18351_K18352_K18354_K18353'] +['K18347_K15739_K08641'] +++++++++++++++++++ +M00652 +K18350 K18349 K18348 K18856 K18866 +['K18350', 'K18349', 'K18348', 'K18856', 'K18866'] +== +['K18350'] +['K18349'] +['K18348'] +['K18856'] +['K18866'] +++++++++++++++++++ +M00704 +K18906 K08168 +['K18906', 'K08168'] +== +['K18906'] +['K08168'] +++++++++++++++++++ +M00725 +K19077 K19078 K03367+K03739+K14188+K03740 +['K19077', 'K19078', 'K03367+K03739+K14188+K03740'] +== +['K19077'] +['K19078'] +['K03367+K03739+K14188+K03740'] +++++++++++++++++++ +M00726 +K19077 K19078 K14205 +['K19077', 'K19078', 'K14205'] +== +['K19077'] +['K19078'] +['K14205'] +++++++++++++++++++ +M00730 +K19077 K19078 K19079+K19080 +['K19077', 'K19078', 'K19079+K19080'] +== +['K19077'] +['K19078'] +['K19079+K19080'] +++++++++++++++++++ +M00744 +K07637 K07660 K08477 +['K07637', 'K07660', 'K08477'] +== +['K07637'] +['K07660'] +['K08477'] +++++++++++++++++++ +M00718 +K18131 K03585+K18138+K18139 +['K18131', 'K03585+K18138+K18139'] +== +['K18131'] +['K03585+K18138+K18139'] +++++++++++++++++++ +M00639 +K18294 K18295+K18296-K08721 +['K18294', 'K18295+K18296-K08721'] +== +['K18294'] +['K18295+K18296-K08721'] +++++++++++++++++++ +M00641 +K18297 K18298+K18299-K18300 +['K18297', 'K18298+K18299-K18300'] +== +['K18297'] +['K18298+K18299-K18300'] +++++++++++++++++++ +M00642 +K18301 K18302+K18303-K18139 +['K18301', 'K18302+K18303-K18139'] +== +['K18301'] +['K18302+K18303-K18139'] +++++++++++++++++++ +M00643 +K18129 K18094+K18095+K18139 +['K18129', 'K18094+K18095+K18139'] +== +['K18129'] +['K18094+K18095+K18139'] +++++++++++++++++++ +M00769 +K18304 K19591 K19595+K19594+K19593 +['K18304', 'K19591', 'K19595+K19594+K19593'] +== +['K18304'] +['K19591'] +['K19595+K19594+K19593'] +++++++++++++++++++ +M00649 +K18143 K18144 K18145+K18146-K18147 +['K18143', 'K18144', 'K18145+K18146-K18147'] +== +['K18143'] +['K18144'] +['K18145+K18146-K18147'] +++++++++++++++++++ +M00696 +K18140 K18141+K18142+K12340 +['K18140', 'K18141+K18142+K12340'] +== +['K18140'] +['K18141+K18142+K12340'] +++++++++++++++++++ +M00697 +K07690 K18898+K18899+K12340 +['K07690', 'K18898+K18899+K12340'] +== +['K07690'] +['K18898+K18899+K12340'] +++++++++++++++++++ +M00698 +K18900 K18901+K18902+K18903 +['K18900', 'K18901+K18902+K18903'] +== +['K18900'] +['K18901+K18902+K18903'] +++++++++++++++++++ +M00700 +(K18906,K18907) K18104 +['(K18906,K18907)', 'K18104'] +== +['K18906', 'K18907'] +['K18104'] +++++++++++++++++++ +M00702 +(K18906,K18907) K08170 +['(K18906,K18907)', 'K08170'] +== +['K18906', 'K18907'] +['K08170'] +++++++++++++++++++ +M00714 +K18938 K08167 +['K18938', 'K08167'] +== +['K18938'] +['K08167'] +++++++++++++++++++ +M00705 +K18909 K18908 +['K18909', 'K18908'] +== +['K18909'] +['K18908'] +++++++++++++++++++ +M00746 +K13632 K18513 K09476 +['K13632', 'K18513', 'K09476'] +== +['K13632'] +['K18513'] +['K09476'] +++++++++++++++++++ +M00660 +K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225+K03223+K18374+K18376 K18373 K18375 K18377 K18378 K18379 K18380 K18381 +['K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225+K03223+K18374+K18376', 'K18373', 'K18375', 'K18377', 'K18378', 'K18379', 'K18380', 'K18381'] +== +['K03222+K03226+K03227+K03228+K03229+K03230+K03224+K03225+K03223+K18374+K18376'] +['K18373'] +['K18375'] +['K18377'] +['K18378'] +['K18379'] +['K18380'] +['K18381'] +++++++++++++++++++ +M00664 +K14658 K14659 K14666 K14657 +['K14658', 'K14659', 'K14666', 'K14657'] +== +['K14658'] +['K14659'] +['K14666'] +['K14657'] +++++++++++++++++++ diff --git a/data/MicrobeAnnotator_KEGG/01.KEGG_DB/06.Module_Groups.txt b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/06.Module_Groups.txt new file mode 100644 index 0000000..9c7c0c4 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/01.KEGG_DB/06.Module_Groups.txt @@ -0,0 +1,394 @@ +M00015 Arginine and proline metabolism #8a3222 +M00028 Arginine and proline metabolism #8a3222 +M00029 Arginine and proline metabolism #8a3222 +M00047 Arginine and proline metabolism #8a3222 +M00763 Arginine and proline metabolism #8a3222 +M00844 Arginine and proline metabolism #8a3222 +M00845 Arginine and proline metabolism #8a3222 +M00879 Arginine and proline metabolism #8a3222 +M00022 Aromatic amino acid metabolism #8641b6 +M00023 Aromatic amino acid metabolism #8641b6 +M00024 Aromatic amino acid metabolism #8641b6 +M00025 Aromatic amino acid metabolism #8641b6 +M00037 Aromatic amino acid metabolism #8641b6 +M00038 Aromatic amino acid metabolism #8641b6 +M00040 Aromatic amino acid metabolism #8641b6 +M00042 Aromatic amino acid metabolism #8641b6 +M00043 Aromatic amino acid metabolism #8641b6 +M00044 Aromatic amino acid metabolism #8641b6 +M00533 Aromatic amino acid metabolism #8641b6 +M00545 Aromatic amino acid metabolism #8641b6 +M00418 Aromatics degradation #76d25b +M00419 Aromatics degradation #76d25b +M00534 Aromatics degradation #76d25b +M00537 Aromatics degradation #76d25b +M00538 Aromatics degradation #76d25b +M00539 Aromatics degradation #76d25b +M00540 Aromatics degradation #76d25b +M00541 Aromatics degradation #76d25b +M00543 Aromatics degradation #76d25b +M00544 Aromatics degradation #76d25b +M00547 Aromatics degradation #76d25b +M00548 Aromatics degradation #76d25b +M00551 Aromatics degradation #76d25b +M00568 Aromatics degradation #76d25b +M00569 Aromatics degradation #76d25b +M00623 Aromatics degradation #76d25b +M00624 Aromatics degradation #76d25b +M00636 Aromatics degradation #76d25b +M00637 Aromatics degradation #76d25b +M00638 Aromatics degradation #76d25b +M00878 Aromatics degradation #76d25b +M00142 ATP synthesis #cdd346 +M00143 ATP synthesis #cdd346 +M00144 ATP synthesis #cdd346 +M00145 ATP synthesis #cdd346 +M00146 ATP synthesis #cdd346 +M00147 ATP synthesis #cdd346 +M00148 ATP synthesis #cdd346 +M00149 ATP synthesis #cdd346 +M00150 ATP synthesis #cdd346 +M00151 ATP synthesis #cdd346 +M00152 ATP synthesis #cdd346 +M00153 ATP synthesis #cdd346 +M00154 ATP synthesis #cdd346 +M00155 ATP synthesis #cdd346 +M00156 ATP synthesis #cdd346 +M00157 ATP synthesis #cdd346 +M00158 ATP synthesis #cdd346 +M00159 ATP synthesis #cdd346 +M00160 ATP synthesis #cdd346 +M00162 ATP synthesis #cdd346 +M00416 ATP synthesis #cdd346 +M00417 ATP synthesis #cdd346 +M00672 Beta-Lactam biosynthesis #3b2882 +M00673 Beta-Lactam biosynthesis #3b2882 +M00674 Beta-Lactam biosynthesis #3b2882 +M00675 Beta-Lactam biosynthesis #3b2882 +M00736 Beta-Lactam biosynthesis #3b2882 +M00039 Biosynthesis of other secondary metabolites #cbde82 +M00137 Biosynthesis of other secondary metabolites #cbde82 +M00138 Biosynthesis of other secondary metabolites #cbde82 +M00370 Biosynthesis of other secondary metabolites #cbde82 +M00661 Biosynthesis of other secondary metabolites #cbde82 +M00785 Biosynthesis of other secondary metabolites #cbde82 +M00786 Biosynthesis of other secondary metabolites #cbde82 +M00787 Biosynthesis of other secondary metabolites #cbde82 +M00788 Biosynthesis of other secondary metabolites #cbde82 +M00789 Biosynthesis of other secondary metabolites #cbde82 +M00790 Biosynthesis of other secondary metabolites #cbde82 +M00805 Biosynthesis of other secondary metabolites #cbde82 +M00808 Biosynthesis of other secondary metabolites #cbde82 +M00814 Biosynthesis of other secondary metabolites #cbde82 +M00815 Biosynthesis of other secondary metabolites #cbde82 +M00819 Biosynthesis of other secondary metabolites #cbde82 +M00835 Biosynthesis of other secondary metabolites #cbde82 +M00837 Biosynthesis of other secondary metabolites #cbde82 +M00838 Biosynthesis of other secondary metabolites #cbde82 +M00848 Biosynthesis of other secondary metabolites #cbde82 +M00875 Biosynthesis of other secondary metabolites #cbde82 +M00876 Biosynthesis of other secondary metabolites #cbde82 +M00877 Biosynthesis of other secondary metabolites #cbde82 +M00019 Branched-chain amino acid metabolism #656cdb +M00036 Branched-chain amino acid metabolism #656cdb +M00432 Branched-chain amino acid metabolism #656cdb +M00535 Branched-chain amino acid metabolism #656cdb +M00570 Branched-chain amino acid metabolism #656cdb +M00165 Carbon fixation #408937 +M00166 Carbon fixation #408937 +M00167 Carbon fixation #408937 +M00168 Carbon fixation #408937 +M00169 Carbon fixation #408937 +M00170 Carbon fixation #408937 +M00171 Carbon fixation #408937 +M00172 Carbon fixation #408937 +M00173 Carbon fixation #408937 +M00374 Carbon fixation #408937 +M00375 Carbon fixation #408937 +M00376 Carbon fixation #408937 +M00377 Carbon fixation #408937 +M00579 Carbon fixation #408937 +M00620 Carbon fixation #408937 +M00001 Central carbohydrate metabolism #c644a5 +M00002 Central carbohydrate metabolism #c644a5 +M00003 Central carbohydrate metabolism #c644a5 +M00004 Central carbohydrate metabolism #c644a5 +M00005 Central carbohydrate metabolism #c644a5 +M00006 Central carbohydrate metabolism #c644a5 +M00007 Central carbohydrate metabolism #c644a5 +M00008 Central carbohydrate metabolism #c644a5 +M00009 Central carbohydrate metabolism #c644a5 +M00010 Central carbohydrate metabolism #c644a5 +M00011 Central carbohydrate metabolism #c644a5 +M00307 Central carbohydrate metabolism #c644a5 +M00308 Central carbohydrate metabolism #c644a5 +M00309 Central carbohydrate metabolism #c644a5 +M00580 Central carbohydrate metabolism #c644a5 +M00633 Central carbohydrate metabolism #c644a5 +M00112 Cofactor and vitamin metabolism #5fda98 +M00115 Cofactor and vitamin metabolism #5fda98 +M00116 Cofactor and vitamin metabolism #5fda98 +M00117 Cofactor and vitamin metabolism #5fda98 +M00119 Cofactor and vitamin metabolism #5fda98 +M00120 Cofactor and vitamin metabolism #5fda98 +M00121 Cofactor and vitamin metabolism #5fda98 +M00122 Cofactor and vitamin metabolism #5fda98 +M00123 Cofactor and vitamin metabolism #5fda98 +M00124 Cofactor and vitamin metabolism #5fda98 +M00125 Cofactor and vitamin metabolism #5fda98 +M00126 Cofactor and vitamin metabolism #5fda98 +M00127 Cofactor and vitamin metabolism #5fda98 +M00128 Cofactor and vitamin metabolism #5fda98 +M00140 Cofactor and vitamin metabolism #5fda98 +M00141 Cofactor and vitamin metabolism #5fda98 +M00572 Cofactor and vitamin metabolism #5fda98 +M00573 Cofactor and vitamin metabolism #5fda98 +M00577 Cofactor and vitamin metabolism #5fda98 +M00622 Cofactor and vitamin metabolism #5fda98 +M00810 Cofactor and vitamin metabolism #5fda98 +M00811 Cofactor and vitamin metabolism #5fda98 +M00836 Cofactor and vitamin metabolism #5fda98 +M00840 Cofactor and vitamin metabolism #5fda98 +M00841 Cofactor and vitamin metabolism #5fda98 +M00842 Cofactor and vitamin metabolism #5fda98 +M00843 Cofactor and vitamin metabolism #5fda98 +M00846 Cofactor and vitamin metabolism #5fda98 +M00847 Cofactor and vitamin metabolism #5fda98 +M00868 Cofactor and vitamin metabolism #5fda98 +M00880 Cofactor and vitamin metabolism #5fda98 +M00017 Cysteine and methionine metabolism #782975 +M00021 Cysteine and methionine metabolism #782975 +M00034 Cysteine and methionine metabolism #782975 +M00035 Cysteine and methionine metabolism #782975 +M00338 Cysteine and methionine metabolism #782975 +M00368 Cysteine and methionine metabolism #782975 +M00609 Cysteine and methionine metabolism #782975 +M00625 Drug resistance #869534 +M00627 Drug resistance #869534 +M00639 Drug resistance #869534 +M00641 Drug resistance #869534 +M00642 Drug resistance #869534 +M00643 Drug resistance #869534 +M00649 Drug resistance #869534 +M00651 Drug resistance #869534 +M00652 Drug resistance #869534 +M00696 Drug resistance #869534 +M00697 Drug resistance #869534 +M00698 Drug resistance #869534 +M00700 Drug resistance #869534 +M00702 Drug resistance #869534 +M00704 Drug resistance #869534 +M00705 Drug resistance #869534 +M00714 Drug resistance #869534 +M00718 Drug resistance #869534 +M00725 Drug resistance #869534 +M00726 Drug resistance #869534 +M00730 Drug resistance #869534 +M00744 Drug resistance #869534 +M00745 Drug resistance #869534 +M00746 Drug resistance #869534 +M00769 Drug resistance #869534 +M00851 Drug resistance #869534 +M00824 Enediyne biosynthesis #d27bde +M00825 Enediyne biosynthesis #d27bde +M00826 Enediyne biosynthesis #d27bde +M00827 Enediyne biosynthesis #d27bde +M00828 Enediyne biosynthesis #d27bde +M00829 Enediyne biosynthesis #d27bde +M00830 Enediyne biosynthesis #d27bde +M00831 Enediyne biosynthesis #d27bde +M00832 Enediyne biosynthesis #d27bde +M00833 Enediyne biosynthesis #d27bde +M00834 Enediyne biosynthesis #d27bde +M00082 Fatty acid metabolism #d9a344 +M00083 Fatty acid metabolism #d9a344 +M00085 Fatty acid metabolism #d9a344 +M00086 Fatty acid metabolism #d9a344 +M00087 Fatty acid metabolism #d9a344 +M00415 Fatty acid metabolism #d9a344 +M00861 Fatty acid metabolism #d9a344 +M00873 Fatty acid metabolism #d9a344 +M00874 Fatty acid metabolism #d9a344 +M00055 Glycan biosynthesis #588cd6 +M00056 Glycan biosynthesis #588cd6 +M00065 Glycan biosynthesis #588cd6 +M00068 Glycan biosynthesis #588cd6 +M00069 Glycan biosynthesis #588cd6 +M00070 Glycan biosynthesis #588cd6 +M00071 Glycan biosynthesis #588cd6 +M00072 Glycan biosynthesis #588cd6 +M00073 Glycan biosynthesis #588cd6 +M00074 Glycan biosynthesis #588cd6 +M00075 Glycan biosynthesis #588cd6 +M00872 Glycan biosynthesis #588cd6 +M00057 Glycosaminoglycan metabolism #d66432 +M00058 Glycosaminoglycan metabolism #d66432 +M00059 Glycosaminoglycan metabolism #d66432 +M00076 Glycosaminoglycan metabolism #d66432 +M00077 Glycosaminoglycan metabolism #d66432 +M00078 Glycosaminoglycan metabolism #d66432 +M00079 Glycosaminoglycan metabolism #d66432 +M00026 Histidine metabolism #66d7bf +M00045 Histidine metabolism #66d7bf +M00066 Lipid metabolism #d53e55 +M00067 Lipid metabolism #d53e55 +M00088 Lipid metabolism #d53e55 +M00089 Lipid metabolism #d53e55 +M00090 Lipid metabolism #d53e55 +M00091 Lipid metabolism #d53e55 +M00092 Lipid metabolism #d53e55 +M00093 Lipid metabolism #d53e55 +M00094 Lipid metabolism #d53e55 +M00098 Lipid metabolism #d53e55 +M00099 Lipid metabolism #d53e55 +M00100 Lipid metabolism #d53e55 +M00113 Lipid metabolism #d53e55 +M00060 Lipopolysaccharide metabolism #83d2de +M00063 Lipopolysaccharide metabolism #83d2de +M00064 Lipopolysaccharide metabolism #83d2de +M00866 Lipopolysaccharide metabolism #83d2de +M00867 Lipopolysaccharide metabolism #83d2de +M00016 Lysine metabolism #d84e8b +M00030 Lysine metabolism #d84e8b +M00031 Lysine metabolism #d84e8b +M00032 Lysine metabolism #d84e8b +M00433 Lysine metabolism #d84e8b +M00525 Lysine metabolism #d84e8b +M00526 Lysine metabolism #d84e8b +M00527 Lysine metabolism #d84e8b +M00773 Macrolide biosynthesis #2e4b26 +M00774 Macrolide biosynthesis #2e4b26 +M00775 Macrolide biosynthesis #2e4b26 +M00776 Macrolide biosynthesis #2e4b26 +M00777 Macrolide biosynthesis #2e4b26 +M00611 Metabolic capacity #9378c3 +M00612 Metabolic capacity #9378c3 +M00613 Metabolic capacity #9378c3 +M00614 Metabolic capacity #9378c3 +M00615 Metabolic capacity #9378c3 +M00616 Metabolic capacity #9378c3 +M00617 Metabolic capacity #9378c3 +M00618 Metabolic capacity #9378c3 +M00174 Methane metabolism #9e7336 +M00344 Methane metabolism #9e7336 +M00345 Methane metabolism #9e7336 +M00346 Methane metabolism #9e7336 +M00356 Methane metabolism #9e7336 +M00357 Methane metabolism #9e7336 +M00358 Methane metabolism #9e7336 +M00378 Methane metabolism #9e7336 +M00422 Methane metabolism #9e7336 +M00563 Methane metabolism #9e7336 +M00567 Methane metabolism #9e7336 +M00608 Methane metabolism #9e7336 +M00175 Nitrogen metabolism #2c2351 +M00528 Nitrogen metabolism #2c2351 +M00529 Nitrogen metabolism #2c2351 +M00530 Nitrogen metabolism #2c2351 +M00531 Nitrogen metabolism #2c2351 +M00804 Nitrogen metabolism #2c2351 +M00027 Other amino acid metabolism #c5d7a9 +M00118 Other amino acid metabolism #c5d7a9 +M00369 Other amino acid metabolism #c5d7a9 +M00012 Other carbohydrate metabolism #872b4e +M00013 Other carbohydrate metabolism #872b4e +M00014 Other carbohydrate metabolism #872b4e +M00061 Other carbohydrate metabolism #872b4e +M00081 Other carbohydrate metabolism #872b4e +M00114 Other carbohydrate metabolism #872b4e +M00129 Other carbohydrate metabolism #872b4e +M00130 Other carbohydrate metabolism #872b4e +M00131 Other carbohydrate metabolism #872b4e +M00132 Other carbohydrate metabolism #872b4e +M00373 Other carbohydrate metabolism #872b4e +M00532 Other carbohydrate metabolism #872b4e +M00549 Other carbohydrate metabolism #872b4e +M00550 Other carbohydrate metabolism #872b4e +M00552 Other carbohydrate metabolism #872b4e +M00554 Other carbohydrate metabolism #872b4e +M00565 Other carbohydrate metabolism #872b4e +M00630 Other carbohydrate metabolism #872b4e +M00631 Other carbohydrate metabolism #872b4e +M00632 Other carbohydrate metabolism #872b4e +M00740 Other carbohydrate metabolism #872b4e +M00741 Other carbohydrate metabolism #872b4e +M00761 Other carbohydrate metabolism #872b4e +M00854 Other carbohydrate metabolism #872b4e +M00855 Other carbohydrate metabolism #872b4e +M00097 Other terpenoid biosynthesis #6e9368 +M00371 Other terpenoid biosynthesis #6e9368 +M00372 Other terpenoid biosynthesis #6e9368 +M00363 Pathogenicity #66406d +M00542 Pathogenicity #66406d +M00564 Pathogenicity #66406d +M00574 Pathogenicity #66406d +M00575 Pathogenicity #66406d +M00576 Pathogenicity #66406d +M00850 Pathogenicity #66406d +M00852 Pathogenicity #66406d +M00853 Pathogenicity #66406d +M00856 Pathogenicity #66406d +M00857 Pathogenicity #66406d +M00859 Pathogenicity #66406d +M00860 Pathogenicity #66406d +M00161 Photosynthesis #cfa68a +M00163 Photosynthesis #cfa68a +M00597 Photosynthesis #cfa68a +M00598 Photosynthesis #cfa68a +M00660 Plant pathogenicity #461d27 +M00133 Polyamine biosynthesis #a5b3da +M00134 Polyamine biosynthesis #a5b3da +M00135 Polyamine biosynthesis #a5b3da +M00136 Polyamine biosynthesis #a5b3da +M00793 Polyketide sugar unit biosynthesis #5c4f24 +M00794 Polyketide sugar unit biosynthesis #5c4f24 +M00795 Polyketide sugar unit biosynthesis #5c4f24 +M00796 Polyketide sugar unit biosynthesis #5c4f24 +M00797 Polyketide sugar unit biosynthesis #5c4f24 +M00798 Polyketide sugar unit biosynthesis #5c4f24 +M00799 Polyketide sugar unit biosynthesis #5c4f24 +M00800 Polyketide sugar unit biosynthesis #5c4f24 +M00801 Polyketide sugar unit biosynthesis #5c4f24 +M00802 Polyketide sugar unit biosynthesis #5c4f24 +M00803 Polyketide sugar unit biosynthesis #5c4f24 +M00048 Purine metabolism #e0a7d2 +M00049 Purine metabolism #e0a7d2 +M00050 Purine metabolism #e0a7d2 +M00546 Purine metabolism #e0a7d2 +M00046 Pyrimidine metabolism #25585e +M00051 Pyrimidine metabolism #25585e +M00052 Pyrimidine metabolism #25585e +M00053 Pyrimidine metabolism #25585e +M00018 Serine and threonine metabolism #de7d78 +M00020 Serine and threonine metabolism #de7d78 +M00033 Serine and threonine metabolism #de7d78 +M00555 Serine and threonine metabolism #de7d78 +M00101 Sterol biosynthesis #4e96a2 +M00102 Sterol biosynthesis #4e96a2 +M00103 Sterol biosynthesis #4e96a2 +M00104 Sterol biosynthesis #4e96a2 +M00106 Sterol biosynthesis #4e96a2 +M00107 Sterol biosynthesis #4e96a2 +M00108 Sterol biosynthesis #4e96a2 +M00109 Sterol biosynthesis #4e96a2 +M00110 Sterol biosynthesis #4e96a2 +M00862 Sterol biosynthesis #4e96a2 +M00176 Sulfur metabolism #4e96a2 +M00595 Sulfur metabolism #4e96a2 +M00596 Sulfur metabolism #4e96a2 +M00664 Symbiosis #88574e +M00095 Terpenoid backbone biosynthesis #4e6089 +M00096 Terpenoid backbone biosynthesis #4e6089 +M00364 Terpenoid backbone biosynthesis #4e6089 +M00365 Terpenoid backbone biosynthesis #4e6089 +M00366 Terpenoid backbone biosynthesis #4e6089 +M00367 Terpenoid backbone biosynthesis #4e6089 +M00849 Terpenoid backbone biosynthesis #4e6089 +M00778 Type II polyketide biosynthesis #af7194 +M00779 Type II polyketide biosynthesis #af7194 +M00780 Type II polyketide biosynthesis #af7194 +M00781 Type II polyketide biosynthesis #af7194 +M00782 Type II polyketide biosynthesis #af7194 +M00783 Type II polyketide biosynthesis #af7194 +M00784 Type II polyketide biosynthesis #af7194 +M00823 Type II polyketide biosynthesis #af7194 diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Bifurcating_Module_Information.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Bifurcating_Module_Information.pkl new file mode 100644 index 0000000..7535b86 Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Bifurcating_Module_Information.pkl differ diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Module-KOs.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Module-KOs.pkl new file mode 100644 index 0000000..cba82d5 Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Module-KOs.pkl differ diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Module_Information.txt b/data/MicrobeAnnotator_KEGG/KEGG_Module_Information.txt new file mode 100644 index 0000000..db9ec87 --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/KEGG_Module_Information.txt @@ -0,0 +1,394 @@ +M00015 Proline biosynthesis, glutamate => proline Arginine and proline metabolism #8a3222 +M00028 Ornithine biosynthesis, glutamate => ornithine Arginine and proline metabolism #8a3222 +M00029 Urea cycle Arginine and proline metabolism #8a3222 +M00047 Creatine pathway Arginine and proline metabolism #8a3222 +M00763 Ornithine biosynthesis, mediated by LysW, glutamate => ornithine Arginine and proline metabolism #8a3222 +M00844 Arginine biosynthesis, ornithine => arginine Arginine and proline metabolism #8a3222 +M00845 Arginine biosynthesis, glutamate => acetylcitrulline => arginine Arginine and proline metabolism #8a3222 +M00879 Arginine succinyltransferase pathway, arginine => glutamate Arginine and proline metabolism #8a3222 +M00022 Shikimate pathway, phosphoenolpyruvate + erythrose-4P => chorismate Aromatic amino acid metabolism #8641b6 +M00023 Tryptophan biosynthesis, chorismate => tryptophan Aromatic amino acid metabolism #8641b6 +M00024 Phenylalanine biosynthesis, chorismate => phenylalanine Aromatic amino acid metabolism #8641b6 +M00025 Tyrosine biosynthesis, chorismate => tyrosine Aromatic amino acid metabolism #8641b6 +M00037 Melatonin biosynthesis, tryptophan => serotonin => melatonin Aromatic amino acid metabolism #8641b6 +M00038 Tryptophan metabolism, tryptophan => kynurenine => 2-aminomuconate Aromatic amino acid metabolism #8641b6 +M00040 Tyrosine biosynthesis, prephanate => pretyrosine => tyrosine Aromatic amino acid metabolism #8641b6 +M00042 Catecholamine biosynthesis, tyrosine => dopamine => noradrenaline => adrenaline Aromatic amino acid metabolism #8641b6 +M00043 Thyroid hormone biosynthesis, tyrosine => triiodothyronine--thyroxine Aromatic amino acid metabolism #8641b6 +M00044 Tyrosine degradation, tyrosine => homogentisate Aromatic amino acid metabolism #8641b6 +M00533 Homoprotocatechuate degradation, homoprotocatechuate => 2-oxohept-3-enedioate Aromatic amino acid metabolism #8641b6 +M00545 Trans-cinnamate degradation, trans-cinnamate => acetyl-CoA Aromatic amino acid metabolism #8641b6 +M00418 Toluene degradation, anaerobic, toluene => benzoyl-CoA Aromatics degradation #76d25b +M00419 Cymene degradation, p-cymene => p-cumate Aromatics degradation #76d25b +M00534 Naphthalene degradation, naphthalene => salicylate Aromatics degradation #76d25b +M00537 Xylene degradation, xylene => methylbenzoate Aromatics degradation #76d25b +M00538 Toluene degradation, toluene => benzoate Aromatics degradation #76d25b +M00539 Cumate degradation, p-cumate => 2-oxopent-4-enoate + 2-methylpropanoate Aromatics degradation #76d25b +M00540 Benzoate degradation, cyclohexanecarboxylic acid =>pimeloyl-CoA Aromatics degradation #76d25b +M00541 Benzoyl-CoA degradation, benzoyl-CoA => 3-hydroxypimeloyl-CoA Aromatics degradation #76d25b +M00543 Biphenyl degradation, biphenyl => 2-oxopent-4-enoate + benzoate Aromatics degradation #76d25b +M00544 Carbazole degradation, carbazole => 2-oxopent-4-enoate + anthranilate Aromatics degradation #76d25b +M00547 Benzene--toluene degradation, benzene => catechol -- toluene => 3-methylcatechol Aromatics degradation #76d25b +M00548 Benzene degradation, benzene => catechol Aromatics degradation #76d25b +M00551 Benzoate degradation, benzoate => catechol -- methylbenzoate => methylcatechol Aromatics degradation #76d25b +M00568 Catechol ortho-cleavage, catechol => 3-oxoadipate Aromatics degradation #76d25b +M00569 Catechol meta-cleavage, catechol => acetyl-CoA -- 4-methylcatechol => propanoyl-CoA Aromatics degradation #76d25b +M00623 Phthalate degradation 1, phthalate => protocatechuate Aromatics degradation #76d25b +M00624 Terephthalate degradation, terephthalate => 3,4-dihydroxybenzoate Aromatics degradation #76d25b +M00636 Phthalate degradation 2, phthalate => protocatechuate Aromatics degradation #76d25b +M00637 Anthranilate degradation, anthranilate => catechol Aromatics degradation #76d25b +M00638 Salicylate degradation, salicylate => gentisate Aromatics degradation #76d25b +M00878 Phenylacetate degradation, phenylaxetate => acetyl-CoA--succinyl-CoA Aromatics degradation #76d25b +M00142 NADH:ubiquinone oxidoreductase, mitochondria ATP synthesis #cdd346 +M00143 NADH dehydrogenase (ubiquinone) Fe-S protein--flavoprotein complex, mitochondria ATP synthesis #cdd346 +M00144 NADH:quinone oxidoreductase, prokaryotes ATP synthesis #cdd346 +M00145 NAD(P)H:quinone oxidoreductase, chloroplasts and cyanobacteria ATP synthesis #cdd346 +M00146 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex ATP synthesis #cdd346 +M00147 NADH dehydrogenase (ubiquinone) 1 beta subcomplex ATP synthesis #cdd346 +M00148 Succinate dehydrogenase (ubiquinone) ATP synthesis #cdd346 +M00149 Succinate dehydrogenase, prokaryotes ATP synthesis #cdd346 +M00150 Fumarate reductase, prokaryotes ATP synthesis #cdd346 +M00151 Cytochrome bc1 complex respiratory unit ATP synthesis #cdd346 +M00152 Cytochrome bc1 complex ATP synthesis #cdd346 +M00153 Cytochrome bd ubiquinol oxidase ATP synthesis #cdd346 +M00154 Cytochrome c oxidase ATP synthesis #cdd346 +M00155 Cytochrome c oxidase, prokaryotes ATP synthesis #cdd346 +M00156 Cytochrome c oxidase, cbb3-type ATP synthesis #cdd346 +M00157 F-type ATPase, prokaryotes and chloroplasts ATP synthesis #cdd346 +M00158 F-type ATPase, eukaryotes ATP synthesis #cdd346 +M00159 V-type ATPase, prokaryotes ATP synthesis #cdd346 +M00160 V-type ATPase, eukaryotes ATP synthesis #cdd346 +M00162 Cytochrome b6f complex ATP synthesis #cdd346 +M00416 Cytochrome aa3-600 menaquinol oxidase ATP synthesis #cdd346 +M00417 Cytochrome o ubiquinol oxidase ATP synthesis #cdd346 +M00672 Penicillin biosynthesis, aminoadipate + cycteine + valine => penicillin Beta-Lactam biosynthesis #3b2882 +M00673 Cephamycin C biosynthesis, aminoadipate + cycteine + valine => cephamycin C Beta-Lactam biosynthesis #3b2882 +M00674 Clavaminate biosynthesis, arginine + glyceraldehyde-3P => clavaminate Beta-Lactam biosynthesis #3b2882 +M00675 Carbapenem-3-carboxylate biosynthesis, pyrroline-5-carboxylate + malonyl-CoA => carbapenem-3-carboxylate Beta-Lactam biosynthesis #3b2882 +M00736 Nocardicin A biosynthesis, L-pHPG + arginine + serine => nocardicin A Beta-Lactam biosynthesis #3b2882 +M00039 Monolignol biosynthesis, phenylalanine--tyrosine => monolignol Biosynthesis of other secondary metabolites #cbde82 +M00137 Flavanone biosynthesis, phenylalanine => naringenin Biosynthesis of other secondary metabolites #cbde82 +M00138 Flavonoid biosynthesis, naringenin => pelargonidin Biosynthesis of other secondary metabolites #cbde82 +M00370 Glucosinolate biosynthesis, tryptophan => glucobrassicin Biosynthesis of other secondary metabolites #cbde82 +M00661 Paspaline biosynthesis, geranylgeranyl-PP + indoleglycerol phosphate => paspaline Biosynthesis of other secondary metabolites #cbde82 +M00785 Cycloserine biosynthesis, arginine--serine => cycloserine Biosynthesis of other secondary metabolites #cbde82 +M00786 Fumitremorgin alkaloid biosynthesis, tryptophan + proline => fumitremorgin C--A Biosynthesis of other secondary metabolites #cbde82 +M00787 Bacilysin biosynthesis, prephenate => bacilysin Biosynthesis of other secondary metabolites #cbde82 +M00788 Terpentecin biosynthesis, GGAP => terpentecin Biosynthesis of other secondary metabolites #cbde82 +M00789 Rebeccamycin biosynthesis, tryptophan => rebeccamycin Biosynthesis of other secondary metabolites #cbde82 +M00790 Pyrrolnitrin biosynthesis, tryptophan => pyrrolnitrin Biosynthesis of other secondary metabolites #cbde82 +M00805 Staurosporine biosynthesis, tryptophan => staurosporine Biosynthesis of other secondary metabolites #cbde82 +M00808 Violacein biosynthesis, tryptophan => violacein Biosynthesis of other secondary metabolites #cbde82 +M00814 Acarbose biosynthesis, sedoheptulopyranose-7P => acarbose Biosynthesis of other secondary metabolites #cbde82 +M00815 Validamycin A biosynthesis, sedoheptulopyranose-7P => validamycin A Biosynthesis of other secondary metabolites #cbde82 +M00819 Pentalenolactone biosynthesis, farnesyl-PP => pentalenolactone Biosynthesis of other secondary metabolites #cbde82 +M00835 Pyocyanine biosynthesis, chorismate => pyocyanine Biosynthesis of other secondary metabolites #cbde82 +M00837 Prodigiosin biosynthesis, L-proline => prodigiosin Biosynthesis of other secondary metabolites #cbde82 +M00838 Undecylprodigiosin biosynthesis, L-proline => undecylprodigiosin Biosynthesis of other secondary metabolites #cbde82 +M00848 Aurachin biosynthesis, anthranilate => aurachin A Biosynthesis of other secondary metabolites #cbde82 +M00875 Staphyloferrin B biosynthesis, L-serine => staphyloferrin B Biosynthesis of other secondary metabolites #cbde82 +M00876 Staphyloferrin A biosynthesis, L-ornithine => staphyloferrin A Biosynthesis of other secondary metabolites #cbde82 +M00877 Kanosamine biosynthesis glucose 6-phosphate => kanosamine Biosynthesis of other secondary metabolites #cbde82 +M00019 Valine--isoleucine biosynthesis, pyruvate => valine -- 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb +M00036 Leucine degradation, leucine => acetoacetate + acetyl-CoA Branched-chain amino acid metabolism #656cdb +M00432 Leucine biosynthesis, 2-oxoisovalerate => 2-oxoisocaproate Branched-chain amino acid metabolism #656cdb +M00535 Isoleucine biosynthesis, pyruvate => 2-oxobutanoate Branched-chain amino acid metabolism #656cdb +M00570 Isoleucine biosynthesis, threonine => 2-oxobutanoate => isoleucine Branched-chain amino acid metabolism #656cdb +M00165 Reductive pentose phosphate cycle (Calvin cycle) Carbon fixation #408937 +M00166 Reductive pentose phosphate cycle, ribulose-5P => glyceraldehyde-3P Carbon fixation #408937 +M00167 Reductive pentose phosphate cycle, glyceraldehyde-3P => ribulose-5P Carbon fixation #408937 +M00168 CAM (Crassulacean acid metabolism), dark Carbon fixation #408937 +M00169 CAM (Crassulacean acid metabolism), light Carbon fixation #408937 +M00170 C4-dicarboxylic acid cycle, phosphoenolpyruvate carboxykinase type Carbon fixation #408937 +M00171 C4-dicarboxylic acid cycle, NAD - malic enzyme type Carbon fixation #408937 +M00172 C4-dicarboxylic acid cycle, NADP - malic enzyme type Carbon fixation #408937 +M00173 Reductive citrate cycle (Arnon-Buchanan cycle) Carbon fixation #408937 +M00374 Dicarboxylate-hydroxybutyrate cycle Carbon fixation #408937 +M00375 Hydroxypropionate-hydroxybutylate cycle Carbon fixation #408937 +M00376 3-Hydroxypropionate bi-cycle Carbon fixation #408937 +M00377 Reductive acetyl-CoA pathway (Wood-Ljungdahl pathway) Carbon fixation #408937 +M00579 Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate Carbon fixation #408937 +M00620 Incomplete reductive citrate cycle, acetyl-CoA => oxoglutarate Carbon fixation #408937 +M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate Central carbohydrate metabolism #c644a5 +M00002 Glycolysis, core module involving three-carbon compounds Central carbohydrate metabolism #c644a5 +M00003 Gluconeogenesis, oxaloacetate => fructose-6P Central carbohydrate metabolism #c644a5 +M00004 Pentose phosphate pathway (Pentose phosphate cycle) Central carbohydrate metabolism #c644a5 +M00005 PRPP biosynthesis, ribose 5P => PRPP Central carbohydrate metabolism #c644a5 +M00006 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P Central carbohydrate metabolism #c644a5 +M00007 Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5 +M00008 Entner-Doudoroff pathway, glucose-6P => glyceraldehyde-3P + pyruvate Central carbohydrate metabolism #c644a5 +M00009 Citrate cycle (TCA cycle, Krebs cycle) Central carbohydrate metabolism #c644a5 +M00010 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate Central carbohydrate metabolism #c644a5 +M00011 Citrate cycle, second carbon oxidation, 2-oxoglutarate => oxaloacetate Central carbohydrate metabolism #c644a5 +M00307 Pyruvate oxidation, pyruvate => acetyl-CoA Central carbohydrate metabolism #c644a5 +M00308 Semi-phosphorylative Entner-Doudoroff pathway, gluconate => glycerate-3P Central carbohydrate metabolism #c644a5 +M00309 Non-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate Central carbohydrate metabolism #c644a5 +M00580 Pentose phosphate pathway, archaea, fructose 6P => ribose 5P Central carbohydrate metabolism #c644a5 +M00633 Semi-phosphorylative Entner-Doudoroff pathway, gluconate--galactonate => glycerate-3P Central carbohydrate metabolism #c644a5 +M00112 Tocopherol--tocotorienol biosynthesis Cofactor and vitamin metabolism #5fda98 +M00115 NAD biosynthesis, aspartate => NAD Cofactor and vitamin metabolism #5fda98 +M00116 Menaquinone biosynthesis, chorismate => menaquinol Cofactor and vitamin metabolism #5fda98 +M00117 Ubiquinone biosynthesis, prokaryotes, chorismate => ubiquinone Cofactor and vitamin metabolism #5fda98 +M00119 Pantothenate biosynthesis, valine--L-aspartate => pantothenate Cofactor and vitamin metabolism #5fda98 +M00120 Coenzyme A biosynthesis, pantothenate => CoA Cofactor and vitamin metabolism #5fda98 +M00121 Heme biosynthesis, plants and bacteria, glutamate => heme Cofactor and vitamin metabolism #5fda98 +M00122 Cobalamin biosynthesis, cobinamide => cobalamin Cofactor and vitamin metabolism #5fda98 +M00123 Biotin biosynthesis, pimeloyl-ACP--CoA => biotin Cofactor and vitamin metabolism #5fda98 +M00124 Pyridoxal biosynthesis, erythrose-4P => pyridoxal-5P Cofactor and vitamin metabolism #5fda98 +M00125 Riboflavin biosynthesis, GTP => riboflavin--FMN--FAD Cofactor and vitamin metabolism #5fda98 +M00126 Tetrahydrofolate biosynthesis, GTP => THF Cofactor and vitamin metabolism #5fda98 +M00127 Thiamine biosynthesis, AIR => thiamine-P--thiamine-2P Cofactor and vitamin metabolism #5fda98 +M00128 Ubiquinone biosynthesis, eukaryotes, 4-hydroxybenzoate => ubiquinone Cofactor and vitamin metabolism #5fda98 +M00140 C1-unit interconversion, prokaryotes Cofactor and vitamin metabolism #5fda98 +M00141 C1-unit interconversion, eukaryotes Cofactor and vitamin metabolism #5fda98 +M00572 Pimeloyl-ACP biosynthesis, BioC-BioH pathway, malonyl-ACP => pimeloyl-ACP Cofactor and vitamin metabolism #5fda98 +M00573 Biotin biosynthesis, BioI pathway, long-chain-acyl-ACP => pimeloyl-ACP => biotin Cofactor and vitamin metabolism #5fda98 +M00577 Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin Cofactor and vitamin metabolism #5fda98 +M00622 Nicotinate degradation, nicotinate => fumarate Cofactor and vitamin metabolism #5fda98 +M00810 Nicotine degradation, pyridine pathway, nicotine => 2,6-dihydroxypyridine--succinate semialdehyde Cofactor and vitamin metabolism #5fda98 +M00811 Nicotine degradation, pyrrolidine pathway, nicotine => succinate semialdehyde Cofactor and vitamin metabolism #5fda98 +M00836 Coenzyme F430 biosynthesis, sirohydrochlorin => coenzyme F430 Cofactor and vitamin metabolism #5fda98 +M00840 Tetrahydrofolate biosynthesis, mediated by ribA and trpF, GTP => THF Cofactor and vitamin metabolism #5fda98 +M00841 Tetrahydrofolate biosynthesis, mediated by PTPS, GTP => THF Cofactor and vitamin metabolism #5fda98 +M00842 Tetrahydrobiopterin biosynthesis, GTP => BH4 Cofactor and vitamin metabolism #5fda98 +M00843 L-threo-Tetrahydrobiopterin biosynthesis, GTP => L-threo-BH4 Cofactor and vitamin metabolism #5fda98 +M00846 Siroheme biosynthesis, glutamate => siroheme Cofactor and vitamin metabolism #5fda98 +M00847 Heme biosynthesis, archaea, siroheme => heme Cofactor and vitamin metabolism #5fda98 +M00868 Heme biosynthesis, animals and fungi, glycine => heme Cofactor and vitamin metabolism #5fda98 +M00880 Molybdenum cofactor biosynthesis, GTP => molybdenum cofactor Cofactor and vitamin metabolism #5fda98 +M00017 Methionine biosynthesis, apartate => homoserine => methionine Cysteine and methionine metabolism #782975 +M00021 Cysteine biosynthesis, serine => cysteine Cysteine and methionine metabolism #782975 +M00034 Methionine salvage pathway Cysteine and methionine metabolism #782975 +M00035 Methionine degradation Cysteine and methionine metabolism #782975 +M00338 Cysteine biosynthesis, homocysteine + serine => cysteine Cysteine and methionine metabolism #782975 +M00368 Ethylene biosynthesis, methionine => ethylene Cysteine and methionine metabolism #782975 +M00609 Cysteine biosynthesis, methionine => cysteine Cysteine and methionine metabolism #782975 +M00625 Methicillin resistance Drug resistance #869534 +M00627 beta-Lactam resistance, Bla system Drug resistance #869534 +M00639 Multidrug resistance, efflux pump MexCD-OprJ Drug resistance #869534 +M00641 Multidrug resistance, efflux pump MexEF-OprN Drug resistance #869534 +M00642 Multidrug resistance, efflux pump MexJK-OprM Drug resistance #869534 +M00643 Multidrug resistance, efflux pump MexXY-OprM Drug resistance #869534 +M00649 Multidrug resistance, efflux pump AdeABC Drug resistance #869534 +M00651 Vancomycin resistance, D-Ala-D-Lac type Drug resistance #869534 +M00652 Vancomycin resistance, D-Ala-D-Ser type Drug resistance #869534 +M00696 Multidrug resistance, efflux pump AcrEF-TolC Drug resistance #869534 +M00697 Multidrug resistance, efflux pump MdtEF-TolC Drug resistance #869534 +M00698 Multidrug resistance, efflux pump BpeEF-OprC Drug resistance #869534 +M00700 Multidrug resistance, efflux pump AbcA Drug resistance #869534 +M00702 Multidrug resistance, efflux pump NorB Drug resistance #869534 +M00704 Tetracycline resistance, efflux pump Tet38 Drug resistance #869534 +M00705 Multidrug resistance, efflux pump MepA Drug resistance #869534 +M00714 Multidrug resistance, efflux pump QacA Drug resistance #869534 +M00718 Multidrug resistance, efflux pump MexAB-OprM Drug resistance #869534 +M00725 Cationic antimicrobial peptide (CAMP) resistance, dltABCD operon Drug resistance #869534 +M00726 Cationic antimicrobial peptide (CAMP) resistance, lysyl-phosphatidylglycerol (L-PG) synthase MprF Drug resistance #869534 +M00730 Cationic antimicrobial peptide (CAMP) resistance, VraFG transporter Drug resistance #869534 +M00744 Cationic antimicrobial peptide (CAMP) resistance, protease PgtE Drug resistance #869534 +M00745 Imipenem resistance, repression of porin OprD Drug resistance #869534 +M00746 Multidrug resistance, repression of porin OmpF Drug resistance #869534 +M00769 Multidrug resistance, efflux pump MexPQ-OpmE Drug resistance #869534 +M00851 Carbapenem resistance Drug resistance #869534 +M00824 9-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 9-membered enediyne core Enediyne biosynthesis #d27bde +M00825 10-membered enediyne core biosynthesis, malonyl-CoA => 3-hydroxyhexadeca-4,6,8,10,12,14-hexaenoyl-ACP => 10-membered enediyne core Enediyne biosynthesis #d27bde +M00826 C-1027 benzoxazolinate moiety biosynthesis, chorismate => benzoxazolinyl-CoA Enediyne biosynthesis #d27bde +M00827 C-1027 beta-amino acid moiety biosynthesis, tyrosine => 3-chloro-4,5-dihydroxy-beta-phenylalanyl-PCP Enediyne biosynthesis #d27bde +M00828 Maduropeptin beta-hydroxy acid moiety biosynthesis, tyrosine => 3-(4-hydroxyphenyl)-3-oxopropanoyl-PCP Enediyne biosynthesis #d27bde +M00829 3,6-Dimethylsalicylyl-CoA biosynthesis, malonyl-CoA => 6-methylsalicylate => 3,6-dimethylsalicylyl-CoA Enediyne biosynthesis #d27bde +M00830 Neocarzinostatin naphthoate moiety biosynthesis, malonyl-CoA => 2-hydroxy-5-methyl-1-naphthoate => 2-hydroxy-7-methoxy-5-methyl-1-naphthoyl-CoA Enediyne biosynthesis #d27bde +M00831 Kedarcidin 2-hydroxynaphthoate moiety biosynthesis, malonyl-CoA => 3,6,8-trihydroxy-2-naphthoate => 3-hydroxy-7,8-dimethoxy-6-isopropoxy-2-naphthoyl-CoA Enediyne biosynthesis #d27bde +M00832 Kedarcidin 2-aza-3-chloro-beta-tyrosine moiety biosynthesis, azatyrosine => 2-aza-3-chloro-beta-tyrosyl-PCP Enediyne biosynthesis #d27bde +M00833 Calicheamicin biosynthesis, calicheamicinone => calicheamicin Enediyne biosynthesis #d27bde +M00834 Calicheamicin orsellinate moiety biosynthesis, malonyl-CoA => orsellinate-ACP => 5-iodo-2,3-dimethoxyorsellinate-ACP Enediyne biosynthesis #d27bde +M00082 Fatty acid biosynthesis, initiation Fatty acid metabolism #d9a344 +M00083 Fatty acid biosynthesis, elongation Fatty acid metabolism #d9a344 +M00085 Fatty acid elongation in mitochondria Fatty acid metabolism #d9a344 +M00086 beta-Oxidation, acyl-CoA synthesis Fatty acid metabolism #d9a344 +M00087 beta-Oxidation Fatty acid metabolism #d9a344 +M00415 Fatty acid elongation in endoplasmic reticulum Fatty acid metabolism #d9a344 +M00861 beta-Oxidation, peroxisome, VLCFA Fatty acid metabolism #d9a344 +M00873 Fatty acid biosynthesis in mitochondria, animals Fatty acid metabolism #d9a344 +M00874 Fatty acid biosynthesis in mitochondria, fungi Fatty acid metabolism #d9a344 +M00055 N-glycan precursor biosynthesis Glycan biosynthesis #588cd6 +M00056 O-glycan biosynthesis, mucin type core Glycan biosynthesis #588cd6 +M00065 GPI-anchor biosynthesis, core oligosaccharide Glycan biosynthesis #588cd6 +M00068 Glycosphingolipid biosynthesis, globo-series, LacCer => Gb4Cer Glycan biosynthesis #588cd6 +M00069 Glycosphingolipid biosynthesis, ganglio series, LacCer => GT3 Glycan biosynthesis #588cd6 +M00070 Glycosphingolipid biosynthesis, lacto-series, LacCer => Lc4Cer Glycan biosynthesis #588cd6 +M00071 Glycosphingolipid biosynthesis, neolacto-series, LacCer => nLc4Cer Glycan biosynthesis #588cd6 +M00072 N-glycosylation by oligosaccharyltransferase Glycan biosynthesis #588cd6 +M00073 N-glycan precursor trimming Glycan biosynthesis #588cd6 +M00074 N-glycan biosynthesis, high-mannose type Glycan biosynthesis #588cd6 +M00075 N-glycan biosynthesis, complex type Glycan biosynthesis #588cd6 +M00872 O-glycan biosynthesis, mannose type (core M3) Glycan biosynthesis #588cd6 +M00057 Glycosaminoglycan biosynthesis, linkage tetrasaccharide Glycosaminoglycan metabolism #d66432 +M00058 Glycosaminoglycan biosynthesis, chondroitin sulfate backbone Glycosaminoglycan metabolism #d66432 +M00059 Glycosaminoglycan biosynthesis, heparan sulfate backbone Glycosaminoglycan metabolism #d66432 +M00076 Dermatan sulfate degradation Glycosaminoglycan metabolism #d66432 +M00077 Chondroitin sulfate degradation Glycosaminoglycan metabolism #d66432 +M00078 Heparan sulfate degradation Glycosaminoglycan metabolism #d66432 +M00079 Keratan sulfate degradation Glycosaminoglycan metabolism #d66432 +M00026 Histidine biosynthesis, PRPP => histidine Histidine metabolism #66d7bf +M00045 Histidine degradation, histidine => N-formiminoglutamate => glutamate Histidine metabolism #66d7bf +M00066 Lactosylceramide biosynthesis Lipid metabolism #d53e55 +M00067 Sulfoglycolipids biosynthesis, ceramide--1-alkyl-2-acylglycerol => sulfatide--seminolipid Lipid metabolism #d53e55 +M00088 Ketone body biosynthesis, acetyl-CoA => acetoacetate--3-hydroxybutyrate--acetone Lipid metabolism #d53e55 +M00089 Triacylglycerol biosynthesis Lipid metabolism #d53e55 +M00090 Phosphatidylcholine (PC) biosynthesis, choline => PC Lipid metabolism #d53e55 +M00091 Phosphatidylcholine (PC) biosynthesis, PE => PC Lipid metabolism #d53e55 +M00092 Phosphatidylethanolamine (PE) biosynthesis, ethanolamine => PE Lipid metabolism #d53e55 +M00093 Phosphatidylethanolamine (PE) biosynthesis, PA => PS => PE Lipid metabolism #d53e55 +M00094 Ceramide biosynthesis Lipid metabolism #d53e55 +M00098 Acylglycerol degradation Lipid metabolism #d53e55 +M00099 Sphingosine biosynthesis Lipid metabolism #d53e55 +M00100 Sphingosine degradation Lipid metabolism #d53e55 +M00113 Jasmonic acid biosynthesis Lipid metabolism #d53e55 +M00060 KDO2-lipid A biosynthesis, Raetz pathway, LpxL-LpxM type Lipopolysaccharide metabolism #83d2de +M00063 CMP-KDO biosynthesis Lipopolysaccharide metabolism #83d2de +M00064 ADP-L-glycero-D-manno-heptose biosynthesis Lipopolysaccharide metabolism #83d2de +M00866 KDO2-lipid A biosynthesis, Raetz pathway, non-LpxL-LpxM type Lipopolysaccharide metabolism #83d2de +M00867 KDO2-lipid A modification pathway Lipopolysaccharide metabolism #83d2de +M00016 Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b +M00030 Lysine biosynthesis, AAA pathway, 2-oxoglutarate => 2-aminoadipate => lysine Lysine metabolism #d84e8b +M00031 Lysine biosynthesis, mediated by LysW, 2-aminoadipate => lysine Lysine metabolism #d84e8b +M00032 Lysine degradation, lysine => saccharopine => acetoacetyl-CoA Lysine metabolism #d84e8b +M00433 Lysine biosynthesis, 2-oxoglutarate => 2-oxoadipate Lysine metabolism #d84e8b +M00525 Lysine biosynthesis, acetyl-DAP pathway, aspartate => lysine Lysine metabolism #d84e8b +M00526 Lysine biosynthesis, DAP dehydrogenase pathway, aspartate => lysine Lysine metabolism #d84e8b +M00527 Lysine biosynthesis, DAP aminotransferase pathway, aspartate => lysine Lysine metabolism #d84e8b +M00773 Tylosin biosynthesis, methylmalonyl-CoA + malonyl-CoA => tylactone => tylosin Macrolide biosynthesis #2e4b26 +M00774 Erythromycin biosynthesis, propanoyl-CoA + methylmalonyl-CoA => deoxyerythronolide B => erythromycin A--B Macrolide biosynthesis #2e4b26 +M00775 Oleandomycin biosynthesis, malonyl-CoA + methylmalonyl-CoA => 8,8a-deoxyoleandolide => oleandomycin Macrolide biosynthesis #2e4b26 +M00776 Pikromycin--methymycin biosynthesis, methylmalonyl-CoA + malonyl-CoA => narbonolide--10-deoxymethynolide => pikromycin--methymycin Macrolide biosynthesis #2e4b26 +M00777 Avermectin biosynthesis, 2-methylbutanoyl-CoA--isobutyryl-CoA => 6,8a-Seco-6,8a-deoxy-5-oxoavermectin 1a--1b aglycone => avermectin A1a--B1a--A1b--B1b Macrolide biosynthesis #2e4b26 +M00611 Oxygenic photosynthesis in plants and cyanobacteria Metabolic capacity #9378c3 +M00612 Anoxygenic photosynthesis in purple bacteria Metabolic capacity #9378c3 +M00613 Anoxygenic photosynthesis in green nonsulfur bacteria Metabolic capacity #9378c3 +M00614 Anoxygenic photosynthesis in green sulfur bacteria Metabolic capacity #9378c3 +M00615 Nitrate assimilation Metabolic capacity #9378c3 +M00616 Sulfate-sulfur assimilation Metabolic capacity #9378c3 +M00617 Methanogen Metabolic capacity #9378c3 +M00618 Acetogen Metabolic capacity #9378c3 +M00174 Methane oxidation, methanotroph, methane => formaldehyde Methane metabolism #9e7336 +M00344 Formaldehyde assimilation, xylulose monophosphate pathway Methane metabolism #9e7336 +M00345 Formaldehyde assimilation, ribulose monophosphate pathway Methane metabolism #9e7336 +M00346 Formaldehyde assimilation, serine pathway Methane metabolism #9e7336 +M00356 Methanogenesis, methanol => methane Methane metabolism #9e7336 +M00357 Methanogenesis, acetate => methane Methane metabolism #9e7336 +M00358 Coenzyme M biosynthesis Methane metabolism #9e7336 +M00378 F420 biosynthesis Methane metabolism #9e7336 +M00422 Acetyl-CoA pathway, CO2 => acetyl-CoA Methane metabolism #9e7336 +M00563 Methanogenesis, methylamine--dimethylamine--trimethylamine => methane Methane metabolism #9e7336 +M00567 Methanogenesis, CO2 => methane Methane metabolism #9e7336 +M00608 2-Oxocarboxylic acid chain extension, 2-oxoglutarate => 2-oxoadipate => 2-oxopimelate => 2-oxosuberate Methane metabolism #9e7336 +M00175 Nitrogen fixation, nitrogen => ammonia Nitrogen metabolism #2c2351 +M00528 Nitrification, ammonia => nitrite Nitrogen metabolism #2c2351 +M00529 Denitrification, nitrate => nitrogen Nitrogen metabolism #2c2351 +M00530 Dissimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351 +M00531 Assimilatory nitrate reduction, nitrate => ammonia Nitrogen metabolism #2c2351 +M00804 Complete nitrification, comammox, ammonia => nitrite => nitrate Nitrogen metabolism #2c2351 +M00027 GABA (gamma-Aminobutyrate) shunt Other amino acid metabolism #c5d7a9 +M00118 Glutathione biosynthesis, glutamate => glutathione Other amino acid metabolism #c5d7a9 +M00369 Cyanogenic glycoside biosynthesis, tyrosine => dhurrin Other amino acid metabolism #c5d7a9 +M00012 Glyoxylate cycle Other carbohydrate metabolism #872b4e +M00013 Malonate semialdehyde pathway, propanoyl-CoA => acetyl-CoA Other carbohydrate metabolism #872b4e +M00014 Glucuronate pathway (uronate pathway) Other carbohydrate metabolism #872b4e +M00061 D-Glucuronate degradation, D-glucuronate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e +M00081 Pectin degradation Other carbohydrate metabolism #872b4e +M00114 Ascorbate biosynthesis, plants, glucose-6P => ascorbate Other carbohydrate metabolism #872b4e +M00129 Ascorbate biosynthesis, animals, glucose-1P => ascorbate Other carbohydrate metabolism #872b4e +M00130 Inositol phosphate metabolism, PI=> PIP2 => Ins(1,4,5)P3 => Ins(1,3,4,5)P4 Other carbohydrate metabolism #872b4e +M00131 Inositol phosphate metabolism, Ins(1,3,4,5)P4 => Ins(1,3,4)P3 => myo-inositol Other carbohydrate metabolism #872b4e +M00132 Inositol phosphate metabolism, Ins(1,3,4)P3 => phytate Other carbohydrate metabolism #872b4e +M00373 Ethylmalonyl pathway Other carbohydrate metabolism #872b4e +M00532 Photorespiration Other carbohydrate metabolism #872b4e +M00549 Nucleotide sugar biosynthesis, glucose => UDP-glucose Other carbohydrate metabolism #872b4e +M00550 Ascorbate degradation, ascorbate => D-xylulose-5P Other carbohydrate metabolism #872b4e +M00552 D-galactonate degradation, De Ley-Doudoroff pathway, D-galactonate => glycerate-3P Other carbohydrate metabolism #872b4e +M00554 Nucleotide sugar biosynthesis, galactose => UDP-galactose Other carbohydrate metabolism #872b4e +M00565 Trehalose biosynthesis, D-glucose 1P => trehalose Other carbohydrate metabolism #872b4e +M00630 D-Galacturonate degradation (fungi), D-galacturonate => glycerol Other carbohydrate metabolism #872b4e +M00631 D-Galacturonate degradation (bacteria), D-galacturonate => pyruvate + D-glyceraldehyde 3P Other carbohydrate metabolism #872b4e +M00632 Galactose degradation, Leloir pathway, galactose => alpha-D-glucose-1P Other carbohydrate metabolism #872b4e +M00740 Methylaspartate cycle Other carbohydrate metabolism #872b4e +M00741 Propanoyl-CoA metabolism, propanoyl-CoA => succinyl-CoA Other carbohydrate metabolism #872b4e +M00761 Undecaprenylphosphate alpha-L-Ara4N biosynthesis, UDP-GlcA => undecaprenyl phosphate alpha-L-Ara4N Other carbohydrate metabolism #872b4e +M00854 Glycogen biosynthesis, glucose-1P => glycogen--starch Other carbohydrate metabolism #872b4e +M00855 Glycogen degradation, glycogen => glucose-6P Other carbohydrate metabolism #872b4e +M00097 beta-Carotene biosynthesis, GGAP => beta-carotene Other terpenoid biosynthesis #6e9368 +M00371 Castasterone biosynthesis, campesterol => castasterone Other terpenoid biosynthesis #6e9368 +M00372 Abscisic acid biosynthesis, beta-carotene => abscisic acid Other terpenoid biosynthesis #6e9368 +M00363 EHEC pathogenicity signature, Shiga toxin Pathogenicity #66406d +M00542 EHEC--EPEC pathogenicity signature, T3SS and effectors Pathogenicity #66406d +M00564 Helicobacter pylori pathogenicity signature, cagA pathogenicity island Pathogenicity #66406d +M00574 Pertussis pathogenicity signature, pertussis toxin Pathogenicity #66406d +M00575 Pertussis pathogenicity signature, T1SS Pathogenicity #66406d +M00576 ETEC pathogenicity signature, heat-labile and heat-stable enterotoxins Pathogenicity #66406d +M00850 Vibrio cholerae pathogenicity signature, cholera toxins Pathogenicity #66406d +M00852 Vibrio cholerae pathogenicity signature, toxin coregulated pilus Pathogenicity #66406d +M00853 ETEC pathogenicity signature, colonization factors Pathogenicity #66406d +M00856 Salmonella enterica pathogenicity signature, typhoid toxin Pathogenicity #66406d +M00857 Salmonella enterica pathogenicity signature, Vi antigen Pathogenicity #66406d +M00859 Bacillus anthracis pathogenicity signature, anthrax toxin Pathogenicity #66406d +M00860 Bacillus anthracis pathogenicity signature, polyglutamic acid capsule biosynthesis Pathogenicity #66406d +M00161 Photosystem II Photosynthesis #cfa68a +M00163 Photosystem I Photosynthesis #cfa68a +M00597 Anoxygenic photosystem II [BR:ko00194] Photosynthesis #cfa68a +M00598 Anoxygenic photosystem I [BR:ko00194] Photosynthesis #cfa68a +M00660 Xanthomonas spp. pathogenicity signature, T3SS and effectors Plant pathogenicity #461d27 +M00133 Polyamine biosynthesis, arginine => agmatine => putrescine => spermidine Polyamine biosynthesis #a5b3da +M00134 Polyamine biosynthesis, arginine => ornithine => putrescine Polyamine biosynthesis #a5b3da +M00135 GABA biosynthesis, eukaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da +M00136 GABA biosynthesis, prokaryotes, putrescine => GABA Polyamine biosynthesis #a5b3da +M00793 dTDP-L-rhamnose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00794 dTDP-6-deoxy-D-allose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00795 dTDP-beta-L-noviose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00796 dTDP-D-mycaminose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00797 dTDP-D-desosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00798 dTDP-L-mycarose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00799 dTDP-L-oleandrose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00800 dTDP-L-megosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00801 dTDP-L-olivose biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00802 dTDP-D-forosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00803 dTDP-D-angolosamine biosynthesis Polyketide sugar unit biosynthesis #5c4f24 +M00048 Inosine monophosphate biosynthesis, PRPP + glutamine => IMP Purine metabolism #e0a7d2 +M00049 Adenine ribonucleotide biosynthesis, IMP => ADP,ATP Purine metabolism #e0a7d2 +M00050 Guanine ribonucleotide biosynthesis IMP => GDP,GTP Purine metabolism #e0a7d2 +M00546 Purine degradation, xanthine => urea Purine metabolism #e0a7d2 +M00046 Pyrimidine degradation, uracil => beta-alanine, thymine => 3-aminoisobutanoate Pyrimidine metabolism #25585e +M00051 Uridine monophosphate biosynthesis, glutamine (+ PRPP) => UMP Pyrimidine metabolism #25585e +M00052 Pyrimidine ribonucleotide biosynthesis, UMP => UDP--UTP,CDP--CTP Pyrimidine metabolism #25585e +M00053 Pyrimidine deoxyribonuleotide biosynthesis, CDP--CTP => dCDP--dCTP,dTDP--dTTP Pyrimidine metabolism #25585e +M00018 Threonine biosynthesis, aspartate => homoserine => threonine Serine and threonine metabolism #de7d78 +M00020 Serine biosynthesis, glycerate-3P => serine Serine and threonine metabolism #de7d78 +M00033 Ectoine biosynthesis, aspartate => ectoine Serine and threonine metabolism #de7d78 +M00555 Betaine biosynthesis, choline => betaine Serine and threonine metabolism #de7d78 +M00101 Cholesterol biosynthesis, squalene 2,3-epoxide => cholesterol Sterol biosynthesis #4e96a2 +M00102 Ergocalciferol biosynthesis Sterol biosynthesis #4e96a2 +M00103 Cholecalciferol biosynthesis Sterol biosynthesis #4e96a2 +M00104 Bile acid biosynthesis, cholesterol => cholate--chenodeoxycholate Sterol biosynthesis #4e96a2 +M00106 Conjugated bile acid biosynthesis, cholate => taurocholate--glycocholate Sterol biosynthesis #4e96a2 +M00107 Steroid hormone biosynthesis, cholesterol => prognenolone => progesterone Sterol biosynthesis #4e96a2 +M00108 C21-Steroid hormone biosynthesis, progesterone => corticosterone--aldosterone Sterol biosynthesis #4e96a2 +M00109 C21-Steroid hormone biosynthesis, progesterone => cortisol--cortisone Sterol biosynthesis #4e96a2 +M00110 C19--C18-Steroid hormone biosynthesis, pregnenolone => androstenedione => estrone Sterol biosynthesis #4e96a2 +M00862 beta-Oxidation, peroxisome, tri--dihydroxycholestanoyl-CoA => choloyl--chenodeoxycholoyl-CoA Sterol biosynthesis #4e96a2 +M00176 Assimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2 +M00595 Thiosulfate oxidation by SOX complex, thiosulfate => sulfate Sulfur metabolism #4e96a2 +M00596 Dissimilatory sulfate reduction, sulfate => H2S Sulfur metabolism #4e96a2 +M00664 Nodulation Symbiosis #88574e +M00095 C5 isoprenoid biosynthesis, mevalonate pathway Terpenoid backbone biosynthesis #4e6089 +M00096 C5 isoprenoid biosynthesis, non-mevalonate pathway Terpenoid backbone biosynthesis #4e6089 +M00364 C10-C20 isoprenoid biosynthesis, bacteria Terpenoid backbone biosynthesis #4e6089 +M00365 C10-C20 isoprenoid biosynthesis, archaea Terpenoid backbone biosynthesis #4e6089 +M00366 C10-C20 isoprenoid biosynthesis, plants Terpenoid backbone biosynthesis #4e6089 +M00367 C10-C20 isoprenoid biosynthesis, non-plant eukaryotes Terpenoid backbone biosynthesis #4e6089 +M00849 C5 isoprenoid biosynthesis, mevalonate pathway, archaea Terpenoid backbone biosynthesis #4e6089 +M00778 Type II polyketide backbone biosynthesis, acyl-CoA + malonyl-CoA => polyketide Type II polyketide biosynthesis #af7194 +M00779 Dihydrokalafungin biosynthesis, octaketide => dihydrokalafungin Type II polyketide biosynthesis #af7194 +M00780 Tetracycline--oxytetracycline biosynthesis, pretetramide => tetracycline--oxytetracycline Type II polyketide biosynthesis #af7194 +M00781 Nogalavinone--aklavinone biosynthesis, deoxynogalonate--deoxyaklanonate => nogalavinone--aklavinone Type II polyketide biosynthesis #af7194 +M00782 Mithramycin biosynthesis, 4-demethylpremithramycinone => mithramycin Type II polyketide biosynthesis #af7194 +M00783 Tetracenomycin C--8-demethyltetracenomycin C biosynthesis, tetracenomycin F2 => tetracenomycin C--8-demethyltetracenomycin C Type II polyketide biosynthesis #af7194 +M00784 Elloramycin biosynthesis, 8-demethyltetracenomycin C => elloramycin A Type II polyketide biosynthesis #af7194 +M00823 Chlortetracycline biosynthesis, pretetramide => chlortetracycline Type II polyketide biosynthesis #af7194 \ No newline at end of file diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Regular_Module_Information.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Regular_Module_Information.pkl new file mode 100644 index 0000000..c2ff119 Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Regular_Module_Information.pkl differ diff --git a/data/MicrobeAnnotator_KEGG/KEGG_Structural_Module_Information.pkl b/data/MicrobeAnnotator_KEGG/KEGG_Structural_Module_Information.pkl new file mode 100644 index 0000000..ba85377 Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/KEGG_Structural_Module_Information.pkl differ diff --git a/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz new file mode 100644 index 0000000..8c3f1d8 Binary files /dev/null and b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz differ diff --git a/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz.md5 b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz.md5 new file mode 100644 index 0000000..12fdf2c --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/MicrobeAnnotator-KEGG.tar.gz.md5 @@ -0,0 +1 @@ +7207b9efe0124c6e9781cf4cf4fa24de MicrobeAnnotator-KEGG.tar.gz diff --git a/data/MicrobeAnnotator_KEGG/README.md b/data/MicrobeAnnotator_KEGG/README.md new file mode 100644 index 0000000..3c1a62d --- /dev/null +++ b/data/MicrobeAnnotator_KEGG/README.md @@ -0,0 +1,69 @@ +# MicrobeAnnotator-KEGG + +**If this is used in any way, please cite the source publication:** + +Ruiz-Perez, C.A., Conrad, R.E. & Konstantinidis, K.T. MicrobeAnnotator: a user-friendly, comprehensive functional annotation pipeline for microbial genomes. BMC Bioinformatics 22, 11 (2021). https://doi.org/10.1186/s12859-020-03940-5 + +**This data has been incorporated from the following source:** + +https://github.com/cruizperez/MicrobeAnnotator/tree/master/microbeannotator/data + +**File Descriptions:** + +* `KEGG_Regular_Module_Information.pkl` - Python dictionary of regular modules from `MicrobeAnnotator` of `{id_module:structured_kegg_orthologs}` +* `KEGG_Bifurcating_Module_Information.pkl` - Python dictionary of bifurcating modules from `MicrobeAnnotator` of `{id_module:structured_kegg_orthologs}` +* `KEGG_Structural_Module_Information.pkl` - Python dictionary of structural modules from `MicrobeAnnotator` of `{id_module:structured_kegg_orthologs}` +* `KEGG_Module_Information.txt` - - Table containing KEGG ortholog, higher level categories, and module color +* `KEGG_Module-KOs.pkl` - Flattened dictionary which includes `{id_module:{KO_1, KO_2, ..., KO_M}`. Note: This is not structured and should be used cautiously as KEGG modules and completion calculations are complex. Generated with the Python code below: + +```python +import pickle, glob, os + +kegg_directory = "{}/MicrobeAnnotator_KEGG/".format(os.environ["VEBA_DATABASE"] + +delimiters = [",","_","-","+"] + +# Load MicrobeAnnotator KEGG dictionaries +module_to_kos__unprocessed = defaultdict(set) +for fp in glob.glob(os.path.join(kegg_directory, "*.pkl")): + with open(fp, "rb") as f: + d = pickle.load(f) + + for id_module, v1 in d.items(): + if isinstance(v1, list): + try: + module_to_kos__unprocessed[id_module].update(v1) + except TypeError: + for v2 in v1: + module_to_kos__unprocessed[id_module].update(v2) + else: + for k2, v2 in v1.items(): + if isinstance(v2, list): + try: + module_to_kos__unprocessed[id_module].update(v2) + except TypeError: + for v3 in v2: + module_to_kos__unprocessed[id_module].update(v3) + +# Flatten the KEGG orthologs +module_to_kos__processed = dict() +for id_module, kos_unprocessed in module_to_kos__unprocessed.items(): + kos_processed = set() + for id_ko in kos: + composite=False + for sep in delimiters: + if sep in id_ko: + id_ko = id_ko.replace(sep,";") + composite = True + if composite: + kos_composite = set(map(str.strip, filter(bool, id_ko.split(";")))) + kos_processed.update(kos_composite) + else: + kos_processed.add(id_ko) + module_to_kos__processed[id_module] = kos_processed + + +# Write +with open(os.path.join(kegg_directory, "KEGG_Module-KOs.pkl"), "wb") as f: + pickle.dump(module_to_kos__processed, f) +``` \ No newline at end of file diff --git a/data/README.md b/data/README.md index 10b3b4b..6525e90 100644 --- a/data/README.md +++ b/data/README.md @@ -9,4 +9,13 @@ The following fastq files are subsets of the original SRA sequences designed for | S3 | SRR17458630 | FASTQ | DNA | 2389989 | 75 | 150.4 | 151 | 56.38 | | S4 | SRR17458638 | FASTQ | DNA | 3142566 | 75 | 150.5 | 151 | 46.34 | -[**Download**](https://zenodo.org/record/7946802#.ZGVSpuzMKDU) \ No newline at end of file +Also includes the following: + +* Metagenomic assemblies using metaSPAdes with sorted BAM files from Bowtie2 +* Genomes, gene models, etc. +* Taxonomy classifications at the genome and genome cluster level +* Annotations for genes and protein clusters +* Biosynthetic gene clusters +* Clusters for genomes and proteins + +[**Download**](https://zenodo.org/records/10094990) \ No newline at end of file diff --git a/install/README.md b/install/README.md index d7d56b0..f92e986 100644 --- a/install/README.md +++ b/install/README.md @@ -3,16 +3,18 @@ ____________________________________________________________ #### Software installation One issue with having large-scale pipeline suites with open-source software is the issue of dependencies. One solution for this is to have a modular software structure where each module has its own `conda` environment. This allows for minimizing dependency constraints as this software suite uses an array of diverse packages from different developers. -The basis for these environments is creating a separate environment for each module with the `VEBA-` prefix and `_env` as the suffix. For example `VEBA-assembly_env` or `VEBA-binning-prokaryotic_env`. Because of this, `VEBA` is currently not available as a `conda` package but each module will be in the near future. In the meantime, please use the `veba/install/install_veba.sh` script which installs each environment from the yaml files in `veba/install/environments/`. After installing the environments, use the `veba/install/download_databases.sh` script to download and configure the databases while also adding the environment variables to the activate/deactivate scripts in each environment. To install anything manually, just read the scripts as they are well documented and refer to different URL and paths for specific installation options. +The basis for these environments is creating a separate environment for each module with the `VEBA-` prefix and `_env` as the suffix. For example `VEBA-assembly_env` or `VEBA-binning-prokaryotic_env`. Because of this, `VEBA` is currently not available as a `conda` package but each module will be in the near future. In the meantime, please use the `veba/install/install.sh` script which installs each environment from the yaml files in `veba/install/environments/`. After installing the environments, use the `veba/install/download_databases.sh` script to download and configure the databases while also adding the environment variables to the activate/deactivate scripts in each environment. To install anything manually, just read the scripts as they are well documented and refer to different URL and paths for specific installation options. -The majority of the time taken to build database is downloading/decompressing large archives, `Diamond` database creation of `UniRef`, and `MMSEQS2` database creation of microeukaryotic protein database. +The majority of the time taken to build database is downloading/decompressing large archives (e.g., `UniRef` & `GTDB`), `Diamond` database creation of `UniRef`, and `MMSEQS2` database creation of `MicroEuk` database. Total size is `243 GB` but if you have certain databases installed already then you can just symlink them so the `VEBA_DATABASE` path has the correct structure. Note, the exact size may vary as Pfam and UniRef are updated regularly. Each major version will be packaged as a [release](https://github.com/jolespin/veba/releases) which will include a log of module and script versions. -**Download Anaconda:** -[https://www.anaconda.com/products/distribution](https://www.anaconda.com/products/distribution) +**Download Miniconda (or Anaconda):** + +* [https://docs.conda.io/projects/miniconda/en/latest/](https://docs.conda.io/projects/miniconda/en/latest/) (Recommended) +* [https://www.anaconda.com/products/distribution](https://www.anaconda.com/products/distribution) ____________________________________________________________ @@ -33,7 +35,7 @@ Currently, **Conda environments for VEBA are ONLY configured for Linux** and, du * Download/configure databases -**0. Clean up your conda installation [Optional, but recommended]** +**0. Clean up your conda installation [Optional, but highly recommended]** The `VEBA` installation is going to configure some `conda` environments for you and some of them have quite a bit of packages. To minimize the likelihood of [weird errors](https://forum.qiime2.org/t/valueerror-unsupported-format-character-t-0x54-at-index-3312-when-creating-environment-from-environment-file/25237), it's recommended to do the following: @@ -83,7 +85,7 @@ The `VEBA` installation is going to configure some `conda` environments for you ``` # For stable version, download and decompress the tarball: -VERSION="1.3.0" +VERSION="1.4.0" wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz tar -xvf v${VERSION}.tar.gz && mv veba-${VERSION} veba @@ -106,14 +108,16 @@ cd veba/install The update from `CheckM1` -> `CheckM2` and installation of `antiSMASH` require more memory and may require grid access if head node is limited. ``` -bash install_veba.sh +bash install.sh ``` **3. Activate the database conda environment, download, and configure databases** **Recommended resource allocatation:** 48 GB memory (time is dependent on I/O of database repositories) -⚠️ **This step should use ~48 GB memory** and should be run using a compute grid via SLURM or SunGridEngine. If this command is run on the head node it will likely fail or timeout if a connection is interrupted. The most computationally intensive steps are creating a `Diamond` database of `UniRef` and a `MMSEQS2` database of the microeukaryotic protein database. Note the duration will depend on several factors including your internet connection speed and the I/O of public repositories. +⚠️ **This step should use ~48 GB memory** and should be run using a compute grid via `SLURM` or `SunGridEngine`. **If this command is run on the head node it will likely fail or timeout if a connection is interrupted.** The most computationally intensive steps are creating a `Diamond` database of `UniRef` and a `MMSEQS2` database of the `MicroEuk100/90/50`. + +Note the duration will depend on several factors including your internet connection speed and the I/O of public repositories. **Future releases will split the downloading and configuration to better make use of resources.** @@ -163,7 +167,7 @@ qsub -o logs/${N}.o -e logs/${N}.e -cwd -N ${N} -j y -pe threaded ${N_JOBS} "${C PARTITION=[partition name] ACCOUNT=[account name] -sbatch -A ${ACCOUNT} -p ${PARTITION} -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 12:00:00 --mem=64G --wrap="${CMD}" +sbatch -A ${ACCOUNT} -p ${PARTITION} -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 16:00:00 --mem=24G --wrap="${CMD}" ``` Now, you should have the following environments: @@ -183,6 +187,7 @@ VEBA-phylogeny_env VEBA-preprocess_env VEBA-profile_env ``` + All the environments should have the `VEBA_DATABASE` environment variable set. If not, then add it manually to ~/.bash_profile: `export VEBA_DATABASE=/path/to/veba_database`. You can check to make sure the `conda` environments were created and all of the environment variables were created using the following command: @@ -218,7 +223,7 @@ ____________________________________________________________ ``` # Remove conda enivronments -bash uninstall_veba.sh +bash uninstall.sh # Remove VEBA database rm -rfv /path/to/veba_database @@ -230,6 +235,6 @@ ____________________________________________________________ There are currently 2 ways to update veba: 1. Basic uninstall reinstall - You can uninstall and reinstall using the scripts in `veba/install/` directory. It's recomended to do a fresh reinstall when updating from `v1.0.x` → `v1.2.x`. -2. Patching existing installation - Complete reinstalls of *VEBA* environments and databases is time consuming so [we've detailed how to do specific patches **for advanced users**](PATCHES.md). If you don't feel comfortable running these commands, then just do a fresh install if you would like to update. +2. Patching existing installation - TBD Guide for updating specific modules in an installation. diff --git a/install/PATCHES.md b/install/deprecated/PATCHES.md similarity index 100% rename from install/PATCHES.md rename to install/deprecated/PATCHES.md diff --git a/install/download_databases.sh b/install/download_databases.sh index 12833fd..06c4d48 100644 --- a/install/download_databases.sh +++ b/install/download_databases.sh @@ -1,11 +1,12 @@ #!/bin/bash -# __version__ = "2023.10.23" -# VEBA_DATABASE_VERSION = "VDB_v5.2" -# MICROEUKAYROTIC_DATABASE_VERSION = "VDB-Microeukaryotic_v2.1" +# __version__ = "2023.12.11" +# VEBA_DATABASE_VERSION = "VDB_v6" +# MICROEUKAYROTIC_DATABASE_VERSION = "MicroEuk_v3" # Create database DATABASE_DIRECTORY=${1:-"."} REALPATH_DATABASE_DIRECTORY=$(realpath $DATABASE_DIRECTORY) +SCRIPT_DIRECTORY=$(dirname "$0") # N_JOBS=$(2:-"1") @@ -28,7 +29,7 @@ echo ". .. ... ..... ........ ............." echo "i * Processing NCBITaxonomy" echo ". .. ... ..... ........ ............." mkdir -v -p ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy -wget -v -P ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz +# wget -v -P ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz wget -v -P ${DATABASE_DIRECTORY} https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz # python -c 'import sys; from ete3 import NCBITaxa; NCBITaxa(taxdump_file="%s/taxdump.tar.gz"%(sys.argv[1]), dbfile="%s/Classify/NCBITaxonomy/taxa.sqlite"%(sys.argv[1]))' $DATABASE_DIRECTORY tar xzfv ${DATABASE_DIRECTORY}/taxdump.tar.gz -C ${DATABASE_DIRECTORY}/Classify/NCBITaxonomy/ @@ -86,18 +87,56 @@ echo ". .. ... ..... ........ ............." echo "v * Processing Microeukaryotic MMSEQS2 database" echo ". .. ... ..... ........ ............." -# Download v2.1 from Zenodo -wget -v -O ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz https://zenodo.org/record/7485114/files/VDB-Microeukaryotic_v2.tar.gz?download=1 -mkdir -p ${DATABASE_DIRECTORY}/Classify/Microeukaryotic && tar -xvzf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz -C ${DATABASE_DIRECTORY}/Classify/Microeukaryotic --strip-components=1 -mmseqs createdb ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic -rm -rf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz +## Download v2.1 from Zenodo +# wget -v -O ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz https://zenodo.org/record/7485114/files/VDB-Microeukaryotic_v2.tar.gz?download=1 +# mkdir -p ${DATABASE_DIRECTORY}/Classify/Microeukaryotic && tar -xvzf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz -C ${DATABASE_DIRECTORY}/Classify/Microeukaryotic --strip-components=1 +# mmseqs createdb --compressed 1 ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic +# rm -rf ${DATABASE_DIRECTORY}/Microeukaryotic.tar.gz -# eukaryota_odb10 subset of Microeukaryotic Protein Database -wget -v -O ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list https://zenodo.org/record/7485114/files/reference.eukaryota_odb10.list?download=1 -seqkit grep -f ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz > ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa -mmseqs createdb ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic.eukaryota_odb10 -rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa -rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz # Comment this out if you want to keep the actual protein sequences +# # eukaryota_odb10 subset of Microeukaryotic Protein Database +# wget -v -O ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list https://zenodo.org/record/7485114/files/reference.eukaryota_odb10.list?download=1 +# seqkit grep -f ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.list ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz > ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa +# mmseqs createdb --compressed 1 ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/microeukaryotic.eukaryota_odb10 +# rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.eukaryota_odb10.faa +# rm -rf ${DATABASE_DIRECTORY}/Classify/Microeukaryotic/reference.faa.gz # Comment this out if you want to keep the actual protein sequences + +# Download MicroEuk_v3 from Zenodo +wget -v -O ${DATABASE_DIRECTORY}/MicroEuk_v3.tar.gz https://zenodo.org/records/10139451/files/MicroEuk_v3.tar.gz?download=1 +tar xvzf ${DATABASE_DIRECTORY}/MicroEuk_v3.tar.gz -C ${DATABASE_DIRECTORY} +mkdir -p ${DATABASE_DIRECTORY}/Classify/MicroEuk + +# Source Taxonomy +cp -rf ${DATABASE_DIRECTORY}/MicroEuk_v3/source_taxonomy.tsv.gz ${DATABASE_DIRECTORY}/Classify/MicroEuk + +# MicroEuk100 +gzip -d ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa.gz +mmseqs createdb --compressed 1 ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk100 + +# MicroEuk100.eukaryota_odb10 +gzip -d ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.eukaryota_odb10.list.gz +seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.eukaryota_odb10.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk100 + +# MicroEuk90 +gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list +seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk90 + +# MicroEuk90 +gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list +seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk90.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk90 + +# MicroEuk50 +gzip -d -c ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk50_clusters.tsv.gz | cut -f1 | sort -u > ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk50.list +seqkit grep -f ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk50.list ${DATABASE_DIRECTORY}/MicroEuk_v3/MicroEuk100.faa | mmseqs createdb --compressed 1 stdin ${DATABASE_DIRECTORY}/Classify/MicroEuk/MicroEuk50 + +# source_to_lineage.dict.pkl.gz +build_source_to_lineage_dictionary.py -i ${DATABASE_DIRECTORY}/MicroEuk_v3/source_taxonomy.tsv.gz -o ${DATABASE_DIRECTORY}/Classify/MicroEuk/source_to_lineage.dict.pkl.gz + +# target_to_source.dict.pkl.gz +build_target_to_source_dictionary.py -i ${DATABASE_DIRECTORY}/MicroEuk_v3/identifier_mapping.proteins.tsv.gz -o ${DATABASE_DIRECTORY}/Classify/MicroEuk/target_to_source.dict.pkl.gz + +# Remove intermediate files +rm -rf ${DATABASE_DIRECTORY}/MicroEuk_v3/ +rm -rf ${DATABASE_DIRECTORY}/MicroEuk_v3.tar.gz # MarkerSets echo ". .. ... ..... ........ ............." @@ -213,11 +252,17 @@ rm -rf ${DATABASE_DIRECTORY}/Contamination/AntiFam/*.seed mkdir -v -p ${DATABASE_DIRECTORY}/Contamination/kmers wget -v -O ${DATABASE_DIRECTORY}/Contamination/kmers/ribokmers.fa.gz https://figshare.com/ndownloader/files/36220587 -# Replacing GRCh38 with CHM13v2.0 in v2022.10.18 +# T2T-CHM13v2.0 +# Bowtie2 Index wget -v -P ${DATABASE_DIRECTORY} https://genome-idx.s3.amazonaws.com/bt/chm13v2.0.zip unzip -d ${DATABASE_DIRECTORY}/Contamination/ ${DATABASE_DIRECTORY}/chm13v2.0.zip rm -rf ${DATABASE_DIRECTORY}/chm13v2.0.zip +# # MiniMap2 Index (Uncomment if you plan on using long reads (7.1 GB)) +# wget -v -P ${DATABASE_DIRECTORY} https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz +# minimap2 -d ${DATABASE_DIRECTORY}/Contamination/chm13v2.0/chm13v2.0.mmi ${DATABASE_DIRECTORY}/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz +# rm -rf ${DATABASE_DIRECTORY}/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz + echo ". .. ... ..... ........ ............." echo "xii * Adding the following environment variable to VEBA environments: export VEBA_DATABASE=${REALPATH_DATABASE_DIRECTORY}" # CONDA_BASE=$(which conda | python -c "import sys; print('/'.join(sys.stdin.read().split('/')[:-2]))") diff --git a/install/environments/VEBA-assembly_env.yml b/install/environments/VEBA-assembly_env.yml index 692c79e..6d5a013 100644 --- a/install/environments/VEBA-assembly_env.yml +++ b/install/environments/VEBA-assembly_env.yml @@ -1,4 +1,4 @@ -name: VEBA-assembly_env__2023.5.15 +name: VEBA-assembly_env__2023.11.30 channels: - conda-forge - bioconda @@ -16,15 +16,16 @@ dependencies: - bz2file=0.98=py_0 - bzip2=1.0.8=h7f98852_4 - c-ares=1.18.1=h7f98852_0 - - ca-certificates=2022.12.7=ha878542_0 + - ca-certificates=2023.11.17=hbcca054_0 - cairo=1.16.0=ha61ee94_1014 - - certifi=2022.12.7=pyhd8ed1ab_0 + - certifi=2023.11.17=pyhd8ed1ab_0 - cffi=1.15.1=py39he91dace_2 - charset-normalizer=2.1.1=pyhd8ed1ab_0 - colorama=0.4.6=pyhd8ed1ab_0 - coreutils=9.3=h0b41bf4_0 - - cryptography=38.0.4=py39hd97740a_0 + - cryptography=41.0.7=py39hd4f0224_0 - expat=2.5.0=h27087fc_0 + - flye=2.9.3=py39hd65a603_0 - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 - font-ttf-inconsolata=3.000=h77eed37_0 - font-ttf-source-code-pro=2.038=h77eed37_0 @@ -38,21 +39,22 @@ dependencies: - giflib=5.2.1=h36c2ea0_2 - graphite2=1.3.13=h58526e2_1001 - harfbuzz=5.3.0=h418a68e_0 - - htslib=1.16=h6bc39ce_0 + - htslib=1.18=h81da01d_0 - icu=70.1=h27087fc_0 - idna=3.4=pyhd8ed1ab_0 - jpeg=9e=h166bdaf_2 + - k8=0.2.5=hdcf5f25_4 - kernel-headers_linux-64=3.10.0=h4a8ded7_13 - keyutils=1.6.1=h166bdaf_0 - - krb5=1.19.3=h3790be6_0 - - lcms2=2.14=h6ed2654_0 + - krb5=1.21.2=h659d440_0 + - lcms2=2.12=hddcbb42_0 - ld_impl_linux-64=2.39=hcc3a1bd_1 - lerc=4.0.0=h27087fc_0 - libblas=3.9.0=16_linux64_openblas - libcblas=3.9.0=16_linux64_openblas - - libcups=2.3.3=h3e49a29_2 - - libcurl=7.86.0=h7bff187_1 - - libdeflate=1.13=h166bdaf_0 + - libcups=2.3.3=h4637d8d_4 + - libcurl=8.2.1=hca28451_0 + - libdeflate=1.19=hd590300_0 - libedit=3.1.20191231=he28a2e2_2 - libev=4.33=h516909a_1 - libffi=3.4.2=h7f98852_5 @@ -64,14 +66,14 @@ dependencies: - libhwloc=2.8.0=h32351e8_1 - libiconv=1.17=h166bdaf_0 - liblapack=3.9.0=16_linux64_openblas - - libnghttp2=1.47.0=hdcd2b5c_1 + - libnghttp2=1.52.0=h61bc06f_0 - libnsl=2.0.0=h7f98852_0 - libopenblas=0.3.21=pthreads_h78a6416_3 - libpng=1.6.39=h753d276_0 - libsqlite=3.40.0=h753d276_0 - - libssh2=1.10.0=haa6b8db_3 + - libssh2=1.11.0=h0841786_0 - libstdcxx-ng=12.2.0=h46fd767_19 - - libtiff=4.4.0=h0e0dad5_3 + - libtiff=4.2.0=hf544144_3 - libuuid=2.32.1=h7f98852_1000 - libwebp-base=1.2.4=h166bdaf_0 - libxcb=1.13=h7f98852_1004 @@ -79,11 +81,12 @@ dependencies: - libzlib=1.2.13=h166bdaf_4 - llvm-openmp=8.0.1=hc9558a2_0 - megahit=1.2.9=h2e03b76_1 + - minimap2=2.26=he4a0461_2 - ncurses=6.3=h27087fc_1 - numpy=1.23.5=py39h3d75532_0 - - openjdk=17.0.3=hafdced1_4 + - openjdk=11.0.1=h516909a_1016 - openmp=8.0.1=0 - - openssl=1.1.1t=h0b41bf4_0 + - openssl=3.2.0=hd590300_1 - pandas=1.5.2=py39h4661b88_0 - pathlib2=2.3.7.post1=py39hf3d152e_2 - pbzip2=1.1.13=0 @@ -93,9 +96,9 @@ dependencies: - pixman=0.40.0=h36c2ea0_0 - pthread-stubs=0.4=h36c2ea0_1001 - pycparser=2.21=pyhd8ed1ab_0 - - pyopenssl=22.1.0=pyhd8ed1ab_0 + - pyopenssl=23.3.0=pyhd8ed1ab_0 - pysocks=1.7.1=pyha2e5f31_6 - - python=3.9.15=h47a2c10_0_cpython + - python=3.9.16=h2782a2a_0_cpython - python-dateutil=2.8.2=pyhd8ed1ab_0 - python-tzdata=2022.7=pyhd8ed1ab_0 - python_abi=3.9=3_cp39 diff --git a/install/environments/VEBA-cluster_env.yml b/install/environments/VEBA-cluster_env.yml index 2f2d189..b9ff294 100644 --- a/install/environments/VEBA-cluster_env.yml +++ b/install/environments/VEBA-cluster_env.yml @@ -1,4 +1,4 @@ -name: VEBA-cluster_env__v2023.5.15 +name: VEBA-cluster_env__v2023.12.8 channels: - conda-forge - bioconda @@ -9,27 +9,36 @@ dependencies: - _openmp_mutex=4.5=2_gnu - aria2=1.36.0=h1e4e653_3 - biopython=1.80=py311hd4cff14_0 + - blast=2.14.1=pl5321h6f7f691_0 - brotlipy=0.7.0=py311hd4cff14_1005 - bz2file=0.98=py_0 - bzip2=1.0.8=h7f98852_4 - c-ares=1.18.1=h7f98852_0 - - ca-certificates=2022.12.7=ha878542_0 - - certifi=2022.12.7=pyhd8ed1ab_0 + - ca-certificates=2023.11.17=hbcca054_0 + - certifi=2023.11.17=pyhd8ed1ab_0 - cffi=1.15.1=py311h409f033_3 - charset-normalizer=2.1.1=pyhd8ed1ab_0 - colorama=0.4.6=pyhd8ed1ab_0 - coreutils=9.3=h0b41bf4_0 - cryptography=39.0.0=py311h9b4c7bb_0 - - fastani=1.33=h0fdf51a_1 + - curl=8.1.2=h409715c_0 + - diamond=2.1.8=h43eeafb_0 + - entrez-direct=16.2=he881be0_1 + - fastani=1.34=h4dfc31f_1 - gawk=5.1.0=h7f98852_0 - genopype=2023.5.15=py_0 - gettext=0.21.1=h27087fc_0 - gsl=2.7=he838d99_0 - icu=70.1=h27087fc_0 - idna=3.4=pyhd8ed1ab_0 + - keyutils=1.6.1=h166bdaf_0 + - krb5=1.20.1=h81ceb04_0 - ld_impl_linux-64=2.40=h41732ed_0 - libblas=3.9.0=16_linux64_openblas - libcblas=3.9.0=16_linux64_openblas + - libcurl=8.1.2=h409715c_0 + - libedit=3.1.20191231=he28a2e2_2 + - libev=4.33=h516909a_1 - libffi=3.4.2=h7f98852_5 - libgcc-ng=12.2.0=h65d4601_19 - libgfortran-ng=12.2.0=h69a702a_19 @@ -38,6 +47,7 @@ dependencies: - libiconv=1.17=h166bdaf_0 - libidn2=2.3.4=h166bdaf_0 - liblapack=3.9.0=16_linux64_openblas + - libnghttp2=1.52.0=h61bc06f_0 - libnsl=2.0.0=h7f98852_0 - libopenblas=0.3.21=pthreads_h78a6416_3 - libsqlite=3.40.0=h753d276_0 @@ -48,13 +58,34 @@ dependencies: - libxml2=2.10.3=h7463322_0 - libzlib=1.2.13=h166bdaf_4 - mmseqs2=14.7e284=pl5321hf1761c0_0 + - ncbi-vdb=3.0.0=pl5321h87f3376_0 - ncurses=6.3=h27087fc_1 - networkx=3.0=pyhd8ed1ab_0 - numpy=1.24.1=py311h8e6699e_0 - - openssl=3.0.8=h0b41bf4_0 + - openssl=3.2.0=hd590300_1 - pandas=1.5.3=py311h2872171_0 - pathlib2=2.3.7.post1=py311h38be061_2 + - pcre=8.45=h9c3ff4c_0 - perl=5.32.1=2_h7f98852_perl5 + - perl-archive-tar=2.40=pl5321hdfd78af_0 + - perl-carp=1.38=pl5321hdfd78af_4 + - perl-common-sense=3.75=pl5321hdfd78af_0 + - perl-compress-raw-bzip2=2.201=pl5321h87f3376_1 + - perl-compress-raw-zlib=2.105=pl5321h87f3376_0 + - perl-encode=3.19=pl5321hec16e2b_1 + - perl-exporter=5.72=pl5321hdfd78af_2 + - perl-exporter-tiny=1.002002=pl5321hdfd78af_0 + - perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0 + - perl-io-compress=2.201=pl5321hdbdd923_2 + - perl-io-zlib=1.14=pl5321hdfd78af_0 + - perl-json=4.10=pl5321hdfd78af_0 + - perl-json-xs=2.34=pl5321h4ac6f70_6 + - perl-list-moreutils=0.430=pl5321hdfd78af_0 + - perl-list-moreutils-xs=0.430=pl5321h031d066_2 + - perl-parent=0.236=pl5321hdfd78af_2 + - perl-pathtools=3.75=pl5321hec16e2b_3 + - perl-scalar-list-utils=1.62=pl5321hec16e2b_1 + - perl-types-serialiser=1.01=pl5321hdfd78af_0 - pip=23.0=pyhd8ed1ab_0 - pycparser=2.21=pyhd8ed1ab_0 - pyopenssl=23.0.0=pyhd8ed1ab_0 @@ -71,6 +102,7 @@ dependencies: - seqkit=2.3.1=h9ee0642_0 - setuptools=66.1.1=pyhd8ed1ab_0 - six=1.16.0=pyh6c4a22f_0 + - skani=0.2.1=h4ac6f70_0 - soothsayer_utils=2022.6.24=py_0 - tk=8.6.12=h27826a3_0 - tqdm=4.64.1=pyhd8ed1ab_0 @@ -80,4 +112,5 @@ dependencies: - wget=1.20.3=ha35d2d1_1 - wheel=0.38.4=pyhd8ed1ab_0 - xz=5.2.6=h166bdaf_0 - - zlib=1.2.13=h166bdaf_4 \ No newline at end of file + - zlib=1.2.13=h166bdaf_4 + - zstd=1.5.5=hfc55251_0 \ No newline at end of file diff --git a/install/environments/VEBA-database_env.yml b/install/environments/VEBA-database_env.yml index 8e56e1c..f78e9c4 100644 --- a/install/environments/VEBA-database_env.yml +++ b/install/environments/VEBA-database_env.yml @@ -1,4 +1,4 @@ -name: VEBA-database_env__v2023.6.20 +name: VEBA-database_env__v2023.11.30 channels: - conda-forge - bioconda @@ -14,8 +14,8 @@ dependencies: - bz2file=0.98=py_0 - bzip2=1.0.8=h7f98852_4 - c-ares=1.19.1=hd590300_0 - - ca-certificates=2023.5.7=hbcca054_0 - - certifi=2023.5.7=pyhd8ed1ab_0 + - ca-certificates=2023.11.17=hbcca054_0 + - certifi=2023.11.17=pyhd8ed1ab_0 - charset-normalizer=3.1.0=pyhd8ed1ab_0 - colorama=0.4.6=pyhd8ed1ab_0 - coreutils=9.3=h0b41bf4_0 @@ -27,6 +27,7 @@ dependencies: - gettext=0.21.1=h27087fc_0 - icu=72.1=hcb278e6_0 - idna=3.4=pyhd8ed1ab_0 + - k8=0.2.5=hdcf5f25_4 - keyutils=1.6.1=h166bdaf_0 - krb5=1.20.1=h81ceb04_0 - ld_impl_linux-64=2.40=h41732ed_0 @@ -57,10 +58,11 @@ dependencies: - libuuid=2.38.1=h0b41bf4_0 - libxml2=2.11.4=h0d562d8_0 - libzlib=1.2.13=hd590300_5 + - minimap2=2.26=he4a0461_2 - mmseqs2=14.7e284=pl5321h6a68c12_2 - ncurses=6.4=hcb278e6_0 - numpy=1.25.0=py311h64a7726_0 - - openssl=3.1.1=hd590300_1 + - openssl=3.2.0=hd590300_1 - pandas=2.0.2=py311h320fe9a_0 - pathlib2=2.3.7.post1=py311h38be061_2 - pcre=8.45=h9c3ff4c_0 diff --git a/install/environments/VEBA-mapping_env.yml b/install/environments/VEBA-mapping_env.yml index 5af32f1..feb6918 100644 --- a/install/environments/VEBA-mapping_env.yml +++ b/install/environments/VEBA-mapping_env.yml @@ -1,106 +1,84 @@ -name: VEBA-mapping_env__v2023.7.25 +name: VEBA-mapping_env__v2023.11.17 channels: - conda-forge - bioconda - jolespin - defaults + - qiime2 dependencies: - _libgcc_mutex=0.1=conda_forge - - _openmp_mutex=4.5=1_gnu - - anndata=0.9.0=pyhd8ed1ab_0 - - bbmap=38.95=h5c4e2a8_1 - - biom-format=2.1.14=py39h72bdee0_2 - - biopython=1.79=py39h3811e60_1 - - bowtie2=2.5.1=py39h6fed5c7_2 - - brotlipy=0.7.0=py39h3811e60_1003 + - _openmp_mutex=4.5=2_gnu + - biopython=1.81=py310h2372a71_1 + - bowtie2=2.5.2=py310ha0a81b8_0 + - brotli-python=1.1.0=py310hc6cd4ac_1 - bz2file=0.98=py_0 - - bzip2=1.0.8=h7f98852_4 - - c-ares=1.18.1=h7f98852_0 + - bzip2=1.0.8=hd590300_5 + - c-ares=1.21.0=hd590300_0 - ca-certificates=2023.7.22=hbcca054_0 - - cached-property=1.5.2=hd8ed1ab_1 - - cached_property=1.5.2=pyha770c72_1 - certifi=2023.7.22=pyhd8ed1ab_0 - - cffi=1.15.0=py39h4bc2ebd_0 - - charset-normalizer=2.0.12=pyhd8ed1ab_0 - - click=8.1.3=unix_pyhd8ed1ab_2 - - colorama=0.4.4=pyh9f0ad1d_0 - - coreutils=9.3=h0b41bf4_0 - - cryptography=41.0.2=py39hd4f0224_0 + - charset-normalizer=3.3.2=pyhd8ed1ab_0 + - colorama=0.4.6=pyhd8ed1ab_0 + - coreutils=9.4=hd590300_0 - genopype=2023.5.15=py_0 - - h5py=3.7.0=nompi_py39h63b1161_100 - - hdf5=1.12.1=nompi_h4df4325_104 - - htslib=1.17=h81da01d_2 - - icu=72.1=hcb278e6_0 - - idna=3.3=pyhd8ed1ab_0 - - importlib-metadata=6.3.0=pyha770c72_0 - - importlib_metadata=6.3.0=hd8ed1ab_0 + - htslib=1.18=h81da01d_0 + - icu=73.2=h59595ed_0 + - idna=3.4=pyhd8ed1ab_0 - keyutils=1.6.1=h166bdaf_0 - - krb5=1.21.1=h659d440_0 - - ld_impl_linux-64=2.36.1=hea4e1c9_2 - - libblas=3.9.0=13_linux64_openblas - - libcblas=3.9.0=13_linux64_openblas - - libcurl=8.2.0=hca28451_0 - - libdeflate=1.18=h0b41bf4_0 + - krb5=1.21.2=h659d440_0 + - ld_impl_linux-64=2.40=h41732ed_0 + - libblas=3.9.0=19_linux64_openblas + - libcblas=3.9.0=19_linux64_openblas + - libcurl=8.4.0=hca28451_0 + - libdeflate=1.19=hd590300_0 - libedit=3.1.20191231=he28a2e2_2 - libev=4.33=h516909a_1 - libffi=3.4.2=h7f98852_5 - - libgcc-ng=12.2.0=h65d4601_19 - - libgfortran-ng=11.2.0=h69a702a_12 - - libgfortran5=11.2.0=h5c6108e_12 - - libgomp=12.2.0=h65d4601_19 - - libhwloc=2.9.1=nocuda_h7313eea_6 + - libgcc-ng=13.2.0=h807b86a_3 + - libgfortran-ng=13.2.0=h69a702a_3 + - libgfortran5=13.2.0=ha4646dd_3 + - libgomp=13.2.0=h807b86a_3 + - libhwloc=2.9.3=default_h554bfaf_1009 - libiconv=1.17=h166bdaf_0 - - liblapack=3.9.0=13_linux64_openblas - - libnghttp2=1.52.0=h61bc06f_0 - - libnsl=2.0.0=h7f98852_0 - - libopenblas=0.3.18=pthreads_h8fe5266_0 - - libsqlite=3.42.0=h2797004_0 + - liblapack=3.9.0=19_linux64_openblas + - libnghttp2=1.58.0=h47da74e_0 + - libnsl=2.0.1=hd590300_0 + - libopenblas=0.3.24=pthreads_h413a1c8_0 + - libsqlite=3.44.0=h2797004_0 - libssh2=1.11.0=h0841786_0 - - libstdcxx-ng=12.2.0=h46fd767_19 - - libuuid=2.32.1=h7f98852_1000 - - libxml2=2.11.4=h0d562d8_0 - - libzlib=1.2.13=h166bdaf_4 - - lz4-c=1.9.3=h9c3ff4c_1 - - natsort=8.3.1=pyhd8ed1ab_0 - - ncurses=6.3=h9c3ff4c_0 - - numpy=1.24.2=py39h7360e5f_0 - - openjdk=8.0.312=h7f98852_0 - - openssl=3.1.1=hd590300_1 - - packaging=23.0=pyhd8ed1ab_0 - - pandas=1.4.1=py39hde0f152_0 - - pathlib2=2.3.7.post1=py39hf3d152e_0 - - pbzip2=1.1.13=0 - - perl=5.32.1=2_h7f98852_perl5 - - pip=22.0.3=pyhd8ed1ab_0 - - pycparser=2.21=pyhd8ed1ab_0 - - pyopenssl=23.2.0=pyhd8ed1ab_1 - - pysocks=1.7.1=py39hf3d152e_4 - - python=3.9.16=h2782a2a_0_cpython + - libstdcxx-ng=13.2.0=h7e041cc_3 + - libuuid=2.38.1=h0b41bf4_0 + - libxml2=2.11.5=h232c23b_1 + - libzlib=1.2.13=hd590300_5 + - ncurses=6.4=h59595ed_2 + - numpy=1.26.0=py310hb13e2d6_0 + - openssl=3.1.4=hd590300_0 + - pandas=2.1.3=py310hcc13569_0 + - pathlib2=2.3.7.post1=py310hff52083_3 + - perl=5.32.1=4_hd590300_perl5 + - pip=23.3.1=pyhd8ed1ab_0 + - pysocks=1.7.1=pyha2e5f31_6 + - python=3.10.13=hd12c33a_0_cpython - python-dateutil=2.8.2=pyhd8ed1ab_0 - - python-tzdata=2021.5=pyhd8ed1ab_0 - - python_abi=3.9=2_cp39 - - pytz=2021.3=pyhd8ed1ab_0 - - pytz-deprecation-shim=0.1.0.post0=py39hf3d152e_1 + - python-tzdata=2023.3=pyhd8ed1ab_0 + - python_abi=3.10=4_cp310 + - pytz=2023.3.post1=pyhd8ed1ab_0 - readline=8.2=h8228510_1 - - requests=2.27.1=pyhd8ed1ab_0 - - samtools=1.17=hd87286a_1 - - scandir=1.10.0=py39h3811e60_4 - - scipy=1.9.3=py39hddc5342_2 - - setuptools=60.9.3=py39hf3d152e_0 + - requests=2.31.0=pyhd8ed1ab_0 + - salmon=0.8.1=0 + - samtools=1.18=h50ea8bc_1 + - scandir=1.10.0=py310h2372a71_7 + - seqkit=2.6.0=h9ee0642_0 + - setuptools=68.2.2=pyhd8ed1ab_0 - six=1.16.0=pyh6c4a22f_0 - soothsayer_utils=2022.6.24=py_0 - - sqlite=3.37.0=h9cd32fc_0 - - star=2.7.10a=h9ee0642_0 - - subread=2.0.3=h7132678_1 - - tbb=2021.9.0=hf52228f_0 - - tk=8.6.12=h27826a3_0 - - tqdm=4.62.3=pyhd8ed1ab_0 - - typing_extensions=4.5.0=pyha770c72_0 - - tzdata=2021e=he74cb21_0 - - tzlocal=4.1=py39hf3d152e_1 - - urllib3=1.26.8=pyhd8ed1ab_1 - - wheel=0.37.1=pyhd8ed1ab_0 + - subread=2.0.6=he4a0461_0 + - tbb=2021.10.0=h00ab1b0_2 + - tk=8.6.13=noxft_h4845f30_101 + - tqdm=4.66.1=pyhd8ed1ab_0 + - tzdata=2023c=h71feb2d_0 + - tzlocal=5.2=py310hff52083_0 + - urllib3=2.1.0=pyhd8ed1ab_0 + - wheel=0.41.3=pyhd8ed1ab_0 - xz=5.2.6=h166bdaf_0 - - zipp=3.15.0=pyhd8ed1ab_0 - - zlib=1.2.13=h166bdaf_4 - - zstd=1.5.2=ha95c52a_0 + - zlib=1.2.13=hd590300_5 + - zstd=1.5.5=hfc55251_0 \ No newline at end of file diff --git a/install/environments/VEBA-preprocess_env.yml b/install/environments/VEBA-preprocess_env.yml index d2f59b2..a7f174b 100644 --- a/install/environments/VEBA-preprocess_env.yml +++ b/install/environments/VEBA-preprocess_env.yml @@ -1,4 +1,4 @@ -name: VEBA-preprocess_env__v2023.8.21 +name: VEBA-preprocess_env__v2023.12.12 channels: - conda-forge - bioconda @@ -7,46 +7,50 @@ channels: dependencies: - _libgcc_mutex=0.1=conda_forge - _openmp_mutex=4.5=2_gnu - - alsa-lib=1.2.8=h166bdaf_0 + - alsa-lib=1.2.7.2=h166bdaf_0 - argparse-manpage-birdtools=1.7.0=pyhd8ed1ab_0 - - aria2=1.36.0=h8b6cd97_3 - - arrow-cpp=10.0.1=h3e2b116_1_cpu - - aws-c-auth=0.6.21=h3cb7b9d_0 - - aws-c-cal=0.5.20=hd3b2fe5_3 - - aws-c-common=0.8.5=h166bdaf_0 - - aws-c-compression=0.2.16=hf5f93bc_0 - - aws-c-event-stream=0.2.15=h2c1f3d0_11 - - aws-c-http=0.6.27=hb11a807_3 - - aws-c-io=0.13.11=hf1b0a34_1 - - aws-c-mqtt=0.7.13=h93e60df_9 - - aws-c-s3=0.1.51=h1222a00_14 - - aws-c-sdkutils=0.1.7=hf5f93bc_0 - - aws-checksums=0.1.13=hf5f93bc_5 - - aws-crt-cpp=0.18.16=hb1454fd_1 - - aws-sdk-cpp=1.9.379=hdc6349a_5 + - aria2=1.36.0=h1e4e653_3 + - arrow-cpp=12.0.0=ha770c72_1_cpu + - aws-c-auth=0.6.26=h2c7c9e7_6 + - aws-c-cal=0.5.26=h71eb795_0 + - aws-c-common=0.8.17=hd590300_0 + - aws-c-compression=0.2.16=h4f47f36_6 + - aws-c-event-stream=0.2.20=h69ce273_6 + - aws-c-http=0.7.7=h7b8353a_3 + - aws-c-io=0.13.21=h2c99d58_4 + - aws-c-mqtt=0.8.6=h3a1964a_15 + - aws-c-s3=0.2.8=h0933b68_4 + - aws-c-sdkutils=0.1.9=h4f47f36_1 + - aws-checksums=0.1.14=h4f47f36_6 + - aws-crt-cpp=0.19.9=h85076f6_5 + - aws-sdk-cpp=1.10.57=hf40e4db_10 - awscli=1.27.23=py39hf3d152e_0 - bbmap=39.01=h5c4e2a8_0 + - binutils_impl_linux-64=2.39=he00db2b_1 - bird_tool_utils_python=0.4.1=pyhdfd78af_0 - botocore=1.29.23=pyhd8ed1ab_0 - bowtie2=2.5.1=py39h3321a2d_0 - brotlipy=0.7.0=py39hb9d737c_1005 - bz2file=0.98=py_0 - bzip2=1.0.8=h7f98852_4 - - c-ares=1.18.1=h7f98852_0 - - ca-certificates=2023.7.22=hbcca054_0 + - c-ares=1.22.1=hd590300_0 + - ca-certificates=2023.11.17=hbcca054_0 - cairo=1.16.0=ha61ee94_1014 - - certifi=2023.7.22=pyhd8ed1ab_0 + - certifi=2023.11.17=pyhd8ed1ab_0 - cffi=1.15.1=py39he91dace_2 - charset-normalizer=2.1.1=pyhd8ed1ab_0 + - chopper=0.7.0=hdcf5f25_0 + - clang=15.0.3=ha770c72_0 + - clang-15=15.0.3=default_h2e3cab8_0 - colorama=0.4.4=pyh9f0ad1d_0 - coreutils=9.3=h0b41bf4_0 - - cryptography=38.0.4=py39hd97740a_0 - - curl=7.86.0=h7bff187_1 + - cryptography=41.0.7=py39hd4f0224_0 + - curl=8.4.0=hca28451_0 - docutils=0.16=py39hf3d152e_3 - expat=2.5.0=h27087fc_0 - extern=0.4.1=py_0 - fastp=0.23.4=h5f740d0_0 - - fastq_preprocessor=2023.7.24=py_0 + - fastq_preprocessor=2023.12.12=py_0 - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 - font-ttf-inconsolata=3.000=h77eed37_0 - font-ttf-source-code-pro=2.038=h77eed37_0 @@ -55,6 +59,7 @@ dependencies: - fonts-conda-ecosystem=1=0 - fonts-conda-forge=1=0 - freetype=2.12.1=hca18f0e_1 + - gcc_impl_linux-64=12.2.0=hcc96c02_19 - genopype=2023.5.15=py_0 - gettext=0.21.1=h27087fc_0 - gflags=2.2.2=he1b5a44_1004 @@ -62,54 +67,62 @@ dependencies: - glog=0.6.0=h6f12383_0 - graphite2=1.3.13=h58526e2_1001 - harfbuzz=5.3.0=h418a68e_0 - - hdf5=1.12.1=nompi_h2386368_104 - - htslib=1.16=h6bc39ce_0 + - hdf5=1.14.2=nompi_h4f84152_100 + - htslib=1.18=h81da01d_0 - icu=70.1=h27087fc_0 - idna=3.4=pyhd8ed1ab_0 - isa-l=2.30.0=ha770c72_4 - jmespath=1.0.1=pyhd8ed1ab_0 - jpeg=9e=h166bdaf_2 + - k8=0.2.5=hdcf5f25_4 + - kernel-headers_linux-64=2.6.32=he073ed8_16 - keyutils=1.6.1=h166bdaf_0 - kingfisher=0.1.0=pyh7cba7a3_1 - - krb5=1.19.3=h3790be6_0 - - lcms2=2.14=h6ed2654_0 + - krb5=1.21.2=h659d440_0 + - lcms2=2.12=hddcbb42_0 - ld_impl_linux-64=2.39=hcc3a1bd_1 - lerc=4.0.0=h27087fc_0 - - libabseil=20220623.0=cxx17_h48a1fff_5 - - libarrow=10.0.1=hcf5dfb8_1_cpu + - libabseil=20230125.0=cxx17_hcb278e6_1 + - libaec=1.1.2=h59595ed_1 + - libarrow=12.0.0=h1cdf7b0_1_cpu - libblas=3.9.0=16_linux64_openblas - libbrotlicommon=1.0.9=h166bdaf_8 - libbrotlidec=1.0.9=h166bdaf_8 - libbrotlienc=1.0.9=h166bdaf_8 - libcblas=3.9.0=16_linux64_openblas + - libclang-cpp15=15.0.3=default_h2e3cab8_0 - libcrc32c=1.1.2=h9c3ff4c_0 - - libcups=2.3.3=h3e49a29_2 - - libcurl=7.86.0=h7bff187_1 - - libdeflate=1.13=h166bdaf_0 + - libcups=2.3.3=h4637d8d_4 + - libcurl=8.4.0=hca28451_0 + - libdeflate=1.19=hd590300_0 - libedit=3.1.20191231=he28a2e2_2 - libev=4.33=h516909a_1 - - libevent=2.1.10=h9b69904_4 + - libevent=2.1.12=hf998b51_1 - libffi=3.4.2=h7f98852_5 + - libgcc-devel_linux-64=12.2.0=h3b97bd3_19 - libgcc-ng=12.2.0=h65d4601_19 - - libgfortran-ng=12.2.0=h69a702a_19 - - libgfortran5=12.2.0=h337968e_19 + - libgfortran-ng=13.2.0=h69a702a_0 + - libgfortran5=13.2.0=ha4646dd_0 - libglib=2.74.1=h606061b_1 - libgomp=12.2.0=h65d4601_19 - - libgoogle-cloud=2.5.0=hcb5eced_0 - - libgrpc=1.49.1=h05bd8bd_1 + - libgoogle-cloud=2.10.0=hac9eb74_0 + - libgrpc=1.54.2=hcf146ea_0 - libhwloc=2.8.0=h32351e8_1 - libiconv=1.17=h166bdaf_0 - liblapack=3.9.0=16_linux64_openblas - - libnghttp2=1.47.0=hdcd2b5c_1 + - libllvm15=15.0.3=h503ea73_0 + - libnghttp2=1.58.0=h47da74e_0 - libnsl=2.0.0=h7f98852_0 + - libnuma=2.0.16=h0b41bf4_1 - libopenblas=0.3.21=pthreads_h78a6416_3 - libpng=1.6.39=h753d276_0 - - libprotobuf=3.21.10=h6239696_0 + - libprotobuf=3.21.12=hfc55251_2 + - libsanitizer=12.2.0=h46fd767_19 - libsqlite=3.40.0=h753d276_0 - - libssh2=1.10.0=haa6b8db_3 + - libssh2=1.11.0=h0841786_0 - libstdcxx-ng=12.2.0=h46fd767_19 - - libthrift=0.16.0=h491838f_2 - - libtiff=4.4.0=h0e0dad5_3 + - libthrift=0.18.1=h8fd135c_2 + - libtiff=4.2.0=hf544144_3 - libutf8proc=2.8.0=h166bdaf_0 - libuuid=2.32.1=h7f98852_1000 - libwebp-base=1.2.4=h166bdaf_0 @@ -117,12 +130,14 @@ dependencies: - libxml2=2.9.14=h22db469_4 - libzlib=1.2.13=h166bdaf_4 - lz4-c=1.9.3=h9c3ff4c_1 + - minimap2=2.26=he4a0461_2 - ncbi-ngs-sdk=2.9.0=0 + - ncbi-vdb=3.0.9=hdbdd923_0 - ncurses=6.3=h27087fc_1 - numpy=1.23.5=py39h3d75532_0 - - openjdk=17.0.3=hafdced1_4 - - openssl=1.1.1u=hd590300_0 - - orc=1.8.0=h09e0d61_0 + - openjdk=17.0.3=hea3dc9f_3 + - openssl=3.2.0=hd590300_1 + - orc=1.8.3=h2f23424_1 - ossuuid=1.6.2=hf484d3e_1000 - pandas=1.5.2=py39h4661b88_0 - parquet-cpp=1.5.1=2 @@ -164,38 +179,41 @@ dependencies: - pip=22.3.1=pyhd8ed1ab_0 - pixman=0.40.0=h36c2ea0_0 - pthread-stubs=0.4=h36c2ea0_1001 - - pyarrow=10.0.1=py39h33d4778_1_cpu + - pyarrow=12.0.0=py39he4327e9_1_cpu - pyasn1=0.4.8=py_0 - pycparser=2.21=pyhd8ed1ab_0 - - pyopenssl=22.1.0=pyhd8ed1ab_0 + - pyopenssl=23.3.0=pyhd8ed1ab_0 - pysocks=1.7.1=pyha2e5f31_6 - - python=3.9.15=h47a2c10_0_cpython + - python=3.9.16=h2782a2a_0_cpython - python-dateutil=2.8.2=pyhd8ed1ab_0 - python-tzdata=2022.7=pyhd8ed1ab_0 - python_abi=3.9=3_cp39 - pytz=2022.6=pyhd8ed1ab_0 - pytz-deprecation-shim=0.1.0.post0=py39hf3d152e_3 - pyyaml=5.4.1=py39hb9d737c_4 - - re2=2022.06.01=h27087fc_1 + - rdma-core=28.9=h59595ed_1 + - re2=2023.02.02=hcb278e6_0 - readline=8.1.2=h0f457ee_0 - requests=2.28.1=pyhd8ed1ab_1 - rsa=4.7.2=pyh44b312d_0 - - s2n=1.3.28=h8d01263_0 + - s2n=1.3.44=h06160fa_0 - s3transfer=0.6.0=pyhd8ed1ab_0 - samtools=1.16.1=h6899075_1 - scandir=1.10.0=py39hb9d737c_6 - seqkit=2.3.1=h9ee0642_0 - setuptools=65.5.1=pyhd8ed1ab_0 - six=1.16.0=pyh6c4a22f_0 - - snappy=1.1.9=hbd366e4_2 + - snappy=1.1.10=h9fff704_0 - soothsayer_utils=2022.6.24=py_0 - - sra-tools=3.0.0=pl5321hd0d85c6_1 + - sra-tools=3.0.9=h9f5acd7_0 - sracat=0.2=h9f5acd7_1 + - sysroot_linux-64=2.12=he073ed8_16 - tbb=2021.7.0=h924138e_1 - tk=8.6.12=h27826a3_0 - tqdm=4.64.1=pyhd8ed1ab_0 - tzdata=2022g=h191b570_0 - tzlocal=4.2=py39hf3d152e_2 + - ucx=1.14.1=h64cca9d_5 - urllib3=1.26.13=pyhd8ed1ab_0 - wheel=0.38.4=pyhd8ed1ab_0 - xorg-fixesproto=5.0=h7f98852_1002 @@ -218,4 +236,4 @@ dependencies: - xz=5.2.6=h166bdaf_0 - yaml=0.2.5=h7f98852_2 - zlib=1.2.13=h166bdaf_4 - - zstd=1.5.2=h6239696_4 \ No newline at end of file + - zstd=1.5.5=hfc55251_0 \ No newline at end of file diff --git a/install/environments/VEBA-profile_env.yml b/install/environments/VEBA-profile_env.yml index bccbda2..f6f3fab 100644 --- a/install/environments/VEBA-profile_env.yml +++ b/install/environments/VEBA-profile_env.yml @@ -1,4 +1,4 @@ -name: VEBA-profile_env__v2023.10.16 +name: VEBA-profile_env__v2023.12.14 channels: - conda-forge - bioconda @@ -21,12 +21,12 @@ dependencies: - bz2file=0.98=py_0 - bzip2=1.0.8=h7f98852_4 - c-ares=1.20.1=hd590300_0 - - ca-certificates=2023.7.22=hbcca054_0 + - ca-certificates=2023.11.17=hbcca054_0 - cached-property=1.5.2=hd8ed1ab_1 - cached_property=1.5.2=pyha770c72_1 - cairo=1.16.0=hb05425b_5 - capnproto=0.9.1=ha19adfc_4 - - certifi=2023.7.22=pyhd8ed1ab_0 + - certifi=2023.11.17=pyhd8ed1ab_0 - charset-normalizer=3.3.0=pyhd8ed1ab_0 - click=8.1.7=unix_pyh707e725_0 - cmseq=1.0.4=pyhb7b1952_0 @@ -119,7 +119,7 @@ dependencies: - numpy=1.26.0=py310hb13e2d6_0 - openjdk=17.0.3=h4335b31_6 - openjpeg=2.5.0=h488ebb8_3 - - openssl=3.1.3=hd590300_0 + - openssl=3.2.0=hd590300_1 - ossuuid=1.6.2=hf484d3e_1000 - packaging=23.2=pyhd8ed1ab_0 - pandas=2.1.1=py310hcc13569_1 @@ -198,6 +198,7 @@ dependencies: - six=1.16.0=pyh6c4a22f_0 - soothsayer_utils=2022.6.24=py_0 - statsmodels=0.14.0=py310h1f7b6fc_2 + - sylph=0.4.1=h4ac6f70_0 - tbb=2021.7.0=h924138e_1 - tk=8.6.13=h2797004_0 - tqdm=4.66.1=pyhd8ed1ab_0 diff --git a/install/install_veba.sh b/install/install.sh similarity index 57% rename from install/install_veba.sh rename to install/install.sh index 8c8fa6d..ac81638 100644 --- a/install/install_veba.sh +++ b/install/install.sh @@ -1,12 +1,14 @@ #!/bin/bash -# __version__ = "2023.3.27" +# __version__ = "2023.12.19" SCRIPT_PATH=$(realpath $0) PREFIX=$(echo $SCRIPT_PATH | python -c "import sys; print('/'.join(sys.stdin.read().split('/')[:-1]))") -CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}") +# CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}") +CONDA_BASE=$(conda info --base) # Update permissions echo "Updating permissions for scripts in ${PREFIX}/../src" +chmod 755 ${PREFIX}/../src/veba chmod 755 ${PREFIX}/../src/*.py chmod 755 ${PREFIX}/../src/scripts/* @@ -15,12 +17,34 @@ conda install -c conda-forge mamba -y # conda update mamba -y # Recommended # Environemnts +# Main environment +echo "Creating ${VEBA} main environment" + +ENV_NAME="VEBA" +mamba create -y -n $ENV_NAME -c conda-forge -c bioconda -c jolespin seqkit genopype networkx biopython biom-format anndata || (echo "Error when creating main VEBA environment" ; exit 1) &> ${PREFIX}/environments/VEBA.log + +# Copy main executable +echo -e "\t*Copying main VEBA executable into ${ENV_NAME} environment path" +cp -r ${PREFIX}/../src/veba ${CONDA_BASE}/envs/${ENV_NAME}/bin/ +# Copy over files to environment bin/ +echo -e "\t*Copying VEBA modules into ${ENV_NAME} environment path" +cp -r ${PREFIX}/../src/*.py ${CONDA_BASE}/envs/${ENV_NAME}/bin/ +echo -e "\t*Copying VEBA utility scripts into ${ENV_NAME} environment path" +cp -r ${PREFIX}/../src/scripts/ ${CONDA_BASE}/envs/${ENV_NAME}/bin/ +# Symlink the utility scripts to bin/ +echo -e "\t*Symlinking VEBA utility scripts into ${ENV_NAME} environment path" +ln -sf ${CONDA_BASE}/envs/${ENV_NAME}/bin/scripts/* ${CONDA_BASE}/envs/${ENV_NAME}/bin/ + +# Version +cp -rf ${PREFIX}/../VERSION ${CONDA_BASE}/envs/${ENV_NAME}/bin/VEBA_VERSION + +# Module environments for ENV_YAML in ${PREFIX}/environments/VEBA*.yml; do # Get environment name ENV_NAME=$(basename $ENV_YAML .yml) # Create conda environment - echo "Creating ${ENV_NAME} environment" + echo "Creating ${ENV_NAME} module environment" mamba env create -n $ENV_NAME -f $ENV_YAML || (echo "Error when creating VEBA environment: ${ENV_YAML}" ; exit 1) &> ${ENV_YAML}.log # Copy over files to environment bin/ @@ -32,6 +56,9 @@ for ENV_YAML in ${PREFIX}/environments/VEBA*.yml; do echo -e "\t*Symlinking VEBA utility scripts into ${ENV_NAME} environment path" ln -sf ${CONDA_BASE}/envs/${ENV_NAME}/bin/scripts/* ${CONDA_BASE}/envs/${ENV_NAME}/bin/ + # Version + cp -rf ${PREFIX}/../VERSION ${CONDA_BASE}/envs/${ENV_NAME}/bin/VEBA_VERSION + done echo -e " _ _ _______ ______ _______\n \ / |______ |_____] |_____|\n \/ |______ |_____] | |" diff --git a/install/uninstall_veba.sh b/install/uninstall.sh similarity index 100% rename from install/uninstall_veba.sh rename to install/uninstall.sh diff --git a/install/update_environment_scripts.sh b/install/update_environment_scripts.sh index 2c98bc7..59a98b0 100644 --- a/install/update_environment_scripts.sh +++ b/install/update_environment_scripts.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# __version__ = "2023.01.05" +# __version__ = "2023.12.18" # Usage: git clone https://github.com/jolespin/veba && update_environment_scripts.sh /path/to/veba_repository echo "-----------------------------------------------------------------------------------------------------" @@ -17,13 +17,14 @@ if [ $# -eq 0 ]; then chmod 775 ${VEBA_REPOSITORY_DIRECTORY}/src/* chmod 775 ${VEBA_REPOSITORY_DIRECTORY}/src/scripts/* - else + else VEBA_REPOSITORY_DIRECTORY=$1 fi -CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}") +# CONDA_BASE=$(conda run -n base bash -c "echo \${CONDA_PREFIX}") +CONDA_BASE=$(conda info --base) echo "-----------------------------------------------------------------------------------------------------" echo " * Source VEBA: ${VEBA_REPOSITORY_DIRECTORY}" @@ -31,9 +32,10 @@ echo " * Destination VEBA environments CONDA_BASE: ${CONDA_BASE}" echo "-----------------------------------------------------------------------------------------------------" # Environemnts -for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA-*; do +for ENV_PREFIX in ${CONDA_BASE}/envs/VEBA ${CONDA_BASE}/envs/VEBA-*; +do echo $ENV_PREFIX cp ${VEBA_REPOSITORY_DIRECTORY}/src/*.py ${ENV_PREFIX}/bin/ cp -r ${VEBA_REPOSITORY_DIRECTORY}/src/scripts/ ${ENV_PREFIX}/bin/ ln -sf ${ENV_PREFIX}/bin/scripts/* ${ENV_PREFIX}/bin/ - done +done diff --git a/src/MODULE_RESOURCES b/src/MODULE_RESOURCES deleted file mode 100644 index a30553e..0000000 --- a/src/MODULE_RESOURCES +++ /dev/null @@ -1,18 +0,0 @@ -Status Environment Module Resources Recommended Threads Description -Stable VEBA-preprocess_env preprocess.py 4GB-16GB 4 Fastq quality trimming, adapter removal, decontamination, and read statistics calculations -Stable VEBA-assembly_env assembly.py 32GB-128GB+ 16 Assemble reads, align reads to assembly, and count mapped reads -Stable VEBA-assembly_env coverage.py 24GB 16 Align reads to (concatenated) reference and counts mapped reads -Stable VEBA-binning-prokaryotic_env binning-prokaryotic.py 16GB 4 Iterative consensus binning for recovering prokaryotic genomes with lineage-specific quality assessment -Stable VEBA-binning-eukaryotic_env binning-eukaryotic.py 128GB 4 Binning for recovering eukaryotic genomes with exon-aware gene modeling and lineage-specific quality assessment -Stable VEBA-binning-viral_env binning-viral.py 16GB 4 Detection of viral genomes and quality assessment -Stable VEBA-classify_env classify-prokaryotic.py 64GB 32 Taxonomic classification of prokaryotic genomes  -Stable VEBA-classify_env classify-eukaryotic.py 32GB 1 Taxonomic classification of eukaryotic genomes -Stable VEBA-classify_env classify-viral.py 16GB 4 Taxonomic classification of viral genomes -Stable VEBA-cluster_env cluster.py 32GB+ 32 Species-level clustering of genomes and lineage-specific orthogroup detection -Stable VEBA-annotate_env annotate.py 64GB 32 Annotates translated gene calls against NR, Pfam, and KOFAM -Stable VEBA-phylogeny_env phylogeny.py 16GB+ 32 Constructs phylogenetic trees given a marker set -Stable VEBA-mapping_env index.py 16GB 4 Builds local or global index for alignment to genomes -Stable VEBA-mapping_env mapping.py 16GB 4 Aligns reads to local or global index of genomes -Stable VEBA-biosynthetic_env biosynthetic.py 16GB 16 Identify biosynthetic gene clusters in prokaryotes and fungi -Developmental VEBA-assembly_env assembly-sequential.py 32GB-128GB+ 16 Assemble metagenomes sequentially -Developmental VEBA-amplicon_env amplicon.py 96GB 16 Automated read trim position detection, DADA2 ASV detection, taxonomic classification, and file conversion \ No newline at end of file diff --git a/src/README.md b/src/README.md index 574149f..790091e 100755 --- a/src/README.md +++ b/src/README.md @@ -3,25 +3,29 @@ # Modules [![Schematic](../images/Schematic.png)](../images/Schematic.pdf) -| Status | Environment | Module | Resources | Recommended Threads | Description | -|---------------|------------------------------|-------------------------|-------------|---------------------|-----------------------------------------------------------------------------------------------------------------| -| Stable | [VEBA-preprocess_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-preprocess_env.yml) | [preprocess.py](https://github.com/jolespin/veba/tree/main/src#preprocesspy) | 4GB-16GB | 4 | Fastq quality trimming, adapter removal, decontamination, and read statistics calculations | -| Stable | [VEBA-assembly_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-assembly_env.yml) | [assembly.py](https://github.com/jolespin/veba/tree/main/src#assemblypy) | 32GB-128GB+ | 4-16 | Assemble reads, align reads to assembly, and count mapped reads | -| Stable | [VEBA-assembly_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-assembly_env.yml) | [coverage.py](https://github.com/jolespin/veba/tree/main/src#coveragepy) | 24GB | 16 | Align reads to (concatenated) reference and counts mapped reads | -| Stable | [VEBA-binning-prokaryotic_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-binning-prokaryotic_env.yml) | [binning-prokaryotic.py](https://github.com/jolespin/veba/tree/main/src#binning-prokaryoticpy) | 16GB | 4 | Iterative consensus binning for recovering prokaryotic genomes with lineage-specific quality assessment | -| Stable | [VEBA-binning-eukaryotic_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-binning-eukaryotic_env.yml) | [binning-eukaryotic.py](https://github.com/jolespin/veba/tree/main/src#binning-eukaryoticpy) | 128GB | 4 | Binning for recovering eukaryotic genomes with exon-aware gene modeling and lineage-specific quality assessment | -| Stable | [VEBA-binning-viral_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-binning-viral_env.yml) | [binning-viral.py](https://github.com/jolespin/veba/tree/main/src#binning-viralpy) | 16GB | 4 | Detection of viral genomes and quality assessment | -| Stable | [VEBA-classify_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-classify_env.yml) | [classify-prokaryotic.py](https://github.com/jolespin/veba/tree/main/src#classify-prokaryoticpy) | 72GB | 32 | Taxonomic classification of prokaryotic genomes | -| Stable | [VEBA-classify_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-classify_env.yml) | [classify-eukaryotic.py](https://github.com/jolespin/veba/tree/main/src#classify-eukaryoticpy) | 32GB | 1 | Taxonomic classification of eukaryotic genomes | -| Stable | [VEBA-classify_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-classify_env.yml) | [classify-viral.py](https://github.com/jolespin/veba/tree/main/src#classify-viralpy) | 16GB | 4 | Taxonomic classification of viral genomes | -| Stable | [VEBA-cluster_env](https://github.com/jolespin/veba/blob/main/install/environments/[VEBA-cluster_env.yml) | [cluster.py](https://github.com/jolespin/veba/tree/main/src#clusterpy) | 32GB+ | 32 | Species-level clustering of genomes and lineage-specific orthogroup detection | -| Stable | [VEBA-annotate_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-annotate_env.yml) | [annotate.py](https://github.com/jolespin/veba/tree/main/src#annotatepy) | 64GB | 32 | Annotates translated gene calls against UniRef, MiBIG, VFDB, Pfam, AntiFam, and KOFAM | -| Stable | [VEBA-phylogeny_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-phylogeny_env.yml) | [phylogeny.py](https://github.com/jolespin/veba/tree/main/src#phylogenypy) | 16GB+ | 32 | Constructs phylogenetic trees given a marker set | -| Stable | [VEBA-mapping_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-mapping_env.yml) | [index.py](https://github.com/jolespin/veba/tree/main/src#indexpy) | 16GB | 4 | Builds local or global index for alignment to genomes | -| Stable | [VEBA-mapping_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-mapping_env.yml) | [mapping.py](https://github.com/jolespin/veba/tree/main/src#mappingpy) | 16GB | 4 | Aligns reads to local or global index of genomes | -| Stable | [VEBA-biosynthetic_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-biosynthetic_env.yml) | [biosynthetic.py](https://github.com/jolespin/veba/tree/main/src#biosyntheticpy) | 16GB | 16 | Identify biosynthetic gene clusters in prokaryotes and fungi | -| Developmental | [VEBA-assembly_env](https://github.com/jolespin/veba/blob/main/install/environments/VEBA-assembly_env.yml) | [assembly-sequential.py](https://github.com/jolespin/veba/tree/main/src#assembly-sequentialpy) | 32GB-128GB+ | 16 | Assemble metagenomes sequentially | -| Developmental | [VEBA-amplicon_env](https://github.com/jolespin/veba/blob/main/install/environments/devel/VEBA-amplicon_env.yml) | [amplicon.py](https://github.com/jolespin/veba/tree/main/src#ampliconpy) | 96GB | 16 | Automated read trim position detection, DADA2 ASV detection, taxonomic classification, and file conversion | +| Status | Module | Environment | Executable | Resources | Recommended Threads | Description | +|---------------|----------------------|------------------------------|-------------------------|-------------|---------------------|-------------------------------------------------------------------------------------------------------------------| +| Stable | preprocess | VEBA-preprocess_env | preprocess.py | 4GB-16GB | 4 | Fastq quality trimming, adapter removal, decontamination, and read statistics calculations (Short Reads) | +| Stable | preproces-long | VEBA-preprocess_env | preproces-long.py | 4GB-16GB | 4 | Fastq quality trimming, adapter removal, decontamination, and read statistics calculations (Long Reads) | +| Stable | assembly | VEBA-assembly_env | assembly.py | 32GB-128GB+ | 16 | Assemble short reads, align reads to assembly, and count mapped reads | +| Stable | assembly-long | VEBA-assembly_env | assembly-long.py | 32GB-128GB+ | 16 | Assemble long reads, align reads to assembly, and count mapped reads | +| Stable | coverage | VEBA-assembly_env | coverage.py | 24GB | 16 | Align short reads to (concatenated) reference and counts mapped reads | +| Stable | coverage-long | VEBA-assembly_env | coverage-long.py | 24GB | 16 | Align long reads to (concatenated) reference and counts mapped reads | +| Stable | binning-prokaryotic | VEBA-binning-prokaryotic_env | binning-prokaryotic.py | 16GB | 4 | Iterative consensus binning for recovering prokaryotic genomes with lineage-specific quality assessment | +| Stable | binning-eukaryotic | VEBA-binning-eukaryotic_env | binning-eukaryotic.py | 128GB | 4 | Binning for recovering eukaryotic genomes with exon-aware gene modeling and lineage-specific quality assessment | +| Stable | binning-viral | VEBA-binning-viral_env | binning-viral.py | 16GB | 4 | Detection of viral genomes and quality assessment | +| Stable | classify-prokaryotic | VEBA-classify_env | classify-prokaryotic.py | 64GB | 32 | Taxonomic classification of prokaryotic genomes | +| Stable | classify-eukaryotic | VEBA-classify_env | classify-eukaryotic.py | 32GB | 1 | Taxonomic classification of eukaryotic genomes | +| Stable | classify-viral | VEBA-classify_env | classify-viral.py | 16GB | 4 | Taxonomic classification of viral genomes | +| Stable | cluster | VEBA-cluster_env | cluster.py | 32GB+ | 32 | Species-level clustering of genomes and lineage-specific orthogroup detection | +| Stable | annotate | VEBA-annotate_env | annotate.py | 64GB | 32 | Annotates translated gene calls against NR, Pfam, and KOFAM | +| Stable | phylogeny | VEBA-phylogeny_env | phylogeny.py | 16GB+ | 32 | Constructs phylogenetic trees given a marker set | +| Stable | index | VEBA-mapping_env | index.py | 16GB | 4 | Builds local or global index for alignment to genomes | +| Stable | mapping | VEBA-mapping_env | mapping.py | 16GB | 4 | Aligns reads to local or global index of genomes | +| Stable | biosynthetic | VEBA-biosynthetic_env | biosynthetic.py | 16GB | 16 | Identify biosynthetic gene clusters in prokaryotes and fungi | +| Stable | profile-pathway | VEBA-profile_env | profile-pathway.py | 16GB | 4 | Pathway profiling of de novo genomes | +| Deprecated | assembly-sequential | VEBA-assembly_env | assembly-sequential.py | 32GB-128GB+ | 16 | Assemble metagenomes sequentially | +| Developmental | amplicon | VEBA-amplicon_env | amplicon.py | 96GB | 16 | Automated read trim position detection, DADA2 ASV detection, taxonomic classification, and file conversion |

^__^

diff --git a/src/amplicon.py b/src/amplicon.py index c673ee7..f66abf9 100755 --- a/src/amplicon.py +++ b/src/amplicon.py @@ -14,7 +14,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" # Reads archive def get_reads_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -626,6 +626,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/annotate.py b/src/annotate.py index c050e86..eda3cf5 100755 --- a/src/annotate.py +++ b/src/annotate.py @@ -15,7 +15,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.25" +__version__ = "2023.11.30" def get_preprocess_cmd( input_filepaths, output_filepaths, output_directory, directories, opts, program): cmd = [ @@ -880,6 +880,7 @@ def main(args=None): print("VEBA Database:", opts.veba_database, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) diff --git a/src/assembly-long.py b/src/assembly-long.py new file mode 100755 index 0000000..0c35cc2 --- /dev/null +++ b/src/assembly-long.py @@ -0,0 +1,627 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, glob +from collections import OrderedDict, defaultdict + +import pandas as pd + +# Soothsayer Ecosystem +from genopype import * +from genopype import __version__ as genopype_version +from soothsayer_utils import * + +pd.options.display.max_colwidth = 100 +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.12.14" + +# Assembly +def get_assembly_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): + # Command + cmd = [ + os.environ["flye"], + "--{} {}".format(opts.reads_type, input_filepaths[0]), + "-g {}".format(opts.estimated_assembly_size) if opts.estimated_assembly_size else "", + "-o {}".format(output_directory), + "-t {}".format(opts.n_jobs), + "--deterministic" if not opts.no_deterministic else "", + "--meta" if opts.program == "metaflye" else "", + opts.assembler_options, + + # Get failed length cutoff fasta + "&&", + + "mv", + os.path.join(output_directory, "assembly.fasta"), + os.path.join(output_directory, "assembly_original.fasta"), + + "&&", + + "cat", + os.path.join(output_directory, "assembly_original.fasta"), + "|", + os.environ["seqkit"], + "seq", + "-M {}".format(max(opts.minimum_contig_length - 1, 1)), + "|", + "gzip", + ">", + os.path.join(output_directory, "assembly_failed_length_cutoff.fasta.gz"), + + # Filter out small scaffolds and add prefix if applicable + "&&", + + "cat", + os.path.join(output_directory, "assembly_original.fasta"), + "|", + os.environ["seqkit"], + "seq", + "-m {}".format(opts.minimum_contig_length), + "|", + os.environ["seqkit"], + "replace", + "-r {}".format(opts.scaffold_prefix), + "-p '^'", + ">", + os.path.join(output_directory, "assembly.fasta"), + + "&&", + + "rm -rf", + os.path.join(output_directory, "assembly_original.fasta"), + + "&&", + + os.environ["fasta_to_saf.py"], + "-i", + os.path.join(output_directory, "assembly.fasta"), + ">", + os.path.join(output_directory, "assembly.fasta.saf"), + ] + + + + # files_to_remove = [ + # ] + + # for fn in files_to_remove: + # cmd += [ + # "&&", + # "rm -rf {}".format(os.path.join(output_directory, fn)), + # ] + return cmd + +# Bowtie2 +def get_alignment_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + cmd = [ + # Clear temporary directory just in case + "rm -rf {}".format(os.path.join(directories["tmp"], "*")), + "&&", + + # MiniMap2 Index + "(", + os.environ["minimap2"], + "-t {}".format(opts.n_jobs), + "-d {}".format(output_filepaths[0]), # Index + opts.minimap2_index_options, + input_filepaths[1], # Reference + ")", + + "&&", + + # MiniMap2 + "(", + os.environ["minimap2"], + "-a", + "-t {}".format(opts.n_jobs), + "-x {}".format(opts.minimap2_preset), + opts.minimap2_options, + output_filepaths[0], + input_filepaths[0], + + + + # Convert to sorted BAM + "|", + + os.environ["samtools"], + "view", + "-b", + "-h", + "-F 4", + + "|", + + os.environ["samtools"], + "sort", + "--threads {}".format(opts.n_jobs), + "--reference {}".format(input_filepaths[1]), + "-T {}".format(os.path.join(directories["tmp"], "samtools_sort")), + ">", + output_filepaths[1], + ")", + + "&&", + + "(", + os.environ["samtools"], + "index", + "-@ {}".format(opts.n_jobs), + output_filepaths[1], + ")", + ] + + return cmd + + +# featureCounts +def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + + # ORF-Level Counts + cmd = [ + "mkdir -p {}".format(os.path.join(directories["tmp"], "featurecounts")), + "&&", + "(", + os.environ["featureCounts"], + # "-G {}".format(input_filepaths[0]), + "-a {}".format(input_filepaths[1]), + "-o {}".format(os.path.join(output_directory, "featurecounts.tsv")), + "-F SAF", + "-L", + "--tmpDir {}".format(os.path.join(directories["tmp"], "featurecounts")), + "-T {}".format(opts.n_jobs), + opts.featurecounts_options, + input_filepaths[2], + ")", + "&&", + "gzip -f {}".format(os.path.join(output_directory, "featurecounts.tsv")), + ] + return cmd + +# seqkit +def get_seqkit_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + + # ORF-Level Counts + cmd = [ + + os.environ["seqkit"], + "stats", + "-a", + "-j {}".format(opts.n_jobs), + "-T", + "-b", + os.path.join(directories[("intermediate","1__assembly")], "*.fasta"), + "|", + "gzip", + ">", + output_filepaths[0], + ] + return cmd + +# Symlink +def get_symlink_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + # Command + cmd = [ + "DST={}; (for SRC in {}; do SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST; done)".format( + output_directory, + " ".join(input_filepaths), + ) + ] + return cmd + +# ============ +# Run Pipeline +# ============ +# Set environment variables +def add_executables_to_environment(opts): + """ + Adapted from Soothsayer: https://github.com/jolespin/soothsayer + """ + accessory_scripts = { + "fasta_to_saf.py", + } + + required_executables={ + "flye", + "minimap2", + "samtools", + "featureCounts", + "seqkit", + } | accessory_scripts + + if opts.path_config == "CONDA_PREFIX": + executables = dict() + for name in required_executables: + executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name) + else: + if opts.path_config is None: + opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv") + opts.path_config = format_path(opts.path_config) + assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config) + assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables) + df_config = pd.read_csv(opts.path_config, sep="\t") + assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config) + df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str) + # Get executable paths + executables = OrderedDict(zip(df_config["name"], df_config["executable"])) + assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys()))) + + # Display + for name in sorted(accessory_scripts): + executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path + + print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) + for name, executable in executables.items(): + if name in required_executables: + print(name, executable, sep = " --> ", file=sys.stdout) + os.environ[name] = executable.strip() + print("", file=sys.stdout) + + +# Pipeline +def create_pipeline(opts, directories, f_cmds): + + # ................................................................. + # Primordial + # ................................................................. + # Commands file + pipeline = ExecutablePipeline(name=__program__, description=opts.name, f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"]) + + # ========== + # Assembly + # ========== + + step = 1 + + # Info + program = "assembly" + program_label = "{}__{}".format(step, program) + description = "Assembling long reads via {}".format(opts.program.capitalize()) + + # Add to directories + output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) + + + # i/o + input_filepaths = [opts.reads] + output_filenames = ["assembly.fasta", "assembly.fasta.saf"] + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_assembly_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + # ========== + # Alignment + # ========== + + step = 2 + + # Info + program = "alignment" + program_label = "{}__{}".format(step, program) + description = "Aligning reads to assembly" + + # Add to directories + output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) + + + # i/o + input_filepaths = [ + opts.reads, + os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta"), + ] + + output_filepaths = [ + os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta.mmi"), + os.path.join(output_directory, "mapped.sorted.bam"), + ] + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_alignment_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + + + # ========== + # featureCounts + # ========== + step = 3 + + # Info + program = "featurecounts" + program_label = "{}__{}".format(step, program) + description = "Counting reads" + + # Add to directories + output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) + + # i/o + + input_filepaths = [ + os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta"), + os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta.saf"), + os.path.join(directories[("intermediate", "2__alignment")], "mapped.sorted.bam"), + ] + + output_filenames = ["featurecounts.tsv.gz"] + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_featurecounts_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + # ========== + # stats + # ========== + + step = 4 + + # Info + program = "seqkit" + program_label = "{}__{}".format(step, program) + description = "Assembly statistics" + + # Add to directories + output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) + + + # i/o + input_filepaths = [ + os.path.join(directories[("intermediate", "1__assembly")], "*.fasta"), + + ] + + output_filenames = ["seqkit_stats.tsv.gz"] + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_seqkit_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + + # ============= + # Symlink + # ============= + step = 5 + + # Info + program = "symlink" + program_label = "{}__{}".format(step, program) + description = "Symlinking relevant output files" + + # Add to directories + output_directory = directories["output"] + + # i/o + + input_filepaths = [ + os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta"), + os.path.join(directories[("intermediate", "1__assembly")], "assembly.fasta.mmi"), + os.path.join(directories[("intermediate", "2__alignment")], "mapped.sorted.bam"), + os.path.join(directories[("intermediate", "2__alignment")], "mapped.sorted.bam.bai"), + os.path.join(directories[("intermediate", "3__featurecounts")], "featurecounts.tsv.gz"), + os.path.join(directories[("intermediate", "4__seqkit")], "seqkit_stats.tsv.gz"), + ] + + output_filenames = map(lambda fp: fp.split("/")[-1], input_filepaths) + output_filepaths = list(map(lambda fn:os.path.join(directories["output"], fn), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_symlink_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + return pipeline + +# Configure parameters +def configure_parameters(opts, directories): + # os.environ[] + + # Scaffold prefix + if opts.scaffold_prefix == "NONE": + opts.scaffold_prefix = "" + else: + if "NAME" in opts.scaffold_prefix: + opts.scaffold_prefix = opts.scaffold_prefix.replace("NAME", opts.name) + print("Using the following prefix for all {} scaffolds: {}".format(opts.program, opts.scaffold_prefix), file=sys.stdout) + + # Set environment variables + add_executables_to_environment(opts=opts) + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -n -g -o ".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser_io = parser.add_argument_group('Required I/O arguments') + parser_io.add_argument("-i","--reads", type=str, required=True, help = "path/to/reads.fq[.gz]") + parser_io.add_argument("-n", "--name", type=str, required=True, help="Name of sample") + parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/assembly", help = "path/to/project_directory [Default: veba_output/assembly]") + + # Utility + parser_utility = parser.add_argument_group('Utility arguments') + parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future + parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]") + parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]") + parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]") + parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) + parser_utility.add_argument("--tmpdir", type=str, help="Set temporary directory") #site-packges in future + + # Assembler + parser_assembler = parser.add_argument_group('Assembler arguments') + parser_assembler.add_argument("-P", "--program", type=str, default="flye", choices={"flye", "metaflye"}, help="Assembler | {flye, metaflye}} [Default: 'flye']") + parser_assembler.add_argument("-s", "--scaffold_prefix", type=str, default="NAME__", help="Assembler | Special options: Use NAME to use --name. Use NONE to not include a prefix. [Default: 'NAME__']") + parser_assembler.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="Minimum contig length. Should be lenient here because longer thresholds can be used for binning downstream. Recommended for metagenomes to use 1000 here. [Default: 1] ") + parser_assembler.add_argument("-t", "--reads_type", type=str, default="nano-hq", choices={"nano-hq", "nano-corr", "nano-raw", "pacbio-hifi", "pacbio-corr", "pacbio-raw"}, help="Reads type for (meta)flye. {nano-hq, nano-corr, nano-raw, pacbio-hifi, pacbio-corr, pacbio-raw} [Default: nano-hq] ") + parser_assembler.add_argument("-g", "--estimated_assembly_size", type=str, help="Estimated assembly size (e.g., 5m, 2.6g)") + parser_assembler.add_argument("--no_deterministic", action="store_true", help="Do not use deterministic mode. This will result in a faster assembly since it will be threaded but can get different assemblies upon rerunning") + parser_assembler.add_argument("--assembler_options", type=str, default="", help="Assembler options for Flye-based programs (e.g. --arg 1 ) [Default: '']") + + # Aligner + parser_aligner = parser.add_argument_group('MiniMap2 arguments') + parser_aligner.add_argument("--minimap2_preset", type=str, default="map-ont", help="MiniMap2 | MiniMap2 preset {map-pb, map-ont, map-hifi} [Default: map-ont]") + # parser_aligner.add_argument("--no_create_index", action="store_true", help="Do not create a MiniMap2 index") + parser_aligner.add_argument("--minimap2_index_options", type=str, default="", help="MiniMap2 | More options (e.g. --arg 1 ) [Default: '']\nhttps://github.com/lh3/minimap2") + parser_aligner.add_argument("--minimap2_options", type=str, default="", help="MiniMap2 | More options (e.g. --arg 1 ) [Default: '']\nhttps://github.com/lh3/minimap2") + + # featureCounts + parser_featurecounts = parser.add_argument_group('featureCounts arguments') + parser_featurecounts.add_argument("--featurecounts_options", type=str, default="", help="featureCounts | More options (e.g. --arg 1 ) [Default: ''] | http://bioinf.wehi.edu.au/featureCounts/") + + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Threads + if opts.n_jobs == -1: + from multiprocessing import cpu_count + opts.n_jobs = cpu_count() + assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1." + + + # Directories + directories = dict() + directories["project"] = create_directory(opts.project_directory) + directories["sample"] = create_directory(os.path.join(directories["project"], opts.name)) + directories["output"] = create_directory(os.path.join(directories["sample"], "output")) + directories["log"] = create_directory(os.path.join(directories["sample"], "log")) + if not opts.tmpdir: + opts.tmpdir = os.path.join(directories["sample"], "tmp") + directories["tmp"] = create_directory(opts.tmpdir) + directories["checkpoints"] = create_directory(os.path.join(directories["sample"], "checkpoints")) + directories["intermediate"] = create_directory(os.path.join(directories["sample"], "intermediate")) + # os.environ["TMPDIR"] = directories["tmp"] + + # Info + print(format_header(__program__, "="), file=sys.stdout) + print(format_header("Configuration:", "-"), file=sys.stdout) + print(format_header("Name: {}".format(opts.name), "."), file=sys.stdout) + print("Python version:", sys.version.replace("\n"," "), file=sys.stdout) + print("Python path:", sys.executable, file=sys.stdout) #sys.path[2] + print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2] + print("Script version:", __version__, file=sys.stdout) + print("Moment:", get_timestamp(), file=sys.stdout) + print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) + print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) + configure_parameters(opts, directories) + sys.stdout.flush() + + # Run pipeline + with open(os.path.join(directories["sample"], "commands.sh"), "w") as f_cmds: + pipeline = create_pipeline( + opts=opts, + directories=directories, + f_cmds=f_cmds, + ) + pipeline.compile() + pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint) + +if __name__ == "__main__": + main() diff --git a/src/assembly.py b/src/assembly.py index 5156eff..32fc4fd 100755 --- a/src/assembly.py +++ b/src/assembly.py @@ -13,7 +13,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" # Assembly def get_assembly_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -683,8 +683,8 @@ def main(args=None): parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) # Pipeline parser_io = parser.add_argument_group('Required I/O arguments') - parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/forward_reads.fq") - parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reverse_reads.fq") + parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/forward_reads.fq[.gz]") + parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reverse_reads.fq[.gz]") parser_io.add_argument("-n", "--name", type=str, help="Name of sample", required=True) parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/assembly", help = "path/to/project_directory [Default: veba_output/assembly]") @@ -758,6 +758,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/binning-eukaryotic.py b/src/binning-eukaryotic.py index f8cfaf2..9fdc054 100755 --- a/src/binning-eukaryotic.py +++ b/src/binning-eukaryotic.py @@ -14,7 +14,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.12.2" # DATABASE_METAEUK="/usr/local/scratch/CORE/jespinoz/db/veba/v1.0/Classify/Eukaryotic/eukaryotic" @@ -310,11 +310,13 @@ def get_eukaryotic_gene_modeling_cmd(input_filepaths, output_filepaths, output_d # Run Eukaryotic Gene Modeling "&&", + os.environ["eukaryotic_gene_modeling_wrapper.py"], "--fasta {}".format(os.path.join(directories["tmp"], "scaffolds.binned.eukaryotic.fasta")), "--scaffolds_to_bins {}".format(input_filepaths[1]), "--tiara_results {}".format(input_filepaths[2]), "--metaeuk_database {}".format(opts.metaeuk_database), + "--metaeuk_split_memory_limit {}".format(opts.metaeuk_split_memory_limit), "-o {}".format(output_directory), "-p {}".format(opts.n_jobs), @@ -1016,8 +1018,10 @@ def main(args=None): # MetaEuk parser_metaeuk = parser.add_argument_group('MetaEuk arguments') + parser_metaeuk.add_argument("-M", "--microeuk_database", type=str, choices={"MicroEuk100", "MicroEuk90", "MicroEuk50"}, default="MicroEuk50", help="MicroEuk database {MicroEuk100, MicroEuk90, MicroEuk50} [Default: MicroEuk50]") parser_metaeuk.add_argument("--metaeuk_sensitivity", type=float, default=4.0, help="MetaEuk | Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [Default: 4.0]") parser_metaeuk.add_argument("--metaeuk_evalue", type=float, default=0.01, help="MetaEuk | List matches below this E-value (range 0.0-inf) [Default: 0.01]") + parser_metaeuk.add_argument("--metaeuk_split_memory_limit", type=str, default="36G", help="MetaEuk | Set max memory per split. E.g. 800B, 5K, 10M, 1G. Use 0 to use all available system memory. (Default value is experimental) [Default: 36G]") parser_metaeuk.add_argument("--metaeuk_options", type=str, default="", help="MetaEuk | More options (e.g. --arg 1 ) [Default: ''] https://github.com/soedinglab/metaeuk") # --split-memory-limit 70G: https://github.com/soedinglab/metaeuk/issues/59 @@ -1071,7 +1075,7 @@ def main(args=None): if opts.veba_database is None: assert "VEBA_DATABASE" in os.environ, "Please set the following environment variable 'export VEBA_DATABASE=/path/to/veba_database' or provide path to --veba_database" opts.veba_database = os.environ["VEBA_DATABASE"] - opts.metaeuk_database = os.path.join(opts.veba_database, "Classify", "Microeukaryotic", "microeukaryotic") + opts.metaeuk_database = os.path.join(opts.veba_database, "Classify", "MicroEuk", opts.microeuk_database) # Directories @@ -1097,6 +1101,7 @@ def main(args=None): print("VEBA Database:", opts.veba_database, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/binning-prokaryotic.py b/src/binning-prokaryotic.py index 29f80c9..a52eb54 100755 --- a/src/binning-prokaryotic.py +++ b/src/binning-prokaryotic.py @@ -13,7 +13,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" # Assembly def get_coverage_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -1683,6 +1683,7 @@ def main(args=None): print("VEBA Database:", opts.veba_database, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/binning-viral.py b/src/binning-viral.py index f109b01..55f299e 100755 --- a/src/binning-viral.py +++ b/src/binning-viral.py @@ -14,7 +14,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" # geNomad def get_genomad_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): @@ -953,6 +953,7 @@ def main(args=None): print("VEBA Database:", opts.veba_database, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/biosynthetic.py b/src/biosynthetic.py index 5c1cb77..9996c68 100755 --- a/src/biosynthetic.py +++ b/src/biosynthetic.py @@ -13,7 +13,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.12.18" # antiSMASH def get_antismash_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -336,7 +336,7 @@ def get_mmseqs2_protein_cmd(input_filepaths, output_filepaths, output_directory, "&&", - os.environ["mmseqs2_wrapper.py"], + os.environ["clustering_wrapper.py"], "--fasta {}".format(os.path.join(directories["tmp"], "components.concatenated.faa")), "--output_directory {}".format(output_directory), "--no_singletons" if bool(opts.no_singletons) else "", @@ -415,7 +415,7 @@ def get_mmseqs2_nucleotide_cmd(input_filepaths, output_filepaths, output_directo "&&", - os.environ["mmseqs2_wrapper.py"], + os.environ["clustering_wrapper.py"], "--fasta {}".format(os.path.join(directories["tmp"], "bgcs.concatenated.fasta")), "--output_directory {}".format(output_directory), "--no_singletons" if bool(opts.no_singletons) else "", @@ -483,7 +483,7 @@ def add_executables_to_environment(opts): "concatenate_dataframes.py", "bgc_novelty_scorer.py", "compile_krona.py", - "mmseqs2_wrapper.py", + "clustering_wrapper.py", "compile_protein_cluster_prevalence_table.py", } @@ -860,7 +860,7 @@ def main(args=None): # antiSMASH parser_antismash = parser.add_argument_group('antiSMASH arguments') parser_antismash.add_argument("-t", "--taxon", type=str, default="bacteria", help="Taxonomic classification of input sequence {bacteria,fungi} [Default: bacteria]") - parser_antismash.add_argument("--minimum_contig_length", type=int, default=1500, help="Minimum contig length. [Default: 1500] ") + parser_antismash.add_argument("--minimum_contig_length", type=int, default=1, help="Minimum contig length. [Default: 1] ") parser_antismash.add_argument("-d", "--antismash_database", type=str, default=os.path.join(site.getsitepackages()[0], "antismash", "databases"), help="antiSMASH | Database directory path [Default: {}]".format(os.path.join(site.getsitepackages()[0], "antismash", "databases"))) parser_antismash.add_argument("-s", "--hmmdetection_strictness", type=str, default="relaxed", help="antiSMASH | Defines which level of strictness to use for HMM-based cluster detection {strict,relaxed,loose} [Default: relaxed] ") parser_antismash.add_argument("--tta_threshold", type=float, default=0.65, help="antiSMASH | Lowest GC content to annotate TTA codons at [Default: 0.65]") @@ -881,7 +881,7 @@ def main(args=None): # MMSEQS2 parser_mmseqs2 = parser.add_argument_group('MMSEQS2 arguments') - parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="easy-cluster", help="MMSEQS2 | {easy-cluster, easy-linclust} [Default: easy-cluster]") + parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="mmseqs-cluster", choices={"mmseqs-cluster", "mmseqs-linclust"}, help="MMSEQS2 | {mmseqs-cluster, mmseqs-linclust} [Default: mmseqs-cluster]") parser_mmseqs2.add_argument("-f","--representative_output_format", type=str, default="fasta", help = "Format of output for representative sequences: {table, fasta} [Default: fasta]") # Should fasta be the new default? @@ -943,6 +943,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/classify-eukaryotic.py b/src/classify-eukaryotic.py index 216c26c..a9bb93d 100755 --- a/src/classify-eukaryotic.py +++ b/src/classify-eukaryotic.py @@ -14,7 +14,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" # Assembly def get_concatenate_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -160,7 +160,7 @@ def get_compile_cmd( input_filepaths, output_filepaths, output_directory, direct return cmd -def get_consensus_genome_classification_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): +def get_consensus_genome_classification_ranked_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): # Command cmd = [ @@ -172,7 +172,7 @@ def get_consensus_genome_classification_cmd( input_filepaths, output_filepaths, "|", "tail -n +2", "|", - os.environ["consensus_genome_classification.py"], + os.environ["consensus_genome_classification_ranked.py"], "--leniency {}".format(opts.leniency), "-o {}".format(output_filepaths[0]), "-r c__,o__,f__,g__,s__", @@ -224,7 +224,7 @@ def get_consensus_cluster_classification_cmd( input_filepaths, output_filepaths, "-n id_genome_cluster", "-i 0", "|", - os.environ["consensus_genome_classification.py"], + os.environ["consensus_genome_classification_ranked.py"], "--leniency {}".format(opts.leniency), "-o {}".format(output_filepaths[0]), "-r c__,o__,f__,g__,s__", @@ -252,7 +252,7 @@ def add_executables_to_environment(opts): "filter_hmmsearch_results.py", "subset_table.py", "compile_eukaryotic_classifications.py", - "consensus_genome_classification.py", + "consensus_genome_classification_ranked.py", "insert_column_to_table.py", "metaeuk_wrapper.py", "scaffolds_to_bins.py", @@ -481,7 +481,7 @@ def create_pipeline(opts, directories, f_cmds): # ========== step += 1 - program = "consensus_genome_classification" + program = "consensus_genome_classification_ranked" program_label = "{}__{}".format(step, program) # Add to directories output_directory = directories["output"]# = create_directory(os.path.join(directories["intermediate"], program_label)) @@ -504,7 +504,7 @@ def create_pipeline(opts, directories, f_cmds): "directories":directories, } - cmd = get_consensus_genome_classification_cmd(**params) + cmd = get_consensus_genome_classification_ranked_cmd(**params) pipeline.add_step( id=program, @@ -698,6 +698,7 @@ def main(args=None): print("VEBA Database:", opts.veba_database, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/classify-prokaryotic.py b/src/classify-prokaryotic.py index b5abb15..e6f1d47 100755 --- a/src/classify-prokaryotic.py +++ b/src/classify-prokaryotic.py @@ -15,7 +15,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" # GTDB-Tk def get_gtdbtk_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -138,7 +138,7 @@ def get_consensus_cluster_classification_cmd( input_filepaths, output_filepaths, "-i {}".format(input_filepaths[0]), "-c {}".format(input_filepaths[1]), "|", - os.environ["consensus_genome_classification.py"], + os.environ["consensus_genome_classification_ranked.py"], "--leniency {}".format(opts.leniency), "-o {}".format(output_filepaths[0]), "-u 'Unclassified prokaryote'", @@ -158,7 +158,7 @@ def add_executables_to_environment(opts): "compile_prokaryotic_genome_cluster_classification_scores_table.py", # "cut_table_by_column_labels.py", "concatenate_dataframes.py", - "consensus_genome_classification.py", + "consensus_genome_classification_ranked.py", # "insert_column_to_table.py", "compile_krona.py", @@ -443,6 +443,7 @@ def main(args=None): print("VEBA Database:", opts.veba_database, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/classify-viral.py b/src/classify-viral.py index ed0da0f..50cca6b 100755 --- a/src/classify-viral.py +++ b/src/classify-viral.py @@ -14,7 +14,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" def get_concatenate_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -359,6 +359,7 @@ def main(args=None): print("VEBA Database:", opts.veba_database, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/cluster.py b/src/cluster.py index 320ef00..e13263f 100755 --- a/src/cluster.py +++ b/src/cluster.py @@ -1,6 +1,6 @@ #!/usr/bin/env python from __future__ import print_function, division -import sys, os, argparse, glob +import sys, os, argparse, glob, warnings from collections import OrderedDict, defaultdict import pandas as pd @@ -13,7 +13,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.24" +__version__ = "2023.12.11" # Global clustering def get_global_clustering_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -26,18 +26,35 @@ def get_global_clustering_cmd( input_filepaths, output_filepaths, output_directo # "--no_singletons" if bool(opts.no_singletons) else "", "-p {}".format(opts.n_jobs), + "--genome_clustering_algorithm {}".format(opts.genome_clustering_algorithm), "--ani_threshold {}".format(opts.ani_threshold), "--genome_cluster_prefix {}".format(opts.genome_cluster_prefix) if bool(opts.genome_cluster_prefix) else "", "--genome_cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", "--genome_cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill) if bool(opts.genome_cluster_prefix_zfill) else "", + "--skani_target_ani {}".format(opts.skani_target_ani), + "--skani_minimum_af {}".format(opts.skani_minimum_af), + "--skani_no_confidence_interval" if opts.skani_no_confidence_interval else "", + + "--skani_nonviral_preset {}".format(opts.skani_nonviral_preset), + "--skani_nonviral_compression_factor {}".format(opts.skani_nonviral_compression_factor), + "--skani_nonviral_marker_kmer_compression_factor {}".format(opts.skani_nonviral_marker_kmer_compression_factor), + "--skani_nonviral_options {}".format(opts.skani_nonviral_options) if bool(opts.skani_nonviral_options) else "", + + "--skani_viral_preset {}".format(opts.skani_viral_preset), + "--skani_viral_compression_factor {}".format(opts.skani_viral_compression_factor), + "--skani_viral_marker_kmer_compression_factor {}".format(opts.skani_viral_marker_kmer_compression_factor), + "--skani_viral_options {}".format(opts.skani_viral_options) if bool(opts.skani_viral_options) else "", + "--fastani_options {}".format(opts.fastani_options) if bool(opts.fastani_options) else "", - "--algorithm {}".format(opts.algorithm), + + "--protein_clustering_algorithm {}".format(opts.protein_clustering_algorithm), "--minimum_identity_threshold {}".format(opts.minimum_identity_threshold), "--minimum_coverage_threshold {}".format(opts.minimum_coverage_threshold), "--protein_cluster_prefix {}".format(opts.protein_cluster_prefix) if bool(opts.protein_cluster_prefix) else "", "--protein_cluster_suffix {}".format(opts.protein_cluster_suffix) if bool(opts.protein_cluster_suffix) else "", "--protein_cluster_prefix_zfill {}".format(opts.protein_cluster_prefix_zfill) if bool(opts.protein_cluster_prefix_zfill) else "", "--mmseqs2_options {}".format(opts.mmseqs2_options) if bool(opts.mmseqs2_options) else "", + "--diamond_options {}".format(opts.diamond_options) if bool(opts.diamond_options) else "", "--minimum_core_prevalence {}".format(opts.minimum_core_prevalence), "&&", @@ -60,18 +77,36 @@ def get_local_clustering_cmd( input_filepaths, output_filepaths, output_director "-o {}".format(output_directory), # "--no_singletons" if bool(opts.no_singletons) else "", "-p {}".format(opts.n_jobs), + + "--genome_clustering_algorithm {}".format(opts.genome_clustering_algorithm), "--ani_threshold {}".format(opts.ani_threshold), "--genome_cluster_prefix {}".format(opts.genome_cluster_prefix) if bool(opts.genome_cluster_prefix) else "", "--genome_cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", "--genome_cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill) if bool(opts.genome_cluster_prefix_zfill) else "", + "--skani_target_ani {}".format(opts.skani_target_ani), + "--skani_minimum_af {}".format(opts.skani_minimum_af), + "--skani_no_confidence_interval" if opts.skani_no_confidence_interval else "", + + "--skani_nonviral_preset {}".format(opts.skani_nonviral_preset), + "--skani_nonviral_compression_factor {}".format(opts.skani_nonviral_compression_factor), + "--skani_nonviral_marker_kmer_compression_factor {}".format(opts.skani_nonviral_marker_kmer_compression_factor), + "--skani_nonviral_options {}".format(opts.skani_nonviral_options) if bool(opts.skani_nonviral_options) else "", + + "--skani_viral_preset {}".format(opts.skani_viral_preset), + "--skani_viral_compression_factor {}".format(opts.skani_viral_compression_factor), + "--skani_viral_marker_kmer_compression_factor {}".format(opts.skani_viral_marker_kmer_compression_factor), + "--skani_viral_options {}".format(opts.skani_viral_options) if bool(opts.skani_viral_options) else "", + "--fastani_options {}".format(opts.fastani_options) if bool(opts.fastani_options) else "", - "--algorithm {}".format(opts.algorithm), + + "--protein_clustering_algorithm {}".format(opts.protein_clustering_algorithm), "--minimum_identity_threshold {}".format(opts.minimum_identity_threshold), "--minimum_coverage_threshold {}".format(opts.minimum_coverage_threshold), "--protein_cluster_prefix {}".format(opts.protein_cluster_prefix) if bool(opts.protein_cluster_prefix) else "", "--protein_cluster_suffix {}".format(opts.protein_cluster_suffix) if bool(opts.protein_cluster_suffix) else "", "--protein_cluster_prefix_zfill {}".format(opts.protein_cluster_prefix_zfill) if bool(opts.protein_cluster_prefix_zfill) else "", "--mmseqs2_options {}".format(opts.mmseqs2_options) if bool(opts.mmseqs2_options) else "", + "--diamond_options {}".format(opts.diamond_options) if bool(opts.diamond_options) else "", "--minimum_core_prevalence {}".format(opts.minimum_core_prevalence), "&&", @@ -107,8 +142,10 @@ def add_executables_to_environment(opts): required_executables={ # 1 + "skani", "fastANI", "mmseqs", + "diamond", } | accessory_scripts if opts.path_config == "CONDA_PREFIX": @@ -142,6 +179,21 @@ def add_executables_to_environment(opts): # Pipeline def create_pipeline(opts, directories, f_cmds): + + # Genome clustering algorithm + GENOME_CLUSTERING_ALGORITHM = opts.genome_clustering_algorithm.lower() + if GENOME_CLUSTERING_ALGORITHM == "fastani": + GENOME_CLUSTERING_ALGORITHM = "FastANI" + if GENOME_CLUSTERING_ALGORITHM == "skani": + GENOME_CLUSTERING_ALGORITHM = "skani" + + # Protein clustering algorithm + PROTEIN_CLUSTERING_ALGORITHM = opts.protein_clustering_algorithm.split("-")[0].lower() + if PROTEIN_CLUSTERING_ALGORITHM == "mmseqs": + PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.upper() + if PROTEIN_CLUSTERING_ALGORITHM == "diamond": + PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.capitalize() + # ................................................................. # Primordial # ................................................................. @@ -159,7 +211,7 @@ def create_pipeline(opts, directories, f_cmds): output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) # Info - description = "Global clustering of genomes (FastANI) and proteins (MMSEQS2)" + description = "Global clustering of genomes ({}) and proteins ({})".format(GENOME_CLUSTERING_ALGORITHM, PROTEIN_CLUSTERING_ALGORITHM) # i/o input_filepaths = [opts.genomes_table] @@ -206,7 +258,7 @@ def create_pipeline(opts, directories, f_cmds): output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) # Info - description = "Local clustering of genomes (FastANI) and proteins (MMSEQS2)" + description = "Local clustering of genomes ({}) and proteins ({})".format(GENOME_CLUSTERING_ALGORITHM, PROTEIN_CLUSTERING_ALGORITHM) # i/o input_filepaths = [opts.genomes_table] @@ -245,8 +297,20 @@ def create_pipeline(opts, directories, f_cmds): # Configure parameters def configure_parameters(opts, directories): - assert_acceptable_arguments(opts.algorithm, {"easy-cluster", "easy-linclust"}) + + assert_acceptable_arguments(opts.protein_clustering_algorithm, {"easy-cluster", "easy-linclust", "mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}) + if opts.protein_clustering_algorithm in {"easy-cluster", "easy-linclust"}: + d = {"easy-cluster":"mmseqs-cluster", "easy-linclust":"mmseqs-linclust"} + warnings.warn("\n\nPlease use `{}` instead of `{}` for MMSEQS2 clustering.".format(d[opts.protein_clustering_algorithm], opts.protein_clustering_algorithm)) + opts.protein_clustering_algorithm = d[opts.protein_clustering_algorithm] + if opts.skani_nonviral_preset.lower() == "none": + opts.skani_nonviral_preset = None + + if opts.skani_viral_preset.lower() == "none": + opts.skani_viral_preset = None + + assert 0 < opts.minimum_core_prevalence <= 1.0, "--minimum_core_prevalence must be a float between (0.0,1.0])" # Set environment variables add_executables_to_environment(opts=opts) @@ -257,7 +321,7 @@ def main(args=None): # Path info description = """ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) - usage = "{} -i -o -A 95 -a easy-cluster".format(__program__) + usage = "{} -i -o -A 95 -a mmseqs-cluster".format(__program__) epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" # Parser @@ -276,24 +340,45 @@ def main(args=None): parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]") parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) - # FastANI + # ANI + parser_genome_clustering = parser.add_argument_group('Genome clustering arguments') + parser_genome_clustering.add_argument("-G", "--genome_clustering_algorithm", type=str, choices={"fastani", "skani"}, default="skani", help="Program to use for ANI calculations. `skani` is faster and more memory efficient. For v1.0.0 - v1.3.x behavior, use `fastani`. [Default: skani]") + parser_genome_clustering.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]") + parser_genome_clustering.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-") + parser_genome_clustering.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") + parser_genome_clustering.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 + + parser_skani = parser.add_argument_group('Skani triangle arguments') + parser_skani.add_argument("--skani_target_ani", type=float, default=80, help="skani | If you set --skani_target_ani to --ani_threshold, you may screen out genomes ANI ≥ --ani_threshold [Default: 80]") + parser_skani.add_argument("--skani_minimum_af", type=float, default=15, help="skani | Minimum aligned fraction greater than this value [Default: 15]") + parser_skani.add_argument("--skani_no_confidence_interval", action="store_true", help="skani | Output [5,95] ANI confidence intervals using percentile bootstrap on the putative ANI distribution") + # parser_skani.add_argument("--skani_low_memory", action="store_true", help="Skani | More options (e.g. --arg 1 ) https://github.com/bluenote-1577/skani [Default: '']") + + parser_skani = parser.add_argument_group('[Prokaryotic & Eukaryotic] Skani triangle arguments') + parser_skani.add_argument("--skani_nonviral_preset", type=str, default="medium", choices={"fast", "medium", "slow", "none"}, help="skani [Prokaryotic & Eukaryotic]| Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: medium]") + parser_skani.add_argument("--skani_nonviral_compression_factor", type=int, default=125, help="skani [Prokaryotic & Eukaryotic]| Compression factor (k-mer subsampling rate). [Default: 125]") + parser_skani.add_argument("--skani_nonviral_marker_kmer_compression_factor", type=int, default=1000, help="skani [Prokaryotic & Eukaryotic] | Marker k-mer compression factor. Markers are used for filtering. [Default: 1000]") + parser_skani.add_argument("--skani_nonviral_options", type=str, default="", help="skani [Prokaryotic & Eukaryotic] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']") + + parser_skani = parser.add_argument_group('[Viral] Skani triangle arguments') + parser_skani.add_argument("--skani_viral_preset", type=str, default="slow", choices={"fast", "medium", "slow", "none"}, help="skani | Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: slow]") + parser_skani.add_argument("--skani_viral_compression_factor", type=int, default=30, help="skani [Viral] | Compression factor (k-mer subsampling rate). [Default: 30]") + parser_skani.add_argument("--skani_viral_marker_kmer_compression_factor", type=int, default=200, help="skani [Viral] | Marker k-mer compression factor. Markers are used for filtering. Consider decreasing to ~200-300 if working with small genomes (e.g. plasmids or viruses). [Default: 200]") + parser_skani.add_argument("--skani_viral_options", type=str, default="", help="skani [Viral] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']") + parser_fastani = parser.add_argument_group('FastANI arguments') - parser_fastani.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="FastANI | Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]") - parser_fastani.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-") - parser_fastani.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") - parser_fastani.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 parser_fastani.add_argument("--fastani_options", type=str, default="", help="FastANI | More options (e.g. --arg 1 ) [Default: '']") - - # MMSEQS2 - parser_mmseqs2 = parser.add_argument_group('MMSEQS2 arguments') - parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="easy-cluster", help="MMSEQS2 | {easy-cluster, easy-linclust} [Default: easy-cluster]") - parser_mmseqs2.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="MMSEQS2 | SLC-Specific Protein Cluster (SSPC, previously referred to as SSO) percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]") - parser_mmseqs2.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="MMSEQS2 | SSPC coverage threshold (Range (0.0, 1.0]) [Default: 0.8]") - parser_mmseqs2.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-") - parser_mmseqs2.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") - parser_mmseqs2.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 - parser_mmseqs2.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']") + # Clustering + parser_protein_clustering = parser.add_argument_group('Protein clustering arguments') + parser_protein_clustering.add_argument("-P", "--protein_clustering_algorithm", type=str, choices={"mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}, default="mmseqs-cluster", help="Clustering algorithm | Diamond can only be used for clustering proteins {mmseqs-cluster, mmseqs-linclust, diamond-cluster, mmseqs-linclust} [Default: mmseqs-cluster]") + parser_protein_clustering.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="Clustering | Percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]") + parser_protein_clustering.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="Clustering | Coverage threshold (Range (0.0, 1.0]) [Default: 0.8]") + parser_protein_clustering.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-") + parser_protein_clustering.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") + parser_protein_clustering.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 + parser_protein_clustering.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']") + parser_protein_clustering.add_argument("--diamond_options", type=str, default="", help="Diamond | More options (e.g. --arg 1 ) [Default: '']") # Pangenome parser_pangenome = parser.add_argument_group('Pangenome arguments') @@ -329,6 +414,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/coverage-long.py b/src/coverage-long.py new file mode 100755 index 0000000..d282754 --- /dev/null +++ b/src/coverage-long.py @@ -0,0 +1,587 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, glob +from collections import OrderedDict, defaultdict + +import pandas as pd + +# Soothsayer Ecosystem +from genopype import * +from genopype import __version__ as genopype_version +from soothsayer_utils import * + +pd.options.display.max_colwidth = 100 +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.12.18" + +# Assembly +def get_index_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + cmd = [ + # Filtering out small contigs + "cat", + opts.fasta, + "|", + os.environ["seqkit"], + "seq", + "-m {}".format(opts.minimum_contig_length), + "-j {}".format(opts.n_jobs), + opts.seqkit_seq_options, + ">", + output_filepaths[0], + + # Create SAF file + "&&", + os.environ["fasta_to_saf.py"], + "-i {}".format(output_filepaths[0]), + ">", + output_filepaths[1], + + "&&", + + # Minimap2 Index + os.environ["minimap2"], + "-t {}".format(opts.n_jobs), + # "--seed {}".format(opts.random_state), + opts.minimap2_index_options, + "-d {}".format(output_filepaths[3]), # Index + output_filepaths[0], # Reference + + # Get stats for reference + "&&", + os.environ["seqkit"], + "stats", + "-a", + "-j {}".format(opts.n_jobs), + "-T", + "-b", + output_filepaths[0], + ">", + output_filepaths[2], + ] + + return cmd + + +# # Bowtie2 +# def get_alignment_gnuparallel_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + +# # Command +# cmd = [ + +# # MAKE THIS A FOR LOOP WITH MAX THREADS FOR EACH ONE. THE REASON FOR THIS IS THAT IF THERE IS A SMALL SAMPLE IT WILL BE DONE QUICK BUT THE LARGER SAMPLES ARE GOING TO BE STUCK WITH ONE THREAD STILL +# """ +# # Clear temporary directory just in case + +# rm -rf %s + +# # Minimap2 +# %s --jobs %d -a %s -C "\t" "mkdir -p %s && %s -x %s -1 {2} -2 {3} --threads 1 --seed %d --no-unal %s | %s sort --threads 1 --reference %s -T %s > %s && %s index -@ 1 %s" + +# """%( +# os.path.join(directories["tmp"], "*"), + +# # Parallel +# os.environ["parallel"], +# opts.n_jobs, +# input_filepaths[0], + +# # Make directory +# os.path.join(output_directory, "{1}"), + +# # Bowtie2 +# os.environ["minimap2"], +# input_filepaths[1], +# opts.random_state, +# opts.bowtie2_options, + +# # Samtools sort +# os.environ["samtools"], +# input_filepaths[0], +# os.path.join(directories["tmp"], "samtools_sort_{1}"), +# os.path.join(output_directory, "{1}", "mapped.sorted.bam"), + +# # Samtools index +# os.environ["samtools"], +# os.path.join(output_directory, "{1}", "mapped.sorted.bam"), + +# ), + + +# ] + +# return cmd + +def get_alignment_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + cmd = [ + +""" + # Clear temporary directory just in case +rm -rf %s + +# Read lines +READ_TABLE=%s + +while IFS= read -r LINE +do echo $LINE + # Split fields + ID_SAMPLE=$(echo $LINE | cut -f1 -d " ") + READS=$(echo $LINE | cut -f2 -d " ") + + # Create subdirectory + mkdir -p %s + + OUTPUT_BAM="%s" + + # Minimap2 + if [[ -e "$OUTPUT_BAM" && -s "$OUTPUT_BAM" ]]; then + echo "[Skipping (Exists)] [Minimap2] [$ID_SAMPLE]" + else + echo "[Running] [Minimap2] [$ID_SAMPLE]" + %s -a -x %s -t %d %s %s $READS | %s view -h -b -F 4 | %s sort -@ %d --reference %s -T %s > $OUTPUT_BAM && %s index -@ %d $OUTPUT_BAM + fi +done < $READ_TABLE + +"""%( + # Clear temporary directory just in case + os.path.join(directories["tmp"], "*"), + + # Read lines + input_filepaths[0], + + # Make directory + os.path.join(output_directory, "${ID_SAMPLE}"), + + # Output BAM + os.path.join(output_directory, "${ID_SAMPLE}", "mapped.sorted.bam"), + + + # Bowtie2 + os.environ["minimap2"], + opts.minimap2_preset, + opts.n_jobs, + opts.minimap2_options, + input_filepaths[2], + + + # Samtools view + os.environ["samtools"], + + + # Samtools sort + os.environ["samtools"], + opts.n_jobs, + input_filepaths[1], + os.path.join(directories["tmp"], "samtools_sort_${ID_SAMPLE}"), + # os.path.join(output_directory, "${ID_SAMPLE}", "mapped.sorted.bam"), + + # Samtools index + os.environ["samtools"], + opts.n_jobs, + # os.path.join(output_directory, "${ID_SAMPLE}", "mapped.sorted.bam"), + ), + + ] + + return cmd + + +# featureCounts +def get_featurecounts_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + + # ORF-Level Counts + cmd = [ + "mkdir -p {}".format(os.path.join(directories["tmp"], "featurecounts")), + "&&", + "(", + os.environ["featureCounts"], + # "-G {}".format(input_filepaths[0]), + "-a {}".format(input_filepaths[0]), + "-o {}".format(os.path.join(output_directory, "featurecounts.tsv")), + "-F SAF", + "--tmpDir {}".format(os.path.join(directories["tmp"], "featurecounts")), + "-T {}".format(opts.n_jobs), + "-L", + opts.featurecounts_options, + *input_filepaths[1:], + ")", + "&&", + "gzip -f {}".format(os.path.join(output_directory, "featurecounts.tsv")), + ] + return cmd + + + +# Symlink +def get_symlink_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + # Command + cmd = [ + "DST={}; (for SRC in {}; do SRC=$(realpath --relative-to $DST $SRC); ln -sf $SRC $DST; done)".format( + output_directory, + " ".join(input_filepaths), + ) + ] + return cmd + +# ============ +# Run Pipeline +# ============ +# Set environment variables +def add_executables_to_environment(opts): + """ + Adapted from Soothsayer: https://github.com/jolespin/soothsayer + """ + accessory_scripts = { + "fasta_to_saf.py" + } + + required_executables={ + "minimap2", + "samtools", + "featureCounts", + "seqkit", + # "parallel", + } | accessory_scripts + + if opts.path_config == "CONDA_PREFIX": + executables = dict() + for name in required_executables: + executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name) + else: + if opts.path_config is None: + opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv") + opts.path_config = format_path(opts.path_config) + assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config) + assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables) + df_config = pd.read_csv(opts.path_config, sep="\t") + assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config) + df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str) + # Get executable paths + executables = OrderedDict(zip(df_config["name"], df_config["executable"])) + assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys()))) + + # Display + for name in sorted(accessory_scripts): + executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path + print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) + for name, executable in executables.items(): + if name in required_executables: + print(name, executable, sep = " --> ", file=sys.stdout) + os.environ[name] = executable.strip() + print("", file=sys.stdout) + +# Pipeline +def create_pipeline(opts, directories, f_cmds): + + # ................................................................. + # Primordial + # ................................................................. + # Commands file + pipeline = ExecutablePipeline(name=__program__, description="Coverage", f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"]) + + # ========== + # Assembly + # ========== + + step = 1 + + # Info + program = "index" + program_label = "{}__{}".format(step, program) + description = "Preprocess fasta file and build Bowtie2 index" + + # Add to directories + output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) + + + # i/o + input_filepaths = [opts.fasta] + output_filenames = ["reference.fasta", "reference.fasta.saf", "seqkit_stats.tsv", "reference.mmi"] + + + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_index_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + # ========== + # Alignment + # ========== + + step = 2 + + # Info + program = "alignment" + program_label = "{}__{}".format(step, program) + description = "Aligning reads to reference" + + # Add to directories + output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) + + + # i/o + input_filepaths = [ + opts.reads, + os.path.join(directories[("intermediate", "1__index")], "reference.fasta"), + os.path.join(directories[("intermediate", "1__index")], "reference.mmi"), + ] + + + + output_filenames = ["*/mapped.sorted.bam"] + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + # if not opts.one_task_per_cpu: + cmd = get_alignment_cmd(**params) + # else: + # cmd = get_alignment_gnuparallel_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + # ========== + # featureCounts + # ========== + step = 3 + + # Info + program = "featurecounts" + program_label = "{}__{}".format(step, program) + description = "Counting reads" + + # Add to directories + output_directory = directories[("intermediate", program_label)] = create_directory(os.path.join(directories["intermediate"], program_label)) + + # i/o + + input_filepaths = [ + os.path.join(directories[("intermediate", "1__index")], "reference.fasta.saf"), + os.path.join(directories[("intermediate", "2__alignment")], "*", "mapped.sorted.bam"), + ] + + output_filenames = ["featurecounts.tsv.gz"] + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_featurecounts_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + + + + # ============= + # Symlink + # ============= + step = 4 + + # Info + program = "symlink" + program_label = "{}__{}".format(step, program) + description = "Symlinking relevant output files" + + # Add to directories + output_directory = directories["output"] + + # i/o + + input_filepaths = [ + os.path.join(directories[("intermediate", "1__index")], "reference.fasta"), + os.path.join(directories[("intermediate", "1__index")], "reference.fasta.saf"), + os.path.join(directories[("intermediate", "1__index")], "seqkit_stats.tsv"), + os.path.join(directories[("intermediate", "2__alignment")], "*"), + os.path.join(directories[("intermediate", "3__featurecounts")], "featurecounts.tsv.gz"), + ] + + output_filenames = map(lambda fp: fp.split("/")[-1], input_filepaths) + output_filepaths = list(map(lambda fn:os.path.join(directories["output"], fn), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_symlink_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + return pipeline + +# Configure parameters +def configure_parameters(opts, directories): + # os.environ[] + + # assert not bool(opts.unpaired_reads), "Cannot have --unpaired_reads if --forward_reads. Note, this behavior may be changed in the future but it's an adaptation of interleaved reads." + df = pd.read_csv(opts.reads, sep="\t", header=None) + n, m = df.shape + assert m == 2, "--reads must be a 2 column table seperated by tabs and no header. Currently there are {} columns".format(m) + # Set environment variables + add_executables_to_environment(opts=opts) + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -f -r -o ".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser_io = parser.add_argument_group('Required I/O arguments') + parser_io.add_argument("-f","--fasta", type=str, required=True, help = "path/to/reference.fasta. Recommended usage is for merging unbinned contigs. [Required]") + parser_io.add_argument("-r","--reads", type=str, required = True, help = "path/to/reads_table.tsv with the following format: [id_sample][path/to/reads.fastq.gz], No header") + parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/assembly/multisample", help = "path/to/project_directory [Default: veba_output/assembly/multisample]") + + # Utility + parser_utility = parser.add_argument_group('Utility arguments') + parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future + parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]") + parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]") + parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]") + parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) + parser_utility.add_argument("--tmpdir", type=str, help="Set temporary directory") #site-packges in future + + # Aligner + parser_seqkit = parser.add_argument_group('SeqKit seq arguments') + parser_seqkit.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="seqkit seq | Minimum contig length [Default: 1]") + parser_seqkit.add_argument("--seqkit_seq_options", type=str, default="", help="seqkit seq | More options (e.g. --arg 1 ) [Default: '']") + + + # Aligner + parser_aligner = parser.add_argument_group('Minmap2 arguments') + parser_aligner.add_argument("--minimap2_preset", type=str, default="map-ont", help="MiniMap2 | MiniMap2 preset {map-pb, map-ont, map-hifi} [Default: map-ont]") + parser_aligner.add_argument("--minimap2_index_options", type=str, default="", help="Minimap2 | More options (e.g. --arg 1 ) [Default: '']") + # parser_aligner.add_argument("--one_task_per_cpu", action="store_true", help="Use GNU parallel to run GNU parallel with 1 task per CPU. Useful if all samples are roughly the same size but inefficient if depth varies.") + parser_aligner.add_argument("--minimap2_options", type=str, default="", help="Minimap2 | More options (e.g. --arg 1 ) [Default: '']") + + # featureCounts + parser_featurecounts = parser.add_argument_group('featureCounts arguments') + parser_featurecounts.add_argument("--featurecounts_options", type=str, default="", help="featureCounts | More options (e.g. --arg 1 ) [Default: ''] | http://bioinf.wehi.edu.au/featureCounts/") + + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Threads + if opts.n_jobs == -1: + from multiprocessing import cpu_count + opts.n_jobs = cpu_count() + assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1." + + # Directories + directories = dict() + directories["project"] = create_directory(opts.output_directory) + directories["output"] = create_directory(os.path.join(directories["project"], "output")) + directories["log"] = create_directory(os.path.join(directories["project"], "log")) + if not opts.tmpdir: + opts.tmpdir = os.path.join(directories["project"], "tmp") + directories["tmp"] = create_directory(opts.tmpdir) + directories["checkpoints"] = create_directory(os.path.join(directories["project"], "checkpoints")) + directories["intermediate"] = create_directory(os.path.join(directories["project"], "intermediate")) + os.environ["TMPDIR"] = directories["tmp"] + + # Info + print(format_header(__program__, "="), file=sys.stdout) + print(format_header("Configuration:", "-"), file=sys.stdout) + print("Python version:", sys.version.replace("\n"," "), file=sys.stdout) + print("Python path:", sys.executable, file=sys.stdout) #sys.path[2] + print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2] + print("Script version:", __version__, file=sys.stdout) + print("Moment:", get_timestamp(), file=sys.stdout) + print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) + print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) + configure_parameters(opts, directories) + sys.stdout.flush() + + # Run pipeline + with open(os.path.join(directories["project"], "commands.sh"), "w") as f_cmds: + pipeline = create_pipeline( + opts=opts, + directories=directories, + f_cmds=f_cmds, + ) + pipeline.compile() + pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint) + +if __name__ == "__main__": + main() diff --git a/src/coverage.py b/src/coverage.py index 77c0131..b7b331f 100755 --- a/src/coverage.py +++ b/src/coverage.py @@ -13,7 +13,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" # ............................................................................. # Notes @@ -525,7 +525,7 @@ def main(args=None): # Aligner parser_seqkit = parser.add_argument_group('SeqKit seq arguments') - parser_seqkit.add_argument("-m", "--minimum_contig_length", type=int, default=1500, help="seqkit seq | Minimum contig length [Default: 1500]") + parser_seqkit.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="seqkit seq | Minimum contig length [Default: 1]") parser_seqkit.add_argument("--seqkit_seq_options", type=str, default="", help="seqkit seq | More options (e.g. --arg 1 ) [Default: '']") @@ -572,6 +572,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/deprecated/preprocess.py b/src/deprecated/preprocess.py new file mode 100755 index 0000000..73adac7 --- /dev/null +++ b/src/deprecated/preprocess.py @@ -0,0 +1,151 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, glob +from collections import OrderedDict + +import pandas as pd + +# Soothsayer Ecosystem +from genopype import * +from genopype import __version__ as genopype_version + +from soothsayer_utils import * +import fastq_preprocessor + + +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.28" + +# ============ +# Run Pipeline +# ============ +# Set environment variables +def add_executables_to_environment(opts): + """ + Adapted from Soothsayer: https://github.com/jolespin/soothsayer + """ + accessory_scripts = set([]) + + required_executables={ + "repair.sh", + "bbduk.sh", + "bowtie2", + "fastp", + "seqkit", + "fastq_preprocessor", + "minimap2", + "pigz", + "chopper", + } | accessory_scripts + + if opts.path_config == "CONDA_PREFIX": + executables = dict() + for name in required_executables: + executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name) + else: + opts.path_config = format_path(opts.path_config) + assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config) + assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables) + df_config = pd.read_csv(opts.path_config, sep="\t") + assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config) + df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str) + # Get executable paths + executables = OrderedDict(zip(df_config["name"], df_config["executable"])) + assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys()))) + + # Display + for name in sorted(accessory_scripts): + executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path + print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) + for name, executable in executables.items(): + if name in required_executables: + print(name, executable, sep = " --> ", file=sys.stdout) + os.environ[name] = executable.strip() + print("", file=sys.stdout) + + +# Configure parameters +def configure_parameters(opts, directories): + + assert opts.forward_reads != opts.reverse_reads, "You probably mislabeled the input files because `r1` should not be the same as `r2`: {}".format(opts.forward_reads) + assert_acceptable_arguments(opts.retain_trimmed_reads, {0,1}) + assert_acceptable_arguments(opts.retain_decontaminated_reads, {0,1}) + + # Set environment variables + add_executables_to_environment(opts=opts) + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Wrapper around github.com/jolespin/fastq_preprocessor + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -1 -2 -n -o |Optional| -x -k ".format(__program__) + epilog = "Copyright 2022 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser_io = parser.add_argument_group('Required I/O arguments') + parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/reads_1.fastq") + parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reads_2.fastq") + parser_io.add_argument("-n", "--name", type=str, help="Name of sample", required=True) + parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/preprocess", help = "path/to/project_directory [Default: veba_output/preprocess]") + + # Utility + parser_utility = parser.add_argument_group('Utility arguments') + parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv. Must have at least 2 columns [name, executable] [Default: CONDA_PREFIX]") #site-packges in future + parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]") + parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]") + parser_utility.add_argument("--restart_from_checkpoint", type=int, help = "Restart from a particular checkpoint") + parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) + + # Fastp + parser_fastp = parser.add_argument_group('Fastp arguments') + parser_fastp.add_argument("-m", "--minimum_read_length", type=int, default=75, help="Fastp | Minimum read length [Default: 75]") + parser_fastp.add_argument("-a", "--adapters", type=str, default="detect", help="Fastp | path/to/adapters.fasta [Default: detect]") + parser_fastp.add_argument("--fastp_options", type=str, default="", help="Fastp | More options (e.g. --arg 1 ) [Default: '']") + + # Bowtie + parser_bowtie2 = parser.add_argument_group('Bowtie2 arguments') + parser_bowtie2.add_argument("-x", "--contamination_index", type=str, help="Bowtie2 | path/to/contamination_index\n(e.g., Human T2T CHM13 v2 in $VEBA_DATABASE/Contamination/chm13v2.0/chm13v2.0)") + parser_bowtie2.add_argument("--retain_trimmed_reads", default=0, type=int, help = "Retain fastp trimmed fastq after decontamination. 0=No, 1=yes [Default: 0]") + parser_bowtie2.add_argument("--retain_contaminated_reads", default=0, type=int, help = "Retain contaminated fastq after decontamination. 0=No, 1=yes [Default: 0]") + parser_bowtie2.add_argument("--bowtie2_options", type=str, default="", help="Bowtie2 | More options (e.g. --arg 1 ) [Default: '']\nhttp://bowtie-bio.sourceforge.net/bowtie2/manual.shtml") + + # BBDuk + parser_bbduk = parser.add_argument_group('BBDuk arguments') + parser_bbduk.add_argument("-k","--kmer_database", type=str, help="BBDuk | path/to/kmer_database\n(e.g., Ribokmers in $VEBA_DATABASE/Contamination/kmers/ribokmers.fa.gz)") + parser_bbduk.add_argument("--kmer_size", type=int, default=31, help="BBDuk | k-mer size [Default: 31]") + parser_bbduk.add_argument("--retain_kmer_hits", default=0, type=int, help = "Retain reads that map to k-mer database. 0=No, 1=yes [Default: 0]") + parser_bbduk.add_argument("--retain_non_kmer_hits", default=0, type=int, help = "Retain reads that do not map to k-mer database. 0=No, 1=yes [Default: 0]") + parser_bbduk.add_argument("--bbduk_options", type=str, default="", help="BBDuk | More options (e.g., --arg 1) [Default: '']") + + # Options + opts = parser.parse_args() + # opts.script_directory = script_directory + # opts.script_filename = script_filename + + # Threads + if opts.n_jobs == -1: + from multiprocessing import cpu_count + opts.n_jobs = cpu_count() + assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1." + + #Get arguments + args = list() + for k,v in opts.__dict__.items(): + if v is not None: + args += ["--{}".format(k), str(v)] + # args = flatten(map(lambda item: ("--{}".format(item[0]), item[1]), opts.__dict__.items())) + sys.argv = [sys.argv[0]] + args + + # Wrapper + fastq_preprocessor.main(args) + + + +if __name__ == "__main__": + main() diff --git a/src/index.py b/src/index.py index 8f532d4..5c10154 100755 --- a/src/index.py +++ b/src/index.py @@ -7,7 +7,7 @@ from soothsayer_utils import * __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.12.12" # ============== # Agostic commands @@ -22,11 +22,22 @@ def get_concatenate_fasta_cmd( input_filepaths, output_filepaths, output_directo "-i {}".format(input_filepaths[0]), "-o {}".format(output_directory), "-m {}".format(opts.minimum_contig_length), - "-x {}".format("fa.gz"), + "-x {}".format("fa.gz" if opts.reference_gzipped else "fa"), "-b reference", "-M {}".format(opts.mode), - + "&&", + + "cat", + os.path.join(output_directory, "reference.fa.gz" if opts.reference_gzipped else "reference.fa"), + "|", + os.environ["seqkit"], + "fx2tab", + "-i", + "-s", + "-n", + ">", + os.path.join(output_directory, "reference.id_to_hash.tsv"), ] return cmd @@ -51,22 +62,25 @@ def get_concatenate_gff_cmd( input_filepaths, output_filepaths, output_directory def get_bowtie2_local_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): os.environ["TMPDIR"] = directories["tmp"] # Command + cmd = [ """ - +OUTPUT_DIRECTORY=%s +FASTA_FILENAME=%s for ID_SAMPLE in $(cut -f1 %s); - do %s --threads %d --seed %d %s/${ID_SAMPLE}/reference.fa.gz %s/${ID_SAMPLE}/reference.fa.gz + do %s --threads %d --seed %d ${OUTPUT_DIRECTORY}/${ID_SAMPLE}/${FASTA_FILENAME} ${OUTPUT_DIRECTORY}/${ID_SAMPLE}/${FASTA_FILENAME} done """%( + output_directory, + "reference.fa.gz" if opts.reference_gzipped else "reference.fa", opts.references, os.environ["bowtie2-build"], opts.n_jobs, opts.random_state, - output_directory, - output_directory, ), ] + return cmd # ============== @@ -115,10 +129,10 @@ def create_local_pipeline(opts, directories, f_cmds): ] output_filenames = [ - "*/reference.fa.gz", + "*/reference.fa.gz" if opts.reference_gzipped else "*/reference.fa", "*/reference.saf", - - ] + "*/reference.id_to_hash.tsv", + ] output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) params = { @@ -207,8 +221,9 @@ def create_local_pipeline(opts, directories, f_cmds): # Info description = "Build mapping index" # i/o + input_filepaths = list( - map(lambda id_sample: os.path.join(directories["output"], id_sample, "reference.fa.gz"), + map(lambda id_sample: os.path.join(directories["output"], id_sample, "reference.fa.gz" if opts.reference_gzipped else "reference.fa"), opts.samples, ), ) @@ -273,8 +288,10 @@ def create_global_pipeline(opts, directories, f_cmds): ] output_filenames = [ - "reference.fa.gz", + "reference.fa.gz" if opts.reference_gzipped else "reference.fa", "reference.saf", + "reference.id_to_hash.tsv", + ] output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) @@ -365,13 +382,22 @@ def create_global_pipeline(opts, directories, f_cmds): # Info description = "Build mapping index" # i/o - input_filepaths = [ - os.path.join(directories["output"], "reference.fa.gz"), - ] + if opts.reference_gzipped: + input_filepaths = [ + os.path.join(directories["output"], "reference.fa.gz"), + ] + + output_filenames = [ + "reference.fa.gz.*.bt2", + ] + else: + input_filepaths = [ + os.path.join(directories["output"], "reference.fa"), + ] - output_filenames = [ - "reference.fa.gz.*.bt2", - ] + output_filenames = [ + "reference.fa.*.bt2", + ] output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) params = { @@ -417,7 +443,8 @@ def add_executables_to_environment(opts): required_executables = set([ - "bowtie2-build", + "seqkit", + "bowtie2-build", ])| accessory_scripts if opts.path_config == "CONDA_PREFIX": @@ -509,8 +536,9 @@ def main(args=None): parser_io.add_argument("-r","--references", type=str, required=True, help = "local mode: [id_sample][path/to/reference.fa] and global mode: [path/to/reference.fa]") parser_io.add_argument("-g","--gene_models", type=str, required=True, help = "local mode: [id_sample][path/to/reference.gff] and global mode: [path/to/reference.gff]") parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/index", help = "path/to/project_directory [Default: veba_output/index]") - parser_io.add_argument("-m", "--minimum_contig_length", type=int, default=1500, help="Minimum contig length [Default: 1500]") + parser_io.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="Minimum contig length [Default: 1]") parser_io.add_argument("-M", "--mode", type=str, default="infer", help="Concatenate all references with global and build index or build index for each reference {global, local, infer}") + parser_io.add_argument("-z", "--reference_gzipped",action="store_true", help="Gzip the reference to generate `reference.fa.gz` instead of `reference.fa`") # parser_io.add_argument("-c", "--copy_files", action="store_true", help="Copy files instead of symlinking. Only applies to global.") # Utility @@ -559,6 +587,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/mapping.py b/src/mapping.py index 8db61cc..b06175c 100755 --- a/src/mapping.py +++ b/src/mapping.py @@ -13,7 +13,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.12.12" # Bowtie2 @@ -451,9 +451,12 @@ def configure_parameters(opts, directories): assert os.path.isdir(opts.reference_index), "If --reference_saf is not provided, then --reference_index must be provided as a directory containing a file 'reference.saf'" opts.reference_saf = os.path.join(opts.reference_index, "reference.saf") - # Check if --reference_index is a directory, if it is then set reference.fa.gz as the directory + # Check if --reference_index is a directory, if it is then set reference.fa as the directory if os.path.isdir(opts.reference_index): - opts.reference_index = os.path.join(opts.reference_index, "reference.fa.gz") + if opts.reference_gzipped: + opts.reference_index = os.path.join(opts.reference_index, "reference.fa.gz") + else: + opts.reference_index = os.path.join(opts.reference_index, "reference.fa") # If --reference_fasta isn't provided then set it to the --reference_index if opts.reference_fasta is None: @@ -491,10 +494,11 @@ def main(args=None): parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/mapping", help = "path/to/project_directory [Default: veba_output/mapping]") parser_reference = parser.add_argument_group('Reference arguments') - parser_reference.add_argument("-x", "--reference_index",type=str, required=True, help="path/to/bowtie2_index. Either a file or directory. If directory, then it assumes the index is named `reference.fa.gz`") + parser_reference.add_argument("-x", "--reference_index",type=str, required=True, help="path/to/bowtie2_index. Either a file or directory. If directory, then it assumes the index is named `reference.fa`") parser_reference.add_argument("-r", "--reference_fasta", type=str, required=False, help = "path/to/reference.fasta. If not provided then it is set to the --reference_index" ) # ; or (2) a directory of fasta files [Must all have the same extension. Use `query_ext` argument] parser_reference.add_argument("-a", "--reference_gff",type=str, required=False, help="path/to/reference.gff. If not provided then --reference_index must be a directory that contains the file: 'reference.gff'") parser_reference.add_argument("-s", "--reference_saf",type=str, required=False, help="path/to/reference.saf. If not provided then --reference_index must be a directory that contains the file: 'reference.saf'") + parser_reference.add_argument("-z", "--reference_gzipped",action="store_true", help="If --reference_index directory, then it assumes the index is named `reference.fa.gz` instead of `reference.fa`") # parser_io.add_argument("-S","--scaffold_identifier_mapping", type=str, required=False, help = "path/to/scaffold_identifiers.tsv, Format: [id_scaffold][id_mag][id_cluster], No header") # parser_io.add_argument("-O","--orf_identifier_mapping", type=str, required=False, help = "path/to/scaffold_identifiers.tsv, Format: [id_scaffold][id_mag][id_cluster], No header") @@ -558,6 +562,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/phylogeny.py b/src/phylogeny.py index 002cce6..0730b4e 100755 --- a/src/phylogeny.py +++ b/src/phylogeny.py @@ -14,7 +14,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.27" +__version__ = "2023.11.30" # Assembly def preprocess( input_filepaths, output_filepaths, output_directory, directories, opts): @@ -650,6 +650,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/preprocess-long.py b/src/preprocess-long.py new file mode 100755 index 0000000..fe1e58a --- /dev/null +++ b/src/preprocess-long.py @@ -0,0 +1,21 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse +from soothsayer_utils import format_header, read_script_as_module + +script_directory = os.path.dirname(os.path.abspath( __file__ )) + +try: + from fastq_preprocessor import fastq_preprocessor_long +except ImportError: + fastq_preprocessor_long = read_script_as_module("fastq_preprocessor_long", os.path.join(script_directory, "fastq_preprocessor_long.py")) + +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.29" + +if __name__ == "__main__": + print(format_header("VEBA Preprocessing Wrapper (fastq_preprocessor v{})".format(fastq_preprocessor_long.__version__)), file=sys.stderr) + label = "Mode: Long Nanopore and PacBio reads" + print(label, file=sys.stderr) + print(len(label)*"-", file=sys.stderr) + fastq_preprocessor_long.main(sys.argv[1:]) diff --git a/src/preprocess.py b/src/preprocess.py index d28ccc2..146b03c 100755 --- a/src/preprocess.py +++ b/src/preprocess.py @@ -1,148 +1,21 @@ #!/usr/bin/env python from __future__ import print_function, division -import sys, os, argparse, glob -from collections import OrderedDict - -import pandas as pd - -# Soothsayer Ecosystem -from genopype import * -from genopype import __version__ as genopype_version - -from soothsayer_utils import * -import fastq_preprocessor +import sys, os, argparse +from soothsayer_utils import format_header, read_script_as_module +script_directory = os.path.dirname(os.path.abspath( __file__ )) +try: + from fastq_preprocessor import fastq_preprocessor_short +except ImportError: + fastq_preprocessor_short = read_script_as_module("fastq_preprocessor_short", os.path.join(script_directory, "fastq_preprocessor_short.py")) + __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" - -# ============ -# Run Pipeline -# ============ -# Set environment variables -def add_executables_to_environment(opts): - """ - Adapted from Soothsayer: https://github.com/jolespin/soothsayer - """ - accessory_scripts = set([]) - - required_executables={ - "repair.sh", - "bbduk.sh", - "bowtie2", - "fastp", - "seqkit", - "fastq_preprocessor", - } | accessory_scripts - - if opts.path_config == "CONDA_PREFIX": - executables = dict() - for name in required_executables: - executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name) - else: - opts.path_config = format_path(opts.path_config) - assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config) - assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables) - df_config = pd.read_csv(opts.path_config, sep="\t") - assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config) - df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str) - # Get executable paths - executables = OrderedDict(zip(df_config["name"], df_config["executable"])) - assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys()))) - - # Display - for name in sorted(accessory_scripts): - executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path - print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) - for name, executable in executables.items(): - if name in required_executables: - print(name, executable, sep = " --> ", file=sys.stdout) - os.environ[name] = executable.strip() - print("", file=sys.stdout) - - -# Configure parameters -def configure_parameters(opts, directories): - - assert opts.forward_reads != opts.reverse_reads, "You probably mislabeled the input files because `r1` should not be the same as `r2`: {}".format(opts.forward_reads) - assert_acceptable_arguments(opts.retain_trimmed_reads, {0,1}) - assert_acceptable_arguments(opts.retain_decontaminated_reads, {0,1}) - - # Set environment variables - add_executables_to_environment(opts=opts) - -def main(args=None): - # Path info - script_directory = os.path.dirname(os.path.abspath( __file__ )) - script_filename = __program__ - # Path info - description = """ - Wrapper around github.com/jolespin/fastq_preprocessor - Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) - usage = "{} -1 -2 -n -o |Optional| -x -k ".format(__program__) - epilog = "Copyright 2022 Josh L. Espinoza (jespinoz@jcvi.org)" - - # Parser - parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) - # Pipeline - parser_io = parser.add_argument_group('Required I/O arguments') - parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/reads_1.fastq") - parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reads_2.fastq") - parser_io.add_argument("-n", "--name", type=str, help="Name of sample", required=True) - parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/preprocess", help = "path/to/project_directory [Default: veba_output/preprocess]") - - # Utility - parser_utility = parser.add_argument_group('Utility arguments') - parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv. Must have at least 2 columns [name, executable] [Default: CONDA_PREFIX]") #site-packges in future - parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]") - parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]") - parser_utility.add_argument("--restart_from_checkpoint", type=int, help = "Restart from a particular checkpoint") - parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) - - # Fastp - parser_fastp = parser.add_argument_group('Fastp arguments') - parser_fastp.add_argument("-m", "--minimum_read_length", type=int, default=75, help="Fastp | Minimum read length [Default: 75]") - parser_fastp.add_argument("-a", "--adapters", type=str, default="detect", help="Fastp | path/to/adapters.fasta [Default: detect]") - parser_fastp.add_argument("--fastp_options", type=str, default="", help="Fastp | More options (e.g. --arg 1 ) [Default: '']") - - # Bowtie - parser_bowtie2 = parser.add_argument_group('Bowtie2 arguments') - parser_bowtie2.add_argument("-x", "--contamination_index", type=str, help="Bowtie2 | path/to/contamination_index\n(e.g., Human T2T CHM13 v2 in $VEBA_DATABASE/Contamination/chm13v2.0/chm13v2.0)") - parser_bowtie2.add_argument("--retain_trimmed_reads", default=0, type=int, help = "Retain fastp trimmed fastq after decontamination. 0=No, 1=yes [Default: 0]") - parser_bowtie2.add_argument("--retain_contaminated_reads", default=0, type=int, help = "Retain contaminated fastq after decontamination. 0=No, 1=yes [Default: 0]") - parser_bowtie2.add_argument("--bowtie2_options", type=str, default="", help="Bowtie2 | More options (e.g. --arg 1 ) [Default: '']\nhttp://bowtie-bio.sourceforge.net/bowtie2/manual.shtml") - - # BBDuk - parser_bbduk = parser.add_argument_group('BBDuk arguments') - parser_bbduk.add_argument("-k","--kmer_database", type=str, help="BBDuk | path/to/kmer_database\n(e.g., Ribokmers in $VEBA_DATABASE/Contamination/kmers/ribokmers.fa.gz)") - parser_bbduk.add_argument("--kmer_size", type=int, default=31, help="BBDuk | k-mer size [Default: 31]") - parser_bbduk.add_argument("--retain_kmer_hits", default=0, type=int, help = "Retain reads that map to k-mer database. 0=No, 1=yes [Default: 0]") - parser_bbduk.add_argument("--retain_non_kmer_hits", default=0, type=int, help = "Retain reads that do not map to k-mer database. 0=No, 1=yes [Default: 0]") - parser_bbduk.add_argument("--bbduk_options", type=str, default="", help="BBDuk | More options (e.g., --arg 1) [Default: '']") - - # Options - opts = parser.parse_args() - # opts.script_directory = script_directory - # opts.script_filename = script_filename - - # Threads - if opts.n_jobs == -1: - from multiprocessing import cpu_count - opts.n_jobs = cpu_count() - assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1." - - #Get arguments - args = list() - for k,v in opts.__dict__.items(): - if v is not None: - args += ["--{}".format(k), str(v)] - # args = flatten(map(lambda item: ("--{}".format(item[0]), item[1]), opts.__dict__.items())) - sys.argv = [sys.argv[0]] + args - - # Wrapper - fastq_preprocessor.main(args) - - +__version__ = "2023.11.29" if __name__ == "__main__": - main() + print(format_header("VEBA Preprocessing Wrapper (fastq_preprocessor v{})".format(fastq_preprocessor_short.__version__)), file=sys.stderr) + label = "Mode: Paired Illumina Reads" + print(label, file=sys.stderr) + print(len(label)*"-", file=sys.stderr) + fastq_preprocessor_short.main(sys.argv[1:]) \ No newline at end of file diff --git a/src/profile-pathway.py b/src/profile-pathway.py index 3f674f4..d84738a 100755 --- a/src/profile-pathway.py +++ b/src/profile-pathway.py @@ -13,7 +13,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.30" DIAMOND_DATABASE_SUFFIX = "_v201901b.dmnd" @@ -625,6 +625,7 @@ def main(args=None): print("Script version:", __version__, file=sys.stdout) print("Moment:", get_timestamp(), file=sys.stdout) print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) configure_parameters(opts, directories) sys.stdout.flush() diff --git a/src/profile-taxonomy.py b/src/profile-taxonomy.py new file mode 100755 index 0000000..2aa4db0 --- /dev/null +++ b/src/profile-taxonomy.py @@ -0,0 +1,357 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, glob, gzip +from collections import OrderedDict, defaultdict + +import pandas as pd + +# Soothsayer Ecosystem +from genopype import * +from genopype import __version__ as genopype_version +from soothsayer_utils import * + +pd.options.display.max_colwidth = 100 +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.12.19" + +# Preprocess reads +def get_sylph_sketch_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): + cmd = [ + os.environ["sylph"], + "sketch", + "-t {}".format(opts.n_jobs), + "-c {}".format(opts.sylph_sketch_subsampling_rate), + "-k {}".format(opts.sylph_sketch_k), + "--min-spacing {}".format(opts.sylph_sketch_minimum_spacing), + "-1 {}".format(opts.forward_reads), + "-2 {}".format(opts.reverse_reads), + "-d {}".format(output_directory), + + "&&", + + "mv", + "-v", + os.path.join(output_directory, "{}.paired.sylsp".format(os.path.split(opts.forward_reads)[1])), + os.path.join(output_directory, "reads.sylsp"), + ] + + return cmd + +def get_sylph_profile_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): + # Command + cmd = [ + os.environ["sylph"], + "profile", + "-t {}".format(opts.n_jobs), + "--minimum-ani {}".format(opts.sylph_profile_minimum_ani), + "--min-number-kmers {}".format(opts.sylph_profile_minimum_number_kmers), + "--min-count-correct {}".format(opts.sylph_profile_minimum_count_correct), + opts.sylph_profile_options, + " ".join(input_filepaths), + "|", + "gzip", + ">", + os.path.join(output_directory, "sylph_profile.tsv.gz"), + + "&&", + + os.environ["reformat_sylph_profile_single_sample_output.py"], + "-i {}".format(os.path.join(output_directory, "sylph_profile.tsv.gz")), + "-o {}".format(output_directory), + "-c {}".format(opts.genome_clusters) if opts.genome_clusters else "", + "-f Taxonomic_abundance", + "-x {}".format(opts.extension), + "--header" if opts.header else "", + ] + + return cmd + + + +# ============ +# Run Pipeline +# ============ +# Set environment variables +def add_executables_to_environment(opts): + """ + Adapted from Soothsayer: https://github.com/jolespin/soothsayer + """ + accessory_scripts = set([ + "reformat_sylph_profile_single_sample_output.py", + ] + ) + + required_executables={ + "sylph", + # "seqkit", + + } | accessory_scripts + + if opts.path_config == "CONDA_PREFIX": + executables = dict() + for name in required_executables: + executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name) + else: + if opts.path_config is None: + opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv") + opts.path_config = format_path(opts.path_config) + assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config) + assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables) + df_config = pd.read_csv(opts.path_config, sep="\t") + assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config) + df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str) + # Get executable paths + executables = OrderedDict(zip(df_config["name"], df_config["executable"])) + assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys()))) + + # Display + for name in sorted(accessory_scripts): + executables[name] = "'{}'".format(os.path.join(opts.script_directory, "scripts", name)) # Can handle spaces in path + + print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) + for name, executable in executables.items(): + if name in required_executables: + print(name, executable, sep = " --> ", file=sys.stdout) + os.environ[name] = executable.strip() + print("", file=sys.stdout) + + +# Pipeline +def create_pipeline(opts, directories, f_cmds): + + # ................................................................. + # Primordial + # ................................................................. + # Commands file + pipeline = ExecutablePipeline(name=__program__, description=opts.name, f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"]) + + # ========== + # Preprocess reads + # ========== + + if opts.input_reads_format == "paired": + + step = 0 + + # Info + program = "sylph_sketch" + program_label = "{}__{}".format(step, program) + description = "Sketch input reads" + + # Add to directories + output_directory = directories["output"] + # i/o + input_filepaths = [opts.forward_reads, opts.reverse_reads] + output_filepaths = [ + os.path.join(output_directory, "reads.sylsp"), + ] + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_sylph_sketch_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + ) + else: + output_filepaths = [opts.reads_sketch] + + + # ========== + # Profile + # ========== + + step = 1 + + # Info + program = "sylph_profile" + program_label = "{}__{}".format(step, program) + description = "Profile genome databases" + + # Add to directories + output_directory = directories["output"] + + # i/o + input_filepaths = output_filepaths + opts.sylph_databases + + + output_filepaths = [ + os.path.join(output_directory, "sylph_profile.tsv.gz"), + os.path.join(output_directory, "taxonomic_abundance.tsv.gz"), + ] + if opts.genome_clusters: + input_filepaths += [ + opts.genome_clusters, + ] + output_filepaths += [ + os.path.join(output_directory, "taxonomic_abundance.clusters.tsv.gz"), + ] + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_sylph_profile_cmd(**params) + pipeline.add_step( + id=program_label, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + log_prefix=program_label, + + ) + + + + return pipeline + +# Configure parameters +def configure_parameters(opts, directories): + + for db in opts.sylph_databases: + assert db.endswith(".syldb"), "{} must have .syldb file extension".format(db) + + # --input_reads_format + assert_acceptable_arguments(opts.input_reads_format, {"paired", "sketch", "auto"}) + if opts.input_reads_format == "auto": + if any([opts.forward_reads, opts.reverse_reads]): + assert opts.forward_reads != opts.reverse_reads, "You probably mislabeled the input files because `forward_reads` should not be the same as `reverse_reads`: {}".format(opts.forward_reads) + assert opts.forward_reads is not None, "If running in --input_reads_format paired mode, --forward_reads and --reverse_reads are needed." + assert opts.reverse_reads is not None, "If running in --input_reads_format paired mode, --forward_reads and --reverse_reads are needed." + opts.input_reads_format = "paired" + if opts.reads_sketch is not None: + assert opts.forward_reads is None, "If running in --input_reads_format sketch mode, you cannot provide --forward_reads, --reverse_reads" + assert opts.reverse_reads is None, "If running in --input_reads_format sketch mode, you cannot provide --forward_reads, --reverse_reads" + opts.input_reads_format = "sketch" + + print("Auto detecting reads format: {}".format(opts.input_reads_format), file=sys.stdout) + assert_acceptable_arguments(opts.input_reads_format, {"paired", "sketch"}) + + # Set environment variables + add_executables_to_environment(opts=opts) + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -1 -2 |-s -n -o -d ".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + + # Pipeline + parser_io = parser.add_argument_group('Required I/O arguments') + parser_io.add_argument("-1","--forward_reads", type=str, help = "path/to/forward_reads.fq[.gz]") + parser_io.add_argument("-2","--reverse_reads", type=str, help = "path/to/reverse_reads.fq[.gz]]") + parser_io.add_argument("-s","--reads_sketch", type=str, help = "path/to/reads_sketch.sylsp (e.g., sylph sketch output) (Cannot be used with --forward_reads and --reverse_reads)") + parser_io.add_argument("-n", "--name", type=str, required=True, help="Name of sample") + parser_io.add_argument("-d","--sylph_databases", type=str, nargs="+", required=True, help = "Sylph database(s) with all genomes. Can be multiple databases delimited by spaces. Use compile_custom_sylph_sketch_database_from_genomes.py to build database.") + parser_io.add_argument("-o","--project_directory", type=str, default="veba_output/profiling/taxonomy", help = "path/to/project_directory [Default: veba_output/profiling/taxonomy]") + parser_io.add_argument("-c","--genome_clusters", type=str, help = "path/to/mags_to_slcs.tsv. [id_genome][id_genome-cluster], No header. Aggregates counts for genome clusters.") + parser_io.add_argument("-F", "--input_reads_format", choices={"paired", "sketch"}, type=str, default="auto", help = "Input reads format {paired, sketch} [Default: auto]") + parser_io.add_argument("-x","--extension", type=str, default="fa", help = "Fasta file extension for bins. Assumes all genomes have the same file extension. [Default: fa]") + + + # Utility + parser_utility = parser.add_argument_group('Utility arguments') + parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future + parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]") + parser_utility.add_argument("--random_state", type=int, default=0, help = "Random state [Default: 0]") + parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]") + parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) + parser_utility.add_argument("--tmpdir", type=str, help="Set temporary directory") #site-packges in future + + # Sylph + parser_sylph_sketch = parser.add_argument_group('Sylph sketch arguments (Fastq)') + parser_sylph_sketch.add_argument("--sylph_sketch_k", type=int, choices={21,31}, default=31, help="Sylph sketch [Fastq] | Value of k. Only k = 21, 31 are currently supported. [Default: 31]") + parser_sylph_sketch.add_argument("--sylph_sketch_minimum_spacing", type=int, default=30, help="Sylph sketch [Fastq] | Minimum spacing between selected k-mers on the genomes [Default: 30]") + parser_sylph_sketch.add_argument("--sylph_sketch_subsampling_rate", type=int, default=100, help="Sylph sketch [Fastq] | Subsampling rate. sylph runs without issues if the -c for all genomes is ≥ the -c for reads. [Default: 100]") + parser_sylph_sketch.add_argument("--sylph_sketch_options", type=str, default="", help="Sylph sketch [Fastq] | More options for `sylph sketch` (e.g. --arg 1 ) [Default: '']") + + parser_sylph_profile = parser.add_argument_group('Sylph profile arguments') + parser_sylph_profile.add_argument("--sylph_profile_minimum_ani", type=float, default=95, help="Sylph profile | Minimum adjusted ANI to consider (0-100). [Default: 95]") + parser_sylph_profile.add_argument("--sylph_profile_minimum_number_kmers", type=int, default=20, help="Sylph profile | Exclude genomes with less than this number of sampled k-mers. Default is 50 in Sylph but lowering to 20 accounts for viruses and small CPR genomes. [Default: 20]") + parser_sylph_profile.add_argument("--sylph_profile_minimum_count_correct", type=int, default=3, help="Sylph profile | Minimum k-mer multiplicity needed for coverage correction. Higher values gives more precision but lower sensitivity [Default: 3]") + parser_sylph_profile.add_argument("--sylph_profile_options", type=str, default="", help="Sylph profile | More options for `sylph profile` (e.g. --arg 1 ) [Default: '']") + parser_sylph_profile.add_argument("--header", action="store_true", help = "Include header in taxonomic abundance tables") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Threads + if opts.n_jobs == -1: + from multiprocessing import cpu_count + opts.n_jobs = cpu_count() + assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1. To select all available threads, use -1." + + + # Directories + directories = dict() + directories["project"] = create_directory(opts.project_directory) + directories["sample"] = create_directory(os.path.join(directories["project"], opts.name)) + directories["output"] = create_directory(os.path.join(directories["sample"], "output")) + + directories["log"] = create_directory(os.path.join(directories["sample"], "log")) + if not opts.tmpdir: + opts.tmpdir = os.path.join(directories["sample"], "tmp") + directories["tmp"] = create_directory(opts.tmpdir) + directories["checkpoints"] = create_directory(os.path.join(directories["sample"], "checkpoints")) + directories["intermediate"] = create_directory(os.path.join(directories["sample"], "intermediate")) + os.environ["TMPDIR"] = directories["tmp"] + + # Info + print(format_header(__program__, "="), file=sys.stdout) + print(format_header("Configuration:", "-"), file=sys.stdout) + print(format_header("Name: {}".format(opts.name), "."), file=sys.stdout) + print("Python version:", sys.version.replace("\n"," "), file=sys.stdout) + print("Python path:", sys.executable, file=sys.stdout) #sys.path[2] + print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2] + print("Script version:", __version__, file=sys.stdout) + print("Moment:", get_timestamp(), file=sys.stdout) + print("Directory:", os.getcwd(), file=sys.stdout) + if "TMPDIR" in os.environ: print(os.environ["TMPDIR"], file=sys.stdout) + print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) + configure_parameters(opts, directories) + sys.stdout.flush() + + # Run pipeline + with open(os.path.join(directories["sample"], "commands.sh"), "w") as f_cmds: + pipeline = create_pipeline( + opts=opts, + directories=directories, + f_cmds=f_cmds, + ) + pipeline.compile() + pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint) + +if __name__ == "__main__": + main() diff --git a/src/scripts/binning_wrapper.py b/src/scripts/binning_wrapper.py index cfd1c0e..a9904e6 100755 --- a/src/scripts/binning_wrapper.py +++ b/src/scripts/binning_wrapper.py @@ -12,7 +12,7 @@ # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.5.8" +__version__ = "2023.12.4" def get_maxbin2_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): # Create dummy scaffolds_to_bins.tsv to overwrite later. This makes DAS_Tool easier to run @@ -740,6 +740,9 @@ def add_executables_to_environment(opts): "merge_cutup_clustering.py", "extract_fasta_bins.py", } + # if opts.algorithm == "vrhyme": + # required_executables |= {"vRhyme"} + # if opts.algorithm == "metacoag": # required_executables |= {"MetaCoAG"} @@ -845,7 +848,7 @@ def main(argv=None): # Binning parser_binning = parser.add_argument_group('Binning arguments') - parser_binning.add_argument("-a", "--algorithm", type=str, default="metabat2", help="Binning algorithm: {concoct, metabat2, maxbin2} Future: {metacoag, vamb} [Default: metabat2] ") + parser_binning.add_argument("-a", "--algorithm", type=str, default="metabat2", help="Binning algorithm: {concoct, metabat2, maxbin2} Future: {vrhyme} [Default: metabat2] ") parser_binning.add_argument("-m", "--minimum_contig_length", type=int, default=1500, help="Minimum contig length. [Default: 1500] ") parser_binning.add_argument("-s", "--minimum_genome_length", type=int, default=150000, help="Minimum genome length. [Default: 150000] ") parser_binning.add_argument("-P","--bin_prefix", type=str, default="DEFAULT", help = "Prefix for bin names. Special strings include: 1) --bin_prefix NONE which does not include a bin prefix; and 2) --bin_prefix DEFAULT then prefix is [ALGORITHM_UPPERCASE]__") @@ -870,8 +873,8 @@ def main(argv=None): # parser_metacoag = parser.add_argument_group('MetaCoAG arguments') # parser_metacoag.add_argument("--metacoag_options", type=str, default="", help="MetaCoAG | More options (e.g. --arg 1 ) [Default: '']") - # parser_vamb = parser.add_argument_group('VAMB arguments') - # parser_vamb.add_argument("--vamb_options", type=str, default="", help="VAMB | More options (e.g. --arg 1 ) [Default: '']") + # parser_vrhyme = parser.add_argument_group('vRhyme arguments') + # parser_vrhyme.add_argument("--vrhyme_options", type=str, default="", help="vRhyme | More options (e.g. --arg 1 ) [Default: '']") # Options opts = parser.parse_args(argv) diff --git a/src/scripts/build_source_to_lineage_dictionary.py b/src/scripts/build_source_to_lineage_dictionary.py new file mode 100755 index 0000000..e593068 --- /dev/null +++ b/src/scripts/build_source_to_lineage_dictionary.py @@ -0,0 +1,69 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, gzip, pickle +from tqdm import tqdm +import pandas as pd + +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.13" + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o ".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser.add_argument("-i","--input", default="stdin", type=str, help = "Path to table [id_source][class][order][family][genus][species], with header. Can include more columns but the first column must be `id_source`. [Default: stdin]") + parser.add_argument("-o","--output", required=True, type=str, help = "Path to dictionary pickle object. Can be gzipped. (Recommended name: source_to_lineage.dict.pkl.gz)") + parser.add_argument("--separator", default=";", type=str, help = "Separator field for taxonomy [Default: ; ]") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Input + if opts.input == "stdin": + opts.input = sys.stdin + + + print(" * Reading identifier mappings from the following file: {}".format(opts.input), file=sys.stderr) + source_to_lineage = dict() + df_input = pd.read_csv(opts.input, sep="\t", index_col=0) + for id_source, row in tqdm(df_input.loc[:,["class", "order", "family", "genus", "species"]].iterrows(), total=df_input.shape[0]): + lineage = list() + for level, taxon in row.items(): + v = level[0] + "__" + if pd.notnull(taxon): + v += taxon + lineage.append(v) + source_to_lineage[id_source] = opts.separator.join(lineage) + + + + print(" * Writing Python dictionary: {}".format(opts.output), file=sys.stderr) + f_out = None + if opts.output.endswith((".gz", ".pgz")): + f_out = gzip.open(opts.output, "wb") + else: + f_out = open(opts.output, "wb") + assert f_out is not None, "Unrecognized file format: {}".format(opts.output) + pickle.dump(source_to_lineage, f_out) + + + + + + + +if __name__ == "__main__": + main() diff --git a/src/scripts/build_target_to_source_dictionary.py b/src/scripts/build_target_to_source_dictionary.py new file mode 100755 index 0000000..367bc94 --- /dev/null +++ b/src/scripts/build_target_to_source_dictionary.py @@ -0,0 +1,77 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, gzip, pickle + +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.15" + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o ".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser.add_argument("-i","--input", default="stdin", type=str, help = "Path to identifier mapping table [id_database][id_source][id_protein][id_hash], No header. [Default: stdin]") + parser.add_argument("-o","--output", required=True, type=str, help = "Path to dictionary pickle object. Can be gzipped. (Recommended name: target_to_source.dict.pkl.gz)") + parser.add_argument("-n","--number_of_sequences", type=int, help = "Number of sequences. If used, the tqdm is required.") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Input + f_in = None + if opts.input == "stdin": + f_in = sys.stdin + else: + if opts.input.endswith(".gz"): + f_in = gzip.open(opts.input, "rt") + else: + f_in = open(opts.input, "r") + assert f_in is not None, "Unrecognized file format: {}".format(opts.input) + + if opts.number_of_sequences is not None: + from tqdm import tqdm + input_iterable = tqdm(f_in, total=opts.number_of_sequences, unit=" sequences") + else: + input_iterable = f_in + + print(" * Reading identifier mappings from the following file: {}".format(f_in), file=sys.stderr) + target_to_source = dict() + for line in input_iterable: + line = line.strip() + if line: + fields = line.split("\t") + id_hash = fields[3] + id_source = fields[1] + target_to_source[id_hash] = id_source + if f_in != sys.stdin: + f_in.close() + + print(" * Writing Python dictionary: {}".format(opts.output), file=sys.stderr) + f_out = None + if opts.output.endswith((".gz", ".pgz")): + f_out = gzip.open(opts.output, "wb") + else: + f_out = open(opts.output, "wb") + assert f_out is not None, "Unrecognized file format: {}".format(opts.output) + pickle.dump(target_to_source, f_out) + + + + + + + +if __name__ == "__main__": + main() diff --git a/src/scripts/check_fasta_duplicates.py b/src/scripts/check_fasta_duplicates.py index b4ca8dd..527b508 100755 --- a/src/scripts/check_fasta_duplicates.py +++ b/src/scripts/check_fasta_duplicates.py @@ -3,7 +3,7 @@ from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.4.17" +__version__ = "2023.11.10" def main(args=None): # Path info @@ -30,13 +30,15 @@ def main(args=None): if not opts.input: identifiers = set() duplicates = set() - for line in tqdm(sys.stdin, "stdin"): + for i, line in tqdm(enumerate(sys.stdin), "stdin"): if line.startswith(">"): id = line[1:].split(" ")[0].strip() if id not in identifiers: identifiers.add(id) else: duplicates.add(id) + else: + assert ">" not in line, "Line={} has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(i+1) if duplicates: print("# Duplicates:", *sorted(duplicates), file=sys.stdout, sep="\n", end=None) sys.exit(1) @@ -48,13 +50,16 @@ def main(args=None): identifiers = set() duplicates = set() f = {True:gzip.open(fp, "rt"), False:open(fp, "r")}[fp.endswith(".gz")] - for line in tqdm(f, fp): + for i,line in tqdm(enumerate(f), fp): if line.startswith(">"): id = line[1:].split(" ")[0] if id not in identifiers: identifiers.add(id) else: duplicates.add(id) + else: + assert ">" not in line, "Line={} has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(i+1) + if duplicates: files_with_duplicates.add(fp) print(f"[Fail] {fp}", file=sys.stdout) diff --git a/src/scripts/clean_fasta.py b/src/scripts/clean_fasta.py new file mode 100755 index 0000000..92192c6 --- /dev/null +++ b/src/scripts/clean_fasta.py @@ -0,0 +1,124 @@ +#!/usr/bin/env python +import sys, os, argparse, gzip +from Bio.SeqIO.FastaIO import SimpleFastaParser +from tqdm import tqdm + +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.10" + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o )".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + + # Pipeline + parser.add_argument("-i","--input", default="stdin", type=str, help = "Input fasta file") + parser.add_argument("-o","--output", default="stdout", type=str, help = "Output fasta file") + parser.add_argument("-r","--retain_description", action="store_true", help = "Retain description") + parser.add_argument("-s","--retain_stop_codon", action="store_true", help = "Retain stop codon character (if one exists)") + parser.add_argument("-m","--minimum_sequence_length", default=1, type=int, help = "Minimum sequence length accepted [Default: 1]") + parser.add_argument("--stop_codon_character", default="*", type=str, help = "Stop codon character [Default: *] ") + # parser.add_argument("-t","--molecule_type", help = "Comma-separated list of names for the --scaffolds_to_bins") + + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + assert opts.minimum_sequence_length > 0 + + # Input + f_in = None + if opts.input == "stdin": + f_in = sys.stdin + else: + if opts.input.endswith(".gz"): + f_in = gzip.open(opts.input, "rt") + else: + f_in = open(opts.input, "r") + assert f_in is not None + + # Output + f_out = None + if opts.output == "stdout": + f_out = sys.stdout + else: + if opts.output.endswith(".gz"): + f_out = gzip.open(opts.output, "wt") + else: + f_out = open(opts.output, "w") + assert f_out is not None + + # retain_description=True + # retain_stop_codon=True + if all([ + opts.retain_description, + opts.retain_stop_codon, + ]): + for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"): + header = header.strip() + if len(seq) >= opts.minimum_sequence_length: + assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header) + print(">{}\n{}".format(header,seq), file=f_out) + + # retain_description=False + # retain_stop_codon=True + if all([ + not opts.retain_description, + opts.retain_stop_codon, + ]): + for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"): + id = header.split(" ")[0].strip() + if len(seq) >= opts.minimum_sequence_length: + assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header) + print(">{}\n{}".format(id,seq), file=f_out) + + # retain_description=True + # retain_stop_codon=False + if all([ + opts.retain_description, + not opts.retain_stop_codon, + ]): + for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"): + header = header.strip() + if seq.endswith(opts.stop_codon_character): + seq = seq[:-1] + if len(seq) >= opts.minimum_sequence_length: + assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header) + print(">{}\n{}".format(header,seq), file=f_out) + + # retain_description=False + # retain_stop_codon=False + if all([ + not opts.retain_description, + not opts.retain_stop_codon, + ]): + for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"): + id = header.split(" ")[0].strip() + if seq.endswith(opts.stop_codon_character): + seq = seq[:-1] + if len(seq) >= opts.minimum_sequence_length: + assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header) + print(">{}\n{}".format(id,seq), file=f_out) + + # Close + if f_in != sys.stdin: + f_in.close() + if f_out != sys.stdout: + f_out.close() + +if __name__ == "__main__": + main() + + + diff --git a/src/scripts/clustering_wrapper.py b/src/scripts/clustering_wrapper.py new file mode 100755 index 0000000..b8eddbb --- /dev/null +++ b/src/scripts/clustering_wrapper.py @@ -0,0 +1,439 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, glob, shutil, time, warnings +from multiprocessing import cpu_count +from collections import OrderedDict, defaultdict + +import pandas as pd + +# Soothsayer Ecosystem +from genopype import * +from soothsayer_utils import * + +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.10" + +# Check +def get_check_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + # Command + + # Command + cmd = [ + os.environ["check_fasta_duplicates.py"], + opts.fasta, + ] + + return cmd + +def get_mmseqs2_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + cmd = [ + os.environ["mmseqs"], + "easy-{}".format(opts.algorithm.split("-")[1]), + opts.fasta, + os.path.join(output_directory, "mmseqs2"), + directories["tmp"], + "--threads {}".format(opts.n_jobs), + "--min-seq-id {}".format(opts.minimum_identity_threshold/100), + "-c {}".format(opts.minimum_coverage_threshold), + "--cov-mode 1", + opts.mmseqs2_options, + + "&&", + + "mv", + os.path.join(output_directory, "mmseqs2_cluster.tsv"), + os.path.join(output_directory, "clusters.tsv"), + + "&&", + + "mv", + os.path.join(output_directory, "mmseqs2_rep_seq.fasta"), + os.path.join(output_directory, "representatives.fasta"), + + "&&", + + "gzip", + os.path.join(output_directory, "representatives.fasta"), + + "&&", + + "rm -rf", + os.path.join(output_directory, "mmseqs2_all_seqs.fasta"), + os.path.join(directories["tmp"], "*"), + ] + + return cmd + +def get_diamond_cmd( input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + cmd = [ + os.environ["diamond"], + {"diamond-cluster":"cluster", "diamond-linclust":"linclust"}[opts.algorithm], + "--db", + opts.fasta, + "--out", + os.path.join(output_directory, "clusters.tsv"), + "--tmpdir", + directories["tmp"], + "--threads {}".format(opts.n_jobs), + "--approx-id {}".format(opts.minimum_identity_threshold), + "--member-cover {}".format(opts.minimum_coverage_threshold*100), + opts.diamond_options, + + "&&", + + "cut -f1", + os.path.join(output_directory, "clusters.tsv"), + "|", + "sort -u", + ">", + os.path.join(output_directory, "representatives.list"), + + "&&", + + os.environ["seqkit"], + "grep", + "-w 0", + "-f", + os.path.join(output_directory, "representatives.list"), + opts.fasta, + "|", + "gzip", + ">", + os.path.join(output_directory, "representatives.fasta.gz"), + + "&&", + + "rm -rf", + os.path.join(directories["tmp"], "*"), + ] + + return cmd + +# Compile +def get_compile_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): + + # Command + cmd = [ + + os.environ["edgelist_to_clusters.py"], + "-i {}".format(input_filepaths[0]), + "--no_singletons" if bool(opts.no_singletons) else "", + "--cluster_prefix {}".format(opts.cluster_prefix) if bool(opts.cluster_prefix) else "", + "--cluster_suffix {}".format(opts.cluster_suffix) if bool(opts.cluster_suffix) else "", + "--cluster_prefix_zfill {}".format(opts.cluster_prefix_zfill), + "-o {}".format(os.path.join(output_directory, "{}.tsv".format(opts.basename))), + # "-g {}".format(os.path.join(output_directory, "{}.networkx_graph.pkl".format(opts.basename))), + # "-d {}".format(os.path.join(output_directory, "{}.dict.pkl".format(opts.basename))), + "--identifiers {}".format(opts.identifiers) if bool(opts.identifiers) else "", + + "&&", + + os.environ["reformat_representative_sequences.py"], + "-c {}".format(os.path.join(output_directory, "{}.tsv".format(opts.basename))), + "-i {}".format(input_filepaths[1]), + "-f {}".format(opts.representative_output_format), + "-o {}".format(output_filepaths[1]), + ] + + if opts.no_sequences_and_header: + cmd += [ + "--no_sequences", + "--no_header", + ] + + return cmd + +# ============ +# Run Pipeline +# ============ +# Set environment variables +def add_executables_to_environment(opts): + """ + Adapted from Soothsayer: https://github.com/jolespin/soothsayer + """ + accessory_scripts = set([ + "check_fasta_duplicates.py", + "edgelist_to_clusters.py", + "reformat_representative_sequences.py", + ]) + + required_executables={ + "mmseqs", + "diamond", + "seqkit", + + } | accessory_scripts + + if opts.path_config == "CONDA_PREFIX": + executables = dict() + for name in required_executables: + executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name) + else: + if opts.path_config is None: + opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv") + opts.path_config = format_path(opts.path_config) + assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config) + assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables) + df_config = pd.read_csv(opts.path_config, sep="\t") + assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config) + df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str) + # Get executable paths + executables = OrderedDict(zip(df_config["name"], df_config["executable"])) + assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys()))) + + # Display + for name in sorted(accessory_scripts): + executables[name] = "'{}'".format(os.path.join(opts.script_directory, name)) # Can handle spaces in path + + print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) + for name, executable in executables.items(): + if name in required_executables: + print(name, executable, sep = " --> ", file=sys.stdout) + os.environ[name] = executable.strip() + print("", file=sys.stdout) + +# Pipeline +def create_pipeline(opts, directories, f_cmds): + + # ................................................................. + # Primordial + # ................................................................. + # Commands file + pipeline = ExecutablePipeline(name=__program__, f_cmds=f_cmds, checkpoint_directory=directories["checkpoints"], log_directory=directories["log"]) + + + # ========== + # Preprocessing + # ========== + + program = "check" + # Add to directories + output_directory = directories["tmp"] + + # Info + step = 0 + description = "Check sequences for duplicates" + + # i/o + input_filepaths = [opts.fasta] + output_filepaths = [ + ] + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_check_cmd(**params) + + pipeline.add_step( + id=program, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=False, + ) + + # ========== + # Clustering + # ========== + step = 1 + + # i/o + output_directory = directories["intermediate"] + + input_filepaths = [opts.fasta] + output_filenames = [ + "clusters.tsv", + "representatives.fasta.gz", + ] + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + if opts.algorithm.split("-")[0] == "mmseqs": + program = "mmseqs2" + # Info + description = "Cluster sequences via MMSEQS2" + cmd = get_mmseqs2_cmd(**params) + + if opts.algorithm.split("-")[0] == "diamond": + program = "diamond" + description = "Cluster sequences via Diamond" + cmd = get_diamond_cmd(**params) + + pipeline.add_step( + id=program, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + ) + + # ========== + # Compile + # ========== + + program = "compile" + # Add to directories + output_directory = directories["output"] + + # Info + step = 2 + description = "Compile clustering results" + + # i/o + input_filepaths = output_filepaths + output_filenames = [ + "{}.tsv".format(opts.basename), + ] + if opts.representative_output_format == "table": + output_filenames += ["representative_sequences.tsv.gz"] + if opts.representative_output_format == "fasta": + output_filenames += ["representative_sequences.fasta.gz"] + output_filepaths = list(map(lambda filename: os.path.join(output_directory, filename), output_filenames)) + + params = { + "input_filepaths":input_filepaths, + "output_filepaths":output_filepaths, + "output_directory":output_directory, + "opts":opts, + "directories":directories, + } + + cmd = get_compile_cmd(**params) + + pipeline.add_step( + id=program, + description = description, + step=step, + cmd=cmd, + input_filepaths = input_filepaths, + output_filepaths = output_filepaths, + validate_inputs=True, + validate_outputs=True, + ) + + return pipeline + +# Configure parameters +def configure_parameters(opts, directories): + + assert_acceptable_arguments(opts.algorithm, {"easy-cluster", "easy-linclust", "mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}) + if opts.algorithm in {"easy-cluster", "easy-linclust"}: + d = {"easy-cluster":"mmseqs-cluster", "easy-linclust":"mmseqs-linclust"} + warnings.warn("\n\nPlease use `{}` instead of `{}` for MMSEQS2 clustering.".format(d[opts.algorithm], opts.algorithm)) + opts.algorithm = d[opts.algorithm] + assert_acceptable_arguments(opts.representative_output_format, {"table", "fasta"}) + # Set environment variables + add_executables_to_environment(opts=opts) + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o ".format(__program__) + + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser_io = parser.add_argument_group('Required I/O arguments') + parser_io.add_argument("-i","--fasta", type=str, help = "Fasta file") + parser_io.add_argument("-o","--output_directory", type=str, default="clustering_output", help = "path/to/project_directory [Default: clustering_output]") + parser_io.add_argument("-e", "--no_singletons", action="store_true", help="Exclude singletons") + parser_io.add_argument("-b", "--basename", type=str, default="clusters", help="Basename for clustering files [Default: clusters]") + + # Utility + parser_utility = parser.add_argument_group('Utility arguments') + parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future + parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]") + parser_utility.add_argument("--restart_from_checkpoint", type=str, default=None, help = "Restart from a particular checkpoint [Default: None]") + parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) + # parser_utility.add_argument("--verbose", action='store_true') + + # Clustering + parser_clustering = parser.add_argument_group('Clustering arguments') + parser_clustering.add_argument("-a", "--algorithm", type=str, default="mmseqs-cluster", help="Clustering algorithm | Diamond can only be used for clustering proteins {mmseqs-cluster, mmseqs-linclust, diamond-cluster, mmseqs-linclust} [Default: mmseqs-cluster]") + parser_clustering.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="Clustering | Percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]") + parser_clustering.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="Clustering | Coverage threshold (Range (0.0, 1.0]) [Default: 0.8]") + parser_clustering.add_argument("--cluster_prefix", type=str, default="SC-", help="Sequence cluster prefix [Default: 'SC-]") + parser_clustering.add_argument("--cluster_suffix", type=str, default="", help="Sequence cluster suffix [Default: '']") + parser_clustering.add_argument("--cluster_prefix_zfill", type=int, default=0, help="Sequence cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 + parser_clustering.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']") + parser_clustering.add_argument("--diamond_options", type=str, default="", help="Diamond | More options (e.g. --arg 1 ) [Default: '']") + parser_clustering.add_argument("--identifiers", type=str, help = "Identifiers to include for `edgelist_to_clusters.py`. If missing identifiers and singletons are allowed, then they will be included as singleton clusters with weight of np.inf") + parser_clustering.add_argument("--no_sequences_and_header", action="store_true", help = "Don't include sequences or header in table. Useful for concatenation and reduced redundancy of sequences") + parser_clustering.add_argument("-f","--representative_output_format", type=str, default="fasta", help = "Format of output for representative sequences: {table, fasta} [Default: fasta]") # Should fasta be the new default? + + # Options + opts = parser.parse_args() + + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Threads + if opts.n_jobs == -1: + opts.n_jobs = cpu_count() + assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1 (or -1 to use all available threads)" + + # Directories + directories = dict() + directories["project"] = create_directory(opts.output_directory) + directories["output"] = create_directory(os.path.join(directories["project"], "output")) + directories["log"] = create_directory(os.path.join(directories["project"], "log")) + directories["tmp"] = create_directory(os.path.join(directories["project"], "tmp")) + directories["checkpoints"] = create_directory(os.path.join(directories["project"], "checkpoints")) + directories["intermediate"] = create_directory(os.path.join(directories["project"], "intermediate")) + os.environ["TMPDIR"] = directories["tmp"] + + # Info + print(format_header(__program__, "="), file=sys.stdout) + print(format_header("Configuration:", "-"), file=sys.stdout) + print("Python version:", sys.version.replace("\n"," "), file=sys.stdout) + print("Python path:", sys.executable, file=sys.stdout) #sys.path[2] + print("Script version:", __version__, file=sys.stdout) + print("Moment:", get_timestamp(), file=sys.stdout) + print("Directory:", os.getcwd(), file=sys.stdout) + print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) + configure_parameters(opts, directories) + sys.stdout.flush() + + # Run pipeline + with open(os.path.join(directories["project"], "commands.sh"), "w") as f_cmds: + pipeline = create_pipeline( + opts=opts, + directories=directories, + f_cmds=f_cmds, + ) + pipeline.compile() + pipeline.execute(restart_from_checkpoint=opts.restart_from_checkpoint) + +if __name__ == "__main__": + main(sys.argv[1:]) + + diff --git a/src/scripts/compile_custom_humann_database_from_annotations.py b/src/scripts/compile_custom_humann_database_from_annotations.py index a644bb5..6604413 100755 --- a/src/scripts/compile_custom_humann_database_from_annotations.py +++ b/src/scripts/compile_custom_humann_database_from_annotations.py @@ -11,7 +11,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.11" +__version__ = "2023.12.15" def main(args=None): @@ -31,7 +31,7 @@ def main(args=None): parser.add_argument("-a","--annotations", type=str, required=True, help = "path/to/annotations.tsv[.gz] Output from annotations.py. Multi-level header that contains (UniRef, sseqid)") parser.add_argument("-t","--taxonomy", type=str, required=True, help = "path/to/taxonomy.tsv[.gz] [id_genome][classification] (No header). Use output from `merge_taxonomy_classifications.py` with --no_header and --no_domain") parser.add_argument("-s","--sequences", type=str, required=True, help = "path/to/proteins.fasta[.gz]") - parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/humann_uniref_annotations.tsv[.gz] [Default: stdout]") + parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/humann_uniref_annotations.tsv[.gz] (veba_output/profiling/databases/) is recommended [Default: stdout]") parser.add_argument("--sep", default=";", help = "Separator for taxonomic levels [Default: ;]") # parser.add_argument("--mandatory_taxonomy_prefixes", help = "Comma-separated values for mandatory prefix levels. (e.g., 'c__,f__,g__,s__')") # parser.add_argument("--discarded_file", help = "Proteins that have been discarded due to incomplete lineage") diff --git a/src/scripts/compile_custom_sylph_sketch_database_from_genomes.py b/src/scripts/compile_custom_sylph_sketch_database_from_genomes.py new file mode 100755 index 0000000..9c25424 --- /dev/null +++ b/src/scripts/compile_custom_sylph_sketch_database_from_genomes.py @@ -0,0 +1,239 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, glob, shutil, time, warnings +from multiprocessing import cpu_count +from collections import OrderedDict, defaultdict + +import pandas as pd + +# Soothsayer Ecosystem +from genopype import * +from genopype import __version__ as genopype_version +from soothsayer_utils import * + +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.12.15" + +# ============ +# Run Pipeline +# ============ +# Set environment variables +def add_executables_to_environment(opts): + """ + Adapted from Soothsayer: https://github.com/jolespin/soothsayer + """ + accessory_scripts = set([ + + ]) + + required_executables={ + "sylph", + + } | accessory_scripts + + if opts.path_config == "CONDA_PREFIX": + executables = dict() + for name in required_executables: + executables[name] = os.path.join(os.environ["CONDA_PREFIX"], "bin", name) + else: + if opts.path_config is None: + opts.path_config = os.path.join(opts.script_directory, "veba_config.tsv") + opts.path_config = format_path(opts.path_config) + assert os.path.exists(opts.path_config), "config file does not exist. Have you created one in the following directory?\n{}\nIf not, either create one, check this filepath:{}, or give the path to a proper config file using --path_config".format(opts.script_directory, opts.path_config) + assert os.stat(opts.path_config).st_size > 1, "config file seems to be empty. Please add 'name' and 'executable' columns for the following program names: {}".format(required_executables) + df_config = pd.read_csv(opts.path_config, sep="\t") + assert {"name", "executable"} <= set(df_config.columns), "config must have `name` and `executable` columns. Please adjust file: {}".format(opts.path_config) + df_config = df_config.loc[:,["name", "executable"]].dropna(how="any", axis=0).applymap(str) + # Get executable paths + executables = OrderedDict(zip(df_config["name"], df_config["executable"])) + assert required_executables <= set(list(executables.keys())), "config must have the required executables for this run. Please adjust file: {}\nIn particular, add info for the following: {}".format(opts.path_config, required_executables - set(list(executables.keys()))) + + # Display + for name in sorted(accessory_scripts): + executables[name] = "'{}'".format(os.path.join(opts.script_directory, name)) # Can handle spaces in path + + print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) + for name, executable in executables.items(): + if name in required_executables: + print(name, executable, sep = " --> ", file=sys.stdout) + os.environ[name] = executable.strip() + print("", file=sys.stdout) + + +# Configure parameters +def configure_parameters(opts, directories): + + + # Set environment variables + add_executables_to_environment(opts=opts) + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o ".format(__program__) + + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser_io = parser.add_argument_group('Required I/O arguments') + parser_io.add_argument("-i","--input", type=str, default="stdin", help = "path/to/input.tsv. Format: Must include the following columns (No header)[organism_type][path/to/genome.fa]. You can get this from `cut -f1,4 veba_output/misc/genomes_table.tsv` [Default: stdin]") + parser_io.add_argument("-o","--output_directory", type=str, default="veba_output/profiling/databases", help = "path/to/output_directory for databases [Default: veba_output/profiling/databases]") + parser_io.add_argument("--viral_tag", type=str, default="viral", help = "[Not case sensitive] Tag/Label of viral organisms in first column of --input (e.g., viral, virus, viron) [Default: viral]") + + + # Utility + parser_utility = parser.add_argument_group('Utility arguments') + parser_utility.add_argument("--path_config", type=str, default="CONDA_PREFIX", help="path/to/config.tsv [Default: CONDA_PREFIX]") #site-packges in future + parser_utility.add_argument("-p", "--n_jobs", type=int, default=1, help = "Number of threads [Default: 1]") + parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) + # parser_utility.add_argument("--verbose", action='store_true') + + # Sylph + parser_sylph = parser.add_argument_group('Sylph sketch arguments') + parser_sylph.add_argument("-k", "--sylph_k", type=int, choices={21,31}, default=31, help="Sylph | Value of k. Only k = 21, 31 are currently supported. [Default: 31]") + parser_sylph.add_argument("-s", "--sylph_minimum_spacing", type=int, default=30, help="Sylph | Minimum spacing between selected k-mers on the genomes [Default: 30]") + + parser_sylph_nonviral = parser.add_argument_group('[Prokaryotic & Eukaryotic] Sylph sketch arguments') + parser_sylph_nonviral.add_argument("--sylph_nonviral_subsampling_rate", type=int, default=200, help="Sylph [Prokaryotic & Eukaryotic]| Subsampling rate. [Default: 200]") + parser_sylph_nonviral.add_argument("--sylph_nonviral_options", type=str, default="", help="Sylph [Prokaryotic & Eukaryotic] | More options for `sylph sketch` (e.g. --arg 1 ) [Default: '']") + + parser_sylph_viral = parser.add_argument_group('[Viral] Sylph sketch arguments') + parser_sylph_viral.add_argument("--sylph_viral_subsampling_rate", type=int, default=100, help="Sylph [Viral]| Subsampling rate. [Default: 100]") + parser_sylph_viral.add_argument("--sylph_viral_options", type=str, default="", help="Sylph [Viral] | More options for `sylph sketch` (e.g. --arg 1 ) [Default: '']") + + # Options + opts = parser.parse_args() + + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Threads + if opts.n_jobs == -1: + opts.n_jobs = cpu_count() + assert opts.n_jobs >= 1, "--n_jobs must be ≥ 1 (or -1 to use all available threads)" + + # Directories + directories = dict() + directories["output"] = create_directory(opts.output_directory) + directories["intermediate"] = create_directory(os.path.join(directories["output"], "intermediate")) + directories["log"] = create_directory(os.path.join(directories["intermediate"], "log")) + directories["checkpoints"] = create_directory(os.path.join(directories["intermediate"], "checkpoints")) + + # Info + print(format_header(__program__, "="), file=sys.stdout) + print(format_header("Configuration:", "-"), file=sys.stdout) + print("Python version:", sys.version.replace("\n"," "), file=sys.stdout) + print("Python path:", sys.executable, file=sys.stdout) #sys.path[2] + print("Script version:", __version__, file=sys.stdout) + print("GenoPype version:", genopype_version, file=sys.stdout) #sys.path[2] + print("Moment:", get_timestamp(), file=sys.stdout) + print("Directory:", os.getcwd(), file=sys.stdout) + print("Commands:", list(filter(bool,sys.argv)), sep="\n", file=sys.stdout) + configure_parameters(opts, directories) + sys.stdout.flush() + + # Make directories + t0 = time.time() + # print(format_header("* ({}) Creating directories:".format(format_duration(t0)), opts.output_directory), file=sys.stdout) + # os.makedirs(opts.output_directory, exist_ok=True) + + # Load input + if opts.input == "stdin": + opts.input = sys.stdin + df_genomes = pd.read_csv(opts.input, sep="\t", header=None) + assert df_genomes.shape[1] == 2, "Must include the follow columns (No header) [organism_type][genome]). Suggested input is from `compile_genomes_table.py` script using `cut -f1,4` to get the necessary columns." + df_genomes.columns = ["organism_type", "genome"] + + opts.viral_tag = opts.viral_tag.lower() + + print(format_header("* ({}) Organizing genomes by organism_type".format(format_duration(t0))), file=sys.stdout) + organism_to_genomes = defaultdict(set) + for i, (organism_type, genome_filepath) in pv(df_genomes.iterrows(), unit="genomes ", total=df_genomes.shape[0]): + organism_type = organism_type.lower() + if organism_type == opts.viral_tag: + organism_to_genomes["viral"].add(genome_filepath) + else: + organism_to_genomes["nonviral"].add(genome_filepath) + # del df_genomes + + # Commands + f_cmds = open(os.path.join(directories["intermediate"], "commands.sh"), "w") + + for organism_type, filepaths in organism_to_genomes.items(): + # Write genomes to file + print(format_header("* ({}) Creating genome database: (N={}) for organism_type='{}'".format(format_duration(t0),len(filepaths), organism_type)), file=sys.stdout) + + genome_filepaths_list = os.path.join(directories["intermediate"], "{}_genomes.list".format(organism_type)) + with open(genome_filepaths_list, "w") as f: + for fp in sorted(filepaths): + print(fp, file=f) + + name = "sylph__{}".format(organism_type) + description = "[Program = sylph sketch] [Organism_Type = {}]".format(organism_type) + + arguments = [ + os.environ["sylph"], + "sketch", + "-t {}".format(opts.n_jobs), + "--gl {}".format(genome_filepaths_list), + "-o {}".format(os.path.join(opts.output_directory, "genome_database-{}".format(organism_type))), + "-k {}".format(opts.sylph_k), + "--min-spacing {}".format(opts.sylph_minimum_spacing), + ] + + if organism_type == "nonviral": + arguments += [ + "-c {}".format(opts.sylph_nonviral_subsampling_rate), + opts.sylph_nonviral_options, + ] + + else: + arguments += [ + "-c {}".format(opts.sylph_viral_subsampling_rate), + opts.sylph_viral_options, + ] + print(arguments, file=sys.stdout) + cmd = Command( + arguments, + name=name, + f_cmds=f_cmds, + ) + + + # Run command + cmd.run( + checkpoint_message_notexists="[Running ({})] | {}".format(format_duration(t0), description), + checkpoint_message_exists="[Loading Checkpoint ({})] | {}".format(format_duration(t0), description), + write_stdout=os.path.join(directories["log"], "{}.o".format(name)), + write_stderr=os.path.join(directories["log"], "{}.e".format(name)), + write_returncode=os.path.join(directories["log"], "{}.returncode".format(name)), + checkpoint=os.path.join(directories["checkpoints"], name), + ) + + if hasattr(cmd, "returncode_"): + if cmd.returncode_ != 0: + print("[Error] | {}".format(description), file=sys.stdout) + print("Check the following files:\ncat {}".format(os.path.join(directories["log"], "{}.*".format(name))), file=sys.stdout) + sys.exit(cmd.returncode_) + else: + output_filepath = os.path.join(opts.output_directory, "genome_database-{}.syldb".format(organism_type)) + size_bytes = os.path.getsize(output_filepath) + size_mb = size_bytes >> 20 + if size_mb < 1: + print("Output Database:", output_filepath, "({} bytes)".format(size_bytes), file=sys.stdout) + else: + print("Output Database:", output_filepath, "({} MB)".format(size_mb), file=sys.stdout) + + f_cmds.close() + +if __name__ == "__main__": + main(sys.argv[1:]) + + diff --git a/src/scripts/compile_eukaryotic_classifications.py b/src/scripts/compile_eukaryotic_classifications.py index 609526e..4841d85 100755 --- a/src/scripts/compile_eukaryotic_classifications.py +++ b/src/scripts/compile_eukaryotic_classifications.py @@ -6,7 +6,7 @@ from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.3.20" +__version__ = "2023.12.14" def main(args=None): @@ -16,20 +16,23 @@ def main(args=None): # Path info description = """ Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) - usage = "{} -i -s -c -o ".format(__program__) + usage = "{} -i -s -c -o ".format(__program__) epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" # Parser parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) # Pipeline parser.add_argument("-i","--metaeuk_identifier_mapping", type=str, required=True, help = "path/to/identifier_mapping.metaeuk.tsv") - parser.add_argument("-s","--scaffolds_to_bins", type=str, required=True, help = "path/to/scaffolds_to_bins.tsv") - parser.add_argument("-c","--clusters", type=str, help = "path/to/clusters.tsv, Format: [id_mag][id_cluster], No header [Optional]") - parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/output.tsv [Default: stdout]") - parser.add_argument("--eukaryotic_database", type=str, default=None, required=True, help="path/to/eukaryotic_database (e.g. --arg 1 )") + parser.add_argument("-s","--scaffolds_to_bins", type=str, required=False, help = "path/to/scaffolds_to_bins.tsv") + # parser.add_argument("-g","--genes_to_contigs", type=str, required=False, help = "path/to/genes_to_contigs.tsv cannot be used with --scaffolds_to_bins") + parser.add_argument("-c","--clusters", type=str, help = "path/to/clusters.tsv, Format: [id_genome][id_cluster], No header [Optional]") + parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/gene-source_lineage.tsv [Default: stdout]") + parser.add_argument("-d", "--eukaryotic_database", type=str, default=None, required=True, help="path/to/eukaryotic_database directory (e.g. --arg 1 )") # parser.add_argument("--veba_database", type=str, default=None, help=f"VEBA database location. [Default: $VEBA_DATABASE environment variable]") parser.add_argument("--header", type=int, default=1, help="Include header in output {0=No, 1=Yes) [Default: 1]") parser.add_argument("--debug", action="store_true") + parser.add_argument("--remove_genes_with_missing_values", action="store_true") + parser.add_argument("--use_original_metaeuk_gene_identifiers", action="store_true") # Options opts = parser.parse_args() @@ -44,20 +47,15 @@ def main(args=None): # opts.eukaryotic_database = os.path.join(opts.veba_database, "Classify", "Microeukaryotic") # I/O - # Scaffolds -> Bins - fp = opts.scaffolds_to_bins - print("* Reading scaffolds to bins table {}".format(fp), file=sys.stderr) - scaffold_to_bin = pd.read_csv(fp, sep="\t", index_col=0, header=None).iloc[:,0] - if opts.debug: - print(fp, file=sys.stderr) - scaffold_to_bin.head().to_csv(sys.stderr, sep="\t", header=None) - print("\n", file=sys.stderr) + # SourceID -> Taxonomy fp = os.path.join(opts.eukaryotic_database,"source_taxonomy.tsv.gz") print("* Reading source taxonomy table {}".format(fp), file=sys.stderr) df_source_taxonomy = pd.read_csv(fp, sep="\t", index_col=0) df_source_taxonomy.index = df_source_taxonomy.index.map(str) + df_source_taxonomy = pd.DataFrame(df_source_taxonomy.to_dict()) # Hack for duplicate entries that will be resolved in MicroEuk_v3.1 + if opts.debug: print(fp, file=sys.stderr) df_source_taxonomy.head().to_csv(sys.stderr, sep="\t") @@ -65,7 +63,7 @@ def main(args=None): # VEBA -> SourceID fp = os.path.join(opts.eukaryotic_database,"target_to_source.dict.pkl.gz") - print("* Reading target to source mapping {}".format(fp), file=sys.stderr) + print("* Reading target to source mapping {} (Note: This one takes a little longer to load...)".format(fp), file=sys.stderr) with gzip.open(fp, "rb") as f: target_to_source = pickle.load(f) #target_to_source = pd.read_csv(fp, sep="\t", index_col=0, dtype=str, usecols=["id_veba", "id_source"], squeeze=True)#.iloc[:,0] @@ -83,32 +81,44 @@ def main(args=None): df_metaeuk.head().to_csv(sys.stderr, sep="\t") print("\n", file=sys.stderr) - orf_to_bitscore = df_metaeuk["bitscore"].map(float) - orf_to_scaffold = df_metaeuk["C_acc"].map(str) - orf_to_mag = orf_to_scaffold.map(lambda id_scaffold: scaffold_to_bin[id_scaffold]) - - orf_to_target = df_metaeuk["T_acc"] - orf_to_source = orf_to_target.map(lambda id_target: target_to_source.get(id_target,np.nan)) - if np.any(pd.isnull(orf_to_source)): + gene_to_bitscore = df_metaeuk["bitscore"].map(float) + gene_to_scaffold = df_metaeuk["C_acc"].map(str) + gene_to_genome = pd.Series([np.nan]*df_metaeuk.shape[0], index=df_metaeuk.index) + gene_to_target = df_metaeuk["T_acc"] + gene_to_source = gene_to_target.map(lambda id_target: target_to_source.get(id_target,np.nan)) + + if opts.scaffolds_to_bins: + # Scaffolds -> Bins + fp = opts.scaffolds_to_bins + print("* Reading scaffolds to bins table {}".format(fp), file=sys.stderr) + scaffold_to_bin = pd.read_csv(fp, sep="\t", index_col=0, header=None).iloc[:,0] + if opts.debug: + print(fp, file=sys.stderr) + scaffold_to_bin.head().to_csv(sys.stderr, sep="\t", header=None) + print("\n", file=sys.stderr) + gene_to_genome = gene_to_scaffold.map(lambda id_scaffold: scaffold_to_bin[id_scaffold]) + + if np.any(pd.isnull(gene_to_source)): warnings.warn("The following gene - target identifiers are not in the database file: {}".format( os.path.join(opts.eukaryotic_database,"target_to_source.dict.pkl.gz"), ), ) - orf_to_target[orf_to_source[orf_to_source.isnull()].index].to_frame().to_csv(sys.stderr, sep="\t", header=None) - orf_to_source = orf_to_source.dropna() + gene_to_target[gene_to_source[gene_to_source.isnull()].index].to_frame().to_csv(sys.stderr, sep="\t", header=None) + gene_to_source = gene_to_source.dropna() # Lineage - orf_to_lineage = OrderedDict() + gene_to_lineage = OrderedDict() missing_lineage = list() - for id_orf, id_source in tqdm(orf_to_source.items(), desc="Retrieving lineage", unit = " ORFs"): + for id_gene, id_source in tqdm(gene_to_source.items(), desc="Retrieving lineage", unit = " genes"): if id_source in df_source_taxonomy.index: lineage = df_source_taxonomy.loc[id_source, ["class", "order", "family", "genus", "species"]] # class order family genus species + lineage = lineage.fillna("") lineage = ";".join(map(lambda items: "".join(items), zip(["c__", "o__", "f__", "g__", "s__"], lineage))) - orf_to_lineage[id_orf] = lineage + gene_to_lineage[id_gene] = lineage else: missing_lineage.append(id_source) - orf_to_lineage = pd.Series(orf_to_lineage) + gene_to_lineage = pd.Series(gene_to_lineage) if len(missing_lineage): warnings.warn("The following source identifiers are not in the database file: {}\n{}`".format( @@ -118,31 +128,47 @@ def main(args=None): ) # Output - # ["id_orf", "id_mag", "bitscore", "lineage"] - df_orf_classifications = pd.concat([ - orf_to_scaffold.to_frame("id_scaffold"), - orf_to_mag.to_frame("id_mag"), - orf_to_target.to_frame("id_target"), - orf_to_source.to_frame("id_source"), - orf_to_lineage.to_frame("lineage"), - orf_to_bitscore.to_frame("bitscore"), - ], - axis=1) - df_orf_classifications.index.name = "id_gene" + df_gene_classifications = pd.DataFrame({ + "id_scaffold":gene_to_scaffold, + "id_genome":gene_to_genome, + "id_target":gene_to_target, + "id_source":gene_to_source, + "lineage":gene_to_lineage, + "bitscore":gene_to_bitscore, + }) + df_gene_classifications.index.name = "id_gene" + + + # df_gene_classifications = pd.concat([ + # gene_to_scaffold.to_frame("id_scaffold"), + # gene_to_genome.to_frame("id_genome"), + # gene_to_target.to_frame("id_target"), + # gene_to_source.to_frame("id_source"), + # gene_to_lineage.to_frame("lineage"), + # gene_to_bitscore.to_frame("bitscore"), + # ], + # axis=1) + # df_gene_classifications.index.name = "id_gene" # Add clusters if provided if opts.clusters: if opts.clusters != "None": # Hack for when called internally - mag_to_cluster = pd.read_csv(opts.clusters, sep="\t", index_col=0, header=None).iloc[:,0] - orf_to_cluster = orf_to_mag.map(lambda id_orf: mag_to_cluster[id_orf]) - df_orf_classifications.insert(loc=2, column="id_cluster", value=orf_to_cluster) + genome_to_cluster = pd.read_csv(opts.clusters, sep="\t", index_col=0, header=None).iloc[:,0] + gene_to_cluster = gene_to_genome.map(lambda id_gene: genome_to_cluster[id_gene]) + df_gene_classifications.insert(loc=2, column="id_cluster", value=gene_to_cluster) # Output if opts.output == "stdout": opts.output = sys.stdout - df_orf_classifications = df_orf_classifications.dropna(how="any", axis=0) - df_orf_classifications.to_csv(opts.output, sep="\t", header=bool(opts.header)) + if opts.remove_genes_with_missing_values: + df_gene_classifications = df_gene_classifications.dropna(how="any", axis=0) + + if not opts.use_original_metaeuk_gene_identifiers: + metaeuk_to_gene = df_metaeuk["gene_id"].to_dict() + df_gene_classifications.index = df_gene_classifications.index.map(lambda x: metaeuk_to_gene[x]) + + df_gene_classifications.to_csv(opts.output, sep="\t", header=bool(opts.header)) diff --git a/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py b/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py index 1e4a031..3ee1147 100755 --- a/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py +++ b/src/scripts/compile_prokaryotic_genome_cluster_classification_scores_table.py @@ -30,7 +30,6 @@ def main(argv=None): parser_io.add_argument("--fill_missing_weight", type=float, help = "Fill missing weight between [0, 100.0]. [Default is to throw error if value is missing]") parser_io.add_argument("--header", action="store_true", help = "Include header") - # Options opts = parser.parse_args() opts.script_directory = script_directory diff --git a/src/scripts/compile_reads_table.py b/src/scripts/compile_reads_table.py index 3b3edd8..8075113 100755 --- a/src/scripts/compile_reads_table.py +++ b/src/scripts/compile_reads_table.py @@ -7,7 +7,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.8.28" +__version__ = "2023.12.18" def parse_basename(query: str, naming_scheme: str): """ @@ -43,6 +43,7 @@ def main(args=None): parser_preprocess_directory = parser.add_argument_group('[Mode 1] Preprocess Directory arguments') parser_preprocess_directory.add_argument("-i","--preprocess_directory", type=str, help = "path/to/preprocess directory (e.g., veba_output/preprocess) [Cannot be used with --fastq_directory]") parser_preprocess_directory.add_argument("-b","--basename", default="cleaned", type=str, help = "File basename to search VEBA preprocess directory [preprocess_directory]/[id_sample]/[output]/[basename]_1/2.fastq.gz [Default: cleaned]") + parser_preprocess_directory.add_argument("-L","--long", action="store_true", help = "Use if reads are ONT or PacBio") parser_fastq_directory = parser.add_argument_group('[Mode 2] Fastq Directory arguments') parser_fastq_directory.add_argument("-f","--fastq_directory", type=str, help = "path/to/fastq_directory [Cannot be used with --preprocess_directory]") @@ -55,6 +56,7 @@ def main(args=None): parser_output.add_argument("-0", "--sample_label", default="sample-id", type=str, help = "Sample ID column label [Reverse: sample-id]") parser_output.add_argument("-1", "--forward_label", default="forward-absolute-filepath", type=str, help = "Forward filepath column label [Default: forward-absolute-filepath]") parser_output.add_argument("-2", "--reverse_label", default="reverse-absolute-filepath", type=str, help = "Reverse filepath column label [Default: reverse-absolute-filepath]") + parser_output.add_argument("-3", "--long_label", default="reads-filepath", type=str, help = "Long reads filepath column label [Default: reads-filepath]") parser_output.add_argument("--header", action="store_true", help = "Write header") parser_output.add_argument("--volume_prefix", type=str, help = "Docker container prefix to volume path") @@ -69,27 +71,41 @@ def main(args=None): output = defaultdict(dict) # Build table from preprocess directory if opts.preprocess_directory: - for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_1.fastq.gz".format(opts.basename))): - id_sample = fp.split("/")[-3] - output[id_sample][opts.forward_label] = fp - for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_2.fastq.gz".format(opts.basename))): - id_sample = fp.split("/")[-3] - output[id_sample][opts.reverse_label] = fp - # Build table from fastq directory - if opts.fastq_directory: - for fp in glob.glob(os.path.join(opts.fastq_directory, "*.{}".format(opts.extension))): - basename = fp.split("/")[-1] - id_sample, direction = parse_basename(basename, naming_scheme=opts.naming_scheme) - # id_sample = "_R".join(basename.split("_R")[:-1]) - if direction == "1": + if not opts.long: + for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_1.fastq.gz".format(opts.basename))): + id_sample = fp.split("/")[-3] output[id_sample][opts.forward_label] = fp - if direction == "2": + for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}_2.fastq.gz".format(opts.basename))): + id_sample = fp.split("/")[-3] output[id_sample][opts.reverse_label] = fp - df_output = pd.DataFrame(output).T.sort_index().loc[:,[opts.forward_label, opts.reverse_label]] + else: + for fp in glob.glob(os.path.join(opts.preprocess_directory, "*", "output", "{}.fastq.gz".format(opts.basename))): + id_sample = fp.split("/")[-3] + output[id_sample][opts.long_label] = fp + + # Build table from fastq directory + if opts.fastq_directory: + if not opts.long: + for fp in glob.glob(os.path.join(opts.fastq_directory, "*.{}".format(opts.extension))): + basename = fp.split("/")[-1] + id_sample, direction = parse_basename(basename, naming_scheme=opts.naming_scheme) + # id_sample = "_R".join(basename.split("_R")[:-1]) + if direction == "1": + output[id_sample][opts.forward_label] = fp + if direction == "2": + output[id_sample][opts.reverse_label] = fp + else: + print("Long reads support with -L is currently only available with --preprocess_directory and not --fastq_directory", file=sys.stderr) + sys.exit(1) + + if not opts.long: + df_output = pd.DataFrame(output).T.sort_index().loc[:,[opts.forward_label, opts.reverse_label]] + else: + df_output = pd.DataFrame(output).T.sort_index().loc[:,[opts.long_label]] df_output.index.name = opts.sample_label # Check missing values - missing_values = df_output.notnull().sum(axis=1)[lambda x: x < 2].index + missing_values = df_output.notnull().sum(axis=1)[lambda x: x < df_output.shape[1]].index assert missing_values.size == 0, "Missing fastq for the following samples: {}".format(missing_values.index) # Absolute paths @@ -97,10 +113,14 @@ def main(args=None): df_output = df_output.applymap(lambda fp: os.path.abspath(fp)) else: if opts.header: - if "absolute" in opts.forward_label.lower(): - print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --forward_label: {}".format(opts.forward_label), file=sys.stderr) - if "absolute" in opts.reverse_label.lower(): - print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --reverse_label: {}".format(opts.reverse_label), file=sys.stderr) + if not opts.long: + if "absolute" in opts.forward_label.lower(): + print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --forward_label: {}".format(opts.forward_label), file=sys.stderr) + if "absolute" in opts.reverse_label.lower(): + print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --reverse_label: {}".format(opts.reverse_label), file=sys.stderr) + else: + if "absolute" in opts.long_label.lower(): + print("You've selected --relative and may want to either not use a header or remove 'absolute' from the --long_label: {}".format(opts.long_label), file=sys.stderr) # Docker volume prefix if opts.volume_prefix: diff --git a/src/scripts/concatenate_assembly.py b/src/scripts/concatenate_assembly.py new file mode 100755 index 0000000..bcec4ff --- /dev/null +++ b/src/scripts/concatenate_assembly.py @@ -0,0 +1,99 @@ +#!/usr/bin/env python +import sys, os, argparse, gzip +from Bio.SeqIO.FastaIO import SimpleFastaParser +from tqdm import tqdm + +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.12.18" + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o )".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + + # Pipeline + parser.add_argument("-i","--input", default="stdin", type=str, help = "Input fasta file") + parser.add_argument("-o","--output", default="stdout", type=str, help = "Output fasta file") + parser.add_argument("-n", "--name", type=str, required=True, help = "Name to use for pseudo-scaffold") + parser.add_argument("-N", "--pad", type=int, default=100, help = "Number of N to use for joining contigs") + parser.add_argument("-d", "--description", type=str, help = "Description to use [Default: Input filepath]") + parser.add_argument("-m","--minimum_sequence_length", default=1, type=int, help = "Minimum sequence length accepted [Default: 1]") + parser.add_argument("-w","--wrap", default=1000, type=int, help = "Wrap fasta. Use 0 for no wrapping [Default: 1000]") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + assert opts.minimum_sequence_length > 0 + assert opts.pad >= 0 + + # Input + f_in = None + if opts.input == "stdin": + f_in = sys.stdin + else: + if opts.input.endswith(".gz"): + f_in = gzip.open(opts.input, "rt") + else: + f_in = open(opts.input, "r") + assert f_in is not None + + # Output + f_out = None + if opts.output == "stdout": + f_out = sys.stdout + else: + if opts.output.endswith(".gz"): + f_out = gzip.open(opts.output, "wt") + else: + f_out = open(opts.output, "w") + assert f_out is not None + + # Concatenated assembly + + if not opts.description: + opts.description = "assembly_filepath: {}".format(opts.input) + else: + if opts.description == "NONE": + opts.description = "" + pseudoscaffold_header = "{} {}".format(opts.name, opts.description) + + print(">{}".format(pseudoscaffold_header), file=f_out) + sequences = list() + for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"): + if len(seq) >= opts.minimum_sequence_length: + sequences.append(seq) + number_of_sequences = len(sequences) + sequences = ("N"*opts.pad).join(sequences) + + # Open output file + if opts.wrap > 0: + for i in range(0, len(sequences), opts.wrap): + wrapped_sequence = sequences[i:i+opts.wrap] + # Write header and wrapped sequence + print(wrapped_sequence, file=f_out) + else: + print(sequences, file=f_out) + + + # Close + if f_in != sys.stdin: + f_in.close() + if f_out != sys.stdout: + f_out.close() + +if __name__ == "__main__": + main() + + + diff --git a/src/scripts/concatenate_fasta.py b/src/scripts/concatenate_fasta.py index 0977942..38d12ae 100755 --- a/src/scripts/concatenate_fasta.py +++ b/src/scripts/concatenate_fasta.py @@ -1,6 +1,6 @@ #!/usr/bin/env python from __future__ import print_function, division -import sys, os, argparse +import sys, os, argparse, hashlib import pandas as pd from Bio.SeqIO.FastaIO import SimpleFastaParser @@ -12,45 +12,7 @@ pd.options.display.max_colwidth = 100 # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2022.02.17" - - - -def fasta_to_saf(path, compression="infer"): - """ - # GeneID Chr Start End Strand - # http://bioinf.wehi.edu.au/featureCounts/ - - # Useful: - import re - record_id = "lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]" - re.search("\[locus_tag=(\w+)\]", record_id).group(1) - # 'MASE_RS00005' - - """ - - - saf_data = list() - - if path == "stdin": - f = sys.stdin - else: - f = get_file_object(path, mode="read", compression=compression, verbose=False) - - for id_record, seq in pv(SimpleFastaParser(f), "Reading sequences [{}]".format(path)): - id_record = id_record.split(" ")[0] - fields = [ - id_record, - id_record, - 1, - len(seq), - "+", - ] - saf_data.append(fields) - if f is not sys.stdin: - f.close() - return pd.DataFrame(saf_data, columns=["GeneID", "Chr", "Start", "End", "Strand"]) - +__version__ = "2023.12.13" def main(args=None): # Path info @@ -120,7 +82,15 @@ def main(args=None): safe_mode=False, verbose=False, ) + saf_filepath = os.path.join(opts.output_directory, "{}.saf".format(id_sample)) + + f_duplicates = get_file_object( + path=os.path.join(opts.output_directory, "{}.duplicates_removed.list".format(id_sample)), + mode="write", + safe_mode=False, + verbose=False, + ) else: os.makedirs(os.path.join(opts.output_directory, id_sample), exist_ok=True) @@ -130,29 +100,43 @@ def main(args=None): safe_mode=False, verbose=False, ) + saf_filepath = os.path.join(opts.output_directory, id_sample, "{}.saf".format(opts.basename)) + f_duplicates = get_file_object( + path=os.path.join(opts.output_directory, id_sample, "{}.duplicates_removed.list".format(opts.basename)), + mode="write", + safe_mode=False, + verbose=False, + ) # Read input fasta, filter out short sequences, and write to concatenated file + sequence_hashes = set() saf_data = list() for fp in pv(filepaths, description=id_sample, unit= " files"): f_query = get_file_object(fp, mode="read", verbose=False) for id, seq in SimpleFastaParser(f_query): if len(seq) >= opts.minimum_contig_length: - print(">{}\n{}".format(id, seq), file=f_out) + id_hash = hashlib.md5(seq.upper().encode()).hexdigest() id_record = id.split(" ")[0] - fields = [ - id_record, - id_record, - 1, - len(seq), - "+", - ] - saf_data.append(fields) + if id_hash not in sequence_hashes: + print(">{}\n{}".format(id, seq), file=f_out) + fields = [ + id_record, + id_record, + 1, + len(seq), + "+", + ] + saf_data.append(fields) + sequence_hashes.add(id_hash) + else: + print(id_record, file=f_duplicates) f_query.close() f_out.close() + f_duplicates.close() df_saf = pd.DataFrame(saf_data, columns=["GeneID", "Chr", "Start", "End", "Strand"]) df_saf.to_csv(saf_filepath, sep="\t", index=None) @@ -173,26 +157,39 @@ def main(args=None): saf_filepath = os.path.join(opts.output_directory, "{}.saf".format(opts.basename)) + f_duplicates = get_file_object( + path=os.path.join(opts.output_directory, "{}.duplicates_removed.list".format(opts.basename)), + mode="write", + safe_mode=False, + verbose=False, + ) + # Read input fasta, filter out short sequences, and write to concatenated file + sequence_hashes = set() saf_data = list() for fp in pv(filepaths, unit= " files"): f_query = get_file_object(fp, mode="read", verbose=False) for id, seq in SimpleFastaParser(f_query): if len(seq) >= opts.minimum_contig_length: - print(">{}\n{}".format(id, seq), file=f_out) + id_hash = hashlib.md5(seq.upper().encode()).hexdigest() id_record = id.split(" ")[0] - fields = [ - id_record, - id_record, - 1, - len(seq), - "+", - ] - saf_data.append(fields) - + if id_hash not in sequence_hashes: + print(">{}\n{}".format(id, seq), file=f_out) + fields = [ + id_record, + id_record, + 1, + len(seq), + "+", + ] + saf_data.append(fields) + else: + print(id_record, file=f_duplicates) f_query.close() f_out.close() + f_duplicates.close() + df_saf = pd.DataFrame(saf_data, columns=["GeneID", "Chr", "Start", "End", "Strand"]) df_saf.to_csv(saf_filepath, sep="\t", index=None) diff --git a/src/scripts/consensus_genome_classification_ranked.py b/src/scripts/consensus_genome_classification_ranked.py new file mode 100755 index 0000000..2c190fa --- /dev/null +++ b/src/scripts/consensus_genome_classification_ranked.py @@ -0,0 +1,222 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse +from collections import OrderedDict, defaultdict +import pandas as pd +import numpy as np + + +pd.options.display.max_colwidth = 100 +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.3" + +# RANK_TO_PREFIX="superkingdom:d__,phylum:p__,class:c__,order:o__,family:f__,genus:g__,species:s__" + +RANK_PREFIXES="d__,p__,c__,o__,f__,g__,s__" + +# Fill empty taxonomic levels for consensus classification +def fill_lower_taxonomy_levels( + classifications:pd.Series, + rank_prefixes:list, + delimiter:str=";", + ): + + rank_prefixes = list(rank_prefixes) + number_of_taxonomic_levels = len(rank_prefixes) + classifications_ = dict() + for id_genome, classification in pd.Series(classifications).items(): + taxonomy = classification.split(delimiter) + classifications_[id_genome] = delimiter.join(taxonomy + rank_prefixes[len(taxonomy):]) + return pd.Series(classifications_)[classifications.index] + +# Get consensus classification +def get_consensus_classification( + classification:pd.Series, + classification_weights:pd.Series, + genome_to_genomecluster:pd.Series, + rank_prefixes:list, + number_of_taxonomic_levels="infer", + delimiter=";", + leniency:float=1.382, + ): + # Assertions + assert np.all(classification.notnull()) + assert np.all(classification_weights.notnull()) + assert np.all(genome_to_genomecluster.notnull()) + + # Set and index overlap + a = set(classification.index) + b = set(classification_weights.index) + c = set(genome_to_genomecluster.index) + assert a == b, "`classification` and `classification_weights` must have the same keys in the index" + assert a <= c, "`classification` and `classification_weights` must be a subset (or equal) to the keys in `genome_to_genomecluster` index" + index_genomes = pd.Index(sorted(a & b & c )) + classification = classification[index_genomes] + classification_weights = classification_weights[index_genomes] + genome_to_genomecluster = genome_to_genomecluster[index_genomes] + + # Taxonomic levels + taxonomic_levels = classification.map(lambda x: x.count(delimiter)).unique() + if len(taxonomic_levels): + assert len(taxonomic_levels) == 1, "Taxonomic levels in `classification` should all have the same number of delimiters" #! Might need to change this to allow for missing taxonomic levels + else: + number_of_taxonomic_levels = 1 + + if number_of_taxonomic_levels == "infer": + number_of_taxonomic_levels = taxonomic_levels[0] + 1 + + # Scaling factors + scaling_factors = np.arange(1, number_of_taxonomic_levels + 1) # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales;f__Dermabacteraceae;g__Brachybacterium + scaling_factors = np.power(scaling_factors, leniency) + + # Get container for scores [SLC -> Taxonomy -> Score] + # + # For example the following MAG: + # CLASSIFICATION=d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Corynebacterium;s__Corynebacterium aurimucosum_E + # MSA_PERCENT=80.0 + # + # Would be stored and appended for it's corresponding SLC: + # d__Bacteria += 80.0 + # d__Bacteria;p__Actinobacteriota += 80.0 + # d__Bacteria;p__Actinobacteriota;c__Actinomycetia += 80.0 + # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales += 80.0 + # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae += 80.0 + # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Corynebacterium += 80.0 + # d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Corynebacterium;s__Corynebacterium aurimucosum_E += 80.0 + genomecluster_taxa_scores = defaultdict(lambda: defaultdict(float)) + + # Iterate through MAG, classification, and score + df = pd.concat([genome_to_genomecluster.to_frame("id"), classification.to_frame("classification"), classification_weights.to_frame("weight")], axis=1) + genomecluster_to_genomes = defaultdict(list) + for id_genome, (id_genome_cluster, classification, w) in df.iterrows(): + genomecluster_to_genomes[id_genome_cluster].append(id_genome) + # Split the taxonomy classification by levels + levels = classification.split(delimiter) + # Remove the empty taxonomy levels (e.g., g__Corynebacterium;s__ --> g__Corynebacterium) + # levels = list(filter(lambda x:x not in rank_prefixes, levels)) + number_of_query_levels = len(levels) + # Iterate through each level, scale score by the leniency weights, and add to running sum + for i in range(1, number_of_query_levels + 1): + taxon_at_level = levels[i-1] + taxon_level_is_missing = taxon_at_level in rank_prefixes + if taxon_level_is_missing: + weighted_score = 0.0 + print("`{}` is missing taxonomic level `{}`".format(id_genome, taxon_at_level), file=sys.stderr) + + else: + weighted_score = float(w) * scaling_factors[i-1] + genomecluster_taxa_scores[id_genome_cluster][tuple(levels[:i])] += weighted_score + genomecluster_to_genomes = pd.Series(genomecluster_to_genomes) + + # Build datafarme + genomecluster_taxa_scores = pd.Series(genomecluster_taxa_scores) + df_consensus_classification = pd.DataFrame(genomecluster_taxa_scores.map(lambda taxa_scores: sorted(taxa_scores.items(), key=lambda x:(x[1], len(x[0])), reverse=True)[0]).to_dict(), index=["consensus_classification", "score"]).T + df_consensus_classification["consensus_classification"] = df_consensus_classification["consensus_classification"].map(";".join) + df_consensus_classification["number_of_unique_classifications"] = df["classification"].groupby(genome_to_genomecluster).apply(lambda x: len(set(x))) + df_consensus_classification["number_of_components"] = genomecluster_to_genomes.map(len) #df["classification"].groupby(genome_to_genomecluster).apply(len) + df_consensus_classification["components"] = genomecluster_to_genomes + df_consensus_classification["classifications"] = df["classification"].groupby(genome_to_genomecluster).apply(lambda x: list(x)) + df_consensus_classification["weights"] = df["weight"].groupby(genome_to_genomecluster).apply(lambda x: list(x)) + df_consensus_classification.index.name = "id" + + # Homogeneity + slc_taxa_homogeneity = defaultdict(lambda: defaultdict(float)) + for id_genome_cluster, (classifications, weights) in df_consensus_classification[["classifications", "weights"]].iterrows(): + for (c, w) in zip(classifications, weights): + slc_taxa_homogeneity[id_genome_cluster][c] += w + df_consensus_classification["homogeneity"] = pd.DataFrame(slc_taxa_homogeneity).T.apply(lambda x: np.nanmax(x)/np.nansum(x), axis=1) + + fields = [ + "consensus_classification", + "homogeneity", + "number_of_unique_classifications", + "number_of_components", + "components", + "classifications", + "weights", + "score", + ] + return df_consensus_classification.loc[:,fields] + + + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o ".format(__program__) + epilog = "Copyright 2022 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser.add_argument("-i","--input", default="stdin", type=str, help = "path/to/genome_to_classification.tsv [id_genome][id_genome_cluster][classification][weight]; No header. [Default: stdin]") + parser.add_argument("-o","--output", type=str, default="stdout", help = "Output table with consensus classification [Default: stdout]") + parser.add_argument("-l","--leniency", default=1.382, type=float, help = "Leniency parameter. Lower value means more conservative weighting. A value of 1 indiciates no weight bias. A value greater than 1 puts higher weight on higher level taxonomic assignments. A value less than 1 puts lower weights on higher level taxonomic assignments. [Default: 1.382]") + parser.add_argument("-r", "--rank_prefixes", type=str, default=RANK_PREFIXES, help = "Rank prefixes separated by , delimiter'\n[Default: {}]".format(RANK_PREFIXES)) + parser.add_argument("-d", "--delimiter", type=str, default=";", help = "Taxonomic delimiter [Default: ; ]") + parser.add_argument("-s", "--simple", action="store_true", help = "Simple classification that does not use lineage information from --rank_prefixes") + # parser.add_argument("--assert_resolved_taxonomy", action="store_true", help = "Do not allow missing taxonomic levels. (e.g., d__Eukaryota;p__;c__Pelagophyceae;o__Pelagomonadales;f__;g__Aureococcus;s__Aureococcus anophagefferens is missing phylum)") + parser.add_argument("--remove_missing_classifications", action="store_true", help = "Remove all classifications and weights that are null. For viruses this could cause an error if this isn't selected.") + parser.add_argument("-u", "--unclassified_label", default="Unclassified", type=str, help = "Unclassified label [Default: Unclassified]") + parser.add_argument("-w", "--unclassified_weight", default=100.0,type=float, help = "Unclassified label weight [Default: 100.0]") + + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + # I/O + if opts.input == "stdin": + opts.input = sys.stdin + + if opts.output == "stdout": + opts.output = sys.stdout + + # Leniency + assert opts.leniency > 0, "--leniency must be > 0" + # Format rank to lineage + opts.rank_prefixes = opts.rank_prefixes.strip().split(",") + + # Classifications + df_input = pd.read_csv(opts.input, sep="\t", index_col=0, header=None) + genome_to_genomecluster = df_input.iloc[:,0] + genome_to_classification = df_input.iloc[:,1].reindex(genome_to_genomecluster.index) + genome_to_weights = df_input.iloc[:,2].reindex(genome_to_genomecluster.index) + if opts.remove_missing_classifications: + genome_to_weights = genome_to_weights.dropna() + genome_to_classification = genome_to_classification[genome_to_weights.index] + else: + mask = genome_to_weights.isnull() + genome_to_classification[mask] = ";".join(map(lambda x: f"{x}__{opts.unclassified_label}", opts.rank_prefixes)) + genome_to_weights[mask] = opts.unclassified_weight + + + # Consensus classification + df_consensus_classification = get_consensus_classification( + classification=genome_to_classification, + classification_weights=genome_to_weights, + genome_to_genomecluster=genome_to_genomecluster, + rank_prefixes=opts.rank_prefixes, + number_of_taxonomic_levels="infer", + delimiter=opts.delimiter, + leniency=opts.leniency, + ) + + if not opts.simple: + # Fill empty taxonomy levels + df_consensus_classification["consensus_classification"] = fill_lower_taxonomy_levels( + classifications=df_consensus_classification["consensus_classification"], + rank_prefixes=opts.rank_prefixes, + delimiter=opts.delimiter, + ) + + df_consensus_classification.to_csv(opts.output, sep="\t") + +if __name__ == "__main__": + main() diff --git a/src/scripts/consensus_genome_classification.py b/src/scripts/deprecated/consensus_genome_classification.py similarity index 100% rename from src/scripts/consensus_genome_classification.py rename to src/scripts/deprecated/consensus_genome_classification.py diff --git a/src/scripts/mmseqs2_wrapper.py b/src/scripts/deprecated/mmseqs2_wrapper.py similarity index 100% rename from src/scripts/mmseqs2_wrapper.py rename to src/scripts/deprecated/mmseqs2_wrapper.py diff --git a/src/scripts/devel/compile_phylogenomic_functional_categories.py b/src/scripts/devel/compile_phylogenomic_functional_categories.py new file mode 100755 index 0000000..10f45f8 --- /dev/null +++ b/src/scripts/devel/compile_phylogenomic_functional_categories.py @@ -0,0 +1,147 @@ +#!/usr/bin/env python +from __future__ import print_function, division +import sys, os, argparse, glob, pickle +from collections import defaultdict +# import numpy as np +import pandas as pd +from tqdm import tqdm +import ensemble_networkx as enx + +pd.options.display.max_colwidth = 100 +# from tqdm import tqdm +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.10.23" + + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -l genome -o ".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + # Pipeline + parser.add_argument("-i","--annotation_results", type=str, default = "stdin", help = "path/to/annotation.tsv from annotate.py [Default: stdin]") + parser.add_argument("-X","--counts", type=str, required = True, help = "path/to/X_orfs.tsv[.gz] from mapping.py at the ORF/gene/protein level. Rows=Samples, Columns=Genes") + parser.add_argument("-g","--genes", type=str, help = "path/to/genes.ffn[.gz] fasta used for scaling-factors") + parser.add_argument("-o","--output_directory", type=str, default="phylogenomic_functional_categories", help = "path/to/output_directory [Default: phylogenomic_functional_categories]") + parser.add_argument("-l","--level", type=str, default="genome_cluster", help = "level {genome, genome_cluster} [Default: genome_cluster]") + parser.add_argument("--minimum_count", type=float, default=1.0, help = "Minimum count to include gene [Default: 1 ]") + parser.add_argument("--veba_database", type=str, help = "VEBA Database [Default: $VEBA_DATABASE environment variable]") + + # parser.add_argument("-p", "--include_protein_identifiers", action="store_true", help = "Write protein identifiers") + # parser.add_argument("--header", action="store_true", help = "Write header") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + assert opts.level in {"genome", "genome_cluster"}, "--level must be either {genome, genome_cluster}" + + if opts.level == "genome": + level_field = ("Identifiers", "id_genome") + if opts.level == "genome_cluster": + level_field = ("Identifiers", "id_genome_cluster") + + if not opts.veba_database: + opts.veba_database = os.environ["VEBA_DATABASE"] + + os.makedirs(opts.output_directory, exist_ok=True) + # os.makedirs(os.path.join(opts.output_directory, opts.level), exist_ok=True) + + # Read annotations + if opts.annotation_results == "stdin": + opts.annotation_results = sys.stdin + df_annotations = pd.read_csv(opts.annotation_results, sep="\t", index_col=0, header=[0,1]) + protein_to_organism = df_annotations[level_field] + + # KEGG Database + delimiters = [",","_","-","+"] + + # Load MicrobeAnnotator KEGG dictionaries + module_to_kos__unprocessed = defaultdict(set) + for fp in glob.glob(os.path.join(opts.veba_database, , "*.pkl")): + with open(fp, "rb") as f: + d = pickle.load(f) + + for id_module, v1 in d.items(): + if isinstance(v1, list): + try: + module_to_kos__unprocessed[id_module].update(v1) + except TypeError: + for v2 in v1: + module_to_kos__unprocessed[id_module].update(v2) + else: + for k2, v2 in v1.items(): + if isinstance(v2, list): + try: + module_to_kos__unprocessed[id_module].update(v2) + except TypeError: + for v3 in v2: + module_to_kos__unprocessed[id_module].update(v3) + + # Flatten the KEGG orthologs + module_to_kos = dict() + for id_module, kos_unprocessed in module_to_kos__unprocessed.items(): + kos_processed = set() + for id_ko in kos: + composite=False + for sep in delimiters: + if sep in id_ko: + id_ko = id_ko.replace(sep,";") + composite = True + if composite: + kos_composite = set(map(str.strip, filter(bool, id_ko.split(";")))) + kos_processed.update(kos_composite) + else: + kos_processed.add(id_ko) + module_to_kos[id_module] = kos_processed + + # Read counts + X_counts = pd.read_csv(opts.counts, sep="\t", index_col=0) + + # Organisms + organisms = df_annotations[level_field].unique() + + # Organizing KOs + organism_to_kos = defaultdict(set) + protein_to_kos = dict() + kos_global = list() + for id_protein, (id_organism, ko_ids) in tqdm(df_annotations.loc[:,[level_field, ("KOFAM", "ids")]].iterrows(), "Compiling KO identifiers", total=df_annotations.shape[0]): + ko_ids = eval(ko_ids) + if len(ko_ids): + ko_ids = set(ko_ids) + protein_to_kos[id_protein] = ko_ids + organism_to_kos[id_organism].update(ko_ids) + for id_ko in ko_ids: + kos_global.append([id_protein, id_organism, id_ko]) + df_kos_global = pd.DataFrame(kos_global, columns=["id_protein", level_field[1], "id_kegg-ortholog"]) + del kos_global + df_kos_global.to_csv(os.path.join(opts.output_directory, "kos.{}s.tsv".format(opts.level)), sep="\t", index=False) + + # Sample -> Organisms -> KOs + sample_to_organism_to_kos = defaultdict(lambda: defaultdict(set)) + for id_sample, row in X_counts.iterrows(): + for id_protein, count in tqdm(row.items(), total=X_counts.shape[1]): + if id_protein in protein_to_kos: + if count >= opts.minimum_count: + id_organism = protein_to_organism[id_protein] + kos = protein_to_kos[id_protein] + sample_to_organism_to_kos[id_sample][id_organism].update(kos) + + + + + + + + + +if __name__ == "__main__": + main() diff --git a/src/scripts/devel/representative_genome_from_networkx_graph.py b/src/scripts/devel/representative_genome_from_networkx_graph.py new file mode 100755 index 0000000..4a82f64 --- /dev/null +++ b/src/scripts/devel/representative_genome_from_networkx_graph.py @@ -0,0 +1,52 @@ +#!/usr/bin/env python +import sys, os, argparse, gzip +from Bio.SeqIO.FastaIO import SimpleFastaParser +from tqdm import tqdm + +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.10" + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o )".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + + # Pipeline + parser.add_argument("-c","--genome_to_cluster", type=str, help = "Input fasta file") + parser.add_argument("-g","--graph", type=str, help = "Input fasta file") + parser.add_argument("-o","--output", default="stdout", type=str, help = "Output fasta file") + parser.add_argument("-m","--maximum_weight", default=100, type=str, help = "Output fasta file") + parser.add_argument("--genome_statistics", type=str, help = "Output fasta file") + parser.add_argument("--sort_by", type=str, help = "Output fasta file") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + + + +G = nx.path_graph(4) # or DiGraph, MultiGraph, MultiDiGraph, etc +H = G.subgraph([0, 1, 2]) +list(H.edges) +[(0, 1), (1, 2)] + + + + + +if __name__ == "__main__": + main() + + + diff --git a/src/scripts/edgelist_to_clusters.py b/src/scripts/edgelist_to_clusters.py index 061b60e..30650b8 100755 --- a/src/scripts/edgelist_to_clusters.py +++ b/src/scripts/edgelist_to_clusters.py @@ -8,7 +8,7 @@ from Bio.SeqIO.FastaIO import SimpleFastaParser __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.4.17" +__version__ = "2023.12.11" def main(args=None): # Path info @@ -26,7 +26,7 @@ def main(args=None): parser.add_argument("-i","--input", type=str, default="stdin", help = "path/to/edgelist.tsv, No header. [id_1][id_2] or [id_1][id_2][weight] [Default: stdin]") # parser.add_argument("-o","--output", type=str, default="stdout", help = "path/to/clusters.tsv [Default: stdout]") - parser.add_argument("-t","--threshold", type=float, default=0.5, help = "Minimum weight threshold. [Default: 0.5]") + parser.add_argument("-t","--threshold", type=float, default=0.0, help = "Minimum weight threshold. [Default: 0.0]") parser.add_argument("-n", "--no_singletons", action="store_true", help = "Don't include self-interactions. Self-interactions will ensure unclustered genomes make it into the output") parser.add_argument("-b", "--basename", action="store_true", help = "Removes filepath prefix and extension. Support for gzipped filepaths.") parser.add_argument("--identifiers", type=str, help = "Identifiers to include. If missing identifiers and singletons are allowed, then they will be included as singleton clusters with weight of np.inf") @@ -53,6 +53,7 @@ def main(args=None): opts.script_directory = script_directory opts.script_filename = script_filename + # Input if opts.input == "stdin": opts.input = sys.stdin @@ -62,7 +63,11 @@ def main(args=None): opts.output = sys.stdout # Edge list - df_edgelist = pd.read_csv(opts.input, sep="\t", header=None) + try: + df_edgelist = pd.read_csv(opts.input, sep="\t", header=None) + except pd.errors.EmptyDataError: + df_edgelist = pd.DataFrame(columns=["query", "reference"]) + assert df_edgelist.shape[1] in {2,3}, "Must have 2 or 3 columns. {} provided.".format(df_edgelist.shape[1]) if opts.basename: def get_basename(x): @@ -72,9 +77,13 @@ def get_basename(x): return ".".join(fn.split(".")[:-1]) df_edgelist.iloc[:,:2] = df_edgelist.iloc[:,:2].applymap(get_basename) - edgelist = df_edgelist.iloc[:,:2].values.tolist() - - identifiers = set.union(*map(set, edgelist)) + # Identifiers from edgelist + if not df_edgelist.empty: + edgelist = df_edgelist.iloc[:,:2].values.tolist() + identifiers = set.union(*map(set, edgelist)) + else: + edgelist = list() + identifiers = set() all_identifiers = identifiers if opts.identifiers: @@ -84,6 +93,7 @@ def get_basename(x): id = line.strip() all_identifiers.add(id) + # Read in fasta if opts.fasta: id_to_sequence = dict() if opts.fasta.endswith(".gz"): diff --git a/src/scripts/eukaryotic_gene_modeling_wrapper.py b/src/scripts/eukaryotic_gene_modeling_wrapper.py index 83591e7..cd15774 100755 --- a/src/scripts/eukaryotic_gene_modeling_wrapper.py +++ b/src/scripts/eukaryotic_gene_modeling_wrapper.py @@ -13,7 +13,7 @@ # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.16" +__version__ = "2023.11.13" # Tiara def get_tiara_cmd(input_filepaths, output_filepaths, output_directory, directories, opts): @@ -131,6 +131,7 @@ def get_metaeuk_cmd(input_filepaths, output_filepaths, output_directory, directo "--threads {}".format(opts.n_jobs), "-s {}".format(opts.metaeuk_sensitivity), "-e {}".format(opts.metaeuk_evalue), + "--split-memory-limit {}".format(opts.metaeuk_split_memory_limit), opts.metaeuk_options, os.path.join(directories["tmp"], "tmp.fasta"), opts.metaeuk_database, # db @@ -1380,6 +1381,7 @@ def main(args=None): parser_metaeuk = parser.add_argument_group('MetaEuk arguments') parser_metaeuk.add_argument("--metaeuk_sensitivity", type=float, default=4.0, help="MetaEuk | Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [Default: 4.0]") parser_metaeuk.add_argument("--metaeuk_evalue", type=float, default=0.01, help="MetaEuk | List matches below this E-value (range 0.0-inf) [Default: 0.01]") + parser_metaeuk.add_argument("--metaeuk_split_memory_limit", type=str, default="36G", help="MetaEuk | Set max memory per split. E.g. 800B, 5K, 10M, 1G. Use 0 to use all available system memory. (Default value is experimental) [Default: 36G]") parser_metaeuk.add_argument("--metaeuk_options", type=str, default="", help="MetaEuk | More options (e.g. --arg 1 ) [Default: ''] https://github.com/soedinglab/metaeuk") # Pyrodigal diff --git a/src/scripts/filter_spades_assembly.py b/src/scripts/filter_spades_assembly.py new file mode 100755 index 0000000..08351a0 --- /dev/null +++ b/src/scripts/filter_spades_assembly.py @@ -0,0 +1,100 @@ +#!/usr/bin/env python +import sys, os, argparse, gzip +from Bio.SeqIO.FastaIO import SimpleFastaParser +from tqdm import tqdm + +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.12.5" + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o )".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + + # Pipeline + parser.add_argument("-i","--input", default="stdin", type=str, help = "Input fasta file") + parser.add_argument("-o","--output", default="stdout", type=str, help = "Output fasta file") + parser.add_argument("-r","--retain_description", action="store_true", help = "Retain description") + parser.add_argument("-c","--minimum_coverage", default=0, type=int, help = "Minimum coverage accepted [Default: 0.0]") + parser.add_argument("-m","--minimum_sequence_length", default=1, type=int, help = "Minimum sequence length accepted [Default: 1]") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + assert opts.minimum_sequence_length > 0 + + # Input + f_in = None + if opts.input == "stdin": + f_in = sys.stdin + else: + if opts.input.endswith(".gz"): + f_in = gzip.open(opts.input, "rt") + else: + f_in = open(opts.input, "r") + assert f_in is not None + + # Output + f_out = None + if opts.output == "stdout": + f_out = sys.stdout + else: + if opts.output.endswith(".gz"): + f_out = gzip.open(opts.output, "wt") + else: + f_out = open(opts.output, "w") + assert f_out is not None + + if opts.retain_description: + for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"): + id = header.split(" ")[0].strip() + fields = id.split("_") + try: + length_index = fields.index("length") + coverage_index = fields.index("cov") + except ValueError: + raise "Your fastq identifiers do not look like they are from SPAdes: {}".format(id) + sys.exit(1) + assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header) + coverage = float(fields[coverage_index + 1]) + length = int(fields[length_index + 1]) + if all([coverage >= opts.minimum_coverage, length >= opts.minimum_sequence_length]): + print(">{}\n{}".format(header,seq), file=f_out) + else: + for header, seq in tqdm(SimpleFastaParser(f_in), "Reading fasta input"): + id = header.split(" ")[0].strip() + fields = id.split("_") + try: + length_index = fields.index("length") + coverage_index = fields.index("cov") + except ValueError: + raise "Your fastq identifiers do not look like they are from SPAdes: {}".format(id) + sys.exit(1) + assert ">" not in seq, "`{}` has a '>' character in the sequence which will cause an error. This can arise from concatenating fasta files where a record is missing a final linebreak".format(header) + coverage = float(fields[coverage_index + 1]) + length = int(fields[length_index + 1]) + if all([coverage >= opts.minimum_coverage, length >= opts.minimum_sequence_length]): + print(">{}\n{}".format(id,seq), file=f_out) + + # Close + if f_in != sys.stdin: + f_in.close() + if f_out != sys.stdout: + f_out.close() + +if __name__ == "__main__": + main() + + + diff --git a/src/scripts/global_clustering.py b/src/scripts/global_clustering.py index 4783794..8e8b6f3 100755 --- a/src/scripts/global_clustering.py +++ b/src/scripts/global_clustering.py @@ -15,7 +15,7 @@ # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.24" +__version__ = "2023.12.8" def get_basename(x): _, fn = os.path.split(x) @@ -50,13 +50,15 @@ def add_executables_to_environment(opts): """ accessory_scripts = { "edgelist_to_clusters.py", - "mmseqs2_wrapper.py", + "clustering_wrapper.py", # "table_to_fasta.py", } required_executables={ + "skani", "fastANI", "mmseqs", + "diamond", } @@ -97,7 +99,18 @@ def add_executables_to_environment(opts): # Configure parameters def configure_parameters(opts, directories): - assert_acceptable_arguments(opts.algorithm, {"easy-cluster", "easy-linclust"}) + assert_acceptable_arguments(opts.protein_clustering_algorithm, {"easy-cluster", "easy-linclust", "mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}) + if opts.protein_clustering_algorithm in {"easy-cluster", "easy-linclust"}: + d = {"easy-cluster":"mmseqs-cluster", "easy-linclust":"mmseqs-linclust"} + warnings.warn("\n\nPlease use `{}` instead of `{}` for MMSEQS2 clustering.".format(d[opts.protein_clustering_algorithm], opts.protein_clustering_algorithm)) + opts.protein_clustering_algorithm = d[opts.protein_clustering_algorithm] + + if opts.skani_nonviral_preset.lower() == "none": + opts.skani_nonviral_preset = None + + if opts.skani_viral_preset.lower() == "none": + opts.skani_viral_preset = None + assert 0 < opts.minimum_core_prevalence <= 1.0, "--minimum_core_prevalence must be a float between (0.0,1.0])" # Set environment variables add_executables_to_environment(opts=opts) @@ -119,10 +132,10 @@ def main(args=None): parser_io = parser.add_argument_group('Required I/O arguments') parser_io.add_argument("-i", "--genomes_table", type=str, default="stdin", help = "path/to/genomes_table.tsv, Format: Must include the following columns (No header) [organism_type][id_sample][id_mag][genome][proteins][cds] but can include additional columns to the right (e.g., [gene_models]). Suggested input is from `compile_genomes_table.py` script. [Default: stdin]") parser_io.add_argument("-o","--output_directory", type=str, default="global_clustering_output", help = "path/to/project_directory [Default: global_clustering_output]") - parser_io.add_argument("-e", "--no_singletons", action="store_true", help="Exclude singletons") #isPSLC-1_SSPC-3345__SRR178126 - parser_io.add_argument("-R", "--no_representative_sequences", action="store_true", help="Do not write representative sequences to fasta") #isPSLC-1_SSPC-3345__SRR178126 - parser_io.add_argument("-C", "--no_core_sequences", action="store_true", help="Do not write core pagenome sequences to fasta") #isPSLC-1_SSPC-3345__SRR178126 - # parser_io.add_argument("-M", "--no_marker_sequences", action="store_true", help="Do not write core pagenome sequences to fasta") #isPSLC-1_SSPC-3345__SRR178126 + parser_io.add_argument("-e", "--no_singletons", action="store_true", help="Exclude singletons") + parser_io.add_argument("-R", "--no_representative_sequences", action="store_true", help="Do not write representative sequences to fasta") + parser_io.add_argument("-C", "--no_core_sequences", action="store_true", help="Do not write core pagenome sequences to fasta") + # parser_io.add_argument("-M", "--no_marker_sequences", action="store_true", help="Do not write core pagenome sequences to fasta") # Utility parser_utility = parser.add_argument_group('Utility arguments') @@ -132,29 +145,50 @@ def main(args=None): parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) # parser_utility.add_argument("--verbose", action='store_true') - # FastANI + # ANI + parser_genome_clustering = parser.add_argument_group('Genome clustering arguments') + parser_genome_clustering.add_argument("-G", "--genome_clustering_algorithm", type=str, choices={"fastani", "skani"}, default="skani", help="Program to use for ANI calculations. `skani` is faster and more memory efficient. For v1.0.0 - v1.3.x behavior, use `fastani`. [Default: skani]") + parser_genome_clustering.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]") + parser_genome_clustering.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-") + parser_genome_clustering.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") + parser_genome_clustering.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 + + parser_skani = parser.add_argument_group('Skani triangle arguments') + parser_skani.add_argument("--skani_target_ani", type=float, default=80, help="skani | If you set --skani_target_ani to --ani_threshold, you may screen out genomes ANI ≥ --ani_threshold [Default: 80]") + parser_skani.add_argument("--skani_minimum_af", type=float, default=15, help="skani | Minimum aligned fraction greater than this value [Default: 15]") + parser_skani.add_argument("--skani_no_confidence_interval", action="store_true", help="skani | Output [5,95] ANI confidence intervals using percentile bootstrap on the putative ANI distribution") + # parser_skani.add_argument("--skani_low_memory", action="store_true", help="Skani | More options (e.g. --arg 1 ) https://github.com/bluenote-1577/skani [Default: '']") + + parser_skani = parser.add_argument_group('[Prokaryotic & Eukaryotic] Skani triangle arguments') + parser_skani.add_argument("--skani_nonviral_preset", type=str, default="medium", choices={"fast", "medium", "slow", "none"}, help="skani [Prokaryotic & Eukaryotic]| Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: medium]") + parser_skani.add_argument("--skani_nonviral_compression_factor", type=int, default=125, help="skani [Prokaryotic & Eukaryotic]| Compression factor (k-mer subsampling rate). [Default: 125]") + parser_skani.add_argument("--skani_nonviral_marker_kmer_compression_factor", type=int, default=1000, help="skani [Prokaryotic & Eukaryotic] | Marker k-mer compression factor. Markers are used for filtering. [Default: 1000]") + parser_skani.add_argument("--skani_nonviral_options", type=str, default="", help="skani [Prokaryotic & Eukaryotic] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']") + + parser_skani = parser.add_argument_group('[Viral] Skani triangle arguments') + parser_skani.add_argument("--skani_viral_preset", type=str, default="slow", choices={"fast", "medium", "slow", "none"}, help="skani | Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: slow]") + parser_skani.add_argument("--skani_viral_compression_factor", type=int, default=30, help="skani [Viral] | Compression factor (k-mer subsampling rate). [Default: 30]") + parser_skani.add_argument("--skani_viral_marker_kmer_compression_factor", type=int, default=200, help="skani [Viral] | Marker k-mer compression factor. Markers are used for filtering. Consider decreasing to ~200-300 if working with small genomes (e.g. plasmids or viruses). [Default: 200]") + parser_skani.add_argument("--skani_viral_options", type=str, default="", help="skani [Viral] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']") + parser_fastani = parser.add_argument_group('FastANI arguments') - parser_fastani.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="FastANI | Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]") - parser_fastani.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-") - parser_fastani.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") - parser_fastani.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 parser_fastani.add_argument("--fastani_options", type=str, default="", help="FastANI | More options (e.g. --arg 1 ) [Default: '']") - # MMSEQS2 - parser_mmseqs2 = parser.add_argument_group('MMSEQS2 arguments') - parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="easy-cluster", help="MMSEQS2 | {easy-cluster, easy-linclust} [Default: easy-cluster]") - parser_mmseqs2.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="MMSEQS2 | SLC-Specific Protein Cluster (SSPC, previously referred to as SSO) percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]") - parser_mmseqs2.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="MMSEQS2 | SSPC coverage threshold (Range (0.0, 1.0]) [Default: 0.8]") - parser_mmseqs2.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-") - parser_mmseqs2.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") - parser_mmseqs2.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 - parser_mmseqs2.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']") + # Clustering + parser_protein_clustering = parser.add_argument_group('Protein clustering arguments') + parser_protein_clustering.add_argument("-P", "--protein_clustering_algorithm", type=str, choices={"mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}, default="mmseqs-cluster", help="Clustering algorithm | Diamond can only be used for clustering proteins {mmseqs-cluster, mmseqs-linclust, diamond-cluster, mmseqs-linclust} [Default: mmseqs-cluster]") + parser_protein_clustering.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="Clustering | Percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]") + parser_protein_clustering.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="Clustering | Coverage threshold (Range (0.0, 1.0]) [Default: 0.8]") + parser_protein_clustering.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-") + parser_protein_clustering.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") + parser_protein_clustering.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 + parser_protein_clustering.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']") + parser_protein_clustering.add_argument("--diamond_options", type=str, default="", help="Diamond | More options (e.g. --arg 1 ) [Default: '']") # Pangenome parser_pangenome = parser.add_argument_group('Pangenome arguments') parser_pangenome.add_argument("--minimum_core_prevalence", type=float, default=1.0, help="Minimum ratio of genomes detected in a SLC for a SSPC to be considered core (Range (0.0, 1.0]) [Default: 1.0]") - # Options opts = parser.parse_args() @@ -196,6 +230,19 @@ def main(args=None): configure_parameters(opts, directories) sys.stdout.flush() + # Genome clustering algorithm + GENOME_CLUSTERING_ALGORITHM = opts.genome_clustering_algorithm.lower() + if GENOME_CLUSTERING_ALGORITHM == "fastani": + GENOME_CLUSTERING_ALGORITHM = "FastANI" + if GENOME_CLUSTERING_ALGORITHM == "skani": + GENOME_CLUSTERING_ALGORITHM = "skani" + + # Protein clustering algorithm + PROTEIN_CLUSTERING_ALGORITHM = opts.protein_clustering_algorithm.split("-")[0].lower() + if PROTEIN_CLUSTERING_ALGORITHM == "mmseqs": + PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.upper() + if PROTEIN_CLUSTERING_ALGORITHM == "diamond": + PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.capitalize() # Make directories t0 = time.time() @@ -279,50 +326,125 @@ def main(args=None): # Commands f_cmds = open(os.path.join(opts.output_directory, "commands.sh"), "w") - # FastANI - print(format_header("* ({}) Running FastANI:".format(format_duration(t0))), file=sys.stdout) - for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "genomes.list")), "Running FastANI"): + # Pairwise ANI + print(format_header("* ({}) Running {}:".format(format_duration(t0), GENOME_CLUSTERING_ALGORITHM)), file=sys.stdout) + for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "genomes.list")), "Running pairwise ANI"): fields = fp.split("/") organism_type = fields[-2] output_directory = os.path.split(fp)[0] + + if opts.genome_clustering_algorithm == "skani": + name = "skani__{}".format(organism_type) + description = "[Program = skani] [Organism_Type = {}]".format(organism_type) + + arguments = list() + + if organism_type.lower() in {"viral", "virus", "virion"}: + arguments += [ + os.environ["skani"], + "triangle", + "--sparse", + "-t {}".format(opts.n_jobs), + "-l {}".format(fp), + "-o {}".format(os.path.join(output_directory, "skani_output.tsv")), + "--ci" if not opts.skani_no_confidence_interval else "", + "--min-af {}".format(opts.skani_minimum_af), + "-s {}".format(opts.skani_target_ani), + "-c {}".format(opts.skani_viral_compression_factor), + "-m {}".format(opts.skani_viral_marker_kmer_compression_factor), + "--{}".format(opts.skani_viral_preset) if opts.skani_viral_preset else "", + opts.skani_viral_options, + ] + + else: + arguments += [ + os.environ["skani"], + "triangle", + "--sparse", + "-t {}".format(opts.n_jobs), + "-l {}".format(fp), + "-o {}".format(os.path.join(output_directory, "skani_output.tsv")), + "--ci" if not opts.skani_no_confidence_interval else "", + "--min-af {}".format(opts.skani_minimum_af), + "-s {}".format(opts.skani_target_ani), + "-c {}".format(opts.skani_nonviral_compression_factor), + "-m {}".format(opts.skani_nonviral_marker_kmer_compression_factor), + "--{}".format(opts.skani_nonviral_preset) if opts.skani_nonviral_preset else "", + opts.skani_nonviral_options, + ] + + arguments += [ + "&&", + + "cat", + os.path.join(output_directory, "skani_output.tsv"), + "|", + "cut -f1-3", + "|", + "tail -n +2", + "|", + os.environ["edgelist_to_clusters.py"], + "--basename", + "-t {}".format(opts.ani_threshold), + "--no_singletons" if bool(opts.no_singletons) else "", + "--cluster_prefix {}{}".format(organism_type[0].upper(), opts.genome_cluster_prefix), + "--cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", + "--cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill), + "-o {}".format(os.path.join(output_directory, "genome_clusters.tsv")), + "--identifiers {}".format(os.path.join(directories["intermediate"], organism_type, "genome_identifiers.list")), + "--export_graph {}".format(os.path.join(directories["serialization"], f"{organism_type}.networkx_graph.pkl")), + "--export_dict {}".format(os.path.join(directories["serialization"], f"{organism_type}.dict.pkl")), + + "&&", + + "rm -rf {}".format(os.path.join(directories["tmp"], "*")), + + ] + + cmd = Command( + arguments, + name=name, + f_cmds=f_cmds, + ) - name = "fastani__{}".format(organism_type) - description = "[Program = FastANI] [Organism_Type = {}]".format(organism_type) - cmd = Command([ - os.environ["fastANI"], - "-t {}".format(opts.n_jobs), - "--rl {}".format(fp), - "--ql {}".format(fp), - "-o {}".format(os.path.join(output_directory, "fastani_output.tsv")), - opts.fastani_options, - - "&&", - - "cat", - os.path.join(output_directory, "fastani_output.tsv"), - "|", - "cut -f1-3", - "|", - os.environ["edgelist_to_clusters.py"], - "--basename", - "-t {}".format(opts.ani_threshold), - "--no_singletons" if bool(opts.no_singletons) else "", - "--cluster_prefix {}{}".format(organism_type[0].upper(), opts.genome_cluster_prefix), - "--cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", - "--cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill), - "-o {}".format(os.path.join(output_directory, "genome_clusters.tsv")), - "--identifiers {}".format(os.path.join(directories["intermediate"], organism_type, "genome_identifiers.list")), - "--export_graph {}".format(os.path.join(directories["serialization"], f"{organism_type}.networkx_graph.pkl")), - "--export_dict {}".format(os.path.join(directories["serialization"], f"{organism_type}.dict.pkl")), - - "&&", - - "rm -rf {}".format(os.path.join(directories["tmp"], "*")), - - ], - name=name, - f_cmds=f_cmds, - ) + if opts.genome_clustering_algorithm == "fastani": + name = "fastani__{}".format(organism_type) + description = "[Program = FastANI] [Organism_Type = {}]".format(organism_type) + cmd = Command([ + os.environ["fastANI"], + "-t {}".format(opts.n_jobs), + "--rl {}".format(fp), + "--ql {}".format(fp), + "-o {}".format(os.path.join(output_directory, "fastani_output.tsv")), + opts.fastani_options, + + "&&", + + "cat", + os.path.join(output_directory, "fastani_output.tsv"), + "|", + "cut -f1-3", + "|", + os.environ["edgelist_to_clusters.py"], + "--basename", + "-t {}".format(opts.ani_threshold), + "--no_singletons" if bool(opts.no_singletons) else "", + "--cluster_prefix {}{}".format(organism_type[0].upper(), opts.genome_cluster_prefix), + "--cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", + "--cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill), + "-o {}".format(os.path.join(output_directory, "genome_clusters.tsv")), + "--identifiers {}".format(os.path.join(directories["intermediate"], organism_type, "genome_identifiers.list")), + "--export_graph {}".format(os.path.join(directories["serialization"], f"{organism_type}.networkx_graph.pkl")), + "--export_dict {}".format(os.path.join(directories["serialization"], f"{organism_type}.dict.pkl")), + + "&&", + + "rm -rf {}".format(os.path.join(directories["tmp"], "*")), + + ], + name=name, + f_cmds=f_cmds, + ) # Run command cmd.run( @@ -339,11 +461,11 @@ def main(args=None): print("Check the following files:\ncat {}".format(os.path.join(directories["log"], "{}.*".format(name))), file=sys.stdout) sys.exit(cmd.returncode_) - # MMSEQS2 - print(format_header(" * ({}) Running MMSEQS2:".format(format_duration(t0))), file=sys.stdout) + # Protein Clustering + print(format_header(" * ({}) Running {}:".format(format_duration(t0), PROTEIN_CLUSTERING_ALGORITHM)), file=sys.stdout) mag_to_genomecluster = dict() protein_to_proteincluster = dict() - for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "genome_clusters.tsv")), "Running MMSEQS2"): + for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "genome_clusters.tsv")), "Running {}".format(PROTEIN_CLUSTERING_ALGORITHM)): fields = fp.split("/") organism_type = fields[-2] @@ -363,20 +485,21 @@ def main(args=None): print(*proteins, sep="\n", file=f) write_fasta(protein_to_sequence[proteins], os.path.join(genomecluster_directory, "proteins.faa" )) - # Run MMSEQS2 - name = "mmseqs2__{}__{}".format(organism_type, id_genomecluster) - description = "[Program = MMSEQS2] [Organism_Type = {}] [Genome_Cluster = {}]".format(organism_type, id_genomecluster) + # Run Clustering + name = "{}__{}__{}".format(PROTEIN_CLUSTERING_ALGORITHM.lower(), organism_type, id_genomecluster) + description = "[Program = {}] [Organism_Type = {}] [Genome_Cluster = {}]".format(PROTEIN_CLUSTERING_ALGORITHM, organism_type, id_genomecluster) cmd = Command([ - os.environ["mmseqs2_wrapper.py"], + os.environ["clustering_wrapper.py"], "--fasta {}".format(os.path.join(genomecluster_directory, "proteins.faa" )), "--output_directory {}".format(genomecluster_directory), "--no_singletons" if bool(opts.no_singletons) else "", - "--algorithm {}".format(opts.algorithm), + "--algorithm {}".format(opts.protein_clustering_algorithm), "--n_jobs {}".format(opts.n_jobs), "--minimum_identity_threshold {}".format(opts.minimum_identity_threshold), "--minimum_coverage_threshold {}".format(opts.minimum_coverage_threshold), "--mmseqs2_options='{}'" if bool(opts.mmseqs2_options) else "", + "--diamond_options='{}'" if bool(opts.diamond_options) else "", "--cluster_prefix {}_{}".format(id_genomecluster, opts.protein_cluster_prefix), "--cluster_suffix {}".format(opts.protein_cluster_suffix) if bool(opts.protein_cluster_suffix) else "", "--cluster_prefix_zfill {}".format(opts.protein_cluster_prefix_zfill), diff --git a/src/scripts/local_clustering.py b/src/scripts/local_clustering.py index 198f8c9..e35c27f 100755 --- a/src/scripts/local_clustering.py +++ b/src/scripts/local_clustering.py @@ -15,7 +15,7 @@ # from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.24" +__version__ = "2023.12.11" def get_basename(x): _, fn = os.path.split(x) @@ -50,12 +50,14 @@ def add_executables_to_environment(opts): """ accessory_scripts = { "edgelist_to_clusters.py", - "mmseqs2_wrapper.py", + "clustering_wrapper.py", } required_executables={ + "skani", "fastANI", "mmseqs", + "diamond", } required_executables |= accessory_scripts @@ -84,7 +86,6 @@ def add_executables_to_environment(opts): executables[name] = "'{}'".format(os.path.join(opts.script_directory, name)) # Can handle spaces in path - print(format_header( "Adding executables to path from the following source: {}".format(opts.path_config), "-"), file=sys.stdout) for name, executable in executables.items(): if name in required_executables: @@ -95,9 +96,20 @@ def add_executables_to_environment(opts): # Configure parameters def configure_parameters(opts, directories): - assert_acceptable_arguments(opts.algorithm, {"easy-cluster", "easy-linclust"}) - assert 0 < opts.minimum_core_prevalence <= 1.0, "--minimum_core_prevalence must be a float between (0.0,1.0])" + + assert_acceptable_arguments(opts.protein_clustering_algorithm, {"easy-cluster", "easy-linclust", "mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}) + if opts.protein_clustering_algorithm in {"easy-cluster", "easy-linclust"}: + d = {"easy-cluster":"mmseqs-cluster", "easy-linclust":"mmseqs-linclust"} + warnings.warn("\n\nPlease use `{}` instead of `{}` for MMSEQS2 clustering.".format(d[opts.protein_clustering_algorithm], opts.protein_clustering_algorithm)) + opts.protein_clustering_algorithm = d[opts.protein_clustering_algorithm] + + if opts.skani_nonviral_preset.lower() == "none": + opts.skani_nonviral_preset = None + + if opts.skani_viral_preset.lower() == "none": + opts.skani_viral_preset = None + assert 0 < opts.minimum_core_prevalence <= 1.0, "--minimum_core_prevalence must be a float between (0.0,1.0])" # Set environment variables add_executables_to_environment(opts=opts) @@ -130,23 +142,45 @@ def main(args=None): parser_utility.add_argument("-v", "--version", action='version', version="{} v{}".format(__program__, __version__)) # parser_utility.add_argument("--verbose", action='store_true') - # FastANI + # ANI + parser_genome_clustering = parser.add_argument_group('Genome clustering arguments') + parser_genome_clustering.add_argument("-G", "--genome_clustering_algorithm", type=str, choices={"fastani", "skani"}, default="skani", help="Program to use for ANI calculations. `skani` is faster and more memory efficient. For v1.0.0 - v1.3.x behavior, use `fastani`. [Default: skani]") + parser_genome_clustering.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]") + parser_genome_clustering.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-") + parser_genome_clustering.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") + parser_genome_clustering.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 + + parser_skani = parser.add_argument_group('Skani triangle arguments') + parser_skani.add_argument("--skani_target_ani", type=float, default=80, help="skani | If you set --skani_target_ani to --ani_threshold, you may screen out genomes ANI ≥ --ani_threshold [Default: 80]") + parser_skani.add_argument("--skani_minimum_af", type=float, default=15, help="skani | Minimum aligned fraction greater than this value [Default: 15]") + parser_skani.add_argument("--skani_no_confidence_interval", action="store_true", help="skani | Output [5,95] ANI confidence intervals using percentile bootstrap on the putative ANI distribution") + # parser_skani.add_argument("--skani_low_memory", action="store_true", help="Skani | More options (e.g. --arg 1 ) https://github.com/bluenote-1577/skani [Default: '']") + + parser_skani = parser.add_argument_group('[Prokaryotic & Eukaryotic] Skani triangle arguments') + parser_skani.add_argument("--skani_nonviral_preset", type=str, default="medium", choices={"fast", "medium", "slow", "none"}, help="skani [Prokaryotic & Eukaryotic]| Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: medium]") + parser_skani.add_argument("--skani_nonviral_compression_factor", type=int, default=125, help="skani [Prokaryotic & Eukaryotic]| Compression factor (k-mer subsampling rate). [Default: 125]") + parser_skani.add_argument("--skani_nonviral_marker_kmer_compression_factor", type=int, default=1000, help="skani [Prokaryotic & Eukaryotic] | Marker k-mer compression factor. Markers are used for filtering. [Default: 1000]") + parser_skani.add_argument("--skani_nonviral_options", type=str, default="", help="skani [Prokaryotic & Eukaryotic] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']") + + parser_skani = parser.add_argument_group('[Viral] Skani triangle arguments') + parser_skani.add_argument("--skani_viral_preset", type=str, default="slow", choices={"fast", "medium", "slow", "none"}, help="skani | Use `none` if you are setting skani -c (compression factor) {fast, medium, slow, none} [Default: slow]") + parser_skani.add_argument("--skani_viral_compression_factor", type=int, default=30, help="skani [Viral] | Compression factor (k-mer subsampling rate). [Default: 30]") + parser_skani.add_argument("--skani_viral_marker_kmer_compression_factor", type=int, default=200, help="skani [Viral] | Marker k-mer compression factor. Markers are used for filtering. Consider decreasing to ~200-300 if working with small genomes (e.g. plasmids or viruses). [Default: 200]") + parser_skani.add_argument("--skani_viral_options", type=str, default="", help="skani [Viral] | More options for `skani triangle` (e.g. --arg 1 ) [Default: '']") + parser_fastani = parser.add_argument_group('FastANI arguments') - parser_fastani.add_argument("-A", "--ani_threshold", type=float, default=95.0, help="FastANI | Species-level cluster (SLC) ANI threshold (Range (0.0, 100.0]) [Default: 95.0]") - parser_fastani.add_argument("--genome_cluster_prefix", type=str, default="SLC-", help="Cluster prefix [Default: 'SLC-") - parser_fastani.add_argument("--genome_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") - parser_fastani.add_argument("--genome_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 parser_fastani.add_argument("--fastani_options", type=str, default="", help="FastANI | More options (e.g. --arg 1 ) [Default: '']") - # MMSEQS2 - parser_mmseqs2 = parser.add_argument_group('MMSEQS2 arguments') - parser_mmseqs2.add_argument("-a", "--algorithm", type=str, default="easy-cluster", help="MMSEQS2 | {easy-cluster, easy-linclust} [Default: easy-cluster]") - parser_mmseqs2.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="MMSEQS2 | SLC-Specific Protein Cluster (SSPC, previously referred to as SSO) percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]") - parser_mmseqs2.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="MMSEQS2 | SSPC coverage threshold (Range (0.0, 1.0]) [Default: 0.8]") - parser_mmseqs2.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-") - parser_mmseqs2.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") - parser_mmseqs2.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 - parser_mmseqs2.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']") + # Clustering + parser_protein_clustering = parser.add_argument_group('Protein clustering arguments') + parser_protein_clustering.add_argument("-P", "--protein_clustering_algorithm", type=str, choices={"mmseqs-cluster", "mmseqs-linclust", "diamond-cluster", "diamond-linclust"}, default="mmseqs-cluster", help="Clustering algorithm | Diamond can only be used for clustering proteins {mmseqs-cluster, mmseqs-linclust, diamond-cluster, mmseqs-linclust} [Default: mmseqs-cluster]") + parser_protein_clustering.add_argument("-t", "--minimum_identity_threshold", type=float, default=50.0, help="Clustering | Percent identity threshold (Range (0.0, 100.0]) [Default: 50.0]") + parser_protein_clustering.add_argument("-c", "--minimum_coverage_threshold", type=float, default=0.8, help="Clustering | Coverage threshold (Range (0.0, 1.0]) [Default: 0.8]") + parser_protein_clustering.add_argument("--protein_cluster_prefix", type=str, default="SSPC-", help="Cluster prefix [Default: 'SSPC-") + parser_protein_clustering.add_argument("--protein_cluster_suffix", type=str, default="", help="Cluster suffix [Default: '") + parser_protein_clustering.add_argument("--protein_cluster_prefix_zfill", type=int, default=0, help="Cluster prefix zfill. Use 7 to match identifiers from OrthoFinder. Use 0 to add no zfill. [Default: 0]") #7 + parser_protein_clustering.add_argument("--mmseqs2_options", type=str, default="", help="MMSEQS2 | More options (e.g. --arg 1 ) [Default: '']") + parser_protein_clustering.add_argument("--diamond_options", type=str, default="", help="Diamond | More options (e.g. --arg 1 ) [Default: '']") # Pangenome parser_pangenome = parser.add_argument_group('Pangenome arguments') @@ -191,6 +225,20 @@ def main(args=None): configure_parameters(opts, directories) sys.stdout.flush() + # Genome clustering algorithm + GENOME_CLUSTERING_ALGORITHM = opts.genome_clustering_algorithm.lower() + if GENOME_CLUSTERING_ALGORITHM == "fastani": + GENOME_CLUSTERING_ALGORITHM = "FastANI" + if GENOME_CLUSTERING_ALGORITHM == "skani": + GENOME_CLUSTERING_ALGORITHM = "skani" + + # Protein clustering algorithm + PROTEIN_CLUSTERING_ALGORITHM = opts.protein_clustering_algorithm.split("-")[0].lower() + if PROTEIN_CLUSTERING_ALGORITHM == "mmseqs": + PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.upper() + if PROTEIN_CLUSTERING_ALGORITHM == "diamond": + PROTEIN_CLUSTERING_ALGORITHM = PROTEIN_CLUSTERING_ALGORITHM.capitalize() + # Make directories t0 = time.time() print(format_header(" " .join(["* ({}) Creating directories:".format(format_duration(t0)), directories["intermediate"]])), file=sys.stdout) @@ -278,50 +326,127 @@ def main(args=None): # Commands f_cmds = open(os.path.join(opts.output_directory, "commands.sh"), "w") - # FastANI - print(format_header("* ({}) Running FastANI:".format(format_duration(t0))), file=sys.stdout) - for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "*", "genomes.list")), "Running FastANI"): + # Pairwise ANI + print(format_header("* ({}) Running {}:".format(format_duration(t0), GENOME_CLUSTERING_ALGORITHM)), file=sys.stdout) + for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "*", "genomes.list")), "Running pairwise ANI"): fields = fp.split("/") organism_type = fields[-3] id_sample = fields[-2] + + os.makedirs(os.path.join(directories["serialization"], id_sample), exist_ok=True) + + if opts.genome_clustering_algorithm == "skani": + name = "skani__{}__{}".format(organism_type, id_sample) + description = "[Program = skani] [Organism_Type = {}] [Sample_ID = {}]".format(organism_type, id_sample) + + arguments = list() + + if organism_type.lower() in {"viral", "virus", "virion"}: + arguments += [ + os.environ["skani"], + "triangle", + "--sparse", + "-t {}".format(opts.n_jobs), + "-l {}".format(fp), + "-o {}".format(os.path.join(os.path.split(fp)[0], "skani_output.tsv")), + "--ci" if not opts.skani_no_confidence_interval else "", + "--min-af {}".format(opts.skani_minimum_af), + "-s {}".format(opts.skani_target_ani), + "-c {}".format(opts.skani_viral_compression_factor), + "-m {}".format(opts.skani_viral_marker_kmer_compression_factor), + "--{}".format(opts.skani_viral_preset) if opts.skani_viral_preset else "", + opts.skani_viral_options, + ] + + else: + arguments += [ + os.environ["skani"], + "triangle", + "--sparse", + "-t {}".format(opts.n_jobs), + "-l {}".format(fp), + "-o {}".format(os.path.join(os.path.split(fp)[0], "skani_output.tsv")), + "--ci" if not opts.skani_no_confidence_interval else "", + "--min-af {}".format(opts.skani_minimum_af), + "-s {}".format(opts.skani_target_ani), + "-c {}".format(opts.skani_nonviral_compression_factor), + "-m {}".format(opts.skani_nonviral_marker_kmer_compression_factor), + "--{}".format(opts.skani_nonviral_preset) if opts.skani_nonviral_preset else "", + opts.skani_nonviral_options, + ] + + arguments += [ + "&&", + + "cat", + os.path.join(os.path.split(fp)[0], "skani_output.tsv"), + "|", + "cut -f1-3", + "|", + "tail -n +2", + "|", + os.environ["edgelist_to_clusters.py"], + "--basename", + "-t {}".format(opts.ani_threshold), + "--no_singletons" if bool(opts.no_singletons) else "", + "--cluster_prefix {}__{}{}".format(id_sample, organism_type[0].upper(), opts.genome_cluster_prefix), + "--cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", + "--cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill), + "-o {}".format(os.path.join(os.path.split(fp)[0], "genome_clusters.tsv")), + "--identifiers {}".format(os.path.join(directories["intermediate"], organism_type, id_sample, "genome_identifiers.list")), + "--export_graph {}".format(os.path.join(directories["serialization"], id_sample, f"{organism_type}.networkx_graph.pkl")), + "--export_dict {}".format(os.path.join(directories["serialization"], id_sample, f"{organism_type}.dict.pkl")), + + "&&", + + "rm -rf {}".format(os.path.join(directories["tmp"], "*")), + + ] + + cmd = Command( + arguments, + name=name, + f_cmds=f_cmds, + ) - name = "fastani__{}__{}".format(organism_type, id_sample) - description = "[Program = FastANI] [Organism_Type = {}] [Sample_ID = {}]".format(organism_type, id_sample) - cmd = Command([ - os.environ["fastANI"], - "-t {}".format(opts.n_jobs), - "--rl {}".format(fp), - "--ql {}".format(fp), - "-o {}".format(os.path.join(os.path.split(fp)[0], "fastani_output.tsv")), - opts.fastani_options, - - "&&", - - "cat", - os.path.join(os.path.split(fp)[0], "fastani_output.tsv"), - "|", - "cut -f1-3", - "|", - os.environ["edgelist_to_clusters.py"], - "--basename", - "-t {}".format(opts.ani_threshold), - "--no_singletons" if bool(opts.no_singletons) else "", - "--cluster_prefix {}__{}{}".format(id_sample, organism_type[0].upper(), opts.genome_cluster_prefix), - "--cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", - "--cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill), - "-o {}".format(os.path.join(os.path.split(fp)[0], "genome_clusters.tsv")), - "--identifiers {}".format(os.path.join(directories["intermediate"], organism_type, id_sample, "genome_identifiers.list")), - "--export_graph {}".format(os.path.join(directories["serialization"], f"{organism_type}.networkx_graph.pkl")), - "--export_dict {}".format(os.path.join(directories["serialization"], f"{organism_type}.dict.pkl")), - - "&&", - - "rm -rf {}".format(os.path.join(directories["tmp"], "*")), - - ], - name=name, - f_cmds=f_cmds, - ) + if opts.genome_clustering_algorithm == "fastani": + name = "fastani__{}__{}".format(organism_type, id_sample) + description = "[Program = FastANI] [Organism_Type = {}] [Sample_ID = {}]".format(organism_type, id_sample) + cmd = Command([ + os.environ["fastANI"], + "-t {}".format(opts.n_jobs), + "--rl {}".format(fp), + "--ql {}".format(fp), + "-o {}".format(os.path.join(os.path.split(fp)[0], "fastani_output.tsv")), + opts.fastani_options, + + "&&", + + "cat", + os.path.join(os.path.split(fp)[0], "fastani_output.tsv"), + "|", + "cut -f1-3", + "|", + os.environ["edgelist_to_clusters.py"], + "--basename", + "-t {}".format(opts.ani_threshold), + "--no_singletons" if bool(opts.no_singletons) else "", + "--cluster_prefix {}__{}{}".format(id_sample, organism_type[0].upper(), opts.genome_cluster_prefix), + "--cluster_suffix {}".format(opts.genome_cluster_suffix) if bool(opts.genome_cluster_suffix) else "", + "--cluster_prefix_zfill {}".format(opts.genome_cluster_prefix_zfill), + "-o {}".format(os.path.join(os.path.split(fp)[0], "genome_clusters.tsv")), + "--identifiers {}".format(os.path.join(directories["intermediate"], organism_type, id_sample, "genome_identifiers.list")), + "--export_graph {}".format(os.path.join(directories["serialization"], id_sample, f"{organism_type}.networkx_graph.pkl")), + "--export_dict {}".format(os.path.join(directories["serialization"], id_sample, f"{organism_type}.dict.pkl")), + + "&&", + + "rm -rf {}".format(os.path.join(directories["tmp"], "*")), + + ], + name=name, + f_cmds=f_cmds, + ) # Run command cmd.run( @@ -338,11 +463,11 @@ def main(args=None): print("Check the following files:\ncat {}".format(os.path.join(directories["log"], "{}.*".format(name))), file=sys.stdout) sys.exit(cmd.returncode_) - # MMSEQS2 - print(format_header(" * ({}) Running MMSEQS2:".format(format_duration(t0))), file=sys.stdout) + # Clustering + print(format_header(" * ({}) Running {}:".format(format_duration(t0), PROTEIN_CLUSTERING_ALGORITHM)), file=sys.stdout) mag_to_genomecluster = dict() protein_to_proteincluster = dict() - for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "*", "genome_clusters.tsv")), "Running MMSEQS2"): + for fp in pv(glob.glob(os.path.join(directories["intermediate"], "*", "*", "genome_clusters.tsv")), "Running {}".format(PROTEIN_CLUSTERING_ALGORITHM)): fields = fp.split("/") organism_type = fields[-3] id_sample = fields[-2] @@ -364,20 +489,21 @@ def main(args=None): print(*proteins, sep="\n", file=f) write_fasta(protein_to_sequence[proteins], os.path.join(genomecluster_directory, "proteins.faa" )) - # Run MMSEQS2 - name = "mmseqs2__{}__{}".format(organism_type, id_genomecluster) - description = "[Program = MMSEQS2] [Organism_Type = {}] [Sample_ID = {}] [Genome_Cluster = {}]".format(organism_type, id_sample, id_genomecluster) + # Run Clustering + name = "{}__{}__{}".format(PROTEIN_CLUSTERING_ALGORITHM.lower(), organism_type, id_genomecluster) + description = "[Program = {}] [Organism_Type = {}] [Sample_ID = {}] [Genome_Cluster = {}]".format(PROTEIN_CLUSTERING_ALGORITHM, organism_type, id_sample, id_genomecluster) cmd = Command([ - os.environ["mmseqs2_wrapper.py"], + os.environ["clustering_wrapper.py"], "--fasta {}".format(os.path.join(genomecluster_directory, "proteins.faa" )), "--output_directory {}".format(genomecluster_directory), "--no_singletons" if bool(opts.no_singletons) else "", - "--algorithm {}".format(opts.algorithm), + "--algorithm {}".format(opts.protein_clustering_algorithm), "--n_jobs {}".format(opts.n_jobs), "--minimum_identity_threshold {}".format(opts.minimum_identity_threshold), "--minimum_coverage_threshold {}".format(opts.minimum_coverage_threshold), "--mmseqs2_options='{}'" if bool(opts.mmseqs2_options) else "", + "--diamond_options='{}'" if bool(opts.diamond_options) else "", "--cluster_prefix {}_{}".format(id_genomecluster, opts.protein_cluster_prefix), "--cluster_suffix {}".format(opts.protein_cluster_suffix) if bool(opts.protein_cluster_suffix) else "", "--cluster_prefix_zfill {}".format(opts.protein_cluster_prefix_zfill), @@ -599,6 +725,5 @@ def main(args=None): df_proteins["id_protein_cluster"].to_frame().dropna(how="any", axis=0).to_csv(os.path.join(directories["output"], "proteins_to_orthogroups.tsv"), sep="\t", header=None) # Change labels? print(*map(lambda fp: " * {}".format(fp), glob.glob(os.path.join(directories["output"],"*.tsv")) + glob.glob(os.path.join(directories["output"],"*.faa"))), sep="\n", file=sys.stdout ) - if __name__ == "__main__": main(sys.argv[1:]) diff --git a/src/scripts/merge_annotations.py b/src/scripts/merge_annotations.py index 50515fd..821c534 100755 --- a/src/scripts/merge_annotations.py +++ b/src/scripts/merge_annotations.py @@ -1,12 +1,12 @@ #!/usr/bin/env python -import sys, os, argparse, re +import sys, os, argparse, re, gzip from collections import defaultdict, OrderedDict import pandas as pd import numpy as np from soothsayer_utils import read_hmmer, pv, get_file_object, assert_acceptable_arguments, format_header, flatten __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2023.10.26" +__version__ = "2023.11.24" # disclaimer = format_header("DISCLAIMER: Lineage predictions are NOT robust and DO NOT USE CORE MARKERS. Please only use for exploratory suggestions.") @@ -72,72 +72,78 @@ def compile_identifiers(df, id_protein_cluster): if len(organism_types) == 1: organism_types = list(organism_types)[0] - # Genomes - genomes = set(df["id_genome"]) + # # Genomes + # genomes = set(df["id_genome"]) - # Samples - samples = set(df["sample_of_origin"]) + # # Samples + # samples = set(df["sample_of_origin"]) # Genome clusters genome_clusters = set(df["id_genome_cluster"]) if len(genome_clusters) == 1: genome_clusters = list(genome_clusters)[0] - data = OrderedDict([ - ("id_genome_cluster", genome_clusters), - ("organism_type", organism_types), - ("genomes", genomes), - ("samples_of_origin", samples), - ], - ) - data = pd.Series(data, name=id_protein_cluster) - data.index = data.index.map(lambda x: ("Identifiers", x)) + # data = OrderedDict([ + # ("id_genome_cluster", genome_clusters), + # ("organism_type", organism_types), + # ("genomes", genomes), + # ("samples_of_origin", samples), + # ], + # ) + # data = pd.Series(data, name=id_protein_cluster) + # data.index = data.index.map(lambda x: ("Identifiers", x)) + data = [genome_clusters, organism_types]#, genomes, samples] return data def compile_uniref(df, id_protein_cluster): df = df.dropna(how="all", axis=0) - unique_identifiers = set(df["sseqid"].unique()) - data = OrderedDict([ - ("number_of_proteins", df.shape[0]), - ("number_of_unique_hits", len(unique_identifiers)), - ("ids", unique_identifiers), - ("names", set(df["product"].unique())), - ], - ) - data = pd.Series(data, name=id_protein_cluster) - data.index = data.index.map(lambda x: ("UniRef", x)) + unique_identifiers = list(df["sseqid"].unique()) + # data = OrderedDict([ + # ("number_of_proteins", df.shape[0]), + # ("number_of_unique_hits", len(unique_identifiers)), + # ("ids", unique_identifiers), + # ("names", set(df["product"].unique())), + # ], + # ) + # data = pd.Series(data, name=id_protein_cluster) + # data.index = data.index.map(lambda x: ("UniRef", x)) + + data = [df.shape[0], len(unique_identifiers), unique_identifiers, list(df["product"].unique())] return data def compile_nonuniref_diamond(df, id_protein_cluster, label): df = df.dropna(how="all", axis=0) unique_identifiers = set(df["sseqid"].unique()) - data = OrderedDict( - [ - ("number_of_proteins", df.shape[0]), - ("number_of_unique_hits", len(unique_identifiers)), - ("ids", unique_identifiers), - ("names", np.nan), - ], - ) - data = pd.Series(data, name=id_protein_cluster) - data.index = data.index.map(lambda x: (label, x)) + # data = OrderedDict( + # [ + # ("number_of_proteins", df.shape[0]), + # ("number_of_unique_hits", len(unique_identifiers)), + # ("ids", unique_identifiers), + # ("names", np.nan), + # ], + # ) + # data = pd.Series(data, name=id_protein_cluster) + # data.index = data.index.map(lambda x: (label, x)) + data = [df.shape[0], len(unique_identifiers), list(unique_identifiers)] + return data def compile_hmmsearch(df, id_protein_cluster, label): df = df.dropna(how="all", axis=0).query("number_of_hits > 0") - unique_identifiers = flatten(df["ids"], into=set) - unique_names = flatten(df["names"], into=set) + unique_identifiers = flatten(df["ids"], into=list, unique=True) + unique_names = flatten(df["names"], unique=True) - data = OrderedDict( - [ - ("number_of_proteins", df.shape[0]), - ("number_of_unique_hits", len(unique_identifiers)), - ("ids", unique_identifiers), - ("names", unique_names), - ], - ) - data = pd.Series(data, name=id_protein_cluster) - data.index = data.index.map(lambda x: (label, x)) + # data = OrderedDict( + # [ + # ("number_of_proteins", df.shape[0]), + # ("number_of_unique_hits", len(unique_identifiers)), + # ("ids", unique_identifiers), + # ("names", unique_names), + # ], + # ) + # data = pd.Series(data, name=id_protein_cluster) + # data.index = data.index.map(lambda x: (label, x)) + data = [df.shape[0], len(unique_identifiers), unique_identifiers, unique_names] return data @@ -487,63 +493,112 @@ def main(args=None): df_annotations.to_csv(os.path.join(opts.output_directory, "annotations.proteins.tsv.gz"), sep="\t") if opts.identifier_mapping: - # Protein clusters - protein_to_proteincluster = df_annotations[("Identifiers", "id_protein_cluster")] - protein_cluster_annotations = list() - for id_protein_cluster, df in pv(df_annotations.groupby(protein_to_proteincluster), description="Compiling consensus annotations for protein clusters", total=protein_to_proteincluster.nunique(), unit=" Protein Clusters"): - # Identifiers - data_identifiers = compile_identifiers(df["Identifiers"], id_protein_cluster) - - # UniRef - data_uniref = compile_uniref(df["UniRef"], id_protein_cluster) - - # MIBiG - data_mibig = compile_nonuniref_diamond(df["MIBiG"], id_protein_cluster, "MIBiG") - - # VFDB - data_vfdb = compile_nonuniref_diamond(df["VFDB"], id_protein_cluster, "VFDB") - - # CAZy - data_cazy = compile_nonuniref_diamond(df["CAZy"], id_protein_cluster, "CAZy") - - # Pfam - data_pfam = compile_hmmsearch(df["Pfam"], id_protein_cluster, "Pfam") - - # NCBIfam-AMR - data_amr = compile_hmmsearch(df["NCBIfam-AMR"], id_protein_cluster, "NCBIfam-AMR") - - # KOFAM - data_kofam = compile_hmmsearch(df["KOFAM"], id_protein_cluster, "KOFAM") - - # AntiFam - data_antifam = compile_hmmsearch(df["AntiFam"], id_protein_cluster, "AntiFam") - - # Composite name - composite_name = list() - composite_name += list(data_uniref[("UniRef","names")]) - composite_name += list(data_kofam[("KOFAM", "names")]) - composite_name += list(data_pfam[("Pfam","names")]) - composite_name = opts.composite_name_joiner.join(composite_name) - data_consensus = pd.Series(composite_name, index=[("Consensus", "composite_name")]) - - # Concatenate - data_concatenated = pd.concat([ - data_identifiers, - data_consensus, - data_uniref, - data_mibig, - data_vfdb, - data_cazy, - data_pfam, - data_amr, - data_kofam, - data_antifam, - ]) - data_concatenated.name = id_protein_cluster - protein_cluster_annotations.append(data_concatenated) - - df_annotations_proteinclusters = pd.DataFrame(protein_cluster_annotations) - df_annotations_proteinclusters.to_csv(os.path.join(opts.output_directory, "annotations.protein_clusters.tsv.gz"), sep="\t") + with gzip.open(os.path.join(opts.output_directory, "annotations.protein_clusters.tsv.gz"), "wt") as f: + print("\t", + *["Identifiers"]*2, + *["Consensus"]*1, + + *["UniRef"]*4, + *["MIBiG"]*3, + *["VFDB"]*3, + *["CAZy"]*3, + *["Pfam"]*4, + *["NCBIfam-AMR"]*4, + *["KOFAM"]*4, + *["AntiFam"]*4, + sep="\t", file=f) + + print( + "id_protein_cluster", + *["id_genome_cluster", "organsim_type"], #, "genomes", "samples_of_origin"], # Identifiers + *["composite_name"], # Consensus + *["number_of_proteins", "number_of_unique_hits", "ids","names"], # UniRef + *["number_of_proteins", "number_of_unique_hits", "ids"], # MIBiG + *["number_of_proteins", "number_of_unique_hits", "ids"], # VFDB + *["number_of_proteins", "number_of_unique_hits", "ids"], # CAZy + *["number_of_proteins", "number_of_unique_hits", "ids","names"], # Pfam + *["number_of_proteins", "number_of_unique_hits", "ids","names"], # NCBIfam-AMR + *["number_of_proteins", "number_of_unique_hits", "ids","names"], # KOFAM + *["number_of_proteins", "number_of_unique_hits", "ids","names"], # AntiFam + sep="\t", + file=f, + ) + # Protein clusters + protein_to_proteincluster = df_annotations[("Identifiers", "id_protein_cluster")] + protein_cluster_annotations = list() + for id_protein_cluster, df in pv(df_annotations.groupby(protein_to_proteincluster), description="Compiling consensus annotations for protein clusters", total=protein_to_proteincluster.nunique(), unit=" Protein Clusters"): + # Identifiers + data_identifiers = compile_identifiers(df["Identifiers"], id_protein_cluster) + + # UniRef + data_uniref = compile_uniref(df["UniRef"], id_protein_cluster) + + # MIBiG + data_mibig = compile_nonuniref_diamond(df["MIBiG"], id_protein_cluster, "MIBiG") + + # VFDB + data_vfdb = compile_nonuniref_diamond(df["VFDB"], id_protein_cluster, "VFDB") + + # CAZy + data_cazy = compile_nonuniref_diamond(df["CAZy"], id_protein_cluster, "CAZy") + + # Pfam + data_pfam = compile_hmmsearch(df["Pfam"], id_protein_cluster, "Pfam") + + # NCBIfam-AMR + data_amr = compile_hmmsearch(df["NCBIfam-AMR"], id_protein_cluster, "NCBIfam-AMR") + + # KOFAM + data_kofam = compile_hmmsearch(df["KOFAM"], id_protein_cluster, "KOFAM") + + # AntiFam + data_antifam = compile_hmmsearch(df["AntiFam"], id_protein_cluster, "AntiFam") + + # Composite name + composite_name = list() + composite_name += list(data_uniref[-1]) + composite_name += list(data_kofam[-1]) + composite_name += list(data_pfam[-1]) + composite_name = list(filter(lambda x: isinstance(x, str), composite_name)) + if len(composite_name) > 0: + composite_name = opts.composite_name_joiner.join(composite_name) + else: + composite_name = np.nan + + print( + id_protein_cluster, + *data_identifiers, + composite_name, + *data_uniref, + *data_mibig, + *data_vfdb, + *data_cazy, + *data_pfam, + *data_amr, + *data_kofam, + *data_antifam, + sep="\t", + file=f, + ) + + # data_consensus = pd.Series(composite_name, index=[("Consensus", "composite_name")]) + # # Concatenate + # data_concatenated = pd.concat([ + # data_identifiers, + # data_consensus, + # data_uniref, + # data_mibig, + # data_vfdb, + # data_cazy, + # data_pfam, + # data_amr, + # data_kofam, + # data_antifam, + # ]) + # data_concatenated.name = id_protein_cluster + # protein_cluster_annotations.append(data_concatenated) + # df_annotations_proteinclusters = pd.DataFrame(protein_cluster_annotations) + # df_annotations_proteinclusters.to_csv(os.path.join(opts.output_directory, "annotations.protein_clusters.tsv.gz"), sep="\t") diff --git a/src/scripts/merge_genome_quality_assessments.py b/src/scripts/merge_genome_quality_assessments.py index a9a8be7..e20a4dd 100755 --- a/src/scripts/merge_genome_quality_assessments.py +++ b/src/scripts/merge_genome_quality_assessments.py @@ -4,7 +4,7 @@ import pandas as pd __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2021.10.16" +__version__ = "2021.11.9" def get_prokaryotic_description(x, fields=["Completeness_Model_Used", "Additional_Notes"]): output = list() @@ -93,7 +93,7 @@ def main(args=None): print("Could not find any prokaryotic genome assessment tables from CheckM2 in the following directory: {}".format(opts.binning_directory), file=sys.stdout) # Viral - viral_genome_quality_files = glob.glob(os.path.join(opts.binning_directory, opts.viral_subdirectory_name, "*", "output", "checkmv_results.filtered.tsv")) + viral_genome_quality_files = glob.glob(os.path.join(opts.binning_directory, opts.viral_subdirectory_name, "*", "output", "checkv_results.filtered.tsv")) if viral_genome_quality_files: print("* Compiling viral genome quality from following files:", *viral_genome_quality_files, sep="\n ", file=sys.stdout) diff --git a/src/scripts/merge_taxonomy_classifications.py b/src/scripts/merge_taxonomy_classifications.py index 46a2c9d..cb0f784 100755 --- a/src/scripts/merge_taxonomy_classifications.py +++ b/src/scripts/merge_taxonomy_classifications.py @@ -5,7 +5,7 @@ from tqdm import tqdm __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2021.10.11" +__version__ = "2021.12.11" def main(args=None): # Path info @@ -62,7 +62,7 @@ def main(args=None): for fp in genome_taxonomy: id_domain = fp.split("/")[-3] df = pd.read_csv(fp, sep="\t", index_col=0) - if id_domain.lower() in {"viral", "virus"}: + if id_domain.lower() in {"viral", "virus", "virion"}: for id_genome, taxonomy in df["lineage"].items(): genome_to_data[id_genome] = {"domain":id_domain, "taxonomy_classification":taxonomy} if id_domain.lower() in {"prokaryotic", "prokaryotes", "prokarya", "bacteria", "archaea", "bacterial","archael", "prok", "proks"}: diff --git a/src/scripts/module_completion_ratios.py b/src/scripts/module_completion_ratios.py index a209a57..7ba04d3 100755 --- a/src/scripts/module_completion_ratios.py +++ b/src/scripts/module_completion_ratios.py @@ -29,7 +29,7 @@ from collections import OrderedDict, defaultdict import pandas as pd -__version__ = "2023.10.23" +__version__ = "2023.12.1" __program__ = os.path.split(sys.argv[0])[-1] ################################################################################ @@ -469,7 +469,7 @@ def main(): parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) - parser.add_argument('-i', '--ko_table', help='path/to/ko_table.tsv following [id_genome][id_ko], No header. Cannot be used with --ko_lists') + parser.add_argument('-i', '--ko_table', default="stdin", help='path/to/ko_table.tsv following [id_genome][id_ko], No header. Cannot be used with --ko_lists [Default: stdin]') parser.add_argument('-k', '--ko_lists', nargs='+', help='Space-delimited list of filepaths where each file represents a genome and each line in the file is a KO id. Cannot be used with --ko_table') parser.add_argument('-o', '--output', default="stdout", help='Output file for module completion ratios [Default: stdout]') parser.add_argument("-d", '--database_directory', required=True, help='path/to/database_directory with pickle files') @@ -482,6 +482,9 @@ def main(): opts = parser.parse_args() + if opts.ko_lists is not None: + if opts.ko_table == "stdin": + opts.ko_table = None assert bool(opts.ko_table) != bool(opts.ko_lists), "Must provide KOs as either a tsv table (--ko_table) or a list of KO ids in different files (--ko_lists)" if opts.ko_table == "stdin": opts.ko_table = sys.stdin diff --git a/src/scripts/partition_unbinned.py b/src/scripts/partition_unbinned.py index d938507..bf4a2eb 100755 --- a/src/scripts/partition_unbinned.py +++ b/src/scripts/partition_unbinned.py @@ -4,7 +4,7 @@ from Bio.SeqIO.FastaIO import SimpleFastaParser __program__ = os.path.split(sys.argv[0])[-1] -__version__ = "2021.08.05" +__version__ = "2023.12.18" def main(args=None): # Path info @@ -24,7 +24,7 @@ def main(args=None): parser.add_argument("-b","--bins", type=str, required=True, help = "path/to/bins.list, No header") parser.add_argument("-f","--fasta", type=str, required=True, help = "path/to/fasta") parser.add_argument("-o","--output", type=str, default="stdout", help = "Output fasta file [Default: stdout]") - parser.add_argument("-m", "--minimum_contig_length", type=int, default=1000, help="Minimum contig length. [Default: 1000] ") + parser.add_argument("-m", "--minimum_contig_length", type=int, default=1, help="Minimum contig length. [Default: 1] ") parser.add_argument("--mode", type=str, default="unbinned", help="Get 'unbinned' or 'binned' contigs [Default: 'unbinned'] ") diff --git a/src/scripts/reformat_sylph_profile_single_sample_output.py b/src/scripts/reformat_sylph_profile_single_sample_output.py new file mode 100755 index 0000000..be7a531 --- /dev/null +++ b/src/scripts/reformat_sylph_profile_single_sample_output.py @@ -0,0 +1,68 @@ +#!/usr/bin/env python +import sys, os, argparse, gzip +import pandas as pd +from tqdm import tqdm + +__program__ = os.path.split(sys.argv[0])[-1] +__version__ = "2023.11.10" + +def filepath_to_genome(fp, extension): + assert fp.endswith(extension) + fn = os.path.split(fp)[1] + return fn[:-(len(extension) + 1)] + +def main(args=None): + # Path info + script_directory = os.path.dirname(os.path.abspath( __file__ )) + script_filename = __program__ + + # Path info + description = """ + Running: {} v{} via Python v{} | {}""".format(__program__, __version__, sys.version.split(" ")[0], sys.executable) + usage = "{} -i -o )".format(__program__) + epilog = "Copyright 2021 Josh L. Espinoza (jespinoz@jcvi.org)" + + # Parser + parser = argparse.ArgumentParser(description=description, usage=usage, epilog=epilog, formatter_class=argparse.RawTextHelpFormatter) + + # Pipeline + parser.add_argument("-i","--input", default="stdin", type=str, help = "Input fasta file") + parser.add_argument("-o","--output_directory", required=True, type=str, help = "Output directory to write output files") + # parser.add_argument("-n", "--name", type=str, required=False, help="Name of sample") + parser.add_argument("-c","--genome_clusters", type=str, help = "path/to/mags_to_slcs.tsv. [id_genome][id_genome-cluster], No header.") + parser.add_argument("-f","--field", type=str, default="Taxonomic_abundance", help = "Field to use for reformating [Default: Taxonomic_abundance]") + parser.add_argument("-x","--extension", type=str, default="fa", help = "Fasta file extension for bins [Default: fa]") + parser.add_argument("--header", action="store_true", help = "Do not include header. Doesn't apply to unstacked dataframe.") + + # Options + opts = parser.parse_args() + opts.script_directory = script_directory + opts.script_filename = script_filename + + # Input + if opts.input == "stdin": + opts.input = sys.stdin + + # Output + os.makedirs(opts.output_directory, exist_ok=True) + + # Process + df_sylph = pd.read_csv(opts.input, sep="\t") + assert opts.field in df_sylph.columns, "--field {} not in --input columns: {}".format(opts.field, ", ".join(df_sylph.columns)) + + genome_to_value = df_sylph.set_index("Genome_file")[opts.field] + genome_to_value.index = genome_to_value.index.map(lambda fp: filepath_to_genome(fp, opts.extension)) + + # Output genome values + genome_to_value.to_frame(opts.field.lower()).to_csv(os.path.join(opts.output_directory, "{}.tsv.gz".format(opts.field.lower())), sep="\t", header=bool(opts.header)) + + if opts.genome_clusters: + genome_to_slc = pd.read_csv(opts.genome_clusters, sep="\t", index_col=0).iloc[:,0] + slc_to_value = genome_to_value.groupby(genome_to_slc).sum() + slc_to_value.to_frame(opts.field.lower()).to_csv(os.path.join(opts.output_directory, "{}.clusters.tsv.gz".format(opts.field.lower())), sep="\t", header=bool(opts.header)) + +if __name__ == "__main__": + main() + + + diff --git a/src/veba b/src/veba new file mode 100755 index 0000000..132127a --- /dev/null +++ b/src/veba @@ -0,0 +1,194 @@ +#!/bin/bash +# v2023.12.18 + +# Define available modules +AVAILABLE_MODULES=( +"annotate" +"assembly-long" +"assembly" +"binning-eukaryotic" +"binning-prokaryotic" +"binning-viral" +"biosynthetic" +"classify-eukaryotic" +"classify-prokaryotic" +"classify-viral" +"cluster" +"coverage-long" +"coverage" +"index" +"mapping" +"phylogeny" +"preprocess-long" +"preprocess" +"profile-pathway" +"profile-taxonomy" +) + +# Conda base +CONDA_BASE=$(conda info --base) + +# Script directory +SCRIPT_DIRECTORY=$(dirname $0) + +# Function to display script usage +show_help() { + echo -e "-------------------------------" + echo " " + echo -e " _ _ _______ ______ _______\n \ / |______ |_____] |_____|\n \/ |______ |_____] | |" + echo " " + echo -e "-------------------------------" + + echo "Usage: $0 [-m ] [-o ] [-v|--version] [-h|--help]" + echo -e "Example: veba --module preprocess --params \"-1 S1_1.fq.gz -2 S1_2.fq.gz -n S1 -o veba_output/preprocess\"" + echo -e "GitHub: https://github.com/jolespin" + echo -e "Developer: Josh L. Espinoza, PhD (ORCiD: 0000-0003-3447-3845)" + echo " " + echo "Options:" + echo " -m, --module Specify the module. Available modules: ${AVAILABLE_MODULES[*]}" + echo " -p, --params Specify parameters to give to each module" + echo " -v, --version Display the version information" + echo " -h, --help Display this help message" + exit 0 +} + +# Parse command-line arguments +ARGS=$(getopt -o m:p:vh --long module:,params:,version,help -n "$0" -- "$@") + +# Exit if getopt encounters an error +if [ $? -ne 0 ]; then + exit 1 +fi + +eval set -- "$ARGS" + +# Default values +MODULE="" +PARAMS="-h" + +# Process command-line options +while true; do + case "$1" in + -m|--module) + MODULE="$2" + shift 2 + ;; + -p|--params) + PARAMS="$2" + shift 2 + ;; + -v|--version) + echo "VEBA Version:" + cat "${SCRIPT_DIRECTORY}/VEBA_VERSION" + exit 0 + ;; + -h|--help) + show_help + ;; + --) + shift + break + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Validate required arguments +if [ -z "$MODULE" ]; then + echo "Module is required. Use --module." + exit 1 +fi + + +# Check if the specified module is valid +if [[ ! " ${AVAILABLE_MODULES[@]} " =~ " $MODULE " ]]; then + echo "Invalid module. Must be one of: ${AVAILABLE_MODULES[*]}" + exit 1 +fi + +# Perform tasks based on the specified module +case $MODULE in + "annotate") + source "${CONDA_BASE}/bin/activate" VEBA-annotate_env + annotate.py $PARAMS + ;; + "assembly-long") + source "${CONDA_BASE}/bin/activate" VEBA-assembly_env + assembly-long.py $PARAMS + ;; + "assembly") + source "${CONDA_BASE}/bin/activate" VEBA-assembly_env + assembly.py $PARAMS + ;; + "binning-eukaryotic") + source "${CONDA_BASE}/bin/activate" VEBA-binning-eukaryotic_env + binning-eukaryotic.py $PARAMS + ;; + "binning-prokaryotic") + source "${CONDA_BASE}/bin/activate" VEBA-binning-prokaryotic_env + binning-prokaryotic.py $PARAMS + ;; + "binning-viral") + source "${CONDA_BASE}/bin/activate" VEBA-binning-viral_env + binning-viral.py $PARAMS + ;; + "biosynthetic") + source "${CONDA_BASE}/bin/activate" VEBA-biosynthetic_env + biosynthetic.py $PARAMS + ;; + "classify-eukaryotic") + source "${CONDA_BASE}/bin/activate" VEBA-classify_env + classify-eukaryotic.py $PARAMS + ;; + "classify-prokaryotic") + source "${CONDA_BASE}/bin/activate" VEBA-classify_env + classify-prokaryotic.py $PARAMS + ;; + "classify-viral") + source "${CONDA_BASE}/bin/activate" VEBA-classify_env + classify-viral.py $PARAMS + ;; + "cluster") + source "${CONDA_BASE}/bin/activate" VEBA-cluster_env + cluster.py $PARAMS + ;; + "coverage-long") + source "${CONDA_BASE}/bin/activate" VEBA-assembly_env + coverage-long.py $PARAMS + ;; + "coverage") + source "${CONDA_BASE}/bin/activate" VEBA-assembly_env + coverage.py $PARAMS + ;; + "index") + source "${CONDA_BASE}/bin/activate" VEBA-mapping_env + index.py $PARAMS + ;; + "mapping") + source "${CONDA_BASE}/bin/activate" VEBA-mapping_env + mapping.py $PARAMS + ;; + "phylogeny") + source "${CONDA_BASE}/bin/activate" VEBA-phylogeny_env + phylogeny.py $PARAMS + ;; + "preprocess-long") + source "${CONDA_BASE}/bin/activate" VEBA-preprocess_env + preprocess-long.py $PARAMS + ;; + "preprocess") + source "${CONDA_BASE}/bin/activate" VEBA-preprocess_env + preprocess.py $PARAMS + ;; + "profile-pathway") + source "${CONDA_BASE}/bin/activate" VEBA-profile_env + profile-pathway.py $PARAMS + ;; + "profile-taxonomy") + source "${CONDA_BASE}/bin/activate" VEBA-profile_env + profile-taxonomy.py $PARAMS + ;; +esac \ No newline at end of file diff --git a/src/get_script_versions.sh b/src/veba_versions.sh similarity index 100% rename from src/get_script_versions.sh rename to src/veba_versions.sh diff --git a/walkthroughs/README.md b/walkthroughs/README.md index 3586d83..2aecec6 100644 --- a/walkthroughs/README.md +++ b/walkthroughs/README.md @@ -31,29 +31,43 @@ sbatch -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N} #### Available walkthroughs: +##### Accessing SRA: + * **[Downloading and preprocessing fastq files](download_and_preprocess_reads.md)** - Explains how to download reads from NCBI and run *VEBA's* `preprocess.py` module to decontaminate either metagenomic and/or metatranscriptomic reads. + +##### End-to-end workflows: + * **[Complete end-to-end metagenomics analysis](end-to-end_metagenomics.md)** - Goes through assembling metagenomic reads, binning, clustering, classification, and annotation. We also show how to use the unbinned contigs in a pseudo-coassembly with guidelines on when it's a good idea to go this route. * **[Recovering viruses from metatranscriptomics](recovering_viruses_from_metatranscriptomics.md)** - Goes through assembling metatranscriptomic reads, viral binning, clustering, and classification. -* **[Read mapping and counts tables](read_mapping_and_counts_tables.md)** - Read mapping and generating counts tables at the contig, MAG, SLC, ORF, and SSO levels. -* **[Phylogenetic inference](phylogenetic_inference.md)** - Phylogenetic inference of eukaryotic diatoms. * **[Setting up *bona fide* coassemblies for metagenomics or metatranscriptomics](setting_up_coassemblies.md)** - In the case where all samples are of low depth, it may be useful to use coassembly instead of sample-specific approaches. This walkthrough goes through concatenating reads, creating a reads table, coassembly of concatenated reads, aligning sample-specific reads to the coassembly for multiple sorted BAM files, and mapping reads for scaffold/transcript-level counts. Please note that a coassembly differs from the pseudo-coassembly concept introduced in the VEBA publication. For more information regarding the differences between *bona fide* coassembly and pseud-coassembly, please refer to [*23. What's the difference between a coassembly and a pseudo-coassembly?*](https://github.com/jolespin/veba/blob/main/FAQ.md#23-whats-the-difference-between-a-coassembly-and-a-pseudo-coassembly). + +##### Phylogenetics: + +* **[Phylogenetic inference](phylogenetic_inference.md)** - Phylogenetic inference of eukaryotic diatoms. + +##### Bioprospecting: + * **[Bioprospecting for biosynthetic gene clusters](bioprospecting_for_biosynthetic_gene_clusters.md)** - Detecting biosynthetic gene clusters (BGC) with and scoring novelty of BGCs. + +##### Mapping reads and rapid profiling: + +* **[Read mapping and counts tables](read_mapping_and_counts_tables.md)** - Read mapping and generating counts tables at the contig, MAG, SLC, ORF, and SSO levels. +* **[Taxonomic profiling *de novo* genomes](taxonomic_profiling_de-novo_genomes.md)** - Explains how to build and profile reads to custom `Sylph` databases from *de novo* genomes. +* **[Pathway profiling *de novo* genomes](pathway_profiling_de-novo_genomes.md)** - Explains how to build and align reads to custom `HUMAnN` databases from *de novo* genomes and annotations. * **[Converting counts tables](converting_counts_tables.md)** - Convert your counts table (with or without metadata) to [anndata](https://anndata.readthedocs.io/en/latest/index.html) or [biom](https://biom-format.org/) format. Also supports [Pandas pickle](https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html) format. + +##### Containerization and AWS: + * **[Adapting commands for Docker](adapting_commands_for_docker.md)** - Explains how to download and use Docker for running VEBA. * **[Adapting commands for AWS](adapting_commands_for_aws.md)** - Explains how to download and use Docker for running VEBA specifically on AWS. -* **[Metabolic Profiling *de novo* genomes](metabolic_profiling_de-novo_genomes.md)** - Explains how to build and align reads to custom `HUMAnN` databases from *de novo* genomes and annotations. - ___________________________________________ **Coming Soon:** * Workflow for low-depth samples with no bins -* Workflow for ASV detection from short-read amplicons -* Workflows for integrating 3rd party software with *VEBA*: - * Using [EukHeist](https://github.com/AlexanderLabWHOI/EukHeist) for eukaryotic binning followed by *VEBA* for mapping and annotation. - * Using [EukMetaSanity](https://github.com/cjneely10/EukMetaSanity) for modeling genes for eukaryotic genomes recovered with *VEBA*. - +* Assigning eukaryotic taxonomy to unbinned contigs +* Bioprospecting using [`PlasticDB` database](https://plasticdb.org/) ___________________________________________ ##### Notes: diff --git a/walkthroughs/adapting_commands_for_aws.md b/walkthroughs/adapting_commands_for_aws.md index bc6acb8..091fe36 100644 --- a/walkthroughs/adapting_commands_for_aws.md +++ b/walkthroughs/adapting_commands_for_aws.md @@ -38,7 +38,7 @@ This job definition pulls the [jolespin/veba_preprocess](https://hub.docker.com/ "jobDefinitionName": "preprocess__S1", "type": "container", "containerProperties": { - "image": "jolespin/veba_preprocess:1.3.0", + "image": "jolespin/veba_preprocess:1.4.0", "command": [ "preprocess.py", "-1", diff --git a/walkthroughs/adapting_commands_for_docker.md b/walkthroughs/adapting_commands_for_docker.md index 939780a..b6d4899 100644 --- a/walkthroughs/adapting_commands_for_docker.md +++ b/walkthroughs/adapting_commands_for_docker.md @@ -24,7 +24,7 @@ Refer to the [Docker documentation](https://docs.docker.com/engine/install/). Let's say you wanted to use the `preprocess` module. Download the Docker image as so: ``` -VERSION=1.3.0 +VERSION=1.4.0 docker image pull jolespin/veba_preprocess:${VERSION} ``` @@ -36,7 +36,7 @@ For example, here's how we would run the `preprocess.py` module. First let's ju ```bash # Version -VERSION=1.2.0 +VERSION=1.4.0 # Image DOCKER_IMAGE="jolespin/veba_preprocess:${VERSION}" @@ -90,7 +90,7 @@ CMD="preprocess.py -1 ${CONTAINER_INPUT_DIRECTORY}/${R1} -2 ${CONTAINER_INPUT_DI # Docker # Version -VERSION=1.2.0 +VERSION=1.4.0 # Image DOCKER_IMAGE="jolespin/veba_preprocess:${VERSION}" diff --git a/walkthroughs/bioprospecting_for_biosynthetic_gene_clusters.md b/walkthroughs/bioprospecting_for_biosynthetic_gene_clusters.md index 23be35e..54f6187 100644 --- a/walkthroughs/bioprospecting_for_biosynthetic_gene_clusters.md +++ b/walkthroughs/bioprospecting_for_biosynthetic_gene_clusters.md @@ -12,6 +12,8 @@ _____________________________________________________ 1. Compile table of genomes and gene models 2. Identify biosynthetic gene clusters and score novelty +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. + _____________________________________________________ @@ -37,8 +39,6 @@ We only need the `[id_genome] [path/to/genome.fasta] [path/to/gene_models.gff]` Now that we have our genome table formatted so it is `[id_genome] [path/to/genome.fasta] [path/to/gene_models.gff]` without headers, we can run the `biosynthetic.py` module to identify biosynthetic gene clusters via `antiSMASH` and detect homology of components to the `MIBiG` database. -**Conda Environment:** `conda activate VEBA-biosynthetic_env` - ``` # Set the number of threads @@ -55,7 +55,7 @@ GENOMES=veba_output/misc/genomes_gene-models.tsv OUT_DIR=veba_output/biosynthetic/prokaryotic # Directory -CMD="source activate VEBA-biosynthetic_env && biosynthetic.py -i ${GENOMES} -o ${OUT_DIR} -p ${N_JOBS} -t bacteria" +CMD="source activate VEBA && veba --module biosynthetic --params \"-i ${GENOMES} -o ${OUT_DIR} -p ${N_JOBS} -t bacteria\"" # Either run this command or use SunGridEnginge/SLURM ``` diff --git a/walkthroughs/converting_counts_tables.md b/walkthroughs/converting_counts_tables.md index 13c2721..1e8a7b6 100644 --- a/walkthroughs/converting_counts_tables.md +++ b/walkthroughs/converting_counts_tables.md @@ -15,9 +15,9 @@ _____________________________________________________ 2. Provide a counts table and sample metadata 3. Provide a counts table, sample metadata, and -_____________________________________________________ +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. -**Conda Environment:** `conda activate VEBA-mapping_env` +_____________________________________________________ #### 1. Let's convert to a Python pickle object without any metadata diff --git a/walkthroughs/download_and_preprocess_reads.md b/walkthroughs/download_and_preprocess_reads.md index 2da95a5..6219160 100644 --- a/walkthroughs/download_and_preprocess_reads.md +++ b/walkthroughs/download_and_preprocess_reads.md @@ -11,7 +11,7 @@ If you want to either remove human contamination or count ribosomal reads then m ``` echo $VEBA_DATABASE -/expanse/projects/jcl110/db/veba/VDB_v4 +/expanse/projects/jcl110/db/veba/VDB_v6 # ^_^ Yours will be different obviously # ``` @@ -111,28 +111,12 @@ Here we are going to count the reads for the human contamination and ribosomal r * ⚠️ If your host is not human then you will need to use a different contamination reference. See item #22 in the [FAQ](https://github.com/jolespin/veba/blob/main/FAQ.md). -* ⚠️ As of 2022.10.18 *VEBA* has switched from using the "GRCh38 no alt analysis set" to the "CHM13v2.0 telomore-to-telomere" build for human. If you've installed *VEBA* before this date or are using `v1.0.0` release from [Espinoza et al. 2022](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04973-8) then you can update with the following code: - -``` -conda activate VEBA-database_env -wget -v -P ${VEBA_DATABASE} https://genome-idx.s3.amazonaws.com/bt/chm13v2.0.zip -unzip -d ${VEBA_DATABASE}/Contamination/ ${VEBA_DATABASE}/chm13v2.0.zip -rm -rf ${VEBA_DATABASE}/chm13v2.0.zip - -# Use this if you want to remove the previous GRCh38 index -rm -rf ${VEBA_DATABASE}/Contamination/grch38/ -``` - -Continuing with the tutorial...just make note of the human index here and swap out GRCh38 for CHM13v2.0 if you decided to update: ``` N_JOBS=4 -# Human Bowtie2 index -HUMAN_INDEX=${VEBA_DATABASE}/Contamination/grch38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index - -# or use this if you have updated from GRCh38 to CHM13v2.0 -# HUMAN_INDEX=${VEBA_DATABASE}/Contamination/chm13v2.0/chm13v2.0 +# CHM13v2.0 +HUMAN_INDEX=${VEBA_DATABASE}/Contamination/chm13v2.0/chm13v2.0 # Ribosomal k-mer fasta RIBOSOMAL_KMERS=${VEBA_DATABASE}/Contamination/kmers/ribokmers.fa.gz @@ -151,7 +135,7 @@ for ID in $(cat identifiers.list); do rm -f logs/${N}.* # Set up the command (use source from base environment instead of conda because of the `init` issues) - CMD="source activate VEBA-preprocess_env && preprocess.py -n ${ID} -1 ${R1} -2 ${R2} -p ${N_JOBS} -x ${HUMAN_INDEX} -k ${RIBOSOMAL_KMERS} --retain_contaminated_reads 0 --retain_kmer_hits 0 --retain_non_kmer_hits 0 -o veba_output/preprocess" + CMD="source activate VEBA && veba --module preprocess --params \"-n ${ID} -1 ${R1} -2 ${R2} -p ${N_JOBS} -x ${HUMAN_INDEX} -k ${RIBOSOMAL_KMERS} --retain_contaminated_reads 0 --retain_kmer_hits 0 --retain_non_kmer_hits 0 -o veba_output/preprocess\"" # If you have SunGrid engine, do something like this: # qsub -o logs/${N}.o -e logs/${N}.e -cwd -N ${N} -j y -pe threaded ${N_JOBS} "${CMD}" @@ -161,7 +145,7 @@ for ID in $(cat identifiers.list); do done ``` -Note: `preprocess.py` is a wrapper around `fastq_preprocessor` which takes in 0 and 1 as False and True, respectively. The reasoning for this is that I was able to keep the prefix `retain` while setting defaults easier. +Note: `preprocess` is a wrapper around `fastq_preprocessor`. It creates the following directory structure where each sample is it's own subdirectory. Makes globbing much easier: diff --git a/walkthroughs/end-to-end_metagenomics.md b/walkthroughs/end-to-end_metagenomics.md index baa22b8..c2592c8 100644 --- a/walkthroughs/end-to-end_metagenomics.md +++ b/walkthroughs/end-to-end_metagenomics.md @@ -23,12 +23,12 @@ _____________________________________________________ 12. Classify eukaryotic genomes 13. Annotate proteins +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. + _____________________________________________________ #### 1. Preprocess reads and get directory set up -**Conda Environment:** `conda activate VEBA-preprocess_env` - Refer to the [downloading and preprocessing reads walkthrough](download_and_preprocess_reads.md). At this point, it's assumed you have the following: * A file with all of your identifiers on a separate line (e.g., `identifiers.list` but you can call it whatever you want) @@ -41,8 +41,6 @@ Here we are going to assemble all of the reads using `metaSPAdes`. If you have **Recommended memory request:** For this *Plastisphere* dataset, I requested `64GB` of memory from my HPC. Though, this will change depending on how deep your samples are sequenced. -**Conda Environment:** `conda activate VEBA-assembly_env` - ``` # Set the number of threads to use for each sample. Let's use 4 N_JOBS=4 @@ -65,7 +63,7 @@ for ID in $(cat identifiers.list); do R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz # Set up command - CMD="source activate VEBA-assembly_env && assembly.py -1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P metaspades.py" + CMD="source activate VEBA && veba --module assembly --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P metaspades.py\"" # Either run this command or use SunGridEnginge/SLURM @@ -98,7 +96,6 @@ Let's start the binning with viruses since this is performed on a per-contig bas **Recommended memory request:** `16 GB` -**Conda Environment:** `conda activate VEBA-binning-viral_env` ``` N_JOBS=4 @@ -108,7 +105,7 @@ for ID in $(cat identifiers.list); rm -f logs/${N}.* FASTA=veba_output/assembly/${ID}/output/scaffolds.fasta BAM=veba_output/assembly/${ID}/output/mapped.sorted.bam - CMD="source activate VEBA-binning-viral_env && binning-viral.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -o veba_output/binning/viral" + CMD="source activate VEBA && veba --module binning-viral --params \"-f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -o veba_output/binning/viral\"" # Either run this command or use SunGridEnginge/SLURM done @@ -140,11 +137,8 @@ Here we are going to perform iterative prokaryotic binning. It's difficult to s If you have a lot of samples and a lot of contigs then use the `--skip_maxbin2` flag because it takes MUCH longer to run. For the *Plastisphere* it was going to take 40 hours per `MaxBin2` run (there are 2 `MaxBin2` runs) per iteration. `Metabat2` and `CONCOCT` can do the heavy lifting much faster and often with better results so it's recommended to skip `MaxBin2` for larger datasets. -**Recommended memory request:** `10GB` - -*Versions prior to `v1.1.0` were reliant on `GTDB-Tk` which needed at least `60GB`. `GTDB-Tk` is no longer required with the update of `CheckM` to `CheckM2`.* +**Recommended memory request:** `16GB` -**Conda Environment:** `conda activate VEBA-binning-prokaryotic_env` ``` N_JOBS=4 @@ -161,7 +155,7 @@ for ID in $(cat identifiers.list); do FASTA=veba_output/binning/viral/${ID}/output/unbinned.fasta BAM=veba_output/assembly/${ID}/output/mapped.sorted.bam - CMD="source activate VEBA-binning-prokaryotic_env && binning-prokaryotic.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -o ${OUT_DIR} -m 1500 -I ${N_ITER}" + CMD="source activate VEBA && veba --module binning-prokaryotic --params \"-f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -o ${OUT_DIR} -m 1500 -I ${N_ITER}\"" # Either run this command or use SunGridEnginge/SLURM @@ -194,9 +188,7 @@ for ID in $(cat identifiers.list); do #### 5. Recover eukaryotes from metagenomic assemblies Let's take the unbinned contigs from the prokaryotic binning and recover eukayoritc genomes. Unfortunately, we aren't going to do iterative binning here because there aren't any tools that can handle consensus genome binning as there is with prokaryotes (e.g., *DAS Tool*). We have the option to use either *Metabat2* or *CONCOCT*. In our experience, *Metabat2* works better for recovering eukaryotic genomes from metagenomes and it's also faster as well. -**Recommended memory request:** `128GB` - -**Conda Environment:** `conda activate VEBA-binning-eukaryotic_env` +**Recommended memory request:** `48GB` ``` N_JOBS=4 @@ -209,7 +201,7 @@ for ID in $(cat identifiers.list); do rm -f logs/${N}.* FASTA=veba_output/binning/prokaryotic/${ID}/output/unbinned.fasta BAM=veba_output/assembly/${ID}/output/mapped.sorted.bam - CMD="source activate VEBA-binning-eukaryotic_env && binning-eukaryotic.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -a metabat2 -o ${OUT_DIR}" + CMD="source activate VEBA && veba --module binning-eukaryotic --params \"-f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -a metabat2 -o ${OUT_DIR}\"" # Either run this command or use SunGridEnginge/SLURM @@ -257,8 +249,6 @@ That said, if you decide to move forward with the multi-sample approach then the **Recommended memory request:** `24 GB` -**Conda Environment:** `conda activate VEBA-assembly_env` - ``` @@ -270,8 +260,6 @@ mkdir -p veba_output/misc # I recommend having this light-weight program in base environment # if you do a lot of fasta manipulation. -conda activate VEBA-preprocess_env - # -------------------------------------------------------------------- # Method 1) Shortcut @@ -302,11 +290,11 @@ compile_reads_table.py -i veba_output/preprocess/ -r > veba_output/misc/reads_ta # Now let's map all the reads to the pseudo-coassembly (i.e., all_sample_specific_mags.unbinned_contigs.gt1500.fasta) -N=pseudo-coassembly +N=Multisample N_JOBS=16 # Let's use more threads here because we are going to be handling multiple samples at once -CMD="source activate VEBA-assembly_env && coverage.py -f veba_output/misc/all_sample_specific_mags.unbinned_contigs.gt1500.fasta -r veba_output/misc/reads_table.tsv -p ${N_JOBS} -o veba_output/assembly/pseudo-coassembly -m 1500" +CMD="source activate VEBA && veba --module coverage --params \"-f veba_output/misc/all_sample_specific_mags.unbinned_contigs.gt1500.fasta -r veba_output/misc/reads_table.tsv -p ${N_JOBS} -o veba_output/assembly/Multisample -m 1500\"" # Either run this command or use SunGridEnginge/SLURM ``` @@ -327,8 +315,6 @@ Let's try to recover some prokaryotes using the concatenated unbinned contigs. **Recommended memory request:** `10 - 24GB` -**Conda Environment:** `conda activate VEBA-binning-prokaryotic_env` - ``` # Setting more threads since we are only running this once N_JOBS=32 @@ -337,14 +323,14 @@ N_JOBS=32 N_ITER=5 # Set up filepaths and names -NAME="pseudo-coassembly" +NAME="Multisample" N="binning-prokaryotic__${NAME}" rm -f logs/${N}.* -FASTA=veba_output/assembly/pseudo-coassembly/output/reference.fasta -BAMS=veba_output/assembly/pseudo-coassembly/output/*/mapped.sorted.bam +FASTA=veba_output/assembly/${NAME}/output/reference.fasta +BAMS=veba_output/assembly/${NAME}/output/*/mapped.sorted.bam # Set up command -CMD="source activate VEBA-binning-prokaryotic_env && binning-prokaryotic.py -f ${FASTA} -b ${BAMS} -n ${NAME} -p ${N_JOBS} -m 1500 -I ${N_ITER} --skip_maxbin2" +CMD="source activate VEBA && veba --module binning-prokaryotic --params \"-f ${FASTA} -b ${BAMS} -n ${NAME} -p ${N_JOBS} -m 1500 -I ${N_ITER} --skip_maxbin2\"" # Either run this command or use SunGridEnginge/SLURM @@ -356,9 +342,8 @@ Check Step 4 for the output file descriptions. #### ⚠️ 8. Recover eukaryotes from pseudo-coassembly [Optional] Let's try to recover some eukaryotes using the updated concatenated unbinned contigs. -**Recommended memory request:** `128GB` +**Recommended memory request:** `48 GB` -**Conda Environment:** `conda activate VEBA-binning-eukaryotic_env` ``` @@ -366,14 +351,14 @@ Let's try to recover some eukaryotes using the updated concatenated unbinned con N_JOBS=32 # Set up filepaths and names -NAME="pseudo-coassembly" +NAME="Multisample" N="binning-eukaryotic__${NAME}" rm -f logs/${N}.* FASTA=veba_output/binning/prokaryotic/${NAME}/output/unbinned.fasta BAMS=veba_output/assembly/${NAME}/output/*/mapped.sorted.bam # Set up command -CMD="source activate VEBA-binning-eukaryotic_env && binning-eukaryotic.py -f ${FASTA} -b ${BAMS} -n ${NAME} -p ${N_JOBS} -m 1500 -a metabat2 -o veba_output/binning/eukaryotic" +CMD="source activate VEBA && veba --module binning-eukaryotic --params \"-f ${FASTA} -b ${BAMS} -n ${NAME} -p ${N_JOBS} -m 1500 -a metabat2 -o veba_output/binning/eukaryotic\"" # Either run this command or use SunGridEnginge/SLURM @@ -389,8 +374,6 @@ To analyze these data, we are going to generate some counts tables and we want a **Recommended memory request:** `24 GB` should work for most datasets but you may need to increase for much larger datasets. -**Conda Environment:** `conda activate VEBA-cluster_env` - ``` # We need to generate a table with the following fields: @@ -403,7 +386,7 @@ compile_genomes_table.py -i veba_output/binning/ > veba_output/misc/genomes_tabl N_JOBS=12 # Set up command -CMD="source activate VEBA-cluster_env && cluster.py -i veba_output/misc/genomes_table.tsv -o veba_output/cluster -p ${N_JOBS}" +CMD="source activate VEBA && veba --module cluster --params \"-i veba_output/misc/genomes_table.tsv -o veba_output/cluster -p ${N_JOBS}\"" # Either run this command or use SunGridEnginge/SLURM @@ -433,15 +416,13 @@ CMD="source activate VEBA-cluster_env && cluster.py -i veba_output/misc/genomes_ * global/pangenome_tables/*.tsv.gz - Pangenome tables for each SLC with prevalence values * global/serialization/*.dict.pkl - Python dictionaries for clusters * global/serialization/*.networkx_graph.pkl - NetworkX graphs for clusters -* local/* - If `--no_local_clustering` is not selected then all of the files are generated for local clustering +* local/* - If `--local_clustering` is selected then all of the files are generated for local clustering #### 10. Classify viral genomes Viral classification is performed using `geNomad`. Classification can be performed using the intermediate binning results which is much quicker. Alternatively, if you have viruses identified elsewhere you can still classify using the `--genomes` argument instead. -**Recommended memory request:** `1 GB` will work if you've performed viral binning via *VEBA*. If not, these use `16 GB` for external genomes. - -**Conda Environment:** `conda activate VEBA-classify_env` +**Recommended memory request:** `1 GB` should work if you've performed viral binning via *VEBA*. If not, these use `16 GB` for external genomes. ``` N=classify-viral @@ -458,7 +439,7 @@ CLUSTERS=veba_output/cluster/output/global/mags_to_slcs.tsv rm -rf logs/${N}.* # Set up the command -CMD="source activate VEBA-classify_env && classify-viral.py -i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/viral -p ${N_JOBS}" +CMD="source activate VEBA && veba --module classify-viral --params \"-i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/viral -p ${N_JOBS}\"" # Either run this command or use SunGridEnginge/SLURM @@ -474,7 +455,6 @@ Prokaryotic classification is performed using `GTDB-Tk`. Classification can be **Recommended memory request:** `72 GB` -**Conda Environment:** `conda activate VEBA-classify_env` ``` N_JOBS=16 @@ -490,7 +470,7 @@ BINNING_DIRECTORY=veba_output/binning/prokaryotic CLUSTERS=veba_output/cluster/output/global/mags_to_slcs.tsv # Set up the command -CMD="source activate VEBA-classify_env && classify-prokaryotic.py -i ${BINNING_DIRECTORY} -c ${CLUSTERS} -p ${N_JOBS} -o veba_output/classify/prokaryotic" +CMD="source activate VEBA && veba --module classify-prokaryotic --params \"-i ${BINNING_DIRECTORY} -c ${CLUSTERS} -p ${N_JOBS} -o veba_output/classify/prokaryotic\"" # Either run this command or use SunGridEnginge/SLURM @@ -502,11 +482,10 @@ The following output files will produced: * taxonomy.clusters.tsv - Prokaryotic cluster classification (If --clusters are provided) #### 12. Classify eukaryotic genomes -*VEBA* is going to use the *MetaEuk/MMSEQS2* protein alignments based on [*VEBA's* microeukaryotic protein database](https://doi.org/10.6084/m9.figshare.19668855.v1). The default is to use [BUSCO's eukaryota_odb10](https://busco-data.ezlab.org/v5/data/lineages/eukaryota_odb10.2020-09-10.tar.gz) marker set but you can use the annotations from all proteins if you want by providing the `--include_all_genes` flag. The former will take a little bit longer since it needs to run *hmmsearch* but it's more robust and doesn't take that much longer. +*VEBA* is going to use the *MetaEuk/MMSEQS2* protein alignments based on [*VEBA's* MicroEuk100](https://zenodo.org/records/10139451). The default is to use [BUSCO's eukaryota_odb10](https://busco-data.ezlab.org/v5/data/lineages/eukaryota_odb10.2020-09-10.tar.gz) marker set but you can use the annotations from all proteins if you want by providing the `--include_all_genes` flag but that's not recommended for classification. **Recommended memory request:** `12 GB` -**Conda Environment:** `conda activate VEBA-classify_env` ``` # This is threaded if you use the default (i.e., core marker detection) @@ -522,7 +501,7 @@ BINNING_DIRECTORY=veba_output/binning/eukaryotic CLUSTERS=veba_output/cluster/output/global/mags_to_slcs.tsv # Set up the command -CMD="source activate VEBA-classify_env && classify-eukaryotic.py -i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/eukaryotic -p ${N_JOBS}" +CMD="source activate VEBA && veba --module classify-eukaryotic --params \"-i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/eukaryotic -p ${N_JOBS}\"" # Either run this command or use SunGridEnginge/SLURM @@ -539,9 +518,6 @@ Instead of having 3 separate classification tables, it would be much more useful **Recommended memory request:** `1 GB` - -**Conda Environment:** `conda activate VEBA-classify_env` - ``` merge_taxonomy_classifications.py -i veba_output/classify -o veba_output/classify ``` @@ -554,8 +530,6 @@ The following output files will produced: #### 14. Annotate proteins Now that all of the MAGs are recovered and classified, let's annotate the proteins using best-hit against UniRef,MiBIG,VFDB,CAZy Pfam, AntiFam, AMRFinder, and KOFAM. HMMSearch will fail with sequences ≥ 100k so we need to remove any that are that long (there probably aren't but just to be safe). -**Conda Environment:** `conda activate VEBA-annotate_env` - ``` # Let's merge all of the proteins. @@ -582,7 +556,7 @@ PROTEINS=veba_output/misc/all_genomes.all_proteins.lt100k.faa IDENTIFIER_MAPPING=veba_output/cluster/output/global/identifier_mapping.proteins.tsv.gz # Command -CMD="source activate VEBA-annotate_env && annotate.py -a ${PROTEINS} -i ${IDENTIFIER_MAPPING} -o veba_output/annotation -p ${N_JOBS} -u uniref50" +CMD="source activate VEBA && veba --module annotate --params \"-a ${PROTEINS} -i ${IDENTIFIER_MAPPING} -o veba_output/annotation -p ${N_JOBS} -u uniref50\"" # Either run this command or use SunGridEnginge/SLURM @@ -607,7 +581,7 @@ If you are restricted by resources or time you may want to do just annotate the PROTEINS=veba_output/cluster/output/global/representative_sequences.faa # Command -CMD="source activate VEBA-annotate_env && annotate.py -a ${PROTEINS} -o veba_output/annotation -p ${N_JOBS} -u uniref50" +CMD="source activate VEBA && veba --module annotate --params \"-a ${PROTEINS} -o veba_output/annotation -p ${N_JOBS} -u uniref50\"" ``` @@ -641,7 +615,7 @@ for i in $(seq -f "%03g" 1 ${N_PARTITIONS}); do N="annotate-${i}" rm -f logs/${N}.* FAA=${PARTITION_DIRECTORY}/stdin.part_${i}.fasta - CMD="source activate VEBA-annotate_env && annotate.py -a ${FAA} -o ${OUT_DIR}/${i} -p ${N_JOBS} -u uniref50" + CMD="source activate VEBA && veba --module annotate --params \"-a ${FAA} -o ${OUT_DIR}/${i} -p ${N_JOBS} -u uniref50\"" # Either run this command or use SunGridEnginge/SLURM diff --git a/walkthroughs/metabolic_profiling_de-novo_genomes.md b/walkthroughs/pathway_profiling_de-novo_genomes.md similarity index 91% rename from walkthroughs/metabolic_profiling_de-novo_genomes.md rename to walkthroughs/pathway_profiling_de-novo_genomes.md index fc10d50..52ce9c0 100644 --- a/walkthroughs/metabolic_profiling_de-novo_genomes.md +++ b/walkthroughs/pathway_profiling_de-novo_genomes.md @@ -1,4 +1,4 @@ -### Metabolic profiling of *de novo* genomes +### Pathway profiling of *de novo* genomes If you build a comprehensive database, you may want to use a read-based approach to functionally profile a large set of samples. This tutorial will show you how to build a custom HUMAnN database from your annotations and how to profile your samples where there is full accounting of reads and your genomes. What you'll end up with at the end of this is a merged taxonomy table, a custom HUMAnN annotation table, and HUMAnN profiles. @@ -14,8 +14,8 @@ _____________________________________________________ 3. Functional profiling using `HUMAnN` of custom database 4. Merge the tables -**Conda Environment:** `conda activate VEBA-profile_env` - +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. +_______________________________________________________ #### 1. Merge taxonomy from all domains @@ -64,7 +64,7 @@ do rm -f logs/${N}.* R1=veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz - CMD="source activate VEBA-profile_env && profile-pathway.py -1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -i ${UNIREF_ANNOTATIONS} -f ${FASTA}" + CMD="source activate VEBA && veba --module profile-pathway --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -i ${UNIREF_ANNOTATIONS} -f ${FASTA}\"" # Either run this command or use SunGridEnginge/SLURM @@ -86,7 +86,7 @@ The following output files will produced for each sample: #### 4. Merge the tables ``` -merge_generalized_mapping.py -o veba_output/profiling/pathways/merged. humann_pathcoverage.tsv veba_output/profiling/pathways/*/output/humann_pathcoverage.tsv +merge_generalized_mapping.py -o veba_output/profiling/pathways/merged.humann_pathcoverage.tsv veba_output/profiling/pathways/*/output/humann_pathcoverage.tsv merge_generalized_mapping.py -o veba_output/profiling/pathways/merged.humann_pathabundance.tsv veba_output/profiling/pathways/*/output/humann_pathabundance.tsv diff --git a/walkthroughs/phylogenetic_inference.md b/walkthroughs/phylogenetic_inference.md index 6e7f4d7..7fbbe5b 100644 --- a/walkthroughs/phylogenetic_inference.md +++ b/walkthroughs/phylogenetic_inference.md @@ -12,6 +12,8 @@ _____________________________________________________ 1. Download the proteomes of similar organisms 2. Perform phylogenetic inference on proteomes +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. + _____________________________________________________ @@ -235,8 +237,6 @@ diatoms/SRR17458638__METABAT2__E.1__bin.3.faa Now that we have all of the files we need, we can perform phylogenetic inference using BUSCO's eukaryota_odb10 markers and score cutoffs. For eukaryotes, it's advised that you use the eukaryota_odb10 marker set because this is the core marker set used for classification. This isn't the case for prokaryotes and viruses. If you don't have enough resources to run maximum likelihood trees via *IQTREE2* then use `--no_iqtree`. -**Conda Environment:** `conda activate VEBA-phylogeny_env` - ``` # Set the number of threads @@ -260,7 +260,7 @@ MINIMUM_GENOMES_ALIGNED_RATIO=0.95 OUT_DIR=veba_output/phylogeny/diatoms # Directory -CMD="source activate VEBA-phylogeny_env && phylogeny.py -a ${PROTEINS} -o ${OUT_DIR} -p ${N_JOBS} -f name --no_iqtree -d ${HMM} -s ${SCORES} --minimum_genomes_aligned_ratio ${MINIMUM_GENOMES_ALIGNED_RATIO} +CMD="source activate VEBA && veba --module phylogeny --params \"-a ${PROTEINS} -o ${OUT_DIR} -p ${N_JOBS} -f name --no_iqtree -d ${HMM} -s ${SCORES} --minimum_genomes_aligned_ratio ${MINIMUM_GENOMES_ALIGNED_RATIO}\"" # Either run this command or use SunGridEnginge/SLURM ``` diff --git a/walkthroughs/read_mapping_and_counts_tables.md b/walkthroughs/read_mapping_and_counts_tables.md index 29ab766..a17f404 100644 --- a/walkthroughs/read_mapping_and_counts_tables.md +++ b/walkthroughs/read_mapping_and_counts_tables.md @@ -15,6 +15,8 @@ _____________________________________________________ 2. Map reads to global reference and create base counts tables 3. Merge the counts tables for all the samples +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. + _____________________________________________________ @@ -22,7 +24,6 @@ _____________________________________________________ Here we are going to concatenate all of the binned contigs (i.e., MAGs) and their respective gene models (i.e., GFF files) then index using `Bowtie2`. -**Conda Environment:** `conda activate VEBA-mapping_env` ``` @@ -44,7 +45,7 @@ ls veba_output/binning/*/*/output/genomes/*.gff > veba_output/misc/gene_models.l GENE_MODELS=veba_output/misc/gene_models.list # Set up command -CMD="source activate VEBA-mapping_env && index.py -r ${GENOMES} -g ${GENE_MODELS} -o veba_output/index/global/ -p ${N_JOBS}" +CMD="source activate VEBA && veba --module index --params \"-r ${GENOMES} -g ${GENE_MODELS} -o veba_output/index/global/ -p ${N_JOBS}\"" # Either run this command or use SunGridEnginge/SLURM ``` @@ -63,9 +64,18 @@ Here we are map all of the reads to the global reference and create base counts **Note:** Versions prior to v1.1.2 require the output directory to include the sample name. (e.g., `-o veba_output/mapping/global/${ID}` where `-n` is not used. In v1.1.2+, the output directory is automatic (e.g., `veba_output/mapping/global/` and `-n ${ID}` are used) -**Conda Environment:** `conda activate VEBA-mapping_env` ``` +# If you have run the cluster.py module you can use this: +SCAFFOLDS_TO_MAGS=veba_output/cluster/output/global/scaffolds_to_mags.tsv +SCAFFOLDS_TO_SLCS=veba_output/cluster/output/global/scaffolds_to_slcs.tsv +PROTEINS_TO_ORTHOGROUPS=veba_output/cluster/output/global/proteins_to_orthogroups.tsv +MAGS_TO_SLCS=veba_output/cluster/output/global/mags_to_slcs.tsv + +# If you skipped the clustering, you can oncatenate all of the scaffolds to bins from all of the domains +cat veba_output/binning/*/*/output/scaffolds_to_bins.tsv > veba_output/misc/all_genomes.scaffolds_to_mags.tsv +SCAFFOLDS_TO_MAGS=veba_output/misc/all_genomes.scaffolds_to_mags.tsv + # Set a lower number of threads since we are running for each sample N_JOBS=2 @@ -84,7 +94,7 @@ for ID in $(cat identifiers.list); do OUT_DIR=veba_output/mapping/global # Set up command - CMD="source activate VEBA-mapping_env && mapping.py -1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -x ${INDEX_DIRECTORY}" + CMD="source activate VEBA && veba --module mapping --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -x ${INDEX_DIRECTORY} --scaffolds_to_bins ${SCAFFOLDS_TO_MAGS}\"" #--scaffolds_to_clusters ${SCAFFOLDS_TO_SLCS} --proteins_to_orthogroups ${PROTEINS_TO_ORTHOGROUPS} # Either run this command or use SunGridEnginge/SLURM @@ -111,16 +121,6 @@ MAPPING_DIRECTORY=veba_output/mapping/global # Set output directory (this is default) OUT_DIR=veba_output/counts -# If you have run the cluster.py module you can use this: -SCAFFOLDS_TO_MAGS=veba_output/cluster/output/global/scaffolds_to_mags.tsv -SCAFFOLDS_TO_SLCS=veba_output/cluster/output/global/scaffolds_to_slcs.tsv -#MAGS_TO_SLCS=veba_output/cluster/output/global/mags_to_slcs.tsv -PROTEINS_TO_ORTHOGROUPS=veba_output/cluster/output/global/proteins_to_orthogroups.tsv - -# If you skipped the clustering, you can oncatenate all of the scaffolds to bins from all of the domains -cat veba_output/binning/*/*/output/scaffolds_to_bins.tsv > veba_output/misc/all_genomes.scaffolds_to_mags.tsv -SCAFFOLDS_TO_MAGS=veba_output/misc/all_genomes.scaffolds_to_mags.tsv - # Merge contig-level counts (excu merge_contig_mapping.py -m ${MAPPING_DIRECTORY} -c ${MAGS_TO_SLCS} -i ${SCAFFOLDS_TO_MAGS} -o ${OUT_DIR} diff --git a/walkthroughs/recovering_viruses_from_metatranscriptomics.md b/walkthroughs/recovering_viruses_from_metatranscriptomics.md index 9890813..81a2035 100644 --- a/walkthroughs/recovering_viruses_from_metatranscriptomics.md +++ b/walkthroughs/recovering_viruses_from_metatranscriptomics.md @@ -15,9 +15,9 @@ _____________________________________________________ 4. Cluster genomes and proteins 5. Classify viral genomes -#### 1. Preprocess reads and get directory set up +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. -**Conda Environment:** `conda activate VEBA-preprocess_env` +#### 1. Preprocess reads and get directory set up Refer to the [downloading and preprocessing reads workflow](download_and_preprocess_reads.md). At this point, it's assumed you have the following: @@ -29,8 +29,6 @@ Refer to the [downloading and preprocessing reads workflow](download_and_preproc Here we are going to assemble all of the reads using `rnaSPAdes`. -**Conda Environment:** `conda activate VEBA-assembly_env` - ``` # Set the number of threads to use for each sample. Let's use 4 N_JOBS=4 @@ -53,7 +51,7 @@ for ID in $(cat identifiers.list); do R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz # Set up command - CMD="source activate VEBA-assembly_env && assembly.py -1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P rnaspades.py" + CMD="source activate VEBA && veba --module assembly --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P rnaspades.py\"" # Either run this command or use SunGridEnginge/SLURM @@ -83,8 +81,6 @@ Where `g0` refers to the predicted gene and `i0` refers to the isoform transcrip #### 3. Recover viruses from metatranscriptomic assemblies We use a similar approach to the metagenomics with *geNomad* and *CheckV* but using the assembled transcripts instead. Again, the criteria for high-quality viral genomes are described by the [*CheckV* author](https://scholar.google.com/citations?user=gmKnjNQAAAAJ&hl=en) [here in this Bitbucket Issue (#38)](https://bitbucket.org/berkeleylab/checkv/issues/38/recommended-cutoffs-for-analyzing-checkv). -**Conda Environment:** `conda activate VEBA-binning-viral_env` - ``` N_JOBS=4 @@ -93,7 +89,7 @@ for ID in $(cat identifiers.list); rm -f logs/${N}.* FASTA=veba_output/transcript_assembly/${ID}/output/transcripts.fasta BAM=veba_output/transcript_assembly/${ID}/output/mapped.sorted.bam - CMD="source activate VEBA-binning-viral_env && binning-viral.py -f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -o veba_output/binning/viral -a genomad" + CMD="source activate VEBA && veba --module binning-viral --params \"-f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -o veba_output/binning/viral -a genomad\"" # Either run this command or use SunGridEnginge/SLURM @@ -123,8 +119,6 @@ for ID in $(cat identifiers.list); #### 4. Cluster genomes and proteins To analyze these data, we are going to generate some counts tables and we want a single set of features to compare across all samples. To achieve this, we are going to cluster the genomes into species-level clusters (SLC) and the proteins into SLC-specific protein clusters (SSPC). Further, this clustering is dual purpose as it alleviates some of the bias from [the curse(s) of dimensionality](https://www.nature.com/articles/s41592-018-0019-x) with dimensionality reduction via feature compression - [a type of feature engineering](https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10). -**Conda Environment:** `conda activate VEBA-cluster_env` - ``` # We need to generate a table with the following fields: @@ -137,7 +131,7 @@ compile_genomes_table.py -i veba_output/binning/ > veba_output/misc/genomes_tabl N_JOBS=12 # Set up command -CMD="source activate VEBA-cluster_env && cluster.py -i veba_output/misc/genomes_table.tsv -o veba_output/cluster -p ${N_JOBS}" +CMD="source activate VEBA && veba --module cluster --params \"-i veba_output/misc/genomes_table.tsv -o veba_output/cluster -p ${N_JOBS}\"" # Either run this command or use SunGridEnginge/SLURM ``` @@ -173,9 +167,6 @@ CMD="source activate VEBA-cluster_env && cluster.py -i veba_output/misc/genomes_ #### 5. Classify viral genomes Viral classification is performed using `geNomad`. Classification can be performed using the intermediate binning results which is much quicker. Alternatively, if you have viruses identified elsewhere you can still classify using the `--genomes` argument instead. -**Conda Environment:** `conda activate VEBA-classify_env` - - ``` N=classify-viral @@ -188,7 +179,7 @@ CLUSTERS=veba_output/cluster/viral/output/clusters.tsv rm -rf logs/${N}.* # Set up the command -CMD="source activate VEBA-classify_env && classify-viral.py -i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/viral" +CMD="source activate VEBA && veba --module classify-viral --params \"-i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/viral\"" # Either run this command or use SunGridEnginge/SLURM diff --git a/walkthroughs/setting_up_coassemblies.md b/walkthroughs/setting_up_coassemblies.md index ebf7a8b..b922e9c 100644 --- a/walkthroughs/setting_up_coassemblies.md +++ b/walkthroughs/setting_up_coassemblies.md @@ -14,6 +14,9 @@ _____________________________________________________ 3. Coassembly using assembly.py 4. Align reads from each sample to the coassembly to create sorted BAM files that will be used for binning and counts tables. +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. +______________________________________________________ + #### 1. Concatenate forward and reverse reads separately Refer to the [downloading and preprocessing reads workflow](download_and_preprocess_reads.md). At this point, it's assumed you have the following: @@ -43,8 +46,6 @@ cat veba_output/preprocess/*/output/cleaned_2.fastq.gz > veba_output/misc/concat Here we are going to coassemble all of the reads using `metaSPAdes` which is default but if you are using metatranscriptomics then use `-P rnaSPAdes.py`. -**Conda Environment:** `conda activate VEBA-assembly_env` - ``` # Set the number of threads to use for each sample. Let's use 4 N_JOBS=4 @@ -67,10 +68,10 @@ R1=veba_output/misc/concatenated_1.fastq.gz R2=veba_output/misc/concatenated_2.fastq.gz # Set up command -CMD="source activate VEBA-assembly_env && assembly.py -1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS}" +CMD="source activate VEBA && veba --module assembly --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS}\"" # Use this for metatranscriptomics -# CMD="source activate VEBA-assembly_env && assembly.py -1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P rnaspades.py" +# CMD="source activate VEBA && veba --module assembly --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P rnaspades.py\"" # Either run this command or use SunGridEnginge/SLURM @@ -91,10 +92,6 @@ The main one we need is `scaffolds.fasta` which we will use for binning. Note #### 3. Align sample-specific reads to the coassembly - - -**Conda Environment:** `conda activate VEBA-assembly_env` - ``` N_JOBS=4 @@ -103,7 +100,7 @@ N="coverage__${ID}"; rm -f logs/${N}.* FASTA=veba_output/assembly/${ID}/output/scaffolds.fasta READS=veba_output/misc/reads_table.tsv -CMD="source activate VEBA-assembly_env && coverage.py -f ${FASTA} -r ${READS} -p ${N_JOBS} -m 1500 -o veba_output/coverage/${ID}" +CMD="source activate VEBA && veba --module coverage --params \"-f ${FASTA} -r ${READS} -p ${N_JOBS} -m 1500 -o veba_output/coverage/${ID}\"" # Either run this command or use SunGridEnginge/SLURM @@ -125,7 +122,7 @@ _____________________________________________________ Now that you have a coassembly and multiple sorted BAM files, it's time for binning. Start at step 3 of the [end-to-end metagenomics](end-to-end_metagenomics.md) or [recovering viruses from metatranscriptomics](recovering_viruses_from_metatranscriptomics.md) workflows depending on whether or not you have metagenomics or metatranscriptomics, respectively. -**Please do not forget to adapt the BAM argument in the `binning-prokaryotic.py` command to include all the sample-specific sorted BAM files and not the concatenated sorted BAM.** +**Please do not forget to adapt the BAM argument in the `binning-prokaryotic` command to include all the sample-specific sorted BAM files and not the concatenated sorted BAM.** More specifically, use `BAM="veba_output/coverage/coassembly/output/*/mapped.sorted.bam"` and not `BAM="veba_output/assembly/coassembly/output/mapped.sorted.bam"`. diff --git a/walkthroughs/taxonomic_profiling_de-novo_genomes.md b/walkthroughs/taxonomic_profiling_de-novo_genomes.md new file mode 100644 index 0000000..a2ae3d3 --- /dev/null +++ b/walkthroughs/taxonomic_profiling_de-novo_genomes.md @@ -0,0 +1,90 @@ +### Taxonomic profiling of *de novo* genomes +If you build a comprehensive database, you may want to use a read-based approach to taxonomicly profile a large set of samples. This tutorial will show you how to build a custom `Sylph` database from your genomes and how to profile your samples for taxonomic abundance. + +What you'll end up with at the end of this is a `Sylph` database and taxonomic abundance profiles. + +Please refer to the [end-to-end metagenomics](end-to-end_metagenomics.md) or [recovering viruses from metatranscriptomics](recovering_viruses_from_metatranscriptomics.md) workflows for details on binning, clustering, and annotation. + +_____________________________________________________ + +#### Steps: +1. Compile custom `Sylph` database from *de novo* genomes +2. Taxonomic profiling using `Sylph ` of custom database +3. Merge the tables + +**Conda Environment:** `conda activate VEBA`. Use this for intermediate scripts. +_______________________________________________________ + +#### 1. Compile custom `Sylph` database from *de novo* genomes + +At this point, it's assumed you have the following: + +* Clustering results from the `cluster.py` module +* A directory of preprocessed reads: `veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz` and `veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz` where `${ID}` represents the identifiers in `identifiers.list`. +* Genome assemblies. These can either be MAGs binned with VEBA, binned elsewhere, or even reference genomes you downloaded. + + +Here we are going to build 2 databases, one for viral genomes and one for non-viral genomes (i.e., prokaryotes and eukaryotes). The reason for 2 separate databases is because there are presets used for small genomes that are different than medium to large genomes. We need a table that has `[organism_type][path/to/genome.fa]` with no headers. We already have some version of this with the `veba_output/misc/genomes_table.tsv` we made for clustering. We can pipe this into stdin for the database build script: + + +``` +cat veba_output/misc/genomes_table.tsv | cut -f1,4 | compile_custom_sylph_sketch_database_from_genomes.py -o veba_output/profiling/databases +``` + +This generates 2 `Sylph` databases (assuming you have viruses and non-viruses): + +* `veba_output/profiling/databases/genome_database-nonviral.syldb` +* `veba_output/profiling/databases/genome_database-viral.syldb` + + +#### 2. Taxonomic profiling using `Sylph` of custom database + +Now it's time to profile the reads against the `Sylph` databases. Since `Sylph` takes in a sketch of reads, we can either use a precompute reads sketch with `-s` or with paired-end reads (`-1` and `-2`) to compute the sketch in the backend. + +``` +N_JOBS=4 +OUT_DIR=veba_output/profiling/taxonomy +DATABASES=veba_output/profiling/databases/*.syldb +MAGS_TO_SLCS=veba_output/cluster/output/global/mags_to_slcs.tsv # Assuming you have clustering results + +mkdir -p logs + +for ID in $(cat identifiers.list); +do + N="profile-taxonomy__${ID}"; + rm -f logs/${N}.* + R1=veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz + R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz + CMD="source activate VEBA && veba --module profile-taxonomy --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -d ${DATABASES} -c ${MAGS_TO_SLCS}\"" + + # Either run this command or use SunGridEnginge/SLURM + +done + +``` + +The following output files will produced for each sample: + +* reads.sylsp - Reads sketch if paired-end reads were provided +* sylph\_profile.tsv.gz - Output of `sylph profile` +* taxonomic_abundance.tsv.gz - Genome-level taxonomic abundance (No header) +* taxonomic_abundance.clusters.tsv.gz - SLC-level taxonomic abundance (No header) + +#### 3. Merge the tables + +``` +merge_generalized_mapping.py -o veba_output/profiling/taxonomy/merged.taxonomic_abundance.tsv.gz veba_output/profiling/taxonomy/*/output/taxonomic_abundance.tsv.gz + +merge_generalized_mapping.py -o veba_output/profiling/taxonomy/merged.taxonomic_abundance.clusters.tsv.gz veba_output/profiling/taxonomy/*/output/taxonomic_abundance.clusters.tsv.gz +``` + +The following output files will produced for each sample: + +* merged.taxonomic\_abundance.tsv.gz - Merged genome-level taxonomic abundance matrix +* merged.taxonomic\_abundance.clusters.tsv.gz - Merged SLC-level taxonomic abundance matrix + +_____________________________________________________ + +#### Next steps: + +Subset stratified tables by their respective levels.