monarch-initiative · ielis · May 22, 2024 · Jan 14, 2024 · Feb 2, 2024 · Feb 9, 2024
diff --git a/.github/workflows/generate_phenopackets.yml b/.github/workflows/generate_phenopackets.yml
diff --git a/docs/developers/developers.md b/docs/developers/developers.md
@@ -1,5 +1,35 @@
 # For developers
 
+## Local Installation
+
+We recommend creating a local environment:
+
+```bash
+python3 -m venv venv
+source venv/bin/activate
+```
+
+and updating Python's `pip` tool:
+
+```bash
+python3 -m pip install --upgrade pip
+```
+
+You can then do a local/editable install:
+
+
+```bash
+python3 -m pip install --editable ".[test]"
+```
+
+After installation you should be able to run the test suite:
+
+```bash
+pytest
+```
+
+
+## Creating Phenopackets
 
 pyphetools provides two main ways of creating phenopackets.
 

diff --git a/docs/img/deletion_error.png b/docs/img/deletion_error.png
diff --git a/docs/user-guide/python_notebook.md b/docs/user-guide/python_notebook.md
@@ -15,27 +15,53 @@ import pyphetools
 print(f"Using pyphetools version {pyphetools.__version__}")
 ```
 
-Import the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/) hp.json file. Note that here we show code that assumes that the file is available in the enclosing directory. Update the ORCID identifier to your own [ORCID](https://orcid.org/){:target="_blank"}  id. Indicate
-the location of the template file.
+### Set paths and identifiers
+Update the ORCID identifier to your own [ORCID](https://orcid.org/){:target="_blank"}  id. 
+Update the path to the template file.
 
 ```python
 template = "input/BRD4_individuals.xlsx"
-hp_json = "../hp.json"
 created_by = "0000-0002-0736-9199"
 ```
 
-import the template file. The code returns the pyphetools Individual objects, each of which contains all of the information needed to create a phenopacket and which here can be used if desired for debugging or further analysis. The cvalidator object is used to display quality assessment information.
+### Import the template file. 
+The code returns the pyphetools Individual objects, each of which contains all of the information needed to create a phenopacket and which here can be used if desired for debugging or further analysis. The cvalidator object is used to display quality assessment information.
+Note that optionally you can provide an argument to the location of the hp.json file using the ``hp_json``argument. If no argument is provided, the hpo-toolkit library will download the latest version of
+hp.json to your user directory (.hpotk folder).
 
-```
-timporter = TemplateImporter(template=template, hp_json=hp_json, created_by=created_by)
+```python
+timporter = TemplateImporter(template=template,  created_by=created_by)
 individual_list, cvalidator = timporter.import_phenopackets_from_template()
 ```
-Display quality assessment data.
+
+### Structural variants
+pyphetools will automatically retrieve information about small variants coded as HGVS strings using the
+[VariantValidator](https://variantvalidator.org/) API. Until very recently, it was challenging to determine the exact positions of larger structural variants, and for this reason, publications often described them
+using phrases such as "whole gene deletion" or "EX9-12DEL". If such as string is found in the template file,
+pyphetool will emit an error such as the following.
+
+<figure markdown>
+![Validation results](../img/deletion_error.png){ width="1000" }
+<figcaption>Validation Results.
+</figcaption>
+</figure>
+
+This can be fixed by passing an argument with a set of all strings that represent deletions (as in the following example), duplications, or inversions.
+
+```python title="Specifying structural variants"
+del_set = {"EX9-12DEL"}
+timporter = TemplateImporter(template=template, created_by=created_by)
+individual_list, cvalidator = timporter.import_phenopackets_from_template(deletions=del_set)
+```
+
+### Display quality assessment data.
 ```
 qc = QcVisualizer(cohort_validator=cvalidator)
 display(HTML(qc.to_summary_html()))
 ```
-Display summaries of each phenopacket. The command ``cvalidator.get_error_free_individual_list()``returns versions of the Individual objects
+### Display summaries of each phenopacket. 
+
+The command ``cvalidator.get_error_free_individual_list()``returns versions of the Individual objects
 in which errors such as redundancies have been removed; this is the data that gets transformed into phenopackets.
 
 

diff --git a/docs/user-guide/template.md b/docs/user-guide/template.md
@@ -1,7 +1,7 @@
 # Data-Entry Template
 
 pyphetools offers two main ways to encode clinical data as phenopackets. The library provides various functions to encode data found in
-typical supplementary materials of publications about cohorts. This option, which is covered in more detail in TODO is intended for those
+typical supplementary materials of publications about cohorts. This option, which is covered in more detail [here](../developers/developers.md) is intended for those
 with skills in scripting with Python. Additionally, pyphetools can ingest data encoded in an Excel template that can be used without additional scripting.
 The template can be ingested using a standardized notebook. Alternatively, users are invited to work with the HPO team to enter the data into the HPO database.
 
@@ -57,4 +57,15 @@ tcreator.create_template(disease_id=disease_id,
                          transcript=ofd1_transcript)
 ```
 
+The following snippet can be used as a "starter" by pasting it into the notebook.
+
+```python
+tc.create_template(disease_id="",
+                         disease_label="",
+                         gene_symbol="",
+                         HGNC_id="",
+                         transcript="")
+```
+
+
 The script creates a file that can be opened in Excel and used for curation. Add additional HPO terms as necessary and remove terms that are not needed.
diff --git a/src/pyphetools/__init__.py b/src/pyphetools/__init__.py
@@ -4,7 +4,7 @@
 from . import visualization
 from . import validation
 
-__version__ = "0.9.77"
+__version__ = "0.9.85"
 
 __all__ = [
     "creation",

diff --git a/src/pyphetools/creation/case_template_encoder.py b/src/pyphetools/creation/case_template_encoder.py
@@ -388,7 +388,7 @@ def _parse_individual(self, row:pd.Series):
         elif sex == "U":
             sex = Constants.UNKNOWN_SEX_SYMBOL
         else:
-            raise ValueError(f"Unrecognized sex symbol: {sex}")
+            raise ValueError(f"Unrecognized sex symbol: {sex} for individual \"{individual_id}\"")
         onset_age = data_items.get(AGE_OF_ONSET_FIELDNAME)
         if onset_age is not None and isinstance(onset_age, str):
             onset_age = PyPheToolsAge.get_age(onset_age)

diff --git a/src/pyphetools/creation/create_template.py b/src/pyphetools/creation/create_template.py
@@ -1,20 +1,31 @@
 import os
+import typing
+
 import pandas as pd
 from collections import defaultdict
 from .hpo_parser  import HpoParser
+from .hp_term import HpTerm
 from typing import List
 import hpotk
 from .case_template_encoder import REQUIRED_H1_FIELDS, REQUIRED_H2_FIELDS
 
 class TemplateCreator:
 
-    def __init__(self, hp_json:str, hp_cr_index:str=None) -> None:
-        if not os.path.isfile(hp_json):
+    def __init__(
+            self,
+            hp_json: typing.Optional[str] = None,
+            hp_cr_index: typing.Optional[str] = None,
+    ) -> None:
+        if hp_json is None:
+            parser = HpoParser()
+        elif not os.path.isfile(hp_json):
             raise FileNotFoundError(f"Could not find hp.json file at {hp_json}")
-        if hp_cr_index:
+        else:
+            parser = HpoParser(hpo_json_file=hp_json)
+        if hp_cr_index is not None:
             if not os.path.isfile(hp_cr_index):
                 raise FileNotFoundError(f"Could not find the FastHPOCR index file at {hp_cr_index}")
-        parser = HpoParser(hpo_json_file=hp_json)
+
         self._hpo_cr = parser.get_hpo_concept_recognizer(hp_cr_index=hp_cr_index)
         self._hpo_ontology = parser.get_ontology()
         self._all_added_hp_term_set = set()
@@ -60,7 +71,7 @@ def arrange_terms(self) -> List[hpotk.model.TermId]:
         return hp_term_list
 
 
-    def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_symbol:str, transcript:str, append=False):
+    def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_symbol:str, transcript:str):
         """Create an Excel file that can be used to enter data as a pyphetools template
 
         :param disease_id: an OMIM, MONDO, or other similar CURIE identifier
@@ -107,17 +118,56 @@ def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_s
             df.loc[len(df)] = new_row
         ## Output as excel
         fname = disease_id.replace(":", "_") + "_individuals.xlsx"
-
         if os.path.isfile(fname):
-            if not append:
-                raise FileExistsError(f"Excel file '{fname}' already exists. Use 'append=True' \
-                                        to append HPO terms to the existing file.")
-            else:
-                print(f"[WARNING] Appending to existing file '{fname}'. This might lead to duplicate HPO terms. \
-                        It's recommended to create a new file instead.")
-
+            raise FileExistsError(f"Excel file '{fname}' already exists.")
         df.to_excel(fname, index=False)
-        print(f"Write excel pyphetools template file to {fname}")
-
-
+        print(f"Wrote Excel pyphetools template file to {fname}")
 
+    def create_from_phenopacket(self, ppkt):
+        """
+        create pyphetools templates from an individual phenopacket.
+        This function is intended to accelerate the process of converting the LIRICAL phenopackets
+        to our current format and generally should not be used for new cases
+        """
+        id_to_observed = set()
+        id_to_excluded = set()
+
+        for pf in ppkt.phenotypic_features:
+            hpt = HpTerm(hpo_id=pf.type.id, label=pf.type.label)
+            self._all_added_hp_term_set.add(hpt)
+            if pf.excluded:
+                id_to_excluded.add(pf.type.label)
+            else:
+                id_to_observed.add(pf.type.label)
+        H1_Headers = REQUIRED_H1_FIELDS
+        H2_Headers = REQUIRED_H2_FIELDS
+        if len(H1_Headers) != len(H2_Headers):
+            raise ValueError("Header lists must have same length")
+        EMPTY_STRING = ""
+        hp_term_list = self.arrange_terms()
+        for hpt in hp_term_list:
+            H1_Headers.append(hpt.label)
+            H2_Headers.append(hpt.id)
+        df = pd.DataFrame(columns=H1_Headers)
+        new_row = dict()
+        for i in range(len(H1_Headers)):
+            new_row[H1_Headers[i]] = H2_Headers[i]
+        df.loc[len(df)] = new_row
+        # add one row with some of the data from the phenopakcet
+        new_row = dict()
+        for i in range(len(H1_Headers)):
+            header_field = H1_Headers[i]
+            if header_field == "HPO":
+                new_row[header_field] = "na"
+            elif header_field in id_to_observed:
+                new_row[header_field] = "observed"
+            elif header_field in id_to_excluded:
+                new_row[header_field] = "excluded"
+            else:
+                new_row[header_field] = "?"
+        df.loc[len(df)] = new_row
+        ## Output as excel
+        ppkt_id = "".join(e for e in ppkt.id if e.isalnum())
+        fname = ppkt_id + "_phenopacket_template.xlsx"
+        df.to_excel(fname, index=False)
+        print(f"Wrote excel pyphetools template file to {fname}")