Skip to content

Commit

Permalink
version bump
Browse files Browse the repository at this point in the history
  • Loading branch information
pnrobinson committed May 28, 2024
2 parents 7951e26 + b7ac7bb commit 7c3cb10
Show file tree
Hide file tree
Showing 21 changed files with 448 additions and 127 deletions.
65 changes: 0 additions & 65 deletions .github/workflows/generate_phenopackets.yml

This file was deleted.

30 changes: 30 additions & 0 deletions docs/developers/developers.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
# For developers

## Local Installation

We recommend creating a local environment:

```bash
python3 -m venv venv
source venv/bin/activate
```

and updating Python's `pip` tool:

```bash
python3 -m pip install --upgrade pip
```

You can then do a local/editable install:


```bash
python3 -m pip install --editable ".[test]"
```

After installation you should be able to run the test suite:

```bash
pytest
```


## Creating Phenopackets

pyphetools provides two main ways of creating phenopackets.

Expand Down
Binary file added docs/img/deletion_error.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
42 changes: 34 additions & 8 deletions docs/user-guide/python_notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,27 +15,53 @@ import pyphetools
print(f"Using pyphetools version {pyphetools.__version__}")
```

Import the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/) hp.json file. Note that here we show code that assumes that the file is available in the enclosing directory. Update the ORCID identifier to your own [ORCID](https://orcid.org/){:target="_blank"} id. Indicate
the location of the template file.
### Set paths and identifiers
Update the ORCID identifier to your own [ORCID](https://orcid.org/){:target="_blank"} id.
Update the path to the template file.

```python
template = "input/BRD4_individuals.xlsx"
hp_json = "../hp.json"
created_by = "0000-0002-0736-9199"
```

import the template file. The code returns the pyphetools Individual objects, each of which contains all of the information needed to create a phenopacket and which here can be used if desired for debugging or further analysis. The cvalidator object is used to display quality assessment information.
### Import the template file.
The code returns the pyphetools Individual objects, each of which contains all of the information needed to create a phenopacket and which here can be used if desired for debugging or further analysis. The cvalidator object is used to display quality assessment information.
Note that optionally you can provide an argument to the location of the hp.json file using the ``hp_json``argument. If no argument is provided, the hpo-toolkit library will download the latest version of
hp.json to your user directory (.hpotk folder).

```
timporter = TemplateImporter(template=template, hp_json=hp_json, created_by=created_by)
```python
timporter = TemplateImporter(template=template, created_by=created_by)
individual_list, cvalidator = timporter.import_phenopackets_from_template()
```
Display quality assessment data.

### Structural variants
pyphetools will automatically retrieve information about small variants coded as HGVS strings using the
[VariantValidator](https://variantvalidator.org/) API. Until very recently, it was challenging to determine the exact positions of larger structural variants, and for this reason, publications often described them
using phrases such as "whole gene deletion" or "EX9-12DEL". If such as string is found in the template file,
pyphetool will emit an error such as the following.

<figure markdown>
![Validation results](../img/deletion_error.png){ width="1000" }
<figcaption>Validation Results.
</figcaption>
</figure>

This can be fixed by passing an argument with a set of all strings that represent deletions (as in the following example), duplications, or inversions.

```python title="Specifying structural variants"
del_set = {"EX9-12DEL"}
timporter = TemplateImporter(template=template, created_by=created_by)
individual_list, cvalidator = timporter.import_phenopackets_from_template(deletions=del_set)
```

### Display quality assessment data.
```
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))
```
Display summaries of each phenopacket. The command ``cvalidator.get_error_free_individual_list()``returns versions of the Individual objects
### Display summaries of each phenopacket.

The command ``cvalidator.get_error_free_individual_list()``returns versions of the Individual objects
in which errors such as redundancies have been removed; this is the data that gets transformed into phenopackets.


Expand Down
13 changes: 12 additions & 1 deletion docs/user-guide/template.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Data-Entry Template

pyphetools offers two main ways to encode clinical data as phenopackets. The library provides various functions to encode data found in
typical supplementary materials of publications about cohorts. This option, which is covered in more detail in TODO is intended for those
typical supplementary materials of publications about cohorts. This option, which is covered in more detail [here](../developers/developers.md) is intended for those
with skills in scripting with Python. Additionally, pyphetools can ingest data encoded in an Excel template that can be used without additional scripting.
The template can be ingested using a standardized notebook. Alternatively, users are invited to work with the HPO team to enter the data into the HPO database.

Expand Down Expand Up @@ -57,4 +57,15 @@ tcreator.create_template(disease_id=disease_id,
transcript=ofd1_transcript)
```

The following snippet can be used as a "starter" by pasting it into the notebook.

```python
tc.create_template(disease_id="",
disease_label="",
gene_symbol="",
HGNC_id="",
transcript="")
```


The script creates a file that can be opened in Excel and used for curation. Add additional HPO terms as necessary and remove terms that are not needed.
2 changes: 2 additions & 0 deletions src/pyphetools/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
from . import visualization
from . import validation


__version__ = "0.9.88"


__all__ = [
"creation",
"pp",
Expand Down
2 changes: 2 additions & 0 deletions src/pyphetools/creation/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from .column_mapper import ColumnMapper
from .constant_column_mapper import ConstantColumnMapper
from .create_template import TemplateCreator
from .discombulator import Discombobulator
from .disease import Disease
from .disease_id_column_mapper import DiseaseIdColumnMapper
from .hgvs_variant import HgvsVariant
Expand Down Expand Up @@ -46,6 +47,7 @@
"ColumnMapper",
"ConstantColumnMapper",
"ColumnMapper",
"Discombobulator",
"Disease",
"DiseaseIdColumnMapper",
"HgvsVariant",
Expand Down
2 changes: 1 addition & 1 deletion src/pyphetools/creation/case_template_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,7 +388,7 @@ def _parse_individual(self, row:pd.Series):
elif sex == "U":
sex = Constants.UNKNOWN_SEX_SYMBOL
else:
raise ValueError(f"Unrecognized sex symbol: {sex}")
raise ValueError(f"Unrecognized sex symbol: {sex} for individual \"{individual_id}\"")
onset_age = data_items.get(AGE_OF_ONSET_FIELDNAME)
if onset_age is not None and isinstance(onset_age, str):
onset_age = PyPheToolsAge.get_age(onset_age)
Expand Down
82 changes: 66 additions & 16 deletions src/pyphetools/creation/create_template.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,31 @@
import os
import typing

import pandas as pd
from collections import defaultdict
from .hpo_parser import HpoParser
from .hp_term import HpTerm
from typing import List
import hpotk
from .case_template_encoder import REQUIRED_H1_FIELDS, REQUIRED_H2_FIELDS

class TemplateCreator:

def __init__(self, hp_json:str, hp_cr_index:str=None) -> None:
if not os.path.isfile(hp_json):
def __init__(
self,
hp_json: typing.Optional[str] = None,
hp_cr_index: typing.Optional[str] = None,
) -> None:
if hp_json is None:
parser = HpoParser()
elif not os.path.isfile(hp_json):
raise FileNotFoundError(f"Could not find hp.json file at {hp_json}")
if hp_cr_index:
else:
parser = HpoParser(hpo_json_file=hp_json)
if hp_cr_index is not None:
if not os.path.isfile(hp_cr_index):
raise FileNotFoundError(f"Could not find the FastHPOCR index file at {hp_cr_index}")
parser = HpoParser(hpo_json_file=hp_json)

self._hpo_cr = parser.get_hpo_concept_recognizer(hp_cr_index=hp_cr_index)
self._hpo_ontology = parser.get_ontology()
self._all_added_hp_term_set = set()
Expand Down Expand Up @@ -60,7 +71,7 @@ def arrange_terms(self) -> List[hpotk.model.TermId]:
return hp_term_list


def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_symbol:str, transcript:str, append=False):
def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_symbol:str, transcript:str):
"""Create an Excel file that can be used to enter data as a pyphetools template
:param disease_id: an OMIM, MONDO, or other similar CURIE identifier
Expand Down Expand Up @@ -107,17 +118,56 @@ def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_s
df.loc[len(df)] = new_row
## Output as excel
fname = disease_id.replace(":", "_") + "_individuals.xlsx"

if os.path.isfile(fname):
if not append:
raise FileExistsError(f"Excel file '{fname}' already exists. Use 'append=True' \
to append HPO terms to the existing file.")
else:
print(f"[WARNING] Appending to existing file '{fname}'. This might lead to duplicate HPO terms. \
It's recommended to create a new file instead.")

raise FileExistsError(f"Excel file '{fname}' already exists.")
df.to_excel(fname, index=False)
print(f"Write excel pyphetools template file to {fname}")


print(f"Wrote Excel pyphetools template file to {fname}")

def create_from_phenopacket(self, ppkt):
"""
create pyphetools templates from an individual phenopacket.
This function is intended to accelerate the process of converting the LIRICAL phenopackets
to our current format and generally should not be used for new cases
"""
id_to_observed = set()
id_to_excluded = set()

for pf in ppkt.phenotypic_features:
hpt = HpTerm(hpo_id=pf.type.id, label=pf.type.label)
self._all_added_hp_term_set.add(hpt)
if pf.excluded:
id_to_excluded.add(pf.type.label)
else:
id_to_observed.add(pf.type.label)
H1_Headers = REQUIRED_H1_FIELDS
H2_Headers = REQUIRED_H2_FIELDS
if len(H1_Headers) != len(H2_Headers):
raise ValueError("Header lists must have same length")
EMPTY_STRING = ""
hp_term_list = self.arrange_terms()
for hpt in hp_term_list:
H1_Headers.append(hpt.label)
H2_Headers.append(hpt.id)
df = pd.DataFrame(columns=H1_Headers)
new_row = dict()
for i in range(len(H1_Headers)):
new_row[H1_Headers[i]] = H2_Headers[i]
df.loc[len(df)] = new_row
# add one row with some of the data from the phenopakcet
new_row = dict()
for i in range(len(H1_Headers)):
header_field = H1_Headers[i]
if header_field == "HPO":
new_row[header_field] = "na"
elif header_field in id_to_observed:
new_row[header_field] = "observed"
elif header_field in id_to_excluded:
new_row[header_field] = "excluded"
else:
new_row[header_field] = "?"
df.loc[len(df)] = new_row
## Output as excel
ppkt_id = "".join(e for e in ppkt.id if e.isalnum())
fname = ppkt_id + "_phenopacket_template.xlsx"
df.to_excel(fname, index=False)
print(f"Wrote excel pyphetools template file to {fname}")
Loading

0 comments on commit 7c3cb10

Please sign in to comment.