Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrectly merged PR for issue 114 #118

Merged
merged 25 commits into from
May 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5c3ed6b
Merge pull request #73 from monarch-initiative/develop
pnrobinson Jan 14, 2024
403254a
Merge pull request #75 from monarch-initiative/develop
pnrobinson Feb 2, 2024
f882105
Merge pull request #78 from monarch-initiative/develop
pnrobinson Feb 9, 2024
7feacc8
Merge pull request #84 from monarch-initiative/develop
pnrobinson Mar 1, 2024
05a398d
Merge pull request #94 from monarch-initiative/develop
pnrobinson Mar 22, 2024
7333bdb
Merge pull request #96 from monarch-initiative/develop
pnrobinson Mar 24, 2024
872588e
Merge pull request #103 from monarch-initiative/develop
pnrobinson Mar 29, 2024
5c5cb10
Merge pull request #104 from monarch-initiative/develop
pnrobinson Mar 30, 2024
a69b7de
updates
pnrobinson Apr 23, 2024
dd176f6
refactor template to use hpotk
pnrobinson Apr 28, 2024
9f5f161
adding image
pnrobinson Apr 28, 2024
a459f0b
improved id generation
pnrobinson Apr 29, 2024
e78ef38
date for biocurator
pnrobinson Apr 29, 2024
8b4a037
lenient MOI for diseases with more than one MOI
pnrobinson May 2, 2024
56d9df5
update docs
pnrobinson May 4, 2024
384bfe1
update docs
pnrobinson May 4, 2024
f097e6a
create template from phenopacket
pnrobinson May 5, 2024
df8f443
Adding installation guide.
cmungall May 7, 2024
44728b7
Put virtual environment into `venv`.
ielis May 8, 2024
885c0fb
Suggest updating `pip` before installation.
ielis May 8, 2024
c322938
Generate FBN1 phenopackets using today's `phenopacket-store` state.
ielis May 8, 2024
d79c275
Update *FBN1* notebook names.
ielis May 8, 2024
ee0ebc8
triggered rebuild
pnrobinson May 21, 2024
42c5846
Drop the GitHub action that runs selected `phenopacket-store` notebooks.
ielis May 21, 2024
f5a7c56
Merge branch 'refs/heads/develop' into issue-114-fix
ielis May 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 0 additions & 65 deletions .github/workflows/generate_phenopackets.yml

This file was deleted.

30 changes: 30 additions & 0 deletions docs/developers/developers.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
# For developers

## Local Installation

We recommend creating a local environment:

```bash
python3 -m venv venv
source venv/bin/activate
```

and updating Python's `pip` tool:

```bash
python3 -m pip install --upgrade pip
```

You can then do a local/editable install:


```bash
python3 -m pip install --editable ".[test]"
```

After installation you should be able to run the test suite:

```bash
pytest
```


## Creating Phenopackets

pyphetools provides two main ways of creating phenopackets.

Expand Down
Binary file added docs/img/deletion_error.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
42 changes: 34 additions & 8 deletions docs/user-guide/python_notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,27 +15,53 @@ import pyphetools
print(f"Using pyphetools version {pyphetools.__version__}")
```

Import the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/) hp.json file. Note that here we show code that assumes that the file is available in the enclosing directory. Update the ORCID identifier to your own [ORCID](https://orcid.org/){:target="_blank"} id. Indicate
the location of the template file.
### Set paths and identifiers
Update the ORCID identifier to your own [ORCID](https://orcid.org/){:target="_blank"} id.
Update the path to the template file.

```python
template = "input/BRD4_individuals.xlsx"
hp_json = "../hp.json"
created_by = "0000-0002-0736-9199"
```

import the template file. The code returns the pyphetools Individual objects, each of which contains all of the information needed to create a phenopacket and which here can be used if desired for debugging or further analysis. The cvalidator object is used to display quality assessment information.
### Import the template file.
The code returns the pyphetools Individual objects, each of which contains all of the information needed to create a phenopacket and which here can be used if desired for debugging or further analysis. The cvalidator object is used to display quality assessment information.
Note that optionally you can provide an argument to the location of the hp.json file using the ``hp_json``argument. If no argument is provided, the hpo-toolkit library will download the latest version of
hp.json to your user directory (.hpotk folder).

```
timporter = TemplateImporter(template=template, hp_json=hp_json, created_by=created_by)
```python
timporter = TemplateImporter(template=template, created_by=created_by)
individual_list, cvalidator = timporter.import_phenopackets_from_template()
```
Display quality assessment data.

### Structural variants
pyphetools will automatically retrieve information about small variants coded as HGVS strings using the
[VariantValidator](https://variantvalidator.org/) API. Until very recently, it was challenging to determine the exact positions of larger structural variants, and for this reason, publications often described them
using phrases such as "whole gene deletion" or "EX9-12DEL". If such as string is found in the template file,
pyphetool will emit an error such as the following.

<figure markdown>
![Validation results](../img/deletion_error.png){ width="1000" }
<figcaption>Validation Results.
</figcaption>
</figure>

This can be fixed by passing an argument with a set of all strings that represent deletions (as in the following example), duplications, or inversions.

```python title="Specifying structural variants"
del_set = {"EX9-12DEL"}
timporter = TemplateImporter(template=template, created_by=created_by)
individual_list, cvalidator = timporter.import_phenopackets_from_template(deletions=del_set)
```

### Display quality assessment data.
```
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))
```
Display summaries of each phenopacket. The command ``cvalidator.get_error_free_individual_list()``returns versions of the Individual objects
### Display summaries of each phenopacket.

The command ``cvalidator.get_error_free_individual_list()``returns versions of the Individual objects
in which errors such as redundancies have been removed; this is the data that gets transformed into phenopackets.


Expand Down
13 changes: 12 additions & 1 deletion docs/user-guide/template.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Data-Entry Template

pyphetools offers two main ways to encode clinical data as phenopackets. The library provides various functions to encode data found in
typical supplementary materials of publications about cohorts. This option, which is covered in more detail in TODO is intended for those
typical supplementary materials of publications about cohorts. This option, which is covered in more detail [here](../developers/developers.md) is intended for those
with skills in scripting with Python. Additionally, pyphetools can ingest data encoded in an Excel template that can be used without additional scripting.
The template can be ingested using a standardized notebook. Alternatively, users are invited to work with the HPO team to enter the data into the HPO database.

Expand Down Expand Up @@ -57,4 +57,15 @@ tcreator.create_template(disease_id=disease_id,
transcript=ofd1_transcript)
```

The following snippet can be used as a "starter" by pasting it into the notebook.

```python
tc.create_template(disease_id="",
disease_label="",
gene_symbol="",
HGNC_id="",
transcript="")
```


The script creates a file that can be opened in Excel and used for curation. Add additional HPO terms as necessary and remove terms that are not needed.
2 changes: 1 addition & 1 deletion src/pyphetools/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from . import visualization
from . import validation

__version__ = "0.9.77"
__version__ = "0.9.85"

__all__ = [
"creation",
Expand Down
2 changes: 1 addition & 1 deletion src/pyphetools/creation/case_template_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,7 +388,7 @@ def _parse_individual(self, row:pd.Series):
elif sex == "U":
sex = Constants.UNKNOWN_SEX_SYMBOL
else:
raise ValueError(f"Unrecognized sex symbol: {sex}")
raise ValueError(f"Unrecognized sex symbol: {sex} for individual \"{individual_id}\"")
onset_age = data_items.get(AGE_OF_ONSET_FIELDNAME)
if onset_age is not None and isinstance(onset_age, str):
onset_age = PyPheToolsAge.get_age(onset_age)
Expand Down
82 changes: 66 additions & 16 deletions src/pyphetools/creation/create_template.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,31 @@
import os
import typing

import pandas as pd
from collections import defaultdict
from .hpo_parser import HpoParser
from .hp_term import HpTerm
from typing import List
import hpotk
from .case_template_encoder import REQUIRED_H1_FIELDS, REQUIRED_H2_FIELDS

class TemplateCreator:

def __init__(self, hp_json:str, hp_cr_index:str=None) -> None:
if not os.path.isfile(hp_json):
def __init__(
self,
hp_json: typing.Optional[str] = None,
hp_cr_index: typing.Optional[str] = None,
) -> None:
if hp_json is None:
parser = HpoParser()
elif not os.path.isfile(hp_json):
raise FileNotFoundError(f"Could not find hp.json file at {hp_json}")
if hp_cr_index:
else:
parser = HpoParser(hpo_json_file=hp_json)
if hp_cr_index is not None:
if not os.path.isfile(hp_cr_index):
raise FileNotFoundError(f"Could not find the FastHPOCR index file at {hp_cr_index}")
parser = HpoParser(hpo_json_file=hp_json)

self._hpo_cr = parser.get_hpo_concept_recognizer(hp_cr_index=hp_cr_index)
self._hpo_ontology = parser.get_ontology()
self._all_added_hp_term_set = set()
Expand Down Expand Up @@ -60,7 +71,7 @@ def arrange_terms(self) -> List[hpotk.model.TermId]:
return hp_term_list


def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_symbol:str, transcript:str, append=False):
def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_symbol:str, transcript:str):
"""Create an Excel file that can be used to enter data as a pyphetools template

:param disease_id: an OMIM, MONDO, or other similar CURIE identifier
Expand Down Expand Up @@ -107,17 +118,56 @@ def create_template(self, disease_id:str, disease_label:str, HGNC_id:str, gene_s
df.loc[len(df)] = new_row
## Output as excel
fname = disease_id.replace(":", "_") + "_individuals.xlsx"

if os.path.isfile(fname):
if not append:
raise FileExistsError(f"Excel file '{fname}' already exists. Use 'append=True' \
to append HPO terms to the existing file.")
else:
print(f"[WARNING] Appending to existing file '{fname}'. This might lead to duplicate HPO terms. \
It's recommended to create a new file instead.")

raise FileExistsError(f"Excel file '{fname}' already exists.")
df.to_excel(fname, index=False)
print(f"Write excel pyphetools template file to {fname}")


print(f"Wrote Excel pyphetools template file to {fname}")

def create_from_phenopacket(self, ppkt):
"""
create pyphetools templates from an individual phenopacket.
This function is intended to accelerate the process of converting the LIRICAL phenopackets
to our current format and generally should not be used for new cases
"""
id_to_observed = set()
id_to_excluded = set()

for pf in ppkt.phenotypic_features:
hpt = HpTerm(hpo_id=pf.type.id, label=pf.type.label)
self._all_added_hp_term_set.add(hpt)
if pf.excluded:
id_to_excluded.add(pf.type.label)
else:
id_to_observed.add(pf.type.label)
H1_Headers = REQUIRED_H1_FIELDS
H2_Headers = REQUIRED_H2_FIELDS
if len(H1_Headers) != len(H2_Headers):
raise ValueError("Header lists must have same length")
EMPTY_STRING = ""
hp_term_list = self.arrange_terms()
for hpt in hp_term_list:
H1_Headers.append(hpt.label)
H2_Headers.append(hpt.id)
df = pd.DataFrame(columns=H1_Headers)
new_row = dict()
for i in range(len(H1_Headers)):
new_row[H1_Headers[i]] = H2_Headers[i]
df.loc[len(df)] = new_row
# add one row with some of the data from the phenopakcet
new_row = dict()
for i in range(len(H1_Headers)):
header_field = H1_Headers[i]
if header_field == "HPO":
new_row[header_field] = "na"
elif header_field in id_to_observed:
new_row[header_field] = "observed"
elif header_field in id_to_excluded:
new_row[header_field] = "excluded"
else:
new_row[header_field] = "?"
df.loc[len(df)] = new_row
## Output as excel
ppkt_id = "".join(e for e in ppkt.id if e.isalnum())
fname = ppkt_id + "_phenopacket_template.xlsx"
df.to_excel(fname, index=False)
print(f"Wrote excel pyphetools template file to {fname}")
Loading