Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix header emitted from gpad writer to match 2.0 specifications #663

Closed
wants to merge 49 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
4b3db7d
update header to be gpad vs gpa
sierra-moxon Dec 19, 2023
3971529
add generated_by and date_generated to header information of gpad 2.0…
sierra-moxon Dec 19, 2023
1b8df3a
use gpad 2.0
sierra-moxon Dec 20, 2023
df4b831
Merge branch 'master' into gopreprocess-gpad20
sierra-moxon Jan 29, 2024
1a39ba3
update poetry lock
sierra-moxon Jan 31, 2024
3637ad6
add back in changes so far
sierra-moxon Jan 31, 2024
85711c5
add back in changes so far
sierra-moxon Jan 31, 2024
ebf8d4c
fix pyproject.toml to run the command
sierra-moxon Jan 31, 2024
8d65cd8
add groups to gitignore
sierra-moxon Jan 31, 2024
c6abcd4
remove go-basic.json
sierra-moxon Jan 31, 2024
71f24b9
progress on GPAD writing outside of make_products
sierra-moxon Feb 1, 2024
dcdab0e
updating click.echo to reflect gpad vs. gaf
sierra-moxon Feb 1, 2024
76fdb9d
null safe noctua file
sierra-moxon Feb 1, 2024
e6f356f
fix make_gpads routine to only generate from noctua when noctua file …
sierra-moxon Feb 5, 2024
1f3a9d6
fix failing tests
sierra-moxon Feb 5, 2024
3c5b0b3
fix formatting of tests, add test for noctua metadata
sierra-moxon Feb 6, 2024
cb75c2d
fix readme
sierra-moxon Feb 6, 2024
49a7efe
fixing line count to reflect gaf or gpad as appropriate
sierra-moxon Feb 7, 2024
6d30a42
add tools.gzips
sierra-moxon Feb 9, 2024
c676a1c
add return of file instead of stream
sierra-moxon Feb 9, 2024
2779fde
add return of file instead of stream
sierra-moxon Feb 9, 2024
2a7a6b0
refactor methods to make it easier to debug file handle open/close/etc.
sierra-moxon Feb 9, 2024
5b05902
remove extra _gpad
sierra-moxon Feb 14, 2024
d298942
remove redundant noctua GPAD combination
sierra-moxon Feb 21, 2024
9d8c131
more tweaking order of annotation production
sierra-moxon Feb 22, 2024
bfca866
attempt to reconcile double .gaf
sierra-moxon Feb 23, 2024
321ea9c
move gaf processing out of mixin a dataset for gpad generation
sierra-moxon Feb 23, 2024
4e1b5fd
fix GPADWriter.version
sierra-moxon Feb 24, 2024
a96a41a
just typo fixing in the docstrings
sierra-moxon Feb 27, 2024
0635b13
stashing changes
sierra-moxon Mar 6, 2024
1c5f283
no gaf files
sierra-moxon Mar 6, 2024
4b7ce38
Merge branch 'master' into gopreprocess-gpad20
sierra-moxon Mar 6, 2024
1c4120b
remove start on protein swap
sierra-moxon Mar 6, 2024
b347f61
output isoform if avaiable as subject
sierra-moxon Mar 13, 2024
04841c3
remove debug statement
sierra-moxon Mar 14, 2024
04e78b4
remove debugging
sierra-moxon Mar 14, 2024
8e65a88
explicitly set ruleset on Gaf and GPAD parsers
sierra-moxon Mar 20, 2024
dd6ec2e
add GPI processing into gaf production
sierra-moxon May 1, 2024
f8c6acb
merge master
sierra-moxon May 1, 2024
55aad22
small migration to support GPI 2.0 output format isntead of GPI 1.2 o…
sierra-moxon May 1, 2024
374d5ed
add extra columns for null values in GPI 2.0 format derived from GAF
sierra-moxon May 1, 2024
267e48c
add test for parsing GPI 2.0 files
sierra-moxon May 2, 2024
7a08057
stash formatting changes
sierra-moxon May 13, 2024
e77556b
add flag to output gpad/gpi in both 1.2 and 2.0 formats
sierra-moxon May 13, 2024
c324923
stashing work towards supporting both GPI 1.2 and GPI 2.0
sierra-moxon May 13, 2024
7843455
add switch for producing GPAD/GPI 1.2 and 2.0 format
sierra-moxon May 13, 2024
7631465
add tests for 1.2 and 2.0
sierra-moxon May 13, 2024
7a54170
fix tests
sierra-moxon May 13, 2024
4a549af
formatting
sierra-moxon May 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,16 @@ __pycache__/
*.so

# Distribution / packaging
go-basic.jso*
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
mgi-*
groups/
.eggs/
lib/
lib64/
Expand Down
25 changes: 18 additions & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,24 @@ foo:

# only run local tests
travis_test:
pytest tests/test_*local*.py tests/test_*parse*.py tests/test*writer*.py tests/test_qc.py \
tests/test_rdfgen.py tests/test_phenosim_engine.py tests/test_ontol.py \
tests/test_validation_rules.py tests/unit/test_annotation_scorer.py \
tests/test_goassociation_model.py tests/test_relations.py \
tests/unit/test_golr_search_query.py tests/unit/test_owlsim2_api.py \
tests/test_collections.py \
tests/test_gocamgen.py
@if [ -d ".venv" ] && [ -f "pyproject.toml" ]; then \
echo "Running tests in Poetry environment..."; \
poetry run pytest tests/test_*local*.py tests/test_*parse*.py tests/test*writer*.py tests/test_qc.py \
tests/test_rdfgen.py tests/test_phenosim_engine.py tests/test_ontol.py \
tests/test_validation_rules.py tests/unit/test_annotation_scorer.py \
tests/test_goassociation_model.py tests/test_relations.py \
tests/unit/test_golr_search_query.py tests/unit/test_owlsim2_api.py \
tests/test_collections.py \
tests/test_gocamgen.py; \
else \
pytest tests/test_*local*.py tests/test_*parse*.py tests/test*writer*.py tests/test_qc.py \
tests/test_rdfgen.py tests/test_phenosim_engine.py tests/test_ontol.py \
tests/test_validation_rules.py tests/unit/test_annotation_scorer.py \
tests/test_goassociation_model.py tests/test_relations.py \
tests/unit/test_golr_search_query.py tests/unit/test_owlsim2_api.py \
tests/test_collections.py \
tests/test_gocamgen.py; \
fi

cleandist:
rm dist/* || true
Expand Down
16 changes: 16 additions & 0 deletions bin/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,17 @@
See [command line docs](http://ontobio.readthedocs.io/en/latest/commandline.html#commandline) on ReadTheDocs

To test validate.py "validate" command, the command that produces the final GPADs in the pipeline via the "mega make"
(aka: "produces GAFs, GPADs, ttl" stage), on a particular source:

```bash
poetry install
poetry run validate produce -m ../go-site/metadata --gpad -t . -o go-basic.json --base-download-url "http://skyhook.berkeleybop.org/[PIPELINE_BRANCH_NAME]/" --only-dataset mgi MGI
poetry run validate produce -m ../go-site/metadata --gpad -t . -o go-basic.json --base-download-url "http://skyhook.berkeleybop.org/[PIPELINE_BRANCH_NAME]/" --only-dataset goa_chicken goa
```


To test whether a GAF file is valid (passes all the GORules):
```bash
poetry install
poetry run python3 ontobio-parse-assocs.py --file [path_to_file.gaf] --format GAF -o mgi_valid.gaf --report-md mgi.report.md -r [path_to_go.json] -l all validate
```
291 changes: 215 additions & 76 deletions bin/validate.py

Large diffs are not rendered by default.

9 changes: 6 additions & 3 deletions ontobio/io/assocparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,10 @@ def association_generator(self, file, skipheader=False, outfile=None) -> Dict:
file = self._ensure_file(file)
for line in file:
parsed_result = self.parse_line(line)
self.report.report_parsed_result(parsed_result, outfile, self.config.filtered_evidence_file, self.config.filter_out_evidence)
self.report.report_parsed_result(parsed_result,
outfile,
self.config.filtered_evidence_file,
self.config.filter_out_evidence)
for association in parsed_result.associations:
# yield association if we don't care if it's a header or if it's definitely a real gaf line
if not skipheader or not isinstance(association, dict):
Expand Down Expand Up @@ -962,6 +965,7 @@ def parse_date(date: str, report: Report, line: List) -> Optional[association.Da

return d


def parse_iso_date(date: str, report: Report, line: List) -> Optional[association.Date]:

def parse_with_dateutil(date: str, repot: Report, line: List) -> Optional[association.Date]:
Expand All @@ -978,15 +982,14 @@ def parse_with_dateutil(date: str, repot: Report, line: List) -> Optional[associ
day="{:02d}".format(parsed.day),
time=parsed.time().isoformat())


if date == "":
report.error(line, Report.INVALID_DATE, "\'\'", "GORULE:0000001: empty", rule=1)
return None

d = None
if len(date) >= 10:
# For ISO style date, should be YYYY-MM-DD all as digits and
# a well formed date string here will be at least 10 characters long.
# a well-formed date string here will be at least 10 characters long.
# Optionally, there could be an appended THH:MM
year = date[0:4]
month = date[5:7]
Expand Down
17 changes: 11 additions & 6 deletions ontobio/io/assocwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import datetime
import json
import logging
import click

from typing import List, Union

Expand Down Expand Up @@ -108,18 +109,23 @@ def write(self, assocs, meta=None):
GPAD_2_0 = "2.0"
GPAD_1_2 = "1.2"


class GpadWriter(AssocWriter):
"""
Writes Associations in GPAD format
"""
def __init__(self, file=None, version=GPAD_1_2):
def __init__(self, file=None, version=None):
self.file = file
click.echo("Writing GPAD version: {}".format(version))
if version in [GPAD_1_2, GPAD_2_0]:
self.version = version
else:
self.version = GPAD_1_2

self._write("!gpa-version: {}\n".format(self.version))
self._write("!gpad-version: {}\n".format(self.version))
click.echo("Writing GPAD version: {}".format(self.version))
self._write("!generated-by: {}\n".format("GO Central"))
self._write("!date-generated: {}\n".format(str(datetime.datetime.now().strftime("%Y-%m-%dT%H:%M"))))
self.ecomap = ecomap.EcoMap()

def as_tsv(self, assoc: Union[association.GoAssociation, dict]):
Expand All @@ -136,7 +142,6 @@ def as_tsv(self, assoc: Union[association.GoAssociation, dict]):
return assoc.to_gpad_1_2_tsv()



class GafWriter(AssocWriter):
"""
Writes Associations in GAF format.
Expand All @@ -151,13 +156,13 @@ class GafWriter(AssocWriter):

The only difference in 2.1 and 2.2 are how qualifiers (column 4) are handled.
GAF 2.1 allows empty or only `NOT` qualifier values, and only allows
`colocalizes_with` and `contributes_to` as qualifer values. However in 2.2
`colocalizes_with` and `contributes_to` as qualifier values. However, in 2.2
qualifier must *not* be empty and cannot have only `NOT` as it's a modifier
on existing qualifers. The set of allowed qualifiers in 2.2 is also expanded.
on existing qualifiers. The set of allowed qualifiers in 2.2 is also expanded.

So if there's a mismatch between converting from an annotation and a GAF
version then that annotation is just skipped and not written out with an
error message displayed. Mismatch occurances of this kind would appear if
error message displayed. Mismatch occurrences of this kind would appear if
the incoming annotation has a qualifier in the 2.2 set, but 2.1 is being
written out, or if the qualifier is empty and 2.2 is being written.
"""
Expand Down
1 change: 1 addition & 0 deletions ontobio/io/entityparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ def list_field(self, field: str) -> List:
# If there is no config file path, return None
# return None


class GpiParser(EntityParser):

def __init__(self, config=None):
Expand Down
94 changes: 75 additions & 19 deletions ontobio/io/entitywriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,11 +71,16 @@ def write(self, entities, meta=None):
for e in entities:
self.write_entity(e)


class GpiWriter(EntityWriter):
"""
Writes entities in GPI format
Writes entities in GPI 1.2 or 2.0 (https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md) format

:param file: file
:param version: str

Takes an "entity" dictionary generated typically from a GoAssociation object

Takes an entity dictionary:
{
'id': id, (String)
'label': db_object_symbol, (String)
Expand All @@ -89,29 +94,80 @@ class GpiWriter(EntityWriter):
}
}
"""
def __init__(self, file=None):
def __init__(self, file=None, version=None):
self.file = file
self.version = version
if self.file:
self.file.write("!gpi-version: 1.2\n")
if self.version == "2.0":
self.file.write("!gpi-version: 2.0\n")
else:
self.file.write("!gpi-version: 1.2\n")

def write_entity(self, entity):
"""
Write a single entity to a line in the output file

:param entity: dict ; typically a dictionary representing an instance of a GoAssociation object
:param gpi_output_version: str ; the version of the GPAD output file to write
:return: None

GPI 2.0 spec <-- entity attributes

1. DB_Object_ID <-- entity.id (CURIE format)
2. DB_Object_symbol <-- entity.label
3. DB_Object_Name <-- entity.full_name
4. DB_Object_Synonyms <-- entity.synonyms
5. DB_Object_Type <-- entity.type
6. DB_Object_Taxon <-- entity.taxon
7. Encoded_by <-- does not appear in GAF file, this is optional in GPI
8. Parent_Protein <-- entity.parents # unclear if this is a list or a single value
9. Protein_Containing_Complex_Members <-- does not appear in GAF file, this is optional in GPI
10. DB_Xrefs <-- entity.xrefs
11. Gene_Product_Properties <-- entity.properties

GPI 1.2 spec <-- entity attributes

1. DB <-- entity.id.prefix
2. DB_Object_ID <-- entity.id.local_id
3. DB_Object_Symbol <-- entity.label
4. DB_Object_Name <-- entity.full_name
5. DB_Object_Synonym(s) <-- entity.synonyms
6. DB_Object_Type <-- entity.type
7. Taxon <-- entity.taxon
8. Parent_Object_ID <-- entity.parents # unclear if this is a list or a single value
9. DB_Xref(s) <-- entity.xrefs
10. Properties <-- entity.properties

"""
db, db_object_id = self._split_prefix(entity)
taxon = normalize_taxon(entity["taxon"]["id"])

vals = [
db,
db_object_id,
entity.get('label'),
entity.get('full_name'),
entity.get('synonyms'),
entity.get('type'),
taxon,
entity.get('parents'),
entity.get('xrefs'),
entity.get('properties')
]
print(entity.get('taxon'))

if self.version == "2.0":
vals = [
entity.get('id'), # DB_Object_ID
entity.get('label'), # DB_Object_symbol
entity.get('full_name'), # DB_Object_Name
entity.get('synonyms'), # DB_Object_Synonyms
entity.get('type'), # DB_Object_Type
normalize_taxon(entity.get("taxon").get("id")), # DB_Object_Taxon
"", # Encoded_by
entity.get('parents'), # Parent_Protein
"", # Protein_Containing_Complex_Members
entity.get('xrefs'), # DB_Xrefs
entity.get('properties') # Gene_Product_Properties
]
else:
prefix, local_id = self._split_prefix(entity)
vals = [
prefix, # DB
local_id, # DB_Object_ID
entity.get('label'), # DB_Object_Symbol
entity.get('full_name'), # DB_Object_Symbol
entity.get('synonyms'), # DB_Object_Name
entity.get('type'), # DB_Object_Synonyms
normalize_taxon(entity.get("taxon").get("id")), # taxon
entity.get('parents'), # Parent_Object_ID
entity.get('xrefs'), # DB_Xref(s)
entity.get('properties') # Properties
]

self._write_row(vals)
51 changes: 38 additions & 13 deletions ontobio/io/gafgpibridge.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,18 @@ def __hash__(self):
return hash(d)


class GafGpiBridge(object):
def convert_association(association, gpad_gpi_output_version="2.0") -> Entity:
"""
'id' is already `join`ed in both the Association and the Entity,
so we don't have to worry about what that looks like. We assume
it's correct.

def __init__(self):
self.cache = []

def convert_association(self, association) -> Entity:
"""
'id' is already `join`ed in both the Association and the Entity,
so we don't have to worry about what that looks like. We assume
it's correct.
"""
if isinstance(association, GoAssociation):
:param association: GoAssociation
:param gpad_gpi_output_version: str value of the GPAD/GPI version to write - either 2.0 or 1.2
:return: Entity
"""
if isinstance(association, GoAssociation):
if gpad_gpi_output_version == "2.0":
# print(json.dumps(association, indent=4))
gpi_obj = {
'id': str(association.subject.id),
Expand All @@ -36,11 +36,36 @@ def convert_association(self, association) -> Entity:
'xrefs': "", # GAF does not have this field, but it's optional in GPI
'taxon': {
'id': str(association.subject.taxon)
}
},
'encoded_by': "" # GAF does not have this field, but it's optional in GPI

}
return Entity(gpi_obj)
else:
gpi_obj = {
'db': str(association.subject.id.split(":")[0]),
'id': str(association.subject.id.split(":")[1]),
'label': association.subject.label, # db_object_symbol,
'full_name': association.subject.fullname, # db_object_name,
'synonyms': association.subject.synonyms,
'type': [gp_type_label_to_curie(association.subject.type[0])], # db_object_type,
'parents': "", # GAF does not have this field, but it's optional in GPI
'xrefs': "", # GAF does not have this field, but it's optional in GPI
'taxon': {
'id': str(association.subject.taxon)
},
'encoded_by': "" # GAF does not have this field, but it's optional in GPI

}
return Entity(gpi_obj)

return None
return None


class GafGpiBridge(object):

def __init__(self):
self.cache = []

def entities(self) -> List[Entity]:
return list(self.cache)
1 change: 0 additions & 1 deletion ontobio/io/qc.py
Original file line number Diff line number Diff line change
Expand Up @@ -919,7 +919,6 @@ def test(self, annotation: association.GoAssociation, config: assocparser.AssocP
evidence = str(annotation.evidence.type)
withfrom = annotation.evidence.with_support_from


if evidence in [iss_eco, isa_eco, iso_eco] and (withfrom is None or len(withfrom) == 0):
return self._result(False)

Expand Down
Loading
Loading