Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update edge provenance info to comply with Translator Standard -- July 1 #208

Closed
andrewsu opened this issue Jun 22, 2021 · 8 comments · Fixed by biothings/bte_trapi_query_graph_handler#39
Assignees
Labels
enhancement New feature or request

Comments

@andrewsu
Copy link
Member

The parent ticket is here: NCATSTranslator/TranslatorArchitecture#48

This is an example edge with provenance:

            "edges": {
                "CHEBI:41423-biolink:metabolic_processing_affected_by-NCBIGene:1576": {
                    "predicate": "biolink:metabolic_processing_affected_by",
                    "subject": "CHEBI:41423",
                    "object": "NCBIGene:1576",
                    "attributes": [
                        {
                            "attribute_type_id": "provided_by",
                            "value": [
                                "drugbank"
                            ],
                            "value_type_id": "biolink:provided_by"
                        },
                        {
                            "attribute_type_id": "api",
                            "value": [
                                "MyChem.info API"
                            ],
                            "value_type_id": "bts:api"
                        },
                        {
                            "attribute_type_id": "publications",
                            "value": [
                                "PMID:22336956"
                            ],
                            "value_type_id": "biolink:publication"
                        },
                        {
                            "attribute_type_id": "action",
                            "value": "substrate",
                            "value_type_id": "bts:action"
                        },
                        {
                            "attribute_type_id": "function",
                            "value": "Vitamin d3 25-hydroxylase activity",
                            "value_type_id": "bts:function"
                        }
                    ]
                },

These are the desired edge properties (copied from the parent ticket):

primary knowledge source:
     is_a: knowledge source
     description: >-
       The most upstream source of the knowledge expressed in an Association that an
       implementer can identify (may or may not be the 'original' source).
     range: information resource
     multivalued: false

  original knowledge source:
    is_a: primary knowledge source
    description: >-
      The Information Resource that created the original record of the knowledge expressed
      in an Association (e.g. via curation of the knowledge from the literature, or
      generation of the knowledge de novo through computation, reasoning, inference over
      data).
    range: information resource
    multivalued: false

  aggregator knowledge source:
    is_a: knowledge source
    description: >-
      An intermediate aggregator resource from which knowledge expressed in an Association was
      retrieved downstream of the original source, on its path to its current serialized form.
    range: information resource
    multivalued: true
@colleenXu
Copy link
Collaborator

colleenXu commented Jun 23, 2021

[updated 7/21 to reflect the discussion in the 7/20 lab call; rearrange to put what we're doing first at the top]


Situation B: BTE uses x-bte to get edge, API called counts as an "aggregator". This is the situation for most APIs BTE uses (see Situations A/C for the exceptions).

Update notes:

  • What if this field doesn't exist in the SmartAPI registry entry? We could take the API name (same behavior as before), or we could map the API name to the corresponding infores ID using a hard-coded mapping.
  • @andrewsu has noted that some APIs outside of Translator may not count as aggregators - and may count as "primary"/"original" sources of knowledge instead. Like MGI, RGD, Litvar, EBI Proteins? This involves discussion...
  • The 7/13 decision was to implement this behavior for all cases where BTE uses x-bte, and then add exceptions / different behavior for the Situation C APIs.
  • Another decision was to NOT transform the values we get from the APIs themselves (where they say their source is) - currently under "bts:source", see examples below. This is a potentially difficult task since each API uses different strings / lists of strings and those would need to be mapped to infores IDs with code (we'd probably set the attribute_type_id to "primary" to make it easier for us).

What to do:

BTE has to add all the source-related information to the edge attributes array:

  1. BTE should say it is an aggregator: add a hard-coded object @ariutta. This is needed for all 3 situations.
  2. the API BTE called to get the edge is an aggregator. Currently this info is the "attribute_type_id":"api" object.
    • code @ariutta :
      1. CHANGE "attribute_type_id":"api" to "attribute_type_id": "biolink:aggregator_knowledge_source".
      2. CHANGE where BTE gets the "attribute_type_id":"api" value to get it from the SmartAPI registry file's info.x-translator.infores-curie field.
      3. CHANGE the "value_type_id":"bts:api" to "value_type_id":"biolink:InformationResource"
    • CX: All APIs BTE uses x-bte with have been updated; they have info.x-translator.infores-curie property
  3. the x-bte annotates "where the edge is from" using a hard-coded "source" field. This can count as a "primary" knowledge source (where BTE thinks this info is from). Currently this info is the "attribute_type_id":"provided_by" object.
    • code @ariutta :
      1. CHANGE "attribute_type_id":"provided_by" to "attribute_type_id": "biolink:primary_knowledge_source".
      2. CHANGE the "value_type_id":"bts:provided_by" to "value_type_id":"biolink:InformationResource"
    • CX: All APIs BTE uses x-bte with have been updated; so the hard-coded source is set to the corresponding infores ID OR IS ABSENT
  4. Currently, don't do anything to the "bts:source" attribute. This comes from the response-mapped field of the API (API telling us where it thinks the info is from).

A Current-and-Desired example:

Current (the source-related attribute objects for an edge):


                    "attributes": [
                        {
                            "attribute_type_id": "api",
                            "value": [
                                "BioLink API"
                            ],
                            "value_type_id": "bts:api"
                        },
                        {
                            "attribute_type_id": "provided_by",
                            "value": [
                                "Monarch Initiative"
                            ],
                            "value_type_id": "biolink:provided_by"
                        },
                        {
                            "attribute_type_id": "source",
                            "value": [
                                "https://archive.monarchinitiative.org/#omim"
                            ],
                            "value_type_id": "bts:source"
                        },
                        ......
                     ]

Desired (comments as //):

                       {  // add this
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:translator-biothings-explorer"],
                            "value_type_id": "biolink:InformationResource"
                        },
                       { // corresponds to the "api" object above
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:biolink-api"],
                            "value_type_id": "biolink:InformationResource"
                        },
                       { // corresponds to the "provided_by" object above
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": ["infores:monarchinitiative"],
                            "value_type_id": "biolink:InformationResource"
                        },
                        { // no change to the "source" object above
                            "attribute_type_id": "source",
                            "value": [
                                "https://archive.monarchinitiative.org/#omim"
                            ],
                            "value_type_id": "bts:source"
                        },

Here's some thoughts on how to update provenance. The situations below are based on what API BTE called to get that edge.

Important notes to read first:

  • The infores IDs are here in column F. We can make "psuedo" values if needed, and ask about adding them later.
  • Matt Brush's examples and newer example from COHD have this setup: in 1 edge's attributes, multiple attribute objects may have the SAME attribute_type_id but different values (like aggregator_knowledge_graph). This may result in unexpected/wonky behavior in BTE, since BTE is currently doing some level of merging-edge-attributes into lists...
  • Despite my worry in the previous bullet, I'll pretend this isn't an issue and outline behavior + examples below where there will be multiple attribute objects with the same attribute type id in one edge...
  • in a perfect world, some of what I'm going to assign as "primary" would actually be the "original" source of the assertion (like SEMMED). But that involves even more logic/work to figure out, so I'm ignoring that for now.
  • I don't think CTD API from this BTE's list of APIs is in the metaKG....can we track down what's going on here? issue here: [quick fix] is CTD api really used by BTE #215

Situation A: BTE ingests edge from a TRAPI API

currently BTE ingests these TRAPI APIs:

  • 'Automat IntAct',
  • 'Automat Cord19 Scibite',
  • 'Automat Gtopdb',
  • 'Automat KEGG',
  • 'Automat Cord19 Scigraph',
  • 'Automat Uberongraph',
  • 'Automat Human GOA',
  • 'Automat HGNC',
  • 'Automat HMDB',
  • 'Automat Hetio',
  • 'Automat Panther',
  • 'Automat Pharos',
  • 'Automat Chembio',
  • 'Automat Foodb',

What to do:

  • Keep the edge attribute arrays from that API's response (see [bug] missing Automat publications (and other attributes) #209). We should assume that it already has all the source-related objects in the edges' attribute arrays.
  • Add an object like this to EVERY edge's attributes array, to cover BTE itself as a source (acting as a "pseudo-KP" / "ARA").:
                       {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": "infores:translator-biothings-explorer",
                            "value_type_id": "biolink:InformationResource"
                        }

Situation C: BTE uses x-bte to get the edge, the API we call counts as "primary".

This is the situation for APIs from multiomics and text mining provider, since they create knowledge from their analysis of data/publications...and perhaps some external APIs that we bring in.

The APIs BTE ingests right now that fit this are:

  • 'Clinical Risk KP API'
  • 'Text Mining Targeted Association API'
  • 'Multiomics Wellness KP API'

Other APIs that fit this (but BTE doesn't ingest right now):

  • 'Drug Response KP API'
  • 'Text Mining Co-occurrence API'
  • 'TCGA Mutation Frequency API'

What to do

BTE has to add all the source-related information to the edge attributes array:

  1. Talk to those teams. They should have their own ideas of how to model their source-related info in TRAPI, and may have info on source that's not currently in the APIs but they want to add it.
  2. BTE should say it is an aggregator (same as the other two situations)
  3. the API BTE called to get the edge is a primary. Currently this info is "attribute_type_id":"api" object. Do the same as situation B above, but set attribute_type_id as "biolink:primary_knowledge_source".
  4. If the team can describe the data source it used to make its knowledge -- we could put that in as a hard-coded x-bte source....perhaps this is a "supporting data source". We could then treat it like the corresponding section in Scenario B, except setting the attribute_type_id as "biolink:supporting_data_source"
  5. It would be awesome, but maybe a reach? If we could add the url for more info on the KP APIs (see the desired example's primary_knowledge_source object).

An example:

Ideally from clinical risk kp api (the source-related attribute objects for an edge) - doesn't exist right now:

                    "attributes": [
                        {
                            "attribute_type_id": "api",
                            "value": [
                                "Clinical Risk KP API"
                            ],
                            "value_type_id": "bts:api"
                        },
                        {
                            "attribute_type_id": "provided_by",
                            "value": [
                                "clinical-records-washington-2018"
                            ],
                            "value_type_id": "biolink:provided_by"
                        },
                        {
                            "attribute_type_id": "provenance",
                            "value": "https://github.com/NCATSTranslator/Translator-All/wiki/EHR-Risk-KP",
                            "value_type_id": "bts:provenance"
                        }
                        ......
                     ]

Desired (comments as //):
Notice that the url clinical risk kp api gave was moved to be under the primary knowledge source. Also I made up the supporting data source below since I don't know what it is; it's not in the info above.

                       {  // added
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:translator-biothings-explorer"],
                            "value_type_id": "biolink:InformationResource"
                        },
                       {  // was "api" object above
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": ["infores:biothings-multiomics-clinical-risk"],
                            "value_url": "https://github.com/NCATSTranslator/Translator-All/wiki/EHR-Risk-KP"
                            "value_type_id": "biolink:InformationResource"
                        },
                       {  // was "provided_by" object above
                            "attribute_type_id": "biolink:supporting_data_source",
                            "value": ["infores:clinical-records-washington-2018"],
                            "value_type_id": "biolink:InformationResource"
                        },

Additional reference:

  • Notes from the 6/15 Translator EPC call:
    • KPs model provenance back as far as they want to. AKA if we want to drop "bts:source" info for KPs from us (Scenario B above), we can.
    • "primary" is what you use if you can't tell if it's the "original" or not, but that's the "upstream" place you got stuff from.
  • Matt Brush's examples: shows that value_type_id = biolink:InformationResource, multiple attributes with the same attribute_type_id
  • where the source-related attributes are in biolink model
  • Attribute object definition from TRAPI

@AlexanderPico
Copy link
Collaborator

Screen Shot 2021-07-06 at 10 59 42 AM

@colleenXu colleenXu added the enhancement New feature or request label Jul 9, 2021
@andrewsu
Copy link
Member Author

As a very quick recap of today's discussion, @ariutta will take the lead on modifying the structure of the JSON output in the edge attributes, and @colleenXu will take the lead on updating the SmartAPI records for where most of those values are drawn. There undoubtedly will be other details and edge cases to fix later, but let's start with that...

@colleenXu
Copy link
Collaborator

I have edited my post above to reflect today's call. @andrewsu and @ariutta, please review at minimum the section under "Scenario B" and confirm whether these tasks/decisions correctly reflect today's decisions.

colleenXu added a commit to NCATS-Tangerine/translator-api-registry that referenced this issue Jul 27, 2021
as example for source provenance update for situation B/C here: biothings/biothings_explorer#208 (comment)
@andrewsu
Copy link
Member Author

Quick note that the ARAX results viewer for Translator now has a nice visualization for the edge provenance info. For example, from https://arax.ncats.io/?source=ARS&id=a7af1e97-eae3-430d-b570-4da271ea56c7

image

colleenXu referenced this issue in NCATS-Tangerine/translator-api-registry Aug 3, 2021
…res curies

also note: 3 apis don't have the hard-coded source field now:
- text mining co-occurrence api
- text mining targeted association api
- drug response kp api

also note: one api fits situation C of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/208\#issuecomment-866620512: tcga mutational freq api
@colleenXu
Copy link
Collaborator

colleenXu commented Aug 3, 2021

@ariutta All APIs with yamls in registry update here are updated to address this issue. Note that 3 APIs don't have the "hard-coded" source field anymore; this is fine - they just won't have the corresponding attribute object in their attributes array.

once the 2 multiomics api yamls have their PRs merged / smartapi registry entries updated, they may also not have the "hard-coded" source field anymore

@colleenXu
Copy link
Collaborator

Note that Provenance situation A may be dealt with, once this PR are merged.

I notice that this PR seems to add the BTE provenance object mentioned above and included below:

                       {  // add this
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:translator-biothings-explorer"],
                            "value_type_id": "biolink:InformationResource"
                        },

@colleenXu
Copy link
Collaborator

I'm okay with closing this issue for now, and opening it again to deal with Provenance situation C related issues as that comes up...

This is going to happen with text mining targeted association soon where the plan is to ingest the edge attributes field from records and preserve its structure...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants