Deductions in Translator #25

cbizon · 2021-02-03T16:38:11Z

cbizon
Feb 3, 2021
Maintainer

Users will formulate questions and KPs will contain a representation of knowledge. But we don’t want to fail a query (miss results) because the representation of the KP was logically the same as the user query, even if its representation was different.

For example, if a user asks, “what chemicals are related to Diabetes?”, if a KP that knows that Pioglitiazone treats diabetes, then Translator should recognize that this is an example of being related to (treats is a subproperty of related to), and return this knowledge; returning only edges that are directly annotated with ‘related to’ is probably not the user’s intention.

There are two kinds of these mismatches that we might worry about: Mismatches that can be solved logically using deductive reasoning, and mismatches that cannot. For example, consider a query like (biolink:ChemcialSubstance)-[biolink:affects_activity_of]->(biolink:Gene). If I have a result that a chemical [increases_activity_of] a gene, that is logically guaranteed to satisfy the query because increases_activity of is a subproperty of affects_activity_of. Note that the reverse is not true. If the query is
(biolink:ChemicalSubstance)-[biolink:increases_activity_of]->(biolink:Gene), and all I know is that a chemical affects the activity of a gene, it may or may not be a correct edge to return. Because not every affects is an increase, it may also decrease.

So what are the set of inferences that we are talking about?

Making identifiers more specific, e.g. replacing (Diabetes) in a query with (Type 2 Diabetes), since any statement that is universally true about diabetes will also be true about T2D, since T2D is a subclass of Diabetes.
Making categories in a query more specific. If a query includes a node for Named Thing, it should be fillable by a ChemicalSubstance
Making predicates more specific. If a query includes “affects expression of”, it should be fillable by “increases expression of”.
Inverting predicates. If a query includes “treats”, it should be fillable by inverting the direction of the edge and replacing the predicate with “treated_by”. This includes symmetric predicates such as “correlated with” which should be accessible using either endpoint as input. (note that there is a more complicated question about whether Aggregated Graph type #4 can truly be deduced or not).

The set of inferences that we might want to worry about, but are not able to be logically deduced are basically these same choices, but making things more general rather than more specific. This is because the more general result may or may not match the more specific query. For example, if I ask about type 2 diabetes, and a KP has information at the superclass level diabetes, then we’re not concerned with that at the moment.

The question that we have been discussing, and which has come into even clearer focus this week is: where in Translator does this inference occur.

Does the scope of this question make sense?

cmungall · 2021-02-04T15:53:42Z

cmungall
Feb 4, 2021
Collaborator

Thanks @cbizon agree with the above

I think the scope of the question makes sense. Note that the question of what are the deductive rules (e.g. which predicates and inverses of one another) is in the scope of DM, see for example biolink/biolink-model#624. These are largely straightforward entailments using common OWL constructs (symmetry, transitivity, relfexivity, inverses)

It would be good to survey KPs and see which ones are implementing which rules. I know @balhoff's CAM provider performs inferences.

2 replies

cbizon Feb 4, 2021
Maintainer Author

Yes, agreed: DM defines what these rules are, Architecture defines where they get applied...

cbizon Feb 11, 2021
Maintainer Author

cbizon
Feb 4, 2021
Maintainer Author

So given this scope, the question becomes where in the architecture stack this kind of deduction occurs. Options I see:

KPs. You send a query to a KP and it responds with information that matches the query through this kind of deduction.
ARAs. KPs expose very specific things, and ARAs do the matchmaking between queries and what KPs expose.
Registry - KPs expose specific things, Registry matchmakes, and ARAs call what the registry tells them are consistent.

Other variations include that

KPs might do this, but ARAs can't count on it: this is just a more complicated version of 2, because ARAs will have to do it, but now have to also worry about whether KPs do it
Some kinds of this are done in some places, and other kinds in other places. Maybe we will need to get to this point, but I think it makes things pointlessly complicated.

Of the three, I would say that we've been leaning towards the registry handling a lot of this, but our experience in the prototype didn't fill me with confidence on this approach. Furthermore, the registry is not going to be able to help with deduction on specific entities (one of our use cases above).

I think when you think through how a named thing or related to query will work, doing anything other than pushing this deduction down to the KP is going to generate a huge number of messages for little gain. So my strong preference is that we require KPs to do this.

(I expect that to be a controversial statement :) )

7 replies

marcdubybroad Feb 8, 2021
Collaborator

If the KP/ARAs perform this translation of edge predicates parent/child relationships, then we:

run the risk of having different implementations across the KP/ARAs (bugs)
are multiplying the effort across the teams

The trapi server services can be explicit, in that a trapi client would use the /predicates GET call results from the server to know what predicates/subject/object triple a server (KP, ARA) supports and queries the server accordingly. A trapi server would also explicitly list the inverse queries in their /predicates GET results, so translation of inverse edge predicates would not be needed.
A central query 'translation' service could take a generalized query, then transform it according to /predicates GET results of the desired client(s) and formulate the /query POST for each client according to what the client supports.

andrewsu Feb 9, 2021
Collaborator

I think it would be great if the KP handles that (with the help of an SRI-supported service). But for a variety of reasons I think there will be some KPs that support these deductions and some that don't (and some that only support a subset). I think to account for that scenario, it would be nice if we added a metadata field (or fields) in the SmartAPI record. That way an ARA can at least know for which KPs they need to handle deductions (presumably by jackhammer the KP API)...

cbizon Feb 11, 2021
Maintainer Author

Yeah, this is my concern as well. The converse side, though, is that if some KPs do X and other KPs don't do X, then X has to live in the ARAs to cover the KPs that don't do it. But there's still an efficiency to be gained if it lives in the KP, so now it ends up living multiple places, and there's complex switching, more metadata, and general uncleanliness.

Any thoughts on how to avoid that?

gglusman Feb 16, 2021

Not the first time we encounter the question of whether some functionality should reside in the KP or the ARA, with both seeing it as being somehow outside their scope and preferring the other to do it. Which suggests this could be an intermediate functionality. My proposal last time was to have 'adaptors' between KPs and ARAs as needed, and I wonder whether it again would make sense here.

In a sense, in Multiomics Provider we split the KP's functionality into two parts: (1) generating the knowledge graph, and (2) serving it (via Service Provider tools, or by wholesale inclusion into ARA KGs). It seems evident that the kind of expansions and deductions discussed here should not affect (1), as one cannot necessarily predict all possible operations needed, and even if one could, it would result in a very bloated pre-computed KG. So that leaves the functionality to be in (2), which means the expansions would need to be supported by the KG-serving code (or by a downstream adaptor) and by any ARA that imported our KG.

stevencox Feb 18, 2021

@andrewsu I agree with your observation that it would be great for KPs to handle this. And it makes a lot of sense that some would have some difficulty doing so. That said, I think, for example, #2 in the list at the top of this discussion should be a must for all answers. @gglusman suggests a specific method that could address some KP implementation difficulties - we might call it a "deduction filter". @gglusman maybe I'm altering your suggestion slightly by envisioning a single implementation, possibly provided by SRI, that proxies to KP. That has the potential to address @cbizon's fear of it living in each ARA and resulting variability. It also grapples with @marcdubybroad's issues of duplication and variability. And I think it's consistent with the spirit of @balhoff's original proposal on this subthread. What do folks think - does this sound like a promising direction?

suihuang-ISB · 2021-02-08T22:19:32Z

suihuang-ISB
Feb 8, 2021
Collaborator

Chris may I respectfully ask a meta-question about the broader context of why this issue of DEDUCTION arises now?

(i) We at imPROVING agent are in the belief that , following the old discusison of #2 , mostly promoted by Andrew Su, ARAs accept the highest granularity information (most specific term) from KPs and deal internally with that. This made sense to us because the space or ARA is likely finite (or very slowly growing) while the space of KPs and their content (notably when dealing with "raw data sources", such as EHRs) is growing ad infinitum and can get messy. For instance, there could possibly be soon several subtypes of Type 2 diabetes based on molecular profiling that show up in EHRs under different names. The relative level of granularity of a term is thus dynamic and for mapping to biolink (for backwards compatibility) it would indeed make more sense if this is is done by ARA "centrally" and under human supervision (aka manual curation) and also considering how a given ARA performs the relevance analytics....

(ii) Thus there is the broader issue of the two types of ARAs tha we rarely explicitly acknowledge: (a) ARAs with well-curated internal big KGs which ingest KPs and do their own data modeling on a per KP basis, i.e. mostly manually (creating some import filter) vs. (b) ARAs that build KGs on the fly using information from KPs, identified eg via registry. The former makes mot a lot of the issues that we discuss which arise solely because we seek to automate something (as is the case for the (b) type of ARAs..)
More generally, I think for a lot of the discussion we need to distinguish between ARAs of type (a) that have ingested KP off-line in their local bug KGs and of type (b) that do realtime "dynamic KGs"... These are just totally different operations.
I have the impression that we tacitly default to assume the type (b) ARAs - is that correct?

2 replies

cbizon Feb 9, 2021
Maintainer Author

Hi Sui,

Yes, this is an interesting perspective. I think we agree that KPs will provide granular knowledge. The question is when a question is less granular, which component can decide that the granular knowledge is specifically responsive to the more general query.

It's coming up now for a couple of reasons: based on our 'paper prototyping' and the Just Fix it sessions, I think we still don't have this handled uniformly or (IMO) efficiently, and I think we should try to sort it out.

You are completely correct about the different ARA types, but I think that this issue will still arise if the 'big KG' graphs that ARAs are using will be exposed via an interface, as I believe that we want them to be. Because in that case, somebody will still have to manage the difference between how questions are asked and how that knowledge is represented in a big KG. Even simple things like the direction of symmetric relationships...

I'm afraid I don't completely understand the point you are making in (i), maybe you could elaborate a little bit more?

suihuang-ISB Feb 9, 2021
Collaborator

Thanks for clarification, Chris. I am essentially trying to articulate that I am not sure how the problem of deduction, which I describe as (i) above , is fundamentally different from the older one we have discussed about the level of granularity of information being passed around (typically KP-ARA, but also ARS-ARA) and who is in charge to map that to biolink. Also, I was always in the believe that t his is a GENEIRC problem of building biomedical knowledge sources (which would have been encountered by anyone building a search portal for any biomed database) and is not SPECIFIC the problem of a TRANSLATOR ... hence in my technical ignorance in such matters, I had always deferred this issue to someone who will solve it - and we would just adapt and implement ... but even for that we need to understand what is going on...

baranzini-lab · 2021-02-09T17:58:22Z

baranzini-lab
Feb 9, 2021

Could we consider a simpler approach that (at least for now) will not require the ARA or KP to perform reasoning?
What if we "steer" the inquirer to reformulate the question on modified terms that the ARA or KP can understand?
This would be the kind response you get from Siri when she does not know the specific answer to your question

1 reply

jh111 Feb 17, 2021

I think this could be a promising direction. If we offer sample queries with different wording (subclasses and superclasses) it can encourage users explore.

cbizon · 2021-02-17T13:57:25Z

cbizon
Feb 17, 2021
Maintainer Author

Based on yesterday's architecture call, let's break this down a little bit into chunks. So imagine a KP that has relationships that it stores as (ChemicalSubstance)-[related_to]->(Gene). That is what is in its /predicates (or /knowledge_map) endpoint, and what gets into the metaKG.

Now, an ARA has a query (ChemicalSubstance CHEBI:50730)-[related_to]->(NamedThing). My bias is that the ARA should be able to send that as a single TRAPI query (which I can write out here if it makes things clearer) to that KP and return any related_to edges between CHEBI:50730 Genes. Again, let's ignore any edges that are not exact predicate matches for the moment.

So KP reps: Is this something that you already do? Is this something you would be able to do or do you think that this is the wrong direction to go? Are there tools that (if they existed) would make this easier and that you would be willing to incorporate?

15 replies

kevinxin90 Feb 19, 2021

@cbizon
Maybe we have different definitions regarding what a batch query is?

The common use case I can think of is given a list of ontology IDs, I would like to return all ontologies IDs which are descendants of any of the input list.

So the TRAPI query would be something as below:

{
  "message": {
    "query_graph": {
      "edges": {
        "e00": {
          "object": "n01",
          "subject": "n00",
          "predicate": "biolink:superclass_of"
        }
      },
      "nodes": {
        "n00": {
          "category": "biolink:Disease",
          "id": ["MONDO:1", "MONDO:2", ..., "MONDO:1000"]
        },
        "n01": {
          "category": "biolink:Disease"
        }
      }
    }
  }
}

The only difference with non-batch query is that the value of node id is an array of ontology IDs, instead of a single one.

So I don't see why it makes less useful for non-ARA clients, or it would complicate inverse predicates.

For services that has backend database already support batch query, the internal logics would be extremely easy, e.g. what we have for all BioThings APIs. Even for backend databases which don't support batch query naturally, it's still going to way more efficient to have the KP TRAPI service call its own backend database 1000 times (assume we have 1000 ontology id inputs) and send them back to user at once than having the user making 1000 HTTP requests to the KP TRAPI service for each of the ids.

cbizon Feb 19, 2021
Maintainer Author

Hi @kevinxin90

Maybe we have different definitions regarding what a batch query is?
The common use case I can think of is given a list of ontology IDs, I would like to return all ontologies IDs which are descendants of any of the input list.

I think that we do have a different definition. I'm talking only in the specific cases related to the inferences in the original post. In terms of sending a set of potentially unrelated input identifiers, I am 100% in agreement with you.

So I don't see why it makes less useful for non-ARA clients,

I should have been more specific - I really meant not just non-ARA but non Translator. The reason being that it puts the burden on the caller of knowing about the biolink model, and potentially using some (to them) third party tooling to generate the expanded query.

or it would complicate inverse predicates.

If we wanted to follow the pattern you describe above for predicates, then we would make the predicate field a list as well, and say, I want any of these predicates. But if I am wanting to pick up inverses, then the subject and object also change. So now I have two edges, one pointing from n00 to n01 containing one set of predicates, and one pointing from n01 to n00 with the inverses, and then I want to group those two edges and say, I want any edges that fit either of these.

it's still going to way more efficient to have the KP TRAPI service call its own backend database 1000 times (assume we have 1000 ontology id inputs) and send them back to user at once than having the user making 1000 HTTP requests to the KP TRAPI service for each of the ids.

Absolutely! the 1000 HTTP requests is something we really don't want. The case that we're trying to work through now is, when can the KP do the expansion into more categories or predicates on its side (and should it)

Does that clarify anything?

bill-baumgartner Feb 19, 2021
Collaborator

Wondering if we have thoughts from other KP owners?

@cbizon I am in agreement with you that KPs should attempt to handle the four kinds of inference you detailed in the original post of this thread and I've been thinking about how the Text Mining Provider can accomplish this, mainly through the addition of extra triples. Our KGs are grounded in many of the ontologies consumed by the Ontology KP. We've had preliminary discussions with @balhoff about including the Ontology KP triples in our text-mined KGs. Doing so will allow our KGs to handle your inference type (1) -- albeit with possible need for query reformulation as discussed below. For types (2), (3), & (4) we plan on adding extra triples to link all nodes to superclass Biolink categories, and to represent edges with super-property relations, and edges with inverse relations, respectively.

This approach might still require a query reformulation as I don't think it will do inference (1) automatically, e.g. it might require this query:

(ChemicalSubstance CHEBI:50730)-[related_to]->(NamedThing)

to be reframed as:

X-[subclass_of]->(ChemicalSubstance CHEBI:50730)
Y-[subclass_of]->(NamedThing)
X-[related_to]->Y

I haven't had a chance to discuss the details of this approach in the context of the Service Provider with @kevinxin90, so consider this a tentative proposal for now.

kevinxin90 Feb 20, 2021

@cbizon Thanks for the clarification. That makes a lot of sense.

I can speak regarding how Service Provider does it. So for every API we maintain (assume the underlying database doesn't do any expansion themselves), we've a 3-layer architecture.

Data Layer: Build a BioThings API on top of it. In this layer, the pure purpose is to get data (e.g. KGX) into a JSON Restful API (we don't do any modification to comply with BioLink or TRAPI)
SmartAPI Layer: We use SmartAPI, especially the x-bte extension, to map the terms used in source database into terms used in BioLink. This includes predicate, input/output type as well as other additional node/edge properties. For predicate, input/output type, we only map to the most precise/close BioLink concept. There's no expansion happening here.
TRAPI Layer: This is a lightweight adapter. The purpose is to lift from the original BioThings API JSON structure into one comply with TRAPI using x-bte SmartAPI specification. And this is the layer we propose to add all the logics regarding expansion (including predicate/node type/ontology). And this expansion logics should be done on the fly by calling a central service built by Translator and some minimal internal logics to transform the results.

Personally, I don't really mind whether the logics of expansion is placed on KP end or ARA end, as long as only one of the two does it and there's a central service which provides the tooling for doing the expansion (including predicate/node type/ontology), so it wouldn't end up having a variety of implementations of the expansion.

And for @bill-baumgartner , when I saw you say you would like to include additional triples to support the expansion, I assume you mean you want to pre-calculate and expand all predicates/node types/ontologies in your KGX file. The Service Provider surely can accommodate that. We just need to label in SmartAPI that your API is already expanded, so we don't do expansion for you in our TRAPI layer.

But personally I think it's not a very good idea to do so. Here're a few of my concerns:

Space concern: If you're going to pre-calculate all the expansion in KGX, I assume you'll end up with a huge KGX file (especially for the co-occurrence KG). And that's not necessary, since all the expansion can be done on the fly.
Maintain: Both BioLink model and ontologies changes over time. If you would like to pre-calculate all expansions in KGX, that means every time BioLink model updates or one of the ontology resources updates, then you would have to update your KGX. Otherwise, your resource would be out-of-date. That seems to me a solution very hard to maintain in the long term.
Flexibility: If you expand at KGX level, I assume it's going to be very hard for users who just want non-expanded version of your data. While, if we put the logics at TRAPI layer, users can decide whether they want expanded or non-expanded data.

cbizon Feb 20, 2021
Maintainer Author

Yeah, I think that this is what makes things a bit difficult- depending on how individual KPs work, it can be harder or easier to precompute versus computing on the fly. For Jim, rebuilding to pick up biolink changes is not a big deal, but it might be for others. For a neo4j db, it's easy to have many labels on nodes, but not on edges.

I don't see any KPs yet saying that they're opposed to taking on this role, so I'm thinking that I will make a PR codifying that, and we can somewhat shift the discussion to what sorts of SRI tools would help KPs implement this kind of reasoning. I imagine that most of the pieces exist, but maybe we need the right interfaces or packaging or grouping to make them really useful to particular teams.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deductions in Translator #25

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 27 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Deductions in Translator #25

cbizon Feb 3, 2021 Maintainer

Replies: 5 comments · 27 replies

cmungall Feb 4, 2021 Collaborator

cbizon Feb 4, 2021 Maintainer Author

cbizon Feb 11, 2021 Maintainer Author

cbizon Feb 4, 2021 Maintainer Author

marcdubybroad Feb 8, 2021 Collaborator

andrewsu Feb 9, 2021 Collaborator

cbizon Feb 11, 2021 Maintainer Author

gglusman Feb 16, 2021

stevencox Feb 18, 2021

suihuang-ISB Feb 8, 2021 Collaborator

cbizon Feb 9, 2021 Maintainer Author

suihuang-ISB Feb 9, 2021 Collaborator

baranzini-lab Feb 9, 2021

jh111 Feb 17, 2021

cbizon Feb 17, 2021 Maintainer Author

kevinxin90 Feb 19, 2021

cbizon Feb 19, 2021 Maintainer Author

bill-baumgartner Feb 19, 2021 Collaborator

kevinxin90 Feb 20, 2021

cbizon Feb 20, 2021 Maintainer Author

cbizon
Feb 3, 2021
Maintainer

Replies: 5 comments 27 replies

cmungall
Feb 4, 2021
Collaborator

cbizon Feb 4, 2021
Maintainer Author

cbizon Feb 11, 2021
Maintainer Author

cbizon
Feb 4, 2021
Maintainer Author

marcdubybroad Feb 8, 2021
Collaborator

andrewsu Feb 9, 2021
Collaborator

cbizon Feb 11, 2021
Maintainer Author

suihuang-ISB
Feb 8, 2021
Collaborator

cbizon Feb 9, 2021
Maintainer Author

suihuang-ISB Feb 9, 2021
Collaborator

baranzini-lab
Feb 9, 2021

cbizon
Feb 17, 2021
Maintainer Author

cbizon Feb 19, 2021
Maintainer Author

bill-baumgartner Feb 19, 2021
Collaborator

cbizon Feb 20, 2021
Maintainer Author