Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mappings to abstractions in the gene/gene-product hierarchy #81

Open
bill-baumgartner opened this issue Mar 30, 2021 · 1 comment
Open
Labels
incoming status - This issue has been submitted and is awaiting approval/triage

Comments

@bill-baumgartner
Copy link
Collaborator

Protein mentions that are automatically identified in text by the Text Mining Provider infrastructure are typically annotated to a species-non-specific class from the Protein Ontology (when one is available). Mapping to a more abstract concept has been shown to greatly improve inter-annotator agreement for the manual annotation task as determining the correct species for a protein mention can often be difficult (even for humans). However when put into practice, e.g. through the text-mined assertion KG provided by the Text Mining Provider, it has become evident that the use of these abstractions creates a disconnect between the contents of the text-mined assertion KG and the rest of the Translator ecosystem which makes use of species-specific identifiers. This is a problem that needs to be addressed.

A related problem involves mapping from a gene in a query to the protein encoded by the gene. Distinguishing between gene and protein mentions in text is also a difficult task (even for humans). It is often unclear whether the author is referring to the gene or the protein. The text-mined assertion KG conflates the two concepts, and although it makes use of identifiers from the Protein Ontology, the mentions should be considered as representing the biolink:GeneOrGeneProduct class. Note: This issue may be addressed by a fix-it session in the upcoming May relay.

Both of the issues described above play a role in the return of zero hits for the query described in NCATSTranslator/testing#28. In order to successfully mine assertions from the text-mined assertion KG for the Chemical substances that "down regulate" STK11 query the following mappings are required:

  • HGNC:11389 needs to be mapped to the protein for which that gene encodes: PR:Q15831
  • PR:Q15831 then needs to be mapped to its species-non-specific abstraction by climbing the subclass hierarchy of the Protein Ontology: PR:000015740

In short, replacing HGNC:11389 with PR:000015740 in the query should result in a non-empty result set from the text-mined assertion KG.

@bill-baumgartner bill-baumgartner added the incoming status - This issue has been submitted and is awaiting approval/triage label Mar 30, 2021
@mikebada
Copy link

mikebada commented Apr 2, 2021

We need to somehow make sure that the queriers are aware that this is being done, i.e., that we'll be returning results for both genes and gene products corresponding to the inputted entity and also for potentially all 1:1 orthologs of the inputted species-specific entity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incoming status - This issue has been submitted and is awaiting approval/triage
Projects
None yet
Development

No branches or pull requests

2 participants