Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add attribute to indicate agreement with SemMedDB for targeted assertions #96

Open
bill-baumgartner opened this issue Jul 13, 2022 · 0 comments
Labels
incoming status - This issue has been submitted and is awaiting approval/triage

Comments

@bill-baumgartner
Copy link
Collaborator

bill-baumgartner commented Jul 13, 2022

Background

At the June 2022 Relay there was interest expressed by consortium members for the Text Mining Provider KP to include an indication when its targeted assertions matched with those provided by SemMedDB. Agreement between the two resources can be considered at varying levels. The two resources might agree at the sentence level, i.e., both resources mined an assertion from the same sentence of a given document. They might agree at the document level, i.e., both resources mined an assertion from the same document, but not necessarily from the same sentence in that document. Finally, the resources might agree at only the assertion level, i.e., one resource mined an assertion from one document while the other resource mined the same assertion from a different document altogether.

Some relevant links:

Proposed EPC metadata

We will return agreement information in the form of an attribute in the EPC metadata for each targeted assertion. The attribute will be on the assertion-level, at least initially, and will have nested fields to indicate the following:

  • [boolean] true if SemMedDB contains this type of assertion, false otherwise, i.e., true if one might expect this kind of assertion to also appear in SemMedDB. If false, then many of the counts below will be zero.
  • [integer] count of this assertion reported by TMKP (will include PubMed and other sources)
  • [integer] count of this assertion reported by SemMedDB
  • [integer] count of PubMed records in TMKP that assert this assertion
  • [integer] count of PubMed records in SemMedDB that assert this assertion
  • [integer or %] number of PubMed records in TMKP & SemMedDB that both assert this assertion
  • [integer] count of sentences in PubMed records in TMKP that assert this assertion
  • [integer] count of sentences in PubMed records in SemMedDB that assert this assertion
  • [integer or %] number of sentences in PubMed records in TMKP & SemMedDB that both assert this assertion

Note that the proposed EPC fields above are PubMed-centric because SemMedDB is comprised of assertions mined from PubMed. TMKP contains assertions mined from PubMed as well as other sources.

Processing SemMedDB

In order to populate the proposed EPC metadata above, we will develop a pipeline to process SemMedDB. Expected challenges include mapping from the entity namespace used by SemMedDB (UMLS, I believe) to the OBO namespace used by TMKP. There have been previous efforts within the translator community to align SemMedDB with Biolink that may be of help. This notebook illustrates many of the modeling decisions that need to be made in order to make use of SemMedDB within the Translator ecosystem.

Output of the pipeline will be a database table with the following fields:
| PMID | sentence_id | subject CURIE | predicate | object CURIE |
where,

  • PMID = PubMed ID
  • sentence_id is a hash of the sentence. Currently we use a SHA256 hash of documentId + documentZone + entityId1 + entitySpan1 + entityId2 + entitySpan2 + sentenceText. We may need to reconsider this based on the information available in SemMedDB.
  • subject CURIE = the CURIE of the subject entity in the OBO namespace
  • predicate = the predicate in the Biolink namespace
  • object CURIE = the CURIE of the object entity in the OBO namespace

We note that SemMedDB has periodic releases, so the pipeline will be run whenever a new release is made available.

@bill-baumgartner bill-baumgartner added the incoming status - This issue has been submitted and is awaiting approval/triage label Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incoming status - This issue has been submitted and is awaiting approval/triage
Projects
None yet
Development

No branches or pull requests

1 participant