Skip to content

Latest commit

 

History

History
184 lines (152 loc) · 11.5 KB

README_assoc_kgs.md

File metadata and controls

184 lines (152 loc) · 11.5 KB

Targeted Biolink Association Knowledge Graphs

The Text Mining Provider provides KGs consisting of explicitly targeted Biolink Associations extracted from sentences in the scientific literature. The association extraction pipeline makes use of concept annotations created by the same concept recognition system used for other aspects of the Text Mining Provider. Association classification is facilitated via a custom-tuned BERT model, one model per association type.

Extracted Associations

To date, there is a single extracted association type available:

Associations under-construction

A note about concept recognition using the Protein Ontology

When using the cooccurrence KGs, it may be important to note that the concept recognition pipeline for Protein Ontology (PRO) concepts tends to annotate to higher-level, species non-specific PRO concepts. This is a result of how the training data has been annotated. It turns out that it is often very difficult to determine the species for a given protein in text (difficult even for human annotators) so our tools make use of the higher-level, species-non-specific concepts. There may, therefore, be added inference steps required to traverse a knowledge graph from a species-specific entity to the species-non-specific PRO concept used in the annotation. In an effort to aid in this traversal, we have built a knowledge graph consisting of the PRO subsumption and relation hierarchies.

Available KGs

Version Association type(s) Format Location
SepRelay ChemicalToGeneAssociation (CHEBI up/down-regulates PR) KGX nodes.tsv / edges.tsv
SepRelay ChemicalToGeneAssociation (CHEBI up/down-regulates PR) BioThings API API
SepRelay ChemicalToGeneAssociation (CHEBI up/down-regulates PR) TRAPI v1.0 API

Requesting extraction of Biolink associations

There is potential to generate associations between concepts from the following ontologies:

Please feel free to submit requests (as GitHub issues) for new concepts and/or associations to be mined from the scientific literature. For details on concept and association systems that are in development, please see the relevant GitHub issues for new concept request and new association requests).

TRAPI v1.0

This KG consists of Biolink associations that have been extracted from sentences in the literature. For each text-mined Biolink association, the sentence(s) that were observed to assert the association are included as evidence/provenance/confidence (EPC) information. Specifically, each extracted Biolink association is accompanied by the following EPC information:

  • The sentence from which the assertion was mined
  • An identifier for the document that contains the sentence, e.g. the PubMed identifier
  • The character offsets (relative to the sentence) for the text mentions of the subject and object concept of the assertion
  • A confidence score for this specific text-mined assertion (right now this is the score reported by the classifier that identified the sentence)

Sample query

Find biolink:GeneOrGeneProduct entities for which bupivacaine (CHEBI:3215) negatively regulates.

curl --location --request POST 'https://api.bte.ncats.io/v1/smartapi/978fe380a147a8641caf72320862697b/query/' \
--header 'Content-Type: application/json' \
--data-raw '{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "category": "biolink:ChemicalSubstance",
                    "id": "CHEBI:3215"
                },
                "n1": {
                    "category": "biolink:GeneOrGeneProduct"
                }
            },
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1",
                    "predicate": "biolink:negatively_regulates"
                }
            }
        }
    }
}'

Evidence/Provenance/Confidence structure

Currently, each packet of EPC information (one per sentence that was observed to assert the association) is stored as an edge attribute in the TRAPI knowledge representation model. Because there may be more than one sentence observed to assert a single association, separate arrays are used to store the different EPC values whereby the index in the array inherently connects the EPC values for a single sentence. The example below demonstrates the TRAPI representation of edge attributes for an extracted Biolink association supported by two sentences in the literature.

Two example EPC packets describing sentences that assert bupivacaine --downregulates--> LRRC3B
# This assertion is supported by two sentences in the literature
      {
        'publication': 'PMID:29085514', 
        'score': '0.99956816', 
        'sentence': 'The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells.', 
        'subject_spans': 'start: 31, end: 42', 
        'object_spans': 'start: 104, end: 110', 
        'provided_by': 'TMProvider'
      }

      {
        'publication': 'PMID:12345678', 
        'score': '0.876', 
        'sentence': 'This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.', 
        'subject_spans': 'start: 42, end: 53', 
        'object_spans': 'start: 75, end: 81', 
        'provided_by': 'TMProvider'
      }
TRAPI v1.0 representation of the two EPC packets shown above
edges:
  - id: 9445e98f72ada21aa572559e303e4d5ac414650f
    predicate: biolink:negatively_regulates,
    subject: CHEBI:3215          # bupivacaine
    object: PR:000031567       # LRRC3B
    attributes:
      - type: biolink:provided_by
        name: provided_by
        value: Text Mining KP
      - type: bts:api
        name: api
        value: Text Mining Targeted Association API
      - type: bts:score
        name: score
        value: 
          - 0.99956816
          - 0.876
      - type: bts:sentence
        name: sentence
        value: 
          - "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
          - "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
      - type: bts:subject_spans
        name: subject_spans
        value: 
          - "31|42"
          - "42|53"
      - type: bts:object_spans
        name: object_spans
        value: 
          - "104|110"
          - "75|81"
      - type: bts:publications
        name: publications
        value: 
          - PMID:29085514
          - PMID:12345678  

KGX format

The KGX format for the targeted association KG makes use of two kinds of nodes:

  1. Entity nodes (similar to the other KGs produced by the Text Mining Provider)
  2. Evidence nodes - these nodes are referenced by edges and include the sentence and other related information used to assert a given association. The score in the evidence node is provided by the classifer that asserted the association.

Nodes

id name category publications score sentence subject_spans relation_spans object_spans provided_by
CHEBI:3215 bupivacaine biolink:ChemicalSubstance
PR:000031567 leucine-rich repeat-containing protein 3B biolink:GeneOrGeneProduct
02T60dgTsntC9kC4rqtr5lIN8n0 Evidence: CHEBI:3215 -neg-reg-> PR:000031567 biolink:InformationContentEntity PMID:29085514 0.99956816 The administration of 50 ?g/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells. start: 31, end: 42 start: 104, end: 110 TMProvider

Edges

subject edge_label object relation id association_type evidence_count has_evidence
CHEBI:3215 biolink:negatively_regulates_entity_to_entity PR:000031567 RO:0002212 IjbFtUdgNQk-HHlsBju-I_jpSnA biolink:ChemicalToGeneAssociation 1 02T60dgTsntC9kC4rqtr5lIN8n0