The Text Mining Provider provides KGs consisting of explicitly targeted Biolink Associations extracted from sentences in the scientific literature. The association extraction pipeline makes use of concept annotations created by the same concept recognition system used for other aspects of the Text Mining Provider. Association classification is facilitated via a custom-tuned BERT model, one model per association type.
To date, there is a single extracted association type available:
Associations under-construction
- biolink:GeneRegulatoryRelationship
- biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation
- CHEBI biolink:treats MONDO
- biolink:GeneToDiseaseAssociation
- PR biolink:contributes_to MONDO
- biolink:DiseaseToPhenotypicFeatureAssociation
- MONDO biolink:has_symptom HP
- biolink:GeneToExpressionSiteAssociation
- PR biolink:expressed_in UBERON
When using the cooccurrence KGs, it may be important to note that the concept recognition pipeline for Protein Ontology (PRO) concepts tends to annotate to higher-level, species non-specific PRO concepts. This is a result of how the training data has been annotated. It turns out that it is often very difficult to determine the species for a given protein in text (difficult even for human annotators) so our tools make use of the higher-level, species-non-specific concepts. There may, therefore, be added inference steps required to traverse a knowledge graph from a species-specific entity to the species-non-specific PRO concept used in the annotation. In an effort to aid in this traversal, we have built a knowledge graph consisting of the PRO subsumption and relation hierarchies.
Version | Association type(s) | Format | Location |
---|---|---|---|
SepRelay | ChemicalToGeneAssociation (CHEBI up/down-regulates PR) | KGX | nodes.tsv / edges.tsv |
SepRelay | ChemicalToGeneAssociation (CHEBI up/down-regulates PR) | BioThings API | API |
SepRelay | ChemicalToGeneAssociation (CHEBI up/down-regulates PR) | TRAPI v1.0 | API |
There is potential to generate associations between concepts from the following ontologies:
- Chemical Entities of Biological Interest (CHEBI)
- Cell Ontology (CL)
- Gene Ontology Biological Process (GO_BP)
- Gene Ontology Cellular Component (GO_CC)
- Gene Ontology Molecular Function (GO_MF)
- Human Phenotype Ontology (HP)
- Molecular Process Ontology (MOP)
- Monarch Disease Ontology (MONDO)
- NCBI Taxonomy (NCBITaxon)
- Protein Ontology (PRO)
- Sequence Ontology (SO)
- Uberon multi-species anatomy ontology (UBERON)
Please feel free to submit requests (as GitHub issues) for new concepts and/or associations to be mined from the scientific literature. For details on concept and association systems that are in development, please see the relevant GitHub issues for new concept request and new association requests).
This KG consists of Biolink associations that have been extracted from sentences in the literature. For each text-mined Biolink association, the sentence(s) that were observed to assert the association are included as evidence/provenance/confidence (EPC) information. Specifically, each extracted Biolink association is accompanied by the following EPC information:
- The sentence from which the assertion was mined
- An identifier for the document that contains the sentence, e.g. the PubMed identifier
- The character offsets (relative to the sentence) for the text mentions of the subject and object concept of the assertion
- A confidence score for this specific text-mined assertion (right now this is the score reported by the classifier that identified the sentence)
Find biolink:GeneOrGeneProduct
entities for which bupivacaine (CHEBI:3215) negatively regulates.
curl --location --request POST 'https://api.bte.ncats.io/v1/smartapi/978fe380a147a8641caf72320862697b/query/' \
--header 'Content-Type: application/json' \
--data-raw '{
"message": {
"query_graph": {
"nodes": {
"n0": {
"category": "biolink:ChemicalSubstance",
"id": "CHEBI:3215"
},
"n1": {
"category": "biolink:GeneOrGeneProduct"
}
},
"edges": {
"e00": {
"subject": "n0",
"object": "n1",
"predicate": "biolink:negatively_regulates"
}
}
}
}
}'
Currently, each packet of EPC information (one per sentence that was observed to assert the association) is stored as an edge attribute in the TRAPI knowledge representation model. Because there may be more than one sentence observed to assert a single association, separate arrays are used to store the different EPC values whereby the index in the array inherently connects the EPC values for a single sentence. The example below demonstrates the TRAPI representation of edge attributes for an extracted Biolink association supported by two sentences in the literature.
# This assertion is supported by two sentences in the literature
{
'publication': 'PMID:29085514',
'score': '0.99956816',
'sentence': 'The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells.',
'subject_spans': 'start: 31, end: 42',
'object_spans': 'start: 104, end: 110',
'provided_by': 'TMProvider'
}
{
'publication': 'PMID:12345678',
'score': '0.876',
'sentence': 'This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.',
'subject_spans': 'start: 42, end: 53',
'object_spans': 'start: 75, end: 81',
'provided_by': 'TMProvider'
}
edges:
- id: 9445e98f72ada21aa572559e303e4d5ac414650f
predicate: biolink:negatively_regulates,
subject: CHEBI:3215 # bupivacaine
object: PR:000031567 # LRRC3B
attributes:
- type: biolink:provided_by
name: provided_by
value: Text Mining KP
- type: bts:api
name: api
value: Text Mining Targeted Association API
- type: bts:score
name: score
value:
- 0.99956816
- 0.876
- type: bts:sentence
name: sentence
value:
- "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
- "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
- type: bts:subject_spans
name: subject_spans
value:
- "31|42"
- "42|53"
- type: bts:object_spans
name: object_spans
value:
- "104|110"
- "75|81"
- type: bts:publications
name: publications
value:
- PMID:29085514
- PMID:12345678
The KGX format for the targeted association KG makes use of two kinds of nodes:
- Entity nodes (similar to the other KGs produced by the Text Mining Provider)
- Evidence nodes - these nodes are referenced by edges and include the sentence and other related information used to assert a given association. The score in the evidence node is provided by the classifer that asserted the association.
Nodes
id | name | category | publications | score | sentence | subject_spans | relation_spans | object_spans | provided_by |
---|---|---|---|---|---|---|---|---|---|
CHEBI:3215 | bupivacaine | biolink:ChemicalSubstance | |||||||
PR:000031567 | leucine-rich repeat-containing protein 3B | biolink:GeneOrGeneProduct | |||||||
02T60dgTsntC9kC4rqtr5lIN8n0 | Evidence: CHEBI:3215 -neg-reg-> PR:000031567 | biolink:InformationContentEntity | PMID:29085514 | 0.99956816 | The administration of 50 ?g/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells. | start: 31, end: 42 | start: 104, end: 110 | TMProvider |
Edges
subject | edge_label | object | relation | id | association_type | evidence_count | has_evidence |
---|---|---|---|---|---|---|---|
CHEBI:3215 | biolink:negatively_regulates_entity_to_entity | PR:000031567 | RO:0002212 | IjbFtUdgNQk-HHlsBju-I_jpSnA | biolink:ChemicalToGeneAssociation | 1 | 02T60dgTsntC9kC4rqtr5lIN8n0 |