Develop KGX serialization component for text-mined assertions #92
Labels
accepted
status - this issue has been approved/accepted and will will be addressed
Milestone
Text-mined assertions and accompanying metadata are stored in a Cloud SQL DB. This task involves the serialization of the assertions and metadata using the KGX format to generate files that will be shared with other components within the Translator consortium.
Two flavors of KG will be serialized to KGX
Note that we will produce two versions of the text-mined targeted Biolink association KG. The versions will differ in how protein/gene nodes are modeled. By default our text mining pipelines link mentions of proteins in the text to species-non-specific Protein Ontology concepts (when possible). Our manual annotation work has demonstrated increased inter-annotator agreement using this strategy as it is often difficult to determine the precise species of a gene/protein mention in text. In the Protein Ontology, these species-non-specific concepts are ancestors of the species-specific concepts. One version of the serialized KG will use the default species-non-specific Protein Ontology concepts to represent gene/protein nodes.
The species-non-specific protein concepts, however, are not generally used by other Translator components. In order to better integrate with the Translator ecosystem we will produce an alternate KG that maps the species-non-specific protein concepts to human UniProt identifiers (when possible). This strategy has been discussed and, although imperfect, has been agreed upon as a way to initially overcome the gene/protein species issue. The implementation of this approach will require the addition of a mapping table to the underlying database that stores the assertions and associated metadata, e.g. text-mined sentences supporting the assertions. This mapping table will map from species-non-specific Protein Ontology identifiers to an appropriate human UniProt identifier, e.g.
Proposed solution
A Docker container that can can interface with the text-mined assertion DB and output data to file in the provisional KGX format that is described in this issue. When invoked, the container will query the DB for all text-mined assertions (excluding all metadata sentences that have been flagged as erroneous and any text-mined assertion that is supported only by erroneous sentences), write the assertions to file using the KGX format, and upload the KGX files to a user-specified GCP bucket. Two sets of KGX files will be generated, one for each of the KG flavors described above.
Input parameters should include:
Additional context
_attributes
field that is relevant to this particular feature request.id
-->label
. The category selected should be the most specific category listed in the array of categories returned by the SRI Node Normalizer service (the first element of thetype
array). Names and categories, once retrieved from the SRI Node Normalizer service will be cached in the Cloud SQL DB to speed up future processing and avoid redundant calls to the service. It is possible there will be identifiers that are not recognized by the SRI Node Normalizer service. In such cases, we will flag these identifiers for later inspection and will useUNKNOWN_NAME
andbiolink:NamedThing
as placeholders for the name and category fields, respectively.Output from the SRI Node Normalizer service is shown below for the input identifier
CHEBI:17824
. Note that we will make use of the canonical labelIsopropyl alcohol
even though the official label of the CHEBI concept is different (propan-2-ol
). The category selected is the first in thetype
array, sobiolink:SmallMolecule
in this case. Also note that it is possible to batch multiple identifiers in the request to the SRI Node Normalizer (see the example provided here for details).The text was updated successfully, but these errors were encountered: