-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdfs:subClassOf relationships missing from MeSH RDF #153
Comments
Behind the UI, we use Virtuoso, the open-source version. As you've seen, it is really a quadstore, so that it stores tuples of the form PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
GRAPH ?g {
?s rdfs:subClassOf ?p .
}
} Does this answer your questions? I'm really glad you are benefiting from it - it does not get as much manual traffic as you might think, even though there is a lot of API usage of the system. |
I see! I was initially looking at https://hhs.github.io/meshrdf/descriptors and I assumed all visualized nodes where from the same graph. So if I want to query a MeSH release with SPARQL, but where we store serve the database locally, I would need to load both of these files from the ftp site?
Would it be okay to load both of these files into a single rdflib merged graph? My goal is to write queries that can access the
Ah good to know. Pasting the results from that query below as a reference:
Thanks! My current goal is to load MeSH into a Python networkx directed graph (using nxontology). Basically, I want a single directed acyclic graph of concepts. I'm thinking that means I want to add meshv:Descriptor and meshv:SupplementaryConceptRecord records as nodes. Feel free to point me to any complimentary resources or efforts. |
You can certainly do that. How you make use of the vocabulary depends on a lot on how your triple store does inference, and on your research need for inference, e.g. whether you need it. I've used rdflib for little things, but never for the full model, and so I don't feel like I am the expert to tell you what to do. I can however expand a bit on inference. Inference makes a property statement such as "?d a meshv:Descriptor" work. Without it, you must very explicit, maybe using SPARQL UNION queries. So, in general, you can always rewrite queries to get around a lack of inference in a bespoke system, but it limits things if you are for instance implementing a question answering system. Different triple stores do inference differently. Virtuoso uses separate graphs as a set of rules (and only does RDFS inference). Oracle SPATIAL and GRAPH calculates an "entailment", which is the full set of inferred triples, then those are loaded into another graph, and you defined a union graph with some sort of aliasing. A quick web search finds https://github.com/RDFLib/OWL-RL, which does limited OWL inferencing as well as RDFS inferencing. So, that would be enough, but I'm not sure whether this is the leading way to do inferencing with rdflib, or whether you need inferencing. |
Since you are explicitly wanting to calculate the extra nodes you need to take it into a DAG system such as networkx, you can ignore the vocabulary file and create your own "entailment", adding the triples you need to make the entailment work by doing something like this: SELECT ?d FROM <http://id.nlm.nih.gov/mesh>
WHERE {
{ ?d a meshv:TopicalDescriptor }
UNION { ?d a meshv:GeographicalDescriptor }
UNION { ?d a meshv:PublicationType }
UNION { ?d a meshv:CheckTag }
} Using the results to generate the new nodes you need and inserting them into your graph. You can do a similar thing with other relationships you need. I caution that networkx will certainly scale to MeSH RDF, but if you are thinking of adding something bigger such as PubChem RDF or SNOMED CT, you may want to think about a DAG system such as neo4j. Using a system like that will give you hosting options if you are going beyond research to a production system. |
One more comment - the reason we have our own vocabulary rather than using something like OWL is that MESH cannot be properly represented as a DAG. You should find the motivating paper by Olivier Bohdenreider before proceeding to "flatten" it into a DAG. It may of course work for a specific purpose, but our goal is to fully represent MeSH in RDF without loss of semantic richness. |
I misspeak below. MeSH RDF cannot be represented as a tree, but should be able to be represented as a DAG.
|
This is probably the easiest solution, since we can list all classes we're interested. Then there are a few ways to structure the SPARQL query. We really only need two queries: one for nodes and one for relationships. But rdflib is struggling here, in terms of running indefinitely for queries where https://id.nlm.nih.gov/mesh/query results within seconds. So it might be nice to query a more performant database. You mentioned Virtuoso and neo4j. My main goals are SPARQL support and ease-of-setup. I like neo4j, but it probably isn't the right tool as it's not a native triplestore. I'd also be fine running our queries on the NLM Virtuoso instance, but I couldn't figure out how to access the full results when there were over 1000 results: see #150.
Okay, the following papers look relevant. Will review:
|
rdfs:subClassOf graph
I loaded Also available as SVG at https://bit.ly/36W5up9. python source & output graphviz dotpython sourceimport pandas as pd
import fsspec
import rdflib
import networkx as nx
from networkx.drawing.nx_pydot import write_dot
rdf = rdflib.Graph()
# load MeSH vocabulary
url = "ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/vocabulary_1.0.0.ttl"
with fsspec.open(url, "rt") as src:
# https://github.com/HHS/meshrdf/issues/153
rdf.parse(source=src, format="n3")
query='''
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?subject_suffix ?object_suffix
WHERE {
?subject rdfs:subClassOf ?object .
BIND( STRAFTER(STR(?subject), "#") AS ?subject_suffix)
BIND( STRAFTER(STR(?object), "#") AS ?object_suffix)
}
ORDER BY ?subject_suffix ?predicate_suffix
'''
results = rdf.query(query)
subclass_df = sparql_results_to_df(results)
subclass_df.head(2)
graph = nx.DiGraph()
for row in subclass_df.itertuples():
graph.add_edge(row.object_suffix, row.subject_suffix)
write_dot(graph, "mesh-subclassof.dot") graphviz source# Medical Subject Headings (MeSH) Vocabulary rdfs:subClassOf graph
digraph {
DescriptorQualifierPair;
AllowedDescriptorQualifierPair;
Descriptor;
CheckTag;
Thing;
Concept;
DisallowedDescriptorQualifierPair;
GeographicalDescriptor;
PublicationType;
Qualifier;
SupplementaryConceptRecord;
SCR_Chemical;
SCR_Disease;
SCR_Organism;
SCR_Protocol;
Term;
TopicalDescriptor;
TreeNumber;
DescriptorQualifierPair -> AllowedDescriptorQualifierPair;
DescriptorQualifierPair -> DisallowedDescriptorQualifierPair;
Descriptor -> CheckTag;
Descriptor -> GeographicalDescriptor;
Descriptor -> PublicationType;
Descriptor -> TopicalDescriptor;
Thing -> Concept;
Thing -> Descriptor;
Thing -> DescriptorQualifierPair;
Thing -> Qualifier;
Thing -> SupplementaryConceptRecord;
Thing -> Term;
Thing -> Thing;
Thing -> TreeNumber;
SupplementaryConceptRecord -> SCR_Chemical;
SupplementaryConceptRecord -> SCR_Disease;
SupplementaryConceptRecord -> SCR_Organism;
SupplementaryConceptRecord -> SCR_Protocol;
} I am going to close this issue since my original question has been answered. But happy to continue discussion on my subsequent questions. |
Very cool - when they ask why we need "the software architect" maintaining this software, I may point to this discussion and ask whether they'd rather have a "principal investigator" from the group that does the science. Feel free to open an issue just to report back how it worked out. |
This query returns results (online explorer):
This query returns no results (online explorer):
The difference being that the later query specifies
FROM <http://id.nlm.nih.gov/mesh>
. UsingFROM <http://id.nlm.nih.gov/mesh/2020>
also returns no results.The original query run via rdflib after loading
ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/mesh2020.nt
also returns no results.I think this is the same issue as #65, but it wasn't clear to me why this is or how to get
rdfs:subClassOf
relationships.Thanks for the help... am new to accessing MeSH via SPARQL / RDF.
The text was updated successfully, but these errors were encountered: