-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDF-Star: Some biological database use cases #19
Comments
If I understand you correctly, the closing two example queries should actually be — SELECT * WHERE { << <http://purl.uniprot.org/uniprot/X> ?p ?o >> ?p1 ?o1 } — and — SELECT * WHERE { << <http://identifiers.org/uniprot/X> ?p ?o >> ?p1 ?o1 } |
Sorry for the delay in getting this use case organized. I don't fully understand the part of your use case where you say: I have started creating a wiki page for your use case, that will eventually contain a clean description of the use case when we finish teasing out all its aspects. Please take a look at https://github.com/w3c/rdf-ucr/wiki/RDF-star-for-explanation-and-provenance-in-biological-data |
what can the pattern implicit in the initial example represent which is beyond a pattern which relies on named graphs?
this offers the advantage, that it could compactly annotate related triples. |
We would end up with multiple graph membership patterns in practice and that will not be so easy to query for either. i.e. a usecase we have is where we load a number of uniprot releases into different named graphs. At that point it becomes really hard to query for an "attribution" that we put into release 1 and is not present in release 2. |
if you are actually distinguishing quads, rather than triples, and your statement ids incorporate four terms rather than just three, why should it be difficult to distinguish the attributions? |
@lisp the query becomes something like this when using named graphs. Which not really nicer than the reification we have now ;) SELECT *
WHERE {
GRAPH release:1 {
?p up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
?attr up:manual true ;
up:evidence ECO:0000303 ;
up:source citation:16784888 .
}
GRAPH ?attr {
?p up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
?attr up:manual true ;
up:evidence ECO:0000303 ;
up:source citation:16784888 .
}
}
} This is selecting the attributions of a certain kind was in a specific release. |
For named graphs to be used in this case, wouldn't the attribution facts still be in the "release" graph? But indeed, the "quoted triple" could be expressed a singleton graph (or indeed, multiple triples if that's preferred). So this would be in one specific uniprot release: graph release:1 {
<Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
# This annotates the above triple, which is quoted in the uniquely named graph (further down):
<urn:tdb:2014:urn:md5:8e08a975b841666a8ff0b7e42e73275a> up:attribution <Q14739#attribution-XX> .
<Q14739#attribution-XX> up:manual true ;
up:evidence ECO:0000303 ;
up:source citation:16784888 .
} And this would be the same in all releases where the triple is annotated (or merely quoted if it is talked about but not asserted in some specific release): # The triple, quoted by keeping it in an "existential" singleton named graph,
# uniqely named by a checksum of its NQuads representation:
graph <urn:tdb:2014:urn:md5:8e08a975b841666a8ff0b7e42e73275a> {
<Q14739#SIP9E6E0C5B850FBF4F> up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
} Selecting attributions of a certain kind in a release would then be: select * {
graph ?triple {
?prot up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
}
graph release:1 {
# This is the only thing ensuring that ?triple isn't bound to any
# other named graphs (release) where it is asserted:
?triple up:attribution ?attr .
# The attribution criteria:
?attr up:manual true ;
up:evidence ECO:0000303 ;
up:source citation:16784888 .
# Needed if the triple must also be asserted in this release:
?prot up:fullName "3-beta-hydroxysterol Delta (14)-reductase" .
}
} Of course, this is a "hack", with non-standardized, manual requirements:
These can be quite challenging (and some could prove a no-go depending on backend), so this pattern is reasonably not good enough as a recommended practise. I do, however, wonder how much further quoted triples need to go beyond supporting it. (See https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023May/0063.html for more on this idea.) A careful reading of https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3199260, p. 6-7 is also advisable when considering this pattern and possible semantics thereof. (For one, it speculates on a regime where the above could entail See also rdf-concepts#46. |
See https://github.com/w3c/rdf-ucr/wiki/RDF-star-for-explanation-and-provenance-in-biological-data for a clean version of this use case
** Contact information
** Brief Description of your use case:
In UniProt we want to refer to triples to explain or "attribute" why they where added to the UniProtKB graphs.
These triples are always asserted and we might have multiple explanations/attributions or none at all.
The explanations and attributions are themselves complicated resources named by an IRI.
At this moment we use RDF reification with consistent IRI's for each triple.
This syntax is inconvenient and also hard to optimize in general. This is important when the RDF graphs are 100+ billion triples in size.
The example above is evidence to support why a certain protein is described with a specific name.
As the data is extremely large we can not afford to maintain mappings that depend on order of visitation inside a single file to derive an temporary IRI. (e.g. in RDF/XML rdf:ID uniqueness constraint is violated and expensive to check for in UniProt when using it for reification quads). In other words, the identity function for deriving an id for a triple should be stateless and allowed to be invoked multiple times, we should not be forced to gather all triples for using a triple reference into one co-localized set.
Our use-cases for un-asserted triples are extremely rare and would preferably be described explicitly as "inversions" of the normal case, or explicit non-membership of an class. e.g. something like the following
For other databases we might want to do things like .
and then use the "star" syntax for quickly selecting the triples we have a high confidence for.
*** What you want to be able to do:
Talk about why triples are added to the dataset and how confident our users should be in trusting them.
*** What is the role of RDF-star quoted triples in your use case:
Quoted triples (or content identified triples) would replace the usecase for rdf reification by allowing a more convenient and clearer way to talk about "edges" in an RDF graph.
*** Why it is hard or impossible to do what you want to do without quoted triples:
Reification, not only is a lot of typing to get right. It is also difficult to optimize in the general case for SPARQL engines.
*** How you want quoted triples to behave in your use case:
(For example, do you want the precise syntax of subjects, predicates, and objects in quoted triples to be important?)
They must be transparent for owl reasoning. UniProt is re-used and re-mixed in many different end user databases. In these they might use different identifiers and map them with owl:sameAs. e.g. often
http://identifiers.org/uniprot/X
owl:sameAshttp://purl.uniprot.org/uniprot/X
. Given that sameAs relation all queries should be able to use either of these identifiers and get the "same" result.and
must return the same results in an owl:sameAs aware setting.
The text was updated successfully, but these errors were encountered: