-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the subject/object source type to something like prefix #126
Comments
This is a really thorny issue. Just look at the number of IRIs there are for a UniProt entry. I like the idea of Option 1, but agree that there is a need to capture what the prefix means within the file itself (it may be used a few years down the line and the external registry you rely on may not be available or may have updated its record). What may help is to have some known alternatives included, e.g.
|
That is exactly right, and a good argument towards Option 1. Thank you @AlasdairGray |
Even for OBO which is well-controlled, there are multiple valid PURLs for different formats, e.g e.g.
We can imagine people getting confused and using a mapping tool that downloads obo format, hence wanting to put in the obo url, despite the fact that for mapping purposes it is semantically identical to the canonical owl and there are different products, e.g.
these 3 will give the same mapping results provided one filters out CL etc but again there is room for confusion there is also a valid use case of wanting to align some subset of an ontology
We would want to capture somewhere the fact that a mapping was only done on a particular subset and/or version but I don't think the source field is intended for this? I had assumed that id->source would be a static functional mapping regardless of version/subset On top of that we have some ontologies and ontology translations that have alternative PURLs and no canonical source - e.g. ncbitaxon, fma. But bioregistry at least gives us a shot at a standard for prefix. Despite all this, I think we can make 2 work for OBO, provided we give very clear documentation and possibly some checks in sssom-py, to the effect that the canonical owl purl should always be used. But even here, as @AlasdairGray points out we need to come up with clear conventions for non-OBO sources. That is why I favor 1. Of course we still need to resolve biopragmatics/bioregistry#170 to avoid people needing to do an extra normalization step (or we could just mandate lowercase as convention here) |
@cmungall how do you suggest we deal with the issue of documenting the prefix. If we refer to IDs, we get:
Now that is different than saying the source is UBERON. How would you prefer to document the actual source? In an embedded fashion I mean? This seems.. Difficult. |
if you want to deal with prefixes, why not deal with them as frontmatter in the mapping? (you know, like an RDF file would) I believe it's fair to say the W3C semantic community has defined just that one standard for mapping prefixes to IRIs (namely, prefixes named in the actual artifact), and another well-defined method for content resolution. So the idea that multiple services (owl, json, obo) or multiple notations (http, https, with/out closing slash) force us to accept multiple identifiers for the same semantic artifact is weird to me. |
Thats exactly my problem though, you will need to different prefixes with the exact same prefix: one prefix is UBERON which is used for the IRIs:
So I can say: UBERON:123. The other is UBERON the resource:
That UBERON is the same string, but means something different, i.e. the Uberon ontology. We could encourage something like Wikidata IRIs, but what @cmungall wants, it seems, is to keep the SSSOM table simple. I just right now don't understand how he wants to encode what
means in terms of RDF... |
A general problem with the OBO schema is that URIs such as So I appreciate that the RDF-level representation is a little unsatisfactory. We could leave as literals but I think it's bad practice to model entities as literals. And the OBO base PURL is awkward. Sorry to do this at this stage, but I think that a CURIE in this field is best. I think Wikidata CURIEs is a good suggestion actually. It doesn't violate simplicity, but it does add some opaqueness (in that a lookup is required) Another option is to use the Or we could use CURIEs with a recommendation to use bioregistry prefix, e.g. bioregistry:uberon --> https://bioregistry.io/bioregistry:uberon Note this issue of how do we refer to a resource has been a longstanding issue on a number of projects... |
@matentzn Umm, do I correctly understand that you are representing two different things: (1) the namespace prefix (to which the unique fragment gets appended) and (2) the actual ontology identifier? If that is true then they are two different entities, and they should be two different prefixes (if people are that attached to using the acronyms). The RDF is then clear. Sorry if I am missing something obvious. |
If they are the same strings, the same prefix makes sense. If they are not the same strings, the same prefix does not make sense, and wanting to use the same prefix in two different contexts in that different-strings case is not simplifying anything. |
Alright then, shall we stick then with subject/object source are IRIs/CURIEs, and recommend using wikidata to identify sources? I bet many people will want to simply add a URL, so we should probably allow this as well? |
I've reached the limit of my competence. Treat the rest of this post with some skepticism. I agree I don't know how to generate an RDF equivalent for "object_source: UBERON_ONT", except that SSSOM can 'just know' that UBERON_ONT needs to be expanded to the string defined for that prefix ("http://purl.obolibrary.org/obo/uberon.owl", say). A URL is an IRI, so not sure I understand you. If you stick with IRIs/CURIEs (I assume that means "IRIs or CURIEs", yes?) you will be entirely consistent with RDF, which I think is super-great! Of course it doesn't prevent people from using different strings to define the same thing. But nothing really does. What comes closest is for the originator of the semantic artifact to define (a) the identifier for that artifact and (b) the namespace for that artifact (ideally different strings, so they can be distinguished). Then everyone knows the 'right' answer, and even if it has to change because http gets replaced with https, everyone can follow along over time and update their references to match the latest reality. If all of this discussion is about the case where the authors didn't specify the identifier in the first place, then recommending a practice for determining the identifier is great. But don't make the software look it up—whoever creates the mapping file has to put in the prefix-string mapping explicitly. If you don't do it that way—if instead you do tricky things with looking up external resources from your SSSOM tool chain—you are building an edifice that is dependent on ALL implementing software to do external lookups (and that those external lookups continue to work). I understand the desire to add a level of indirection, but it is not Simple in my eyes. |
Yeah sorry, I meant: allow an IRI in the table format, which we have designed to require entities in CURIE format (so rather than
I think the suggestion here is to use
I agree. You may need to look up what |
Can I get everyone to sign off on #177? |
The concern is that the variability of the way to describe URLs is too large (http, https etc). @cmungall believes we should prefer a clear prefix based scheme for source type, rather than URLs. I can see the appeal to that, but I am a bit on the fence here - sometimes it seems just easier to identify a resource by its resource URL.
@cmungall main concern is, that if you merge mapping sets, you may not be able to accurately group on subject and object source due to the above risk of variance. For example, you may refer to SNOMED as a source like this: https://www.snomed.org/, or this http://www.snomed.org/ or this https://www.snomed.org (don't get hung up on the fact that some ontologies have PURLs, many dont).
Request for comment:
object_source
:UBERON
. The context will provide an entry for UBERON to ensure its unambiguously clear what that refers to, i.e. http://purl.obolibrary.org/obo/uberon.owl. Note versions etc have nothing to do with anything here: they are captured by a different field.object_source
:http://purl.obolibrary.org/obo/uberon.owl
.I don't mind Option 1 personally, but I would like some clarity on how
UBERON
is linked tohttp://purl.obolibrary.org/obo/uberon.owl
. Just answering with "bioregistry" is not sufficient - I would insist on this information being encoded somewhere in the metadata, for example the prefix map.The text was updated successfully, but these errors were encountered: