Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the subject/object source type to something like prefix #126

Closed
matentzn opened this issue Dec 8, 2021 · 14 comments · Fixed by #177
Closed

Change the subject/object source type to something like prefix #126

matentzn opened this issue Dec 8, 2021 · 14 comments · Fixed by #177
Assignees
Milestone

Comments

@matentzn
Copy link
Collaborator

matentzn commented Dec 8, 2021

The concern is that the variability of the way to describe URLs is too large (http, https etc). @cmungall believes we should prefer a clear prefix based scheme for source type, rather than URLs. I can see the appeal to that, but I am a bit on the fence here - sometimes it seems just easier to identify a resource by its resource URL.

@cmungall main concern is, that if you merge mapping sets, you may not be able to accurately group on subject and object source due to the above risk of variance. For example, you may refer to SNOMED as a source like this: https://www.snomed.org/, or this http://www.snomed.org/ or this https://www.snomed.org (don't get hung up on the fact that some ontologies have PURLs, many dont).

Request for comment:

  • Option 1: use Prefixes rather than URLs to denote sources. Example: object_source: UBERON. The context will provide an entry for UBERON to ensure its unambiguously clear what that refers to, i.e. http://purl.obolibrary.org/obo/uberon.owl. Note versions etc have nothing to do with anything here: they are captured by a different field.
  • Option 2: Use URLs, preferring PURLs to denote sources (current solution): object_source: http://purl.obolibrary.org/obo/uberon.owl.

I don't mind Option 1 personally, but I would like some clarity on how UBERON is linked to http://purl.obolibrary.org/obo/uberon.owl. Just answering with "bioregistry" is not sufficient - I would insist on this information being encoded somewhere in the metadata, for example the prefix map.

@AlasdairGray
Copy link

This is a really thorny issue. Just look at the number of IRIs there are for a UniProt entry.

I like the idea of Option 1, but agree that there is a need to capture what the prefix means within the file itself (it may be used a few years down the line and the external registry you rely on may not be available or may have updated its record).

What may help is to have some known alternatives included, e.g.

prefix: UNIPROT
- http://purl.uniprot.org/
- https://purl.uniprot.org/
- http://www.uniprot.org/
- https://www.uniprot.org/

@matentzn
Copy link
Collaborator Author

matentzn commented Dec 8, 2021

That is exactly right, and a good argument towards Option 1. Thank you @AlasdairGray

@cmungall
Copy link
Contributor

cmungall commented Dec 8, 2021

Even for OBO which is well-controlled, there are multiple valid PURLs for different formats, e.g

e.g.

We can imagine people getting confused and using a mapping tool that downloads obo format, hence wanting to put in the obo url, despite the fact that for mapping purposes it is semantically identical to the canonical owl

and there are different products, e.g.

these 3 will give the same mapping results provided one filters out CL etc but again there is room for confusion

there is also a valid use case of wanting to align some subset of an ontology

We would want to capture somewhere the fact that a mapping was only done on a particular subset and/or version but I don't think the source field is intended for this? I had assumed that id->source would be a static functional mapping regardless of version/subset

On top of that we have some ontologies and ontology translations that have alternative PURLs and no canonical source - e.g. ncbitaxon, fma. But bioregistry at least gives us a shot at a standard for prefix.

Despite all this, I think we can make 2 work for OBO, provided we give very clear documentation and possibly some checks in sssom-py, to the effect that the canonical owl purl should always be used. But even here, as @AlasdairGray points out we need to come up with clear conventions for non-OBO sources.

That is why I favor 1. Of course we still need to resolve biopragmatics/bioregistry#170 to avoid people needing to do an extra normalization step (or we could just mandate lowercase as convention here)

@matentzn
Copy link
Collaborator Author

matentzn commented Dec 8, 2021

@cmungall how do you suggest we deal with the issue of documenting the prefix. If we refer to IDs, we get:

curie_map:
    UBERON: http://purl.obolibrary.org/obo/UBERON_

Now that is different than saying the source is UBERON. How would you prefer to document the actual source? In an embedded fashion I mean? This seems.. Difficult.

@graybeal
Copy link

graybeal commented Dec 8, 2021

if you want to deal with prefixes, why not deal with them as frontmatter in the mapping? (you know, like an RDF file would)
why would you need a secondary system for resolving prefixes to particular identifiers? Define prefixes in the frontmatter and then the subject/object source type can be whatever the user wants, and still be rigorous and easily consistent with RDF itself.

I believe it's fair to say the W3C semantic community has defined just that one standard for mapping prefixes to IRIs (namely, prefixes named in the actual artifact), and another well-defined method for content resolution. So the idea that multiple services (owl, json, obo) or multiple notations (http, https, with/out closing slash) force us to accept multiple identifiers for the same semantic artifact is weird to me.

@matentzn
Copy link
Collaborator Author

matentzn commented Dec 9, 2021

Thats exactly my problem though, you will need to different prefixes with the exact same prefix:

one prefix is UBERON which is used for the IRIs:

UBERON: http://purl.obolibrary.org/obo/UBERON_

So I can say: UBERON:123.

The other is UBERON the resource:

object_source: UBERON

That UBERON is the same string, but means something different, i.e. the Uberon ontology. We could encourage something like Wikidata IRIs, but what @cmungall wants, it seems, is to keep the SSSOM table simple. I just right now don't understand how he wants to encode what

object_source: UBERON

means in terms of RDF...

@cmungall
Copy link
Contributor

cmungall commented Dec 9, 2021

A general problem with the OBO schema is that URIs such as http://purl.obolibrary.org/obo/UBERON_ are not resolvable (by convention - we could fix this on a per ontology basis) and just non-idiomatic.

So I appreciate that the RDF-level representation is a little unsatisfactory. We could leave as literals but I think it's bad practice to model entities as literals. And the OBO base PURL is awkward.

Sorry to do this at this stage, but I think that a CURIE in this field is best.

I think Wikidata CURIEs is a good suggestion actually. It doesn't violate simplicity, but it does add some opaqueness (in that a lookup is required)

Another option is to use the infores prefix we are using in Translator (cc @sierra-moxon, @mbrush; NCATSTranslator/TranslatorArchitecture#59 (comment) -- btw we need to register this in bioregistry)

Or we could use CURIEs with a recommendation to use bioregistry prefix, e.g. bioregistry:uberon --> https://bioregistry.io/bioregistry:uberon

Note this issue of how do we refer to a resource has been a longstanding issue on a number of projects...

@graybeal
Copy link

graybeal commented Dec 9, 2021

@matentzn Umm, do I correctly understand that you are representing two different things: (1) the namespace prefix (to which the unique fragment gets appended) and (2) the actual ontology identifier? If that is true then they are two different entities, and they should be two different prefixes (if people are that attached to using the acronyms). The RDF is then clear. Sorry if I am missing something obvious.

@matentzn
Copy link
Collaborator Author

matentzn commented Dec 9, 2021

@graybeal, no, you are exactly describing my problem :P

But @cmungall suggestions above could work.

@graybeal
Copy link

If they are the same strings, the same prefix makes sense. If they are not the same strings, the same prefix does not make sense, and wanting to use the same prefix in two different contexts in that different-strings case is not simplifying anything.
I don't think it matters whether you call it CURIEs or prefixes, it's fine with me if they are in there for simplification, as long as there isn't any inconsistency in their translation to strings—that needs to be unambiguously defined in the CSV.

@matentzn matentzn self-assigned this Apr 11, 2022
@matentzn matentzn added this to the 1.0.0 milestone Apr 15, 2022
@matentzn
Copy link
Collaborator Author

matentzn commented May 7, 2022

Alright then, shall we stick then with subject/object source are IRIs/CURIEs, and recommend using wikidata to identify sources? I bet many people will want to simply add a URL, so we should probably allow this as well?

@graybeal
Copy link

graybeal commented May 7, 2022

I've reached the limit of my competence. Treat the rest of this post with some skepticism.

I agree I don't know how to generate an RDF equivalent for "object_source: UBERON_ONT", except that SSSOM can 'just know' that UBERON_ONT needs to be expanded to the string defined for that prefix ("http://purl.obolibrary.org/obo/uberon.owl", say).

A URL is an IRI, so not sure I understand you. If you stick with IRIs/CURIEs (I assume that means "IRIs or CURIEs", yes?) you will be entirely consistent with RDF, which I think is super-great!

Of course it doesn't prevent people from using different strings to define the same thing. But nothing really does. What comes closest is for the originator of the semantic artifact to define (a) the identifier for that artifact and (b) the namespace for that artifact (ideally different strings, so they can be distinguished). Then everyone knows the 'right' answer, and even if it has to change because http gets replaced with https, everyone can follow along over time and update their references to match the latest reality.

If all of this discussion is about the case where the authors didn't specify the identifier in the first place, then recommending a practice for determining the identifier is great. But don't make the software look it up—whoever creates the mapping file has to put in the prefix-string mapping explicitly. If you don't do it that way—if instead you do tricky things with looking up external resources from your SSSOM tool chain—you are building an edifice that is dependent on ALL implementing software to do external lookups (and that those external lookups continue to work). I understand the desire to add a level of indirection, but it is not Simple in my eyes.

@matentzn
Copy link
Collaborator Author

matentzn commented May 7, 2022

A URL is an IRI, so not sure I understand you. If you stick with IRIs/CURIEs (I assume that means "IRIs or CURIEs", yes?) you will be entirely consistent with RDF, which I think is super-great!

Yeah sorry, I meant: allow an IRI in the table format, which we have designed to require entities in CURIE format (so rather than obo:uberon.owl we should allow http://purl.obolibrary.org/obo/uberon.owl as a value as well). In RDF both would be the same anyways, this is just a syntactic question on the table format.

Of course it doesn't prevent people from using different strings to define the same thing.

I think the suggestion here is to use UBERON: http://purl.obolibrary.org/obo/UBERON_ to denote the namespace of a term and wikidata:Q7876491 or <https://www.wikidata.org/wiki/Q7876491> to refer to the source. It is not clear that this is really much better then <http://purl.obolibrary.org/obo/uberon.owl>, which is the official PURL, but according to @cmungall the OBO purl is sort of sub-optimal. Another option is obo:uberon. This even resolves to something, in this case, but we should probably let this resolve to the OBO foundry website pages for all ontologies if we wanted to go that way..

But don't make the software look it up

I agree. You may need to look up what wikidata:Q7876491 for a human to understand, but a machine will not care.

@matentzn
Copy link
Collaborator Author

matentzn commented May 20, 2022

Can I get everyone to sign off on #177?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants