-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON serialisation format is woefully underspecified #321
Comments
Thank you @gouttegd, very, very welcome nudge. We really need to take a closer look at the other serialisations. IMO RDF is in an even worse state right night, not that it makes this situation much better. Parsing issue:the fact the sssom-py does not parse the JSON again is IMO a sssom-py issue. The problem is that the context is somehow passed wrongly to the conversion function. @hrshdhgd can you create an isolated sssom-py issue that ensures that TSV -> JSON -> TSV always works, and add a few tests that involve custom prefixes?
I think CURIES should be at the very least allowed in that format. I am a bit on the fence of making them required though. If a JSON-LD parser can parse both a document with expanded IRIs and CURIEs into the same internal representation, we should probably permit both. The question is if JSON-LD is really the format most people would be looking for, rather then some more classic, compact "normal" JSON. In any case, what we call JSON right now is JSON-LD, and therefore should be treated as such - this wont change.
I thought so too.. @hrshdhgd can you look into this?
It has not been at the forefront of importance so far, but it is becoming more important now that we are writing our first APIs, cc @anitacaron @udp |
Sure, but is it an issue of
Emphasis on custom prefixes. The problem is made worse by the fact that For example, if you replace |
So in this specific case, how do we handle case-sensitiveness of prefixes?
I understand standardizing prefixes is a huge ask (and an issue greater than us ) but what would be a practical solution? |
I am not sure I understand your question. The problem here has nothing to do with case-sensitiveness. In the provided example, the prefix is declared as But to answer your question, I think prefixes should be treated as case-sensitive and if there is a discrepancy between the Curie map and the actual Curies used in the mapping set (e.g., the Curie map contains a declaration for |
Thanks for clarifying Damien! So in that case, should |
and I use it as such in the rest of the metadata:
This is 100% correct. And this works just as expected as long as long as I am only using the TSV format. But once the mapping set is converted into JSON (using And I still don’t know whether this is because
And the reason I don’t know that is because the spec does not specify the expected behaviour. This is what should be fixed. The spec MUST say explicitly where and how the Curie map should be stored in the JSON serialisation. Only then will it be possible to fix the implementation. |
Aah! I see your point. Apologies for not grasping this from the original issue. |
I would like to revisit this issue immediately after the big sssom-py refactor now. @hrshdhgd can you resurface this in one of our next meetings? |
Re-opening because the root issue has not been addressed at all. The root issue is not the round-tripping bug in SSSOM-Py. It’s the complete absence of an actual specification for the JSON serialisation format. I don’t understand how is that so difficult to grasp. I‘ve created the issue on the repo holding the specification, not the Python implementation. The title of the issue explicitly says: “JSON serialisation format is woefully underspecified”. I’ve explained in the initial comment that I have been trying to understand what the format is supposed to look like, and that doing so is basically impossible given the current state of the spec. I’ve asked precise questions about the format, none of which have been answered. I’ve re-stated again in later comments that “the spec should provide the authoritative answer to that question (by saying what a valid file should look like), and the fact that it does not is (for me, as someone trying to implement the spec) the real issue here” and that “the reason I don’t know that is because the spec does not specify the expected behaviour – this is what should be fixed.” How can I be more clear that this issue is a about a specification problem, not a bug in SSSOM-Py? The round-tripping bug in SSSOM-Py only made the problem worse by making it impossible to infer the JSON format by reverse-engineering SSSOM-Py. Glad that this is now seemingly fixed, but the problem remains that if I want to support the JSON format in SSSOM-Java, I still need to reverse-engineer the reference implementation, which – do I even need to say it ? – is very bad for something aiming to be a standard exchange format! Right now, I feel that the JSON format is solely an internal format that is only supposed to be used by SSSOM-Py, and not intended to be interoperable. If that’s the case, then fine, but then the spec does not even need to mention that format at all. |
In my view we should move to properly define the JSON format, as the main alternative to TSV. The starting point should be https://github.com/mapping-commons/sssom/blob/master/project/jsonschema/sssom_schema.schema.json which is the JSON schema, but I don't know enough about JSON-LD to suggest any concrete steps to a full documentation of the format. @gouttegd do you have a suggestion on how we should start documenting the format properly? What exactly is missing in between the linkml docs, the json schema export we need to document carefully? Do you have a suggestion on how to document? |
The first thing would be to define the expectations of the format. Funnily, the JSON generated by Now SSSOM-Py 0.4.3 generates what seems to be fully valid JSON-LD. I suppose this is the intention? Then first questions the spec MUST answer:
For example, according to the JSON-LD spec it should be possible to do funny things like this: {
"mapping_set_id": "http://purl.obolibrary.org/obo/fbbt/fbbt-mappings.sssom.tsv",
"license": "https://creativecommons.org/licenses/by/4.0/",
"mappings": [
{
"subject_id": "FBbt:00000001",
"predicate_id": "semapv:crossSpeciesExactMatch",
"object_id": "UBERON:0000468",
"mapping_justification": "semapv:ManualMappingCuration"
},
{
"subject_id": "FBbt:00000002",
"predicate_id": "semapv:crossSpeciesExactMatch",
"object_id": "UB:6000002",
"mapping_justification": "semapv:ManualMappingCuration"
"@context": {
"UB": "http://purl.obolibrary.org/obo/UBERON_"
}
}
],
"@type": "MappingSet",
"@context": {
"@vocab": "https://w3id.org/sssom/",
/* Fragments of context skipped for brevity */
"FBbt": "http://purl.obolibrary.org/obo/FBbt_",
"UBERON": "http://purl.obolibrary.org/obo/UBERON_"
}
} where the second mapping uses an embedded context to define another prefix ( I’m not suggesting that using embedded contexts to allow redefining prefixes throughout a mapping set would be a good idea – it would be, in fact, a very bad idea in my opinion. But the JSON-LD spec allows it, for better or for worse. If the SSSOM JSON is in fact built on top of JSON-LD, then people would expect to be able to whatever the JSON-LD spec allows. Likewise, it should also be possible (again, according to the JSON-LD spec), to do this: {
"mapping_set_id": "http://purl.obolibrary.org/obo/fbbt/fbbt-mappings.sssom.tsv",
"license": "https://creativecommons.org/licenses/by/4.0/",
"mappings": [
{
"my_subject": "FBbt:00000001",
"my_predicate": "semapv:crossSpeciesExactMatch",
"my_object": "UBERON:0000468",
"my_justification": "semapv:ManualMappingCuration"
}
],
"@type": "MappingSet",
"@context": {
"@vocab": "https://w3id.org/sssom/",
/* Fragments of context skipped for brevity */
"my_subject": {
"@type": "rdfs:Resource",
"@id": "owl:annotatedSource"
},
"my_object": {
"@type": "rdfs:Resource",
"@id": "owl:annotatedTarget"
},
"my_predicate": {
"@type": "rdfs:Resource",
"@id": "owl:annotatedProperty"
},
"my_justification": {
"@type": "rdfs:Resource",
"@id": "sssom:mapping_justification"
},
}
} where I use custom keys ( Is that supposed to be allowed? Again, not saying this is necessarily a good idea, but JSON-LD does allow that.
(This is tightly related to question 1, because if all features of JSON-LD are supposed to be allowed, then implementations have no choice: they MUST send the file through JSON-LD expansion before they can do anything with it.) In other words: Is the JSON format strictly built on top of JSON-LD, or is it merely a custom JSON format that happens to re-use some (not all) features of JSON-LD? It is important to note that after JSON-LD expansion, the JSON Schema you linked becomes useless, because the keys in the expanded dictionary no longer match what the schema is describing (for example, a tuple Other questions I can think of:
3.a. Whether identifiers are expected to be written in full-length form only, CURIEfied form only, or if both forms are allowed. 3.b. Where and how the CURIE map should be stored (if CURIEfied forms are allowed; of course this is moot if we don’t allow CURIEfied identifiers). Based from the output of
{
"@context": {
"UBERON": "http://purl.obolibrary.org/obo/UBERON_"
}
} is not a valid prefix for compacting IRIs, because it ends with an underscore (valid prefixes intended to be used for compacting IRIs must end with one of those characters: Using the JSON-LD context to store prefix declarations that will not be used by JSON-LD processing and with prefixes that in many cases will not even be valid seems very weird to me. If that is really how the spec wants the Curie map to be stored, it must be very explicit about it – and I would argue that it must explain the rationale behind that choice. |
Sorry, but what the hell is that: "object_type": {
"@context": {
"@vocab": "@null",
"text": "skos:notation",
"description": "skos:prefLabel",
"meaning": "@id"
}
}, |
Ok, all of these are good questions, and I know the answers to exactly none of them. I will see that I schedule a working meeting about this soon. |
My suggestion is to have separate json and jsonld exports. Anything regarding prefixes will be a separate concern from the json export. The json export will be isomorphic to the TSV, the cell values will essentially be the values in the json (I am not providing a spec here, this is just the flavor). All of the things mentioned in this issue that relate to prefix expansion will no be relevant to the json export. I can commit to providing specification (in the form of a stable json-schema), associated docs, and tooling in the short term. jsonld is a more complicated beast to specify, for many of the reasons alluding to in this issue. I also explain the reasons for this here: https://linkml.io/linkml/howtos/using-jsonld.html. tl;dr it's not clear what a spec for a jsonld of a model would look like in the general case, so I don't want to commit to that. However, we can commit in the short term to providing a separate jsonld serialization that will effectively be identical to the json serialization plus injected context. But I would rather separate the discussion of these, as discussion of jsonld always brings in separate concerns that are irrelevant to most json consumers. |
Ok, I am not opposed to that (keeping JSON-LD and JSON side by side, and separate). Does this mean we should focus our standardisation effort on JSON (rather than JSON-LD)? |
I have no opinion on that (well, not quite: I am not that fond of JSON-LD personally, but I can keep that aside), but please note that right now what SSSOM-Py calls “JSON” is really JSON-LD (
What are people using the JSON output for, and do they actually need JSON-LD over plain JSON? |
I need the JSON-LD for the SSSOM-API. |
Resolves [#330] - [x] `docs/` have been added/updated if necessary - [x] `make test` has been run locally - [ ] tests have been added/updated (not applicable) - [ ] [CHANGELOG.md](https://github.com/mapping-commons/sssom/blob/master/CHANGELOG.md) has been updated (not applicable) This PR reorganises the documentation, especially the specification part, as suggested in #330. More precisely: * Several bits of informations that were scattered throughout the website are now regrouped on the index page (e.g. “contacts” and “credits”). * The home page gets a “SSSOM at a glance” section to give an immediate glimpse of what the standard is about. * The specification is broken down as follows: * general introduction * specification of the data model * introduction and notes complementary to the LinkML-generated documentation * LinkML-generated documentation * specification of the serialisation formats * SSSOM/TSV format (completely rewritten) * OWL/RDF format (taken from the pre-existing `spec.md` document) * JSON format (currently merely a placeholder, since the JSON format is unspecified #321) The “resources for users” section is left untouched for now. The urgent part was reorganising the _specification_, so that we can start enriching it to make it ready for 1.0.
The spec basically says, “the JSON serialisation format is whatever LinkML’s dumper classes produce”. That may be enough for people working in languages well supported by LinkML (so, Python and… what else?), but for the others, this is completely inadequate.
As a result, I’m trying to figure out what the JSON serialisation should look like and what
sssom-py
(as the reference implementation) expects, but it is not helped by the fact that even a JSON file produced bysssom convert
(so, presumably, correct… right?) is not accepted as it is bysssom parse
!Consider the following minimal TSV file:
It is accepted without any issue by
sssom convert
to produce a JSON file:And yet
sssom parse -I json
barfs when trying to read the resulting JSON:The JSON file does contain a declaration for the JSON prefix in its JSON-LD context, though:
And yet those prefix declarations seem completely ignored. The only way I got
sssom parse
to accept that JSON file was by specifying an external metadata file containing the Curie map (option-m
).So, I have the following questions about the JSON format:
1) Are (or should) CURIEs be allowed in that format? Obviously
sssom convert
behaves as if they should (since it does not write identifiers in their expanded form). But one could argue that, given the inherent verbosity of the JSON output (compared to the TSV format), the rationale for using CURIEs in that format is tenuous.The spec does hint at the fact that CURIEs may only be intended to be used in the TSV format, based on this bit (emphasis mine):
and on the fact that the notion of Curie map (which is indissociable of the use of CURIEs) is only ever mentioned in the section about the TSV format.
So, maybe JSON output should only use expanded identifiers? In which case this is a problem of
sssom convert
not doing the expansion it should perform when writing JSON.If CURIEs are intended to be used in JSON output, then:
2) How should the Curie map be stored when using JSON output? I thought it was in the JSON-LD context (that’s apparently what
sssom convert
thinks, too), but as illustrated abovesssom parse
does not expect it there (it completely ignores whatever is in that context).Is
sssom parse
wrong to ignore the JSON-LD context, or is intended that the Curie map should always be provided in a separate metadata file? Either way, the spec should clearly state what is expected here.Finally:
3) How important is the JSON format for SSSOM? I can’t help but feel that the JSON format is completely neglected by the spec, so I am seriously considering giving up on trying to support it in SSSOM-Java, unless there is a compelling reason to do so.
The text was updated successfully, but these errors were encountered: