-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assorted feedback: order, equality, reversibility and RDF #9
Comments
Hi Emmanuel, thanks for the feedback! My intent is to keep data "value", "type", and "meaning" mostly separate in the Concise Encoding universe. I want the "value" format (CBE and CTE) to concern itself with providing the minimum amount of information necessary to (in theory) efficiently rebuild the original data structure with correct values after transmission (while not complicating it with data compression techniques). I've left the data format deliberately vague on types because I believe that a moderately sophisticated schema should be able to handle this seldom-changing information, such that it doesn't have to be transmitted with every payload, and can also be ignored by untyped ingestors depending on your use case. A decent schema could also specify ordering and equivalence rules, with sensible defaults for implementations where tight definitions are not so important. The equivalence section is intended more as a default starting point for schemaless designs in order to keep the confusion levels down. I intend to bake more control into the schema format so that equivalence, ordering, etc can be explicitly defined when it's important. I've been reading up on RDF over the weekend, and I like the idea of recording relationship data, but I'm having a hard time wrapping my head around how I'd encode quads (triples would be easy). Specifically, I'd like the encoding to be able to model this sort of thing:
Quads such as
But this only supports triples, not quads... If I could find a way to encode quads into the key-value structure of map entries, that would be ideal. Something along the lines of:
... except I'd want something better, because the above looks messy and hard to read. Also, the meta-relationships end up placed ad-hoc in some other section of the document, which data ingestors would need to understand how to find. This gets tricky fast... One other thing that came from this exercise is IRIs (which I had no idea existed). I think I'll just rename the |
Hey Karl, I'm glad you are digging into RDF! You describe "reification", not needed as often as one would imagine. For instance, on could have stand-alone Have you heard of RDF*? It is a newish reification spec. There's an intro here, a WIP spec, and a note describing a "hacky" implementation (through a form: One thing that makes RDF serializations a lot more readable are IRI prefixes, if you could have something like that it would really help. There are formats like microdata and RDFA (embedding RDF in HTML), but I think is better to start with the data and then embed that data into document. I'm thinking of how to do this with my website (expressed in turtle). I have a little proof-of-concept of rendering a RDF+markup template too. Maybe concise-encoding could express this sort of thing better than turtle 🙂 . |
OK, I've been thinking on-and-off about this in my spare time, and I think I've got my head wrapped around most of it now. Please tell me if I've got this right:
Tying the information theory to data communications theory:
Schemas are currently used to enforce structure on data so that foreign systems can safely validate and ingest it. Since schemas don't change much, it seems that the schema would be a good place to codify mappings to concepts. You'd then have two mapping levels in any information system:
The schema would also set out the limitations of what relationships can be used where (e.g. male people can have father_of relationships, but female people cannot). Data classes and such... Then the transmitted data would only contain a reference to the schema, which the receiver would consult if it needs to derive meaning from the data's relationships, or wants to validate types/structure/format/whatever. In Concise Encoding, this could be done using a metadata map containing a pointer to the schema. As a side note, it seems that relationships can't be restricted to map-like structures where the map itself is the implied subject (like in JSON-LD). An information system could hold canonical relationship information about data it does not itself control (for example, Also, since Concise Encoding is merely the transmission format, it doesn't need to concern itself with addressibility of the relationships contained in the data; that's what the universal resources (IRIs) are for. For internally referencing such data, the existing marker/reference types are sufficient. Am I understanding this correctly? Is there anything I've missed? |
OK I think I've got it... My above comments should handle semantic content within the bounds of data defined by a schema, but it doesn't deal with referencing semantic data. CE would need some changes in order to make things less cumbersome. Prefixes & ConcatenationUnaided, semantic references become a madness of endless repetition of the same base IRIs with slightly different endings, thus IRI prefixes in Turtle etc. Since prefixes are special definition operations that don't represent actual data but rather references to IRI partials that will be used elsewhere in the document:
CE could accomplish something similar using metadata maps and markers:
To produce this effect, I've made a metadata map containing a list of marked resources. The name "prefixes" is purely arbitrary and could be anything. In fact, the entire structure of the metadata map is arbitrary, and won't affect how the actual data is processed in this case since it's just being used to store the definitions that will be referenced elsewhere in the real data section. Since the metadata is "outside" of the data, we now in effect have reference definitions for "local" and "pub". To use these definitions, CE requires a new type to represent the concatenation of a resource and a string. An actual "concatenation" operator type would probably work best here, with the restriction that it can only concatenate a string onto a resource. This complicates parsing a little bit by requiring a lookahead, but overall I don't think it's too terrible. For CBE, I can simply add a new type code for "concatenate". For CTE, I'll need to modify the markup a little bit. The least disruptive, most recognizable approach would be to use Putting it all together:In Turtle:
In Concise Encoding:
Relationship data with ad-hoc subjectsMaps are basically sets of relationships where the map itself is the implied subject for each relationship (denoted by key-value pairs). For ad-hoc relationship data where this is not the case, CE would need another type to represent the [subject predicate object] container. I could overload I think this covers everything? |
To test these ideas, I've converted the examples from https://www.w3.org/TR/turtle to Concise Encoding. The following modifications to CE seem to suffice:
Example 1:Turtle:
Concise Encoding: There's no
Example 2:Turtle:
Concise Encoding:
Example 3:Turtle:
Concise Encoding:
Example 4:Turtle:
Concise Encoding:
Example 5:Turtle:
Concise Encoding: The
Example 7:Turtle:
Concise Encoding:
Example 12:Turtle:
Concise Encoding:
Example 14:Turtle:
Concise Encoding: This one is a little cumbersome. I'm not really sure what is the best way to encode this data. You could do something like this:
Or perhaps to keep to only the specific data, put the blank nodes in with the metadata?
Example 15:Turtle:
Concise Encoding: An anonymous blank node could be represented by
Or just use map notation:
Example 16:Turtle:
Concise Encoding:
Example 17:Turtle:
Concise Encoding: I'm putting the blank nodes in the metadata again to keep the focus on the relationship data.
|
Tagged string literals still bother me... They feel like arbitrary constructs that aren't extensible or composeable in any way. Here is the standard example in Turtle:
This data represents the statement Let's first fix this so that the language tagging is consistent:
This is actually two statements with the same subject and predicate:
But this breaks the subject-predicate-object model (there are four pieces of information per statement here). What you actually have is a relationship to an object with multiple properties:
Tagged literals feel like a convenience when language tagging, but they obscure the actual relationship graph. They also completely lack any provenance for the language codes (they just magically mean something), which means that this tagging scheme cannot be extended to anything else (due to the implicit knowledge that must be added), and any additions to this implicit "language code" knowledge would require a re-issuance of the spec or risk incompatibilities between implementations. |
Karl: still need to sit down and read your updates attently, but few points about string literals. Since this was discussed recently in Clojure's RDF Chat, maybe it will be useful to provide pointers:
Quoting @quoll:
I had the same feelings you are expressing about redundancy, but put in that light I think it makes sense. In any case, note that any value is susceptible of becoming "stringly typed" since you can specify any user defined type in a string literal: <http://ex.com/subject> <http://ex.com/predicate> "1607253502"^^random:definition-of/unix-timestamp . I'm not well versed on the details of how these things work but I think those user defined types follow the conventions defined by the XSD schema language... although in practice it seems a typed string value can be just about anything and the type just any url you want ... a deserializer would then read that Turtle tag and "hydrate" the string to whatever makes sense on the environment of the programming language: <http://ex.com/subject> <http://ex.com/predicate> "c1 { ... }"^^<https://concise-encoding/...> . 😛 |
Yeah sorry :P I'm coming from zero experience or knowledge here, so this is mostly my journey into the wild world of knowledge systems and semantic data, while at the same time trying not to hobble the Concise Encoding implementation ;-) I can see why they chose to add string tags, but I dunno... I'm not really convinced that the trade-off was worth it. It feels too much like magic for a system that's supposed to rely on formalized descriptions and allow no assumptive knowledge to creep in. The example "chat" was a complete surprise to me. I didn't realize that it's an indicator of a complete semantic meaning. I'd just assumed that it was a generic "This text is in language X" marker ( |
WIP description: RelationshipA relationship is a container-like structure for making statements about resources in the form of subject-predicate-object triples (like in RDF). Relationships form edges between nodes (resources or values) to build a semantic graph. Local resources are anonymous by default, but can be made addressable by marking them. A relationship is composed of the following three components (in order):
Maps as RelationshipsMaps can also be used to represent relationships because they are natural relationship structures (where the map itself is the subject, the key is the predicate, and the value is the object). In Concise Encoding, the key-value pairs of a map are only considered relationships if their types match the requirements for the predicate and object of a relationship. Using maps to represent relationships can make the document more concise and the graph structure easier to follow, but the relationships expressed as key-value pairs cannot be made addressable (and thus cannot be used as resources). This is generally not a problem because few relationships actually need to be used as resources in real-world applications. ResourceA resource is one of:
Examples: At their most basic, relationships are simply 3-component statements containing a subject, a predicate, and an object:
Using the full URI is tedious, but we can use markers and the concatenation operator to make things more manageable. In the following example, the marked resource pointers are placed in a list (arbitrarily named "rdf") in a top-level metadata map so that they themselves don't constitute data, but can still be referenced from the data.
We can also use map syntax to model most relationships, which often makes the graph more clear to a human reader:
With map syntax, relationships can't be marked. When relationship marking is needed, they must be written using standard relationship statements:
Note: If the previous document were published at
Technically, these would also be accessible, although they would only resolve to resource pointers:
|
OK, I've read through all of the semantic web and RDF literature on w3.org and I think I've got the important bits now. Quick description: https://concise-encoding.org/index.html#relationships Long descriptinon: https://github.com/kstenerud/concise-encoding/blob/master/ce-structure.md#relationship No matter how many ways I look at it, string tags just feel like a mistake. Language is a property (a predicate in fact), and should be recorded as relationship data, not directly into the literal itself. |
Equality
There was an interesting thread going on on HN recently about type definition languages.
Two type definition languages presented these ideas:
serialized . deserialize = identity
I saw C/E covers some equivalence considerations but was wondering, maybe there's a way, perhaps by picking a subset of the spec, that order, equality and reversibility could be backed in.
RDF
The design is clearly attempting to cover the use cases of JSON and XML, I thought one last leg to be all-encompassing would be to cover RDF, being able to express triples in such nice way as Turtle does. I think mapping to RDF would be doable with the current design (say, xsd:string map to c/e string, etc.). One thing that is not as easy to express are tagged string literals: "Hello"@en.
The text was updated successfully, but these errors were encountered: