Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map keys not checked for uniqueness (must they be?) #28

Open
kengruven opened this issue Jun 12, 2021 · 6 comments
Open

Map keys not checked for uniqueness (must they be?) #28

kengruven opened this issue Jun 12, 2021 · 6 comments

Comments

@kengruven
Copy link

The CE spec says, in various places:

"By default, a map is unordered and does not allow duplicate keys."
"If a container disallows duplicates, duplicate entries are structural errors."
"A decoder MUST stop processing and issue a diagnostic when a structural error occurs."

Yet there's also a section "Key Collisions" which says:

"Key collisions occur when two or more identical keys are present in the same encoded map. Systems could potentially deal with this situation in a number of different ways…"

and goes on to list 4 possible solutions. The 4th one is "abort processing", but the first 3 are seemingly non-conformant solutions.

Anyway, enctool seems to pick a 5th option: pass duplicate keys without error or warning.

$ echo 'c1 {1 = 1 1 = 2}' | ./enctool convert -df cbe | hexdump -C
00000000  03 00 79 01 01 01 02 7b                           |..y....{|
00000008
@kstenerud
Copy link
Owner

kstenerud commented Jun 13, 2021

Hmm the descriptions for this need to be tightened up, but basically:

  • Whether duplicates are allowed or not will be determined by the schema (if any). By default, lists are ordered and allow duplicate entries, and maps are unordered and do not allow duplicate keys. The idea is that the schema can change ordering and duplicates rules to turn a list into a set for example, and can relax the map key rule if the data it will be receiving is expected to be dirty (so that the receiving end can decide how best to deal with it). Basically: sane defaults that can be changed if you know what you're doing.
  • There are a number of (bad) ways to handle key collisions, which are listed for comparison, in case someone wants to change the defaults in their schema (to give an idea of what they'll be getting into).
  • The only secure way to handle duplicate keys is to abort processing, which is what CE decoders must do by default.
  • go-concise-encoding (and by extension enctool) doesn't have any of that implemented yet :P

@kstenerud
Copy link
Owner

Spec clarified here: 2690775

@kengruven
Copy link
Author

kengruven commented Jun 13, 2021

Ah, now I see what that section was meant to say.

I'm still unclear on this schema concept. It seems to be a combination of a traditional schema and a way to change the fundamental rules of parsing. CE already has ways to define "custom" types, but also the schema will let you re-purpose existing data types to new semantics.

Normally I would just ignore a schema system, since I've seen them for XML/JSON/CSV/… and never found them to be worth the trouble. But with CE it looks like I may be physically unable to parse some CE data unless I implement schemas. Right? That's a little weird.

@kengruven
Copy link
Author

Hmm, I just noticed this rule:

"References CAN refer to objects marked later in the document (forward references)."

And since references to keyable types are themselves keyable types, that means I have to keep all sets of keys of all maps which contain any references, so that when I get to the end of the document and they're finally resolvable, I can tell if the earlier maps were valid or not, and what's in them.

I'd hoped I could process an arbitrary size document using only O(max_nesting_depth) memory, but I don't think that's possible. Technically, that wasn't possible before -- I needed to keep the current map's worth of keys to check it -- but then at least "max object count in a container" would limit that.

This looks like a bigger DOS risk to me. All someone has to do is make a bunch of maps with a lot of long keys and 1 forward reference, and the parser will have to store essentially the entire document in memory until the end.

(Or write a multi-pass parser, but if the data isn't coming from random-access storage, that might not be feasible.)

c1 [
  {
    "reallyreallyreallyreallyreallyreallyreallyreallylong_1_1" = 1
    "reallyreallyreallyreallyreallyreallyreallyreallylong_1_2" = 1
    "reallyreallyreallyreallyreallyreallyreallyreallylong_1_3" = 1
    ...  // 1000 keys
    $you_dont_know_what_i_am = 1
  }
  {
    "reallyreallyreallyreallyreallyreallyreallyreallylong_2_1" = 1
    "reallyreallyreallyreallyreallyreallyreallyreallylong_2_2" = 1
    "reallyreallyreallyreallyreallyreallyreallyreallylong_2_3" = 1
    ...
    $maybe_a_duplicate_key_who_knows = 1
  }
  ...  // 1000 maps
}

With these two features, it seems easy to build a 5 GB document which would require a parser to hold all of its data in memory at once, until the very end (using way more than 5 GB of memory, since hash tables are more expensive than raw data), and without hitting any of the mandatory limits (no more than 1000 anything in an anything).

I don't know when I'd ever use forward refs. I don't know why I'd ever use refs for keys. The intersection of these two features seems like more trouble than it's worth. Personally, I'd rather have a more streaming-friendly format.

Maybe I'll give my implementation a "max map key buffer size" parameter, and then set it to a tiny value by default. I don't want to block an actual feature, but I also can't imagine it'll be used much for legitimate data.

@kstenerud
Copy link
Owner

The only type that cannot be parsed without a schema is the constant (which is only available in CTE, so only during human editing, or if viewed on a system where such a schema is available). Everything else is parseable and gets interpreted according to the defaults when there's no schema to say otherwise. I still haven't defined what the schema even is, although JSON-Schema looks good as a starting point.

Regarding memory usage, you'll need to keep all keys of all maps leading up to the current depth. So if you had 1000 entry maps, 50 containers deep, you'd have to keep in memory on average 501000/2 entries in order to check for duplicates (average case) with a worst case of 501000. But yes if you have a forward ref to a key, you'd need to keep map keys around longer than the current depth... potentially the whole document. Maybe it's better to just not allow references as keys at all. Not sure they add enough usefulness anyway...

With references, an amplification attack is a real danger, which is why I'd double-count all objects and sub-objects that a reference represents. So if you have a marked map containing 1000 entries within its depths, every reference to that map "adds" 1000 objects towards the maximum objects allowed in a document. I think I put that in the spec but it may not be very clear...

@kengruven
Copy link
Author

I would think that if you had the document c1 { "a" = 1 "a" = 2 } and an accompanying schema that said "btw it's a multi-map", you would not be able to parse that with a schema-agnostic parser.


The question of double-counting all referenced objects depends on how the parser reports them to the application. I haven't really given that much thought yet, and I don't know that it's ever spelled out in the spec. I'm starting to think that it would make sense, depending on how the application intends to use the document, to let it request any of:

  • "give me the markers/references exactly as they appear in the document"
  • "replace all references with copies of the marked data they refer to"
  • or maybe even "replace all references with the referenced data, but guarantee that they all refer to the same value in memory" (if that's feasible in that programming language's data model)

I tried converting a marker/reference from CE to formats that don't have this feature:

$ echo 'c1 { "A" = &1:[1 2 3 4 5] "B" = $1 }' | ./enctool convert -df xml
Cannot convert map to XML
$ echo 'c1 { "A" = &1:[1 2 3 4 5] "B" = $1 }' | ./enctool convert -df json
{"A":[1,2,3,4,5]}

I was hoping this would either duplicate the data or give a "references not supported" error, to give a hint as to the intent of that feature, but it dropped them, which I suppose just means it's not implemented yet either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants