Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow integers in keys #592

Closed
xoofx opened this issue Feb 2, 2019 · 26 comments
Closed

Disallow integers in keys #592

xoofx opened this issue Feb 2, 2019 · 26 comments

Comments

@xoofx
Copy link

xoofx commented Feb 2, 2019

Hey,

While implementing the lexer for TOML, I have found that allowing integers in a key makes the lexer parser state dependent. It is pretty uncommon to have this in simple languages as it happens usually in languages that have complicated context dependent parsing (e.g preprocessor macros in a C like language). Imo, it's an unfortunate choice for a simple language like TOML to incur contextual lexers.

Another weird side effect of this is that you can have in a key something that looks like a float, or an integer hexa

[table]
1.0 = 2.0
0x155 = 0x155

I don't know if this is something that could be revisited before 1.0 and change a key to [a-zA-Z\-_][a-zA-Z\-_0-9]*?

@pradyunsg
Copy link
Member

ASCII letters, ASCII digits, underscores, and dashes (A-Za-z0-9_-)

Base keys can not contain .. The first key is thus not valid TOML.

As for the alternative notations, I don't think it's really that big an issue.

@xoofx
Copy link
Author

xoofx commented Feb 2, 2019

Base keys can not contain .. The first key is thus not valid TOML.

Sorry, I used the term key as a bag for base key and dot key. So 1.0 is a valid dot key right?

As for the alternative notations, I don't think it's really that big an issue.

hm, I don't understand which alternatives you are referring to? 😅

I'm just pointing a double issue:

  • the base key/dot key allow to parse things that look like integers, floating points while of course the specs says that it would be converted to a string, but imo this is confusing
  • it makes the lexer state dependent, which is not a great choice for a language that is supposed to be simple

@bd82
Copy link

bd82 commented Feb 2, 2019

Interesting.

I have recently implemented a Toml lexer & parser but did not notice this edge case.

Perhaps it can be resolved by lexer states that are independent of the parser state?
Example:

  • Start in KEY mode (no floating point allowed)
  • Once an "=" token is encountered we are in VALUE mode and floating points are allowed.
  • Finding a newline in VALUE mode would move us back to KEY mode.

This obviously is naive and simplified (no handling of inline-tables) but it is only meant to demonstrate the concept.

btw if we can differentiate between TOP Level context (key) and value mode
that will also resolve the "[[" vs "[" ambiguity (table vs table of tables vs array of arrays)

@xoofx
Copy link
Author

xoofx commented Feb 2, 2019

Perhaps it can be resolved by lexer states that are independent of the parser state?

That's a bit playing with the concepts! 😅 As it's basically bringing parser knowledge into the lexer... and also as you said, inline tables are going to require more knowledge to correctly do it.

In my case, my lexer has something like this:

if (State == LexerSate.Key)
{
    NextTokenForKey();
}
else
{
    NextTokenForValue();
}

and the parser is feeding the lexer with the correct state, and I'm "fine" with this solution. But this is annoying, because typically, I cannot test my lexer separately on a full TOML document and it is even barely unit testable on slices, as it would require my unit tests to have this parser knowledge...

But I'm worried that my original issue dragged the discussion to some "implementation details" while I should have put an emphasis on the user experience ambiguity first (e.g 1.0 = 2.0), implying the side effect of an unfortunate parser/lexer state coupling

@bd82
Copy link

bd82 commented Feb 2, 2019

Back to the high level concepts.

I agree that from a simplicity perspective that it would be best to decouple the lexing and parsing state completely.

If only due to the fact some language tools have completely separate lexing and parsing
phases so more tools could be used to build (correct) parsers for Toml without "advanced mitigation logic"

Some References to a similar (but more complex) scenario:

There is another concern that when creating a fault tolerant parser,
for example to be used in an IDE.
It is much easier to naively recover from lexing errors by simply ignoring unidentified
character sequences when the lexer has only a single mode.

@xoofx
Copy link
Author

xoofx commented Feb 2, 2019

It is much easier to naively recover from lexing errors by simply ignoring unidentified
character sequences when the lexer has only a single mode.

Yes, and specially, in the case of TOML, we are lucky that the grammar doesn't require token previews

@Jezza
Copy link

Jezza commented Feb 2, 2019

So, I just thought I'd add my two cents, having already taken a stab at implementing a toml 0.5 parser.

Parser Code

Basically, if it's lexed as number, it needs to be a base 10 integer, otherwise it'll reject it.

I've actually just taken a glance at the spec, and I realise I've actually done some things incorrectly.
First of all, I should definitely accept 0x2 as a valid key.
It's not a number, but still a valid key.

However, floats an issue does come up.
JFlex does nom as much input as it can, and so it'll take 1.0 as a float, over 1 . 0.
I can perform a work around in code, and honestly, it's quite easy, but yeah, just thought
I'd chime in and say maybe the spec should inform us a bit about the precedence.

Reading the actual spec, I would 100% convert 1.0 into a dotted key with 1 and 0.
I think I'll go change my parser to do that.

@Jezza
Copy link

Jezza commented Feb 2, 2019

Ok, just went and changed it:

10 = "asd"
1.0 = "asd"
1.00.0 = "asd"
0x10 = "asd"
0o10 = "asd"
0b10 = "asd"

[1.00.00]
value = "asd"

is now parsed as:

{
    10 = "asd",
    1 = {
        0 = "asd",
        00 = {
            0 = "asd",
            00 = {
                value = "asd",
            },
        },
    },
    0x10 = "asd",
    0o10 = "asd",
    0b10 = "asd",
}

@xoofx
Copy link
Author

xoofx commented Feb 3, 2019

JFlex does nom as much input as it can, and so it'll take 1.0 as a float, over 1 . 0.
I can perform a work around in code, and honestly, it's quite easy, but yeah, just thought
I'd chime in and say maybe the spec should inform us a bit about the precedence.

And that's another case to remove these integers in keys in TOML, as I'm pretty sure that many implementations got actually that wrong.

That's why you can't use a bare lexer, you need to maintain a state, otherwise going through "oh I just parsed a float but it is actually not " is pretty fragile/clunky to recover from (specially with the dot, as the parser is usually in a very different state when it handles a dotted key)

@pradyunsg
Copy link
Member

pradyunsg commented Feb 3, 2019

So 1.0 is a valid dot key right?

Yes.

This is not going to change in TOML 1.0. We could revisit this post TOML 1.0, if someone's motivated enough to bring this up then. :)


Closing since there's nothing actionable here.

@Jezza
Copy link

Jezza commented Feb 3, 2019

Personally, I think that integer keys are a bit weird anyway, but I might just not have the use-case for it.

The best action might be to clarify that 1.0 = should be parsed as a dotted key 1 0.

It's implicit in the spec, but as this issue is showing, a couple of people already missed it, and it's not even version 1.0.

@pradyunsg
Copy link
Member

The best action might be to clarify that 1.0 = should be parsed as a dotted key 1 0.

Happy to accept a PR for this. :)

@xoofx
Copy link
Author

xoofx commented Feb 3, 2019

This is not going to change in TOML 1.0. We could revisit this post TOML 1.0, if someone's motivated enough to bring this up then. :)

Is this because you don't want to break compatibility (while I thought that 0.x could allow this) or you think that the cost of the ambiguity is worth the price that allowing integers in keys is bringing? I could run some stats on all TOML documents on github to find out how many are using integers in their keys if that's something that could balance a decision....
Otherwise, I don't think TOML should revisit later, it is either today or never.

It's implicit in the spec, but as this issue is showing, a couple of people already missed it, and it's not even version 1.0.

Yes, and I'm going to send a PR to https://github.com/BurntSushi/toml-test to add these edge cases... pretty sure that it is going to break a few libraries that were assuming full 0.5 compatibility.

@pradyunsg
Copy link
Member

pradyunsg commented Feb 3, 2019

Is this because you don't want to break compatibility (while I thought that 0.x could allow this)

This -- the README has a note about 0.5.0 being "extremely stable".

@xoofx
Copy link
Author

xoofx commented Feb 3, 2019

This -- the README has a note about 0.5.0 being "extremely stable".

You are going finicky. 😛 My turn: you have omitted to mention the following phrase:

The goal is for version 1.0.0 to be backwards compatible (as much as humanly possible) with version 0.5.0

So the README doesn't say that it should stay stable at all cost 😉

But fair enough, I will find the time to send a PR to make the specs more explicit.

@pradyunsg
Copy link
Member

As of version 0.5.0, TOML should be considered extremely stable. The goal is for version 1.0.0 to be backwards compatible (as much as humanly possible) with version 0.5.0. All implementations are strongly encouraged to become 0.5.0 compatible so that the transition to 1.0.0 will be simple when that happens.

@xoofx I'm not sure what you mean by finicky here. I assume you meant it in good faith. :)

@xoofx
Copy link
Author

xoofx commented Feb 3, 2019

@xoofx I'm not sure what you mean by finicky here. I assume you meant it in good faith. :)

I was just teasing a bit, as I found the responses a bit terse regarding the problem I'm bringing here 😊

@philippgl
Copy link

philippgl commented Mar 7, 2019

Reading the spec, if would think, that integers (!= string with digits) are not allowed as keys and

[table]
1.0 = 2.0
0x155 = 0x155

is parsed as

[table]
"1"."0" = 2.0
"0x155" = 0x155

and

1.0 = 2.0
"1"."0" = "whatever"

is forbidden due to the duplicate key. (github output looks strange)

@xoofx
Copy link
Author

xoofx commented Mar 8, 2019

Reading the spec, if would think, that integers (!= string with digits) are not allowed as keys and
... is forbidden due to the duplicate key. (github output looks strange)

Indeed. The point of the discussion here was just to highlight that "integers like" keys can be actually confusing and it makes the lexer dependent on the state of the parser, which is quite annoying.

@LongTengDao
Copy link
Contributor

(github output looks strange)

maybe github only support toml earlier version which not support dotted keys...

@pradyunsg
Copy link
Member

pradyunsg commented May 6, 2019

(slight OT) @xoofx I usually try not to be terse or come across as dismissive but apologies if it came across as such. Sarcasm isn't something that gets communicated well unless folks share the same context which I didn't in this case.


The ship has sailed for this. For now, it's better to not break compatibility with TOML 0.5 -> 1.0.

In terms of actionable items here, I am splitting this ticket into 2.

  • Add a small note to the dotted keys section that 1.0 = true gets parsed into {"1": {"0": true"}}.
  • Have a post-1.0 discussion for actually disallowing this in the next major version bump.

@sevmeyer
Copy link
Contributor

sevmeyer commented May 7, 2019

Please excuse my ignorance, but doesn't this also affect keywords? Even with the proposed rule [a-zA-Z\-_][a-zA-Z\-_0-9]*, bare keys remain context-sensitive:

true = false
nan  = inf
-42  = +42

I don't think there is a simple way to avoid this ambiguity.

@eksortso
Copy link
Contributor

eksortso commented May 7, 2019

Devil's advocate says: It's not hard to tell what's on the left side of the equals sign and what's on the right side. Yes you can confuse yourself by naming your keys the same as Boolean values. But there's no good reason to do that.

@ChristianSi
Copy link
Contributor

ChristianSi commented May 8, 2019

Regarding the post-1.0 discussion, I'd argue that integer-like bare keys should not be forbidden since they might actually be useful in some cases. Maybe someone has a file birthdays.toml:

[December]
3 = "myself"
9 = ["Kirk Douglas", "John Malkovich"]
24 = "Jesus"

That's just an example, but use cases like this should not be ruled out merely to make the lives of parser writers easier. Nor do I see a good reason to force people to use quotation marks in such cases.

@pradyunsg
Copy link
Member

I was open to ideas/reasons to change the behavior. @ChristianSi's comment from the future (see end of post) has convinced me that it's not worth changing this behavior. There are valid use cases for this and we explicitly call this behavior allowed in the current specification.

We won't be disallowing integers in keys in TOML. A clarification will be added to explain the dotted keys edge case (this is being tracked #616).

Thanks for filing this issue @xoofx and to everyone who's contributed to this discussion! :)


Screenshot 2019-05-08 at 10 23 42 AM

Hi @ChristianSi from the future! :P

@ChristianSi
Copy link
Contributor

ChristianSi commented May 8, 2019

Well, post-1.0 is still a long way of, so the future seemed the right place to make that comment 😁

Also, this is the first time someone has reacted to a comment of mine before I even made it. Cheers, @pradyunsg !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants