Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement "cut" (as in Prolog) for comitting to a choice #35

Open
osa1 opened this issue Nov 23, 2021 · 1 comment
Open

Implement "cut" (as in Prolog) for comitting to a choice #35

osa1 opened this issue Nov 23, 2021 · 1 comment
Labels
design feature New feature or request

Comments

@osa1
Copy link
Owner

osa1 commented Nov 23, 2021

Suppose I'm trying to lex this invalid Rust code: b"\xa". The problem here is \x needs to be followed by two hex digits, not one.

If I run this with rustc I get an "invalid escape" error, as expected.

If I run this with lexgen_rust, I get an id b first, then an error.

The problem is with backtracking. The lexgen-generated lexer records the successful match for b as an identifier and continues lexing, to be able to return the longest match. When it fails to match the rest of the token, it returns b as an identifier.

Instead what we want to do is, when we see b" we want to "commit" to the byte string rule, i.e. no backtracking from that point. If the rest of the token is not a valid byte string then we don't return b as an id and fail.

This is trivial to implement once we come up with a syntax: just reset the last_match when we make a transitions with a "cut" (or "commit") annotation.

Currently the workaround is to have a lexer state for lexing the string body. So instead of this:

rule Init {
    ...

    "b\"" ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::ByteString(match_)))
    },
}

We need something like:

rule Init {
    ...

    "b\"" => |lexer| lexer.switch(LexerRule::ByteString),
}

rule ByteString {
    ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::ByteString(match_)))
    },
}

Since the idea is similar to Prolog's "cut", I suggest a similar syntax:

rule Init {
    ...

    "b\"" ! ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::ByteString(match_)))
    },
}

That ! above is "cut" (or "commit"), meaning once b" is matched there is no backtracking, we either match rest of the string according to the current rule, or fail with an error pointing to the character b.

I wonder if other lexer generators have a syntax for this kind of thing?

@osa1
Copy link
Owner Author

osa1 commented Sep 13, 2022

I just came across this similar feature in a peg parser library and they also call it "cut": https://tatsu.readthedocs.io/en/stable/syntax.html#id20. Copying here in case the page gets updated:

~

The cut expression. Commit to the current active option and prevent other options from being considered even if what follows fails to parse.

In this example, other options won’t be considered if a parenthesis is parsed:

atom
    =
    | '(' ~ @:expre ')'
    | int
    | bool
    ;

There are also options in optional expressions, because [foo] is equivalent to (foo|()).

There are options also in closures, because of a similar equivalency, so the following rule will fail if expression is not parsed after an = is parsed, while the version without the ~ would succeed over a partial parse of the name '=' expression ahead in the input:

parameters
    =
    ','.{name '=' ~ expression}
    ;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant