Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust highlighting in 11.11.0 very broken #4190

Open
steffahn opened this issue Dec 20, 2024 · 11 comments
Open

Rust highlighting in 11.11.0 very broken #4190

steffahn opened this issue Dec 20, 2024 · 11 comments
Labels
bug help welcome Could use help from community language

Comments

@steffahn
Copy link

This must be from #4156; I’m opening an issue to properly track the problem.

I’m not quite familiar with grammar format here, but it turns out this change seems to be very breaking for a lot of Rust code:

Screenshot_20241220_070444

Screenshot_20241220_070504

code example from Rust by Example:

use std::fmt::Debug; // Trait to bound with.

#[derive(Debug)]
struct Ref<'a, T: 'a>(&'a T);
// `Ref` contains a reference to a generic type `T` that has
// some lifetime `'a` unknown by `Ref`. `T` is bounded such that any
// *references* in `T` must outlive `'a`. Additionally, the lifetime
// of `Ref` may not exceed `'a`.

// A generic function which prints using the `Debug` trait.
fn print<T>(t: T) where
    T: Debug {
    println!("`print`: t is {:?}", t);
}

// Here a reference to `T` is taken where `T` implements
// `Debug` and all *references* in `T` outlive `'a`. In
// addition, `'a` must outlive the function.
fn print_ref<'a, T>(t: &'a T) where
    T: Debug + 'a {
    println!("`print_ref`: t is {:?}", t);
}

fn main() {
    let x = 7;
    let ref_x = Ref(&x);

    print_ref(&ref_x);
    print(ref_x);
}

screenshot made with the jsfiddle I saw in #3933; adapted into: https://jsfiddle.net/vwo97jnt/

@steffahn steffahn added bug help welcome Could use help from community language labels Dec 20, 2024
@steffahn steffahn changed the title Rust highlighting in 11.11.0 very *VERY* broken Rust highlighting in 11.11.0 very broken Dec 20, 2024
@joshgoebel
Copy link
Member

Ok, telling the difference between lifetime parameters and single quoted strings without a full parser seems more than a bit difficult. Any ideas on context clues we could perhaps use?

@steffahn
Copy link
Author

steffahn commented Dec 21, 2024

Single-quote literals are not string literals but character literals. So they can't contain very much before they must end. If there's more than either a single character, or a single escape sequence (of various forms) before the matching ' then it can't be a character literal. You can find grammar for the lexical structure of character literals here.

If you want a simplified heuristic that captures all character literals but no lifetimes, you could say that after the ' for a char literal there must be either:

  • one single character that isn't \ or ' or a line break (or carriage return or tab, but who cares about those..)
    • directly followed by the closing '
  • or a two-character escaped sequence (\n, \r, \t, \\ \0, \', \" – though for simplicity you could arguably just accept \ followed by any character that isn't x or u)
    • directly followed by the closing '
  • or one of the longer escape sequences:
    • after \x follow exactly two more characters [technically, one is an octal, one is a hex digit; if you want the same logic for byte literals, two hex digits are also a thing], for simplicity you could arguably just accept 2 arbitrary characters
      • directly followed by the closing '
    • after \u follows {, then 1 to 6 hex digits, arbitrarily interspersed with arbitrarily many _ characters, followed by }, for simplicity you could just accept anything terminated by the matching }
      • directly followed by the closing '

Technically, the grammar also specifies a suffix identifier that can be part of the literal. However, these are lexical features not part of any actual language syntax; they can technically be used by macros that only operate on a lexical level, but to be honest, I've never seen those used ever anywhere, so if 'x'foo gets highlighted as if 'x' and foo are two separate things, that's totally fine.

@joshgoebel
Copy link
Member

For our pattern matching uses chars are just a variant of string as far as I can tell.

I was asking about lifetime paramaters (what chatGPT called it, I'm not a Rust person)... such as:

fn print_ref<'a, T>(t: &'a T) where

To the parser when we see a ' it seems very hard to know if that's the beginning of a lifetime parameter (which does not need a closing ' and a string, which does.

@steffahn
Copy link
Author

steffahn commented Dec 21, 2024

You could just go back to the prior style of handling via a single regex that tries to match the whole literal at once, couldn't you?

The main remaining issue with emoji seems to be support for unicode outside of the basic plane… however, as far as I understand, most browsers should support /u suffixed regex nowadays?

Here's a simple attempt to match all kinds of char literals, including somewhat broken ones, but certainly no lifetimes:

  • accepts optional b prefix, then ', then
    • one (not ', \) character followed by '
    • or \ followed by one more character (not x or u), then '
    • or \ followed by
      • x then up to two more (non ') characters, then ', or
      • u then {, any number of characters, then }, but any ' that comes early terminates the literal, too
        • (i.e. I've made the { optional, the characters all exclude ', the } is also optional)

I've also excluded line breaks everywhere.

Any additional syntax that this characterizes as "character literal" incorrectly is actually something the rust compiler would error about. So all legal Rust programs are interpreted correctly, and IMO broken ones use reasonable fallbacks

/b?'([^\\'\n]|\\([^xu\n]|x[^'\n]?[^'\n]?|u(\{[^'\}\n]*\}?)?)?)'/u

If you don't want the /u, then the result with transpilation via https://mothereff.in/regexpu would be

/b?'((?:[\0-\t\x0B-&\(-\[\]-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|\\((?:[\0-\t\x0B-tvwy-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|x(?:[\0-\t\x0B-&\(-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])?(?:[\0-\t\x0B-&\(-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])?|u(\{(?:[\0-\t\x0B-&\(-\|~-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*\}?)?)?)'/

@steffahn
Copy link
Author

steffahn commented Dec 21, 2024

If you want strict regexes instead that match nothing beyond legal lexical char literal tokens in Rust, that could look like

/'([^'\\\n\r\t]|\\(['"nrt\\0]|x[0-7][0-9a-fA-F]|u\{([0-9a-fA-F]_*){1,6}\}))'/u

which could transpile to

/'((?:[\0-\x08\x0B\f\x0E-&\(-\[\]-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|\\(["'0\\nrt]|x[0-7][0-9A-Fa-f]|u\{([0-9A-Fa-f]_*){1,6}\}))'/

and for byte literals:

/b'([\0-\x08\x0B\f\x0E-!#-&\(-\x7F]|\\(["'0\\nrt]|x[0-9A-Fa-f]{2}))'/

(also I'm now noticing that all groups here and in the previous reply could be non-capturing)

@joshgoebel
Copy link
Member

We'd prefer not one large regex since that prevents us from highlighting child elements, such as the character escapes. Take a look at the latest main and see what you think. It seems to work well on all the test cases I've seen so far... just add a negative look-ahead guard to the un-terminated ' elements to prevent them from matching the beginning of strings.

Thoughts?

@steffahn
Copy link
Author

What's the easiest way to test out the main branch?

@steffahn
Copy link
Author

Ah, I guess I can follow these instructions https://highlightjs.readthedocs.io/en/latest/building-testing.html#basic-testing

@steffahn
Copy link
Author

The issue seems fixed for most Rust code. Do note however that technically lifetimes (as any Rust identifiers) support all kinds of unicode symbols, not just english letters (see Identifiers).

Regarding matching of the whole thing, could perhaps the following work? Match the whole char literal at once for all unescaped ones; only use a bracketing approach for char literals that start with '\…? Or would that preclude highlighting the \ part of an escape if escapes are to get special highlighting?

Another thing that seems currently broken on main is handling of '\'', e.g.

fn test() {
    let x = '\'';
}

@steffahn
Copy link
Author

Here’s a (perhaps contrived) example with non-English (even non-latin-character) identifiers in lifetimes:

fn foo<'ライフ>(input: &'ライフ str) -> &'ライフ str {
    let スペシャル = "special characters!";
    println!("wow! {スペシャル}");
    return input;
}

@steffahn
Copy link
Author

Now that I know how to test locally, I think I might try a PR myself tomorrow :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help welcome Could use help from community language
Projects
None yet
Development

No branches or pull requests

2 participants