-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust highlighting in 11.11.0 very broken #4190
Comments
Ok, telling the difference between lifetime parameters and single quoted strings without a full parser seems more than a bit difficult. Any ideas on context clues we could perhaps use? |
Single-quote literals are not string literals but character literals. So they can't contain very much before they must end. If there's more than either a single character, or a single escape sequence (of various forms) before the matching If you want a simplified heuristic that captures all character literals but no lifetimes, you could say that after the
Technically, the grammar also specifies a suffix identifier that can be part of the literal. However, these are lexical features not part of any actual language syntax; they can technically be used by macros that only operate on a lexical level, but to be honest, I've never seen those used ever anywhere, so if |
For our pattern matching uses chars are just a variant of string as far as I can tell. I was asking about lifetime paramaters (what chatGPT called it, I'm not a Rust person)... such as:
To the parser when we see a |
You could just go back to the prior style of handling via a single regex that tries to match the whole literal at once, couldn't you? The main remaining issue with emoji seems to be support for unicode outside of the basic plane… however, as far as I understand, most browsers should support Here's a simple attempt to match all kinds of char literals, including somewhat broken ones, but certainly no lifetimes:
I've also excluded line breaks everywhere. Any additional syntax that this characterizes as "character literal" incorrectly is actually something the rust compiler would error about. So all legal Rust programs are interpreted correctly, and IMO broken ones use reasonable fallbacks /b?'([^\\'\n]|\\([^xu\n]|x[^'\n]?[^'\n]?|u(\{[^'\}\n]*\}?)?)?)'/u If you don't want the /b?'((?:[\0-\t\x0B-&\(-\[\]-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|\\((?:[\0-\t\x0B-tvwy-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|x(?:[\0-\t\x0B-&\(-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])?(?:[\0-\t\x0B-&\(-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])?|u(\{(?:[\0-\t\x0B-&\(-\|~-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*\}?)?)?)'/ |
If you want strict regexes instead that match nothing beyond legal lexical char literal tokens in Rust, that could look like /'([^'\\\n\r\t]|\\(['"nrt\\0]|x[0-7][0-9a-fA-F]|u\{([0-9a-fA-F]_*){1,6}\}))'/u which could transpile to /'((?:[\0-\x08\x0B\f\x0E-&\(-\[\]-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|\\(["'0\\nrt]|x[0-7][0-9A-Fa-f]|u\{([0-9A-Fa-f]_*){1,6}\}))'/ and for byte literals: /b'([\0-\x08\x0B\f\x0E-!#-&\(-\x7F]|\\(["'0\\nrt]|x[0-9A-Fa-f]{2}))'/ (also I'm now noticing that all groups here and in the previous reply could be non-capturing) |
We'd prefer not one large regex since that prevents us from highlighting child elements, such as the character escapes. Take a look at the latest Thoughts? |
What's the easiest way to test out the main branch? |
Ah, I guess I can follow these instructions https://highlightjs.readthedocs.io/en/latest/building-testing.html#basic-testing |
The issue seems fixed for most Rust code. Do note however that technically lifetimes (as any Rust identifiers) support all kinds of unicode symbols, not just english letters (see Identifiers). Regarding matching of the whole thing, could perhaps the following work? Match the whole char literal at once for all unescaped ones; only use a bracketing approach for char literals that start with Another thing that seems currently broken on main is handling of fn test() {
let x = '\'';
} |
Here’s a (perhaps contrived) example with non-English (even non-latin-character) identifiers in lifetimes: fn foo<'ライフ>(input: &'ライフ str) -> &'ライフ str {
let スペシャル = "special characters!";
println!("wow! {スペシャル}");
return input;
} |
Now that I know how to test locally, I think I might try a PR myself tomorrow :-) |
This must be from #4156; I’m opening an issue to properly track the problem.
I’m not quite familiar with grammar format here, but it turns out this change seems to be very breaking for a lot of Rust code:
code example from Rust by Example:
screenshot made with the jsfiddle I saw in #3933; adapted into: https://jsfiddle.net/vwo97jnt/
The text was updated successfully, but these errors were encountered: