Rust highlighting in 11.11.0 very broken #4190

steffahn · 2024-12-20T06:22:17Z

This must be from #4156; I’m opening an issue to properly track the problem.

I’m not quite familiar with grammar format here, but it turns out this change seems to be very breaking for a lot of Rust code:

code example from Rust by Example:

use std::fmt::Debug; // Trait to bound with.

#[derive(Debug)]
struct Ref<'a, T: 'a>(&'a T);
// `Ref` contains a reference to a generic type `T` that has
// some lifetime `'a` unknown by `Ref`. `T` is bounded such that any
// *references* in `T` must outlive `'a`. Additionally, the lifetime
// of `Ref` may not exceed `'a`.

// A generic function which prints using the `Debug` trait.
fn print<T>(t: T) where
    T: Debug {
    println!("`print`: t is {:?}", t);
}

// Here a reference to `T` is taken where `T` implements
// `Debug` and all *references* in `T` outlive `'a`. In
// addition, `'a` must outlive the function.
fn print_ref<'a, T>(t: &'a T) where
    T: Debug + 'a {
    println!("`print_ref`: t is {:?}", t);
}

fn main() {
    let x = 7;
    let ref_x = Ref(&x);

    print_ref(&ref_x);
    print(ref_x);
}

screenshot made with the jsfiddle I saw in #3933; adapted into: https://jsfiddle.net/vwo97jnt/

joshgoebel · 2024-12-21T16:19:31Z

Ok, telling the difference between lifetime parameters and single quoted strings without a full parser seems more than a bit difficult. Any ideas on context clues we could perhaps use?

steffahn · 2024-12-21T16:35:56Z

Single-quote literals are not string literals but character literals. So they can't contain very much before they must end. If there's more than either a single character, or a single escape sequence (of various forms) before the matching ' then it can't be a character literal. You can find grammar for the lexical structure of character literals here.

If you want a simplified heuristic that captures all character literals but no lifetimes, you could say that after the ' for a char literal there must be either:

one single character that isn't \ or ' or a line break (or carriage return or tab, but who cares about those..)
- directly followed by the closing '
or a two-character escaped sequence (\n, \r, \t, \\ \0, \', \" – though for simplicity you could arguably just accept \ followed by any character that isn't x or u)
- directly followed by the closing '
or one of the longer escape sequences:
- after \x follow exactly two more characters [technically, one is an octal, one is a hex digit; if you want the same logic for byte literals, two hex digits are also a thing], for simplicity you could arguably just accept 2 arbitrary characters
  - directly followed by the closing '
- after \u follows {, then 1 to 6 hex digits, arbitrarily interspersed with arbitrarily many _ characters, followed by }, for simplicity you could just accept anything terminated by the matching }
  - directly followed by the closing '

Technically, the grammar also specifies a suffix identifier that can be part of the literal. However, these are lexical features not part of any actual language syntax; they can technically be used by macros that only operate on a lexical level, but to be honest, I've never seen those used ever anywhere, so if 'x'foo gets highlighted as if 'x' and foo are two separate things, that's totally fine.

joshgoebel · 2024-12-21T16:42:30Z

For our pattern matching uses chars are just a variant of string as far as I can tell.

I was asking about lifetime paramaters (what chatGPT called it, I'm not a Rust person)... such as:

fn print_ref<'a, T>(t: &'a T) where

To the parser when we see a ' it seems very hard to know if that's the beginning of a lifetime parameter (which does not need a closing ' and a string, which does.

steffahn · 2024-12-21T17:25:19Z

You could just go back to the prior style of handling via a single regex that tries to match the whole literal at once, couldn't you?

The main remaining issue with emoji seems to be support for unicode outside of the basic plane… however, as far as I understand, most browsers should support /u suffixed regex nowadays?

Here's a simple attempt to match all kinds of char literals, including somewhat broken ones, but certainly no lifetimes:

accepts optional b prefix, then ', then
- one (not ', \) character followed by '
- or \ followed by one more character (not x or u), then '
- or \ followed by
  - x then up to two more (non ') characters, then ', or
  - u then {, any number of characters, then }, but any ' that comes early terminates the literal, too
    - (i.e. I've made the { optional, the characters all exclude ', the } is also optional)

I've also excluded line breaks everywhere.

Any additional syntax that this characterizes as "character literal" incorrectly is actually something the rust compiler would error about. So all legal Rust programs are interpreted correctly, and IMO broken ones use reasonable fallbacks

/b?'([^\\'\n]|\\([^xu\n]|x[^'\n]?[^'\n]?|u(\{[^'\}\n]*\}?)?)?)'/u

If you don't want the /u, then the result with transpilation via https://mothereff.in/regexpu would be

/b?'((?:[\0-\t\x0B-&\(-\[\]-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|\\((?:[\0-\t\x0B-tvwy-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|x(?:[\0-\t\x0B-&\(-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])?(?:[\0-\t\x0B-&\(-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])?|u(\{(?:[\0-\t\x0B-&\(-\|~-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])*\}?)?)?)'/

steffahn · 2024-12-21T17:53:41Z

If you want strict regexes instead that match nothing beyond legal lexical char literal tokens in Rust, that could look like

/'([^'\\\n\r\t]|\\(['"nrt\\0]|x[0-7][0-9a-fA-F]|u\{([0-9a-fA-F]_*){1,6}\}))'/u

which could transpile to

/'((?:[\0-\x08\x0B\f\x0E-&\(-\[\]-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])|\\(["'0\\nrt]|x[0-7][0-9A-Fa-f]|u\{([0-9A-Fa-f]_*){1,6}\}))'/

and for byte literals:

/b'([\0-\x08\x0B\f\x0E-!#-&\(-\x7F]|\\(["'0\\nrt]|x[0-9A-Fa-f]{2}))'/

(also I'm now noticing that all groups here and in the previous reply could be non-capturing)

joshgoebel · 2024-12-22T20:46:06Z

We'd prefer not one large regex since that prevents us from highlighting child elements, such as the character escapes. Take a look at the latest main and see what you think. It seems to work well on all the test cases I've seen so far... just add a negative look-ahead guard to the un-terminated ' elements to prevent them from matching the beginning of strings.

Thoughts?

steffahn · 2024-12-22T20:58:14Z

What's the easiest way to test out the main branch?

steffahn · 2024-12-22T21:00:41Z

Ah, I guess I can follow these instructions https://highlightjs.readthedocs.io/en/latest/building-testing.html#basic-testing

steffahn · 2024-12-22T21:21:01Z

The issue seems fixed for most Rust code. Do note however that technically lifetimes (as any Rust identifiers) support all kinds of unicode symbols, not just english letters (see Identifiers).

Regarding matching of the whole thing, could perhaps the following work? Match the whole char literal at once for all unescaped ones; only use a bracketing approach for char literals that start with '\…? Or would that preclude highlighting the \ part of an escape if escapes are to get special highlighting?

Another thing that seems currently broken on main is handling of '\'', e.g.

fn test() {
    let x = '\'';
}

steffahn · 2024-12-22T21:24:39Z

Here’s a (perhaps contrived) example with non-English (even non-latin-character) identifiers in lifetimes:

fn foo<'ライフ>(input: &'ライフ str) -> &'ライフ str {
    let スペシャル = "special characters!";
    println!("wow! {スペシャル}");
    return input;
}

steffahn · 2024-12-22T22:46:12Z

Now that I know how to test locally, I think I might try a PR myself tomorrow :-)

steffahn added bug help welcome Could use help from community language labels Dec 20, 2024

steffahn mentioned this issue Dec 20, 2024

Build(deps): Bump highlight.js from 11.10.0 to 11.11.0 discourse/discourse#30288

Merged

steffahn changed the title ~~Rust highlighting in 11.11.0 very *VERY* broken~~ Rust highlighting in 11.11.0 very broken Dec 20, 2024

D-Sketon mentioned this issue Dec 24, 2024

Codeblock highlight incorrectly for rust 'lifetime specifier'. hexojs/hexo#5595

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust highlighting in 11.11.0 very broken #4190

Rust highlighting in 11.11.0 very broken #4190

steffahn commented Dec 20, 2024

joshgoebel commented Dec 21, 2024

steffahn commented Dec 21, 2024 •

edited

Loading

joshgoebel commented Dec 21, 2024

steffahn commented Dec 21, 2024 •

edited

Loading

steffahn commented Dec 21, 2024 •

edited

Loading

joshgoebel commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

Rust highlighting in 11.11.0 very broken #4190

Rust highlighting in 11.11.0 very broken #4190

Comments

steffahn commented Dec 20, 2024

joshgoebel commented Dec 21, 2024

steffahn commented Dec 21, 2024 • edited Loading

joshgoebel commented Dec 21, 2024

steffahn commented Dec 21, 2024 • edited Loading

steffahn commented Dec 21, 2024 • edited Loading

joshgoebel commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 22, 2024

steffahn commented Dec 21, 2024 •

edited

Loading

steffahn commented Dec 21, 2024 •

edited

Loading

steffahn commented Dec 21, 2024 •

edited

Loading