Attach input location to tokens (add spans feature) #10

not-my-profile · 2021-11-30T11:46:18Z

Hey, thanks for this library ... it looks really promising :) I am working on an HTML linter for which I require the spans of parser errors, tag names, attribute names and attribute values. These spans would ideally be reported as core::ops::Range<usize>, so that I can pass them directly to the codespan_reporting library (codespan_reporting::diagnostic::Label::range in particular). Since span tracking is of course overhead it would be behind an off-by-default feature flag.

I recently implemented this in my fork of the html5ever tokenizer ... which I frankly would love to abandon for a more sound library :) If you are interested in this I can probably implement it.

The text was updated successfully, but these errors were encountered:

untitaker · 2021-11-30T11:56:55Z

i think the easiest way to hack this in without adding overhead to the common case is to wrap the Reader in one that keeps track of the currently read character, then you can get the current file position and attach it to all the tokens you retrieve from the tokenizer. File position is not really a span but maybe close enough?

i'm interested in this feature but I have no idea how it could be done without adding overhead. probably using a generic that resolves to () for the common case?

not-my-profile · 2021-11-30T12:15:20Z

File position is not really a span but maybe close enough?

I want to output the nicest error messages possible ... for which I need spans ... not just the current position.

I have no idea how it could be done without adding overhead

I was thinking of a bunch of #[cfg(feature = "spans")]{ ... } in the giant (unreadable) match block.

untitaker · 2021-11-30T12:26:39Z

yes compile-time feature probably works, ideally it would be configurable per tokenizer instance for testability reasons though. perhaps it can be done via generics

not-my-profile · 2021-11-30T12:41:40Z

I am afraid I don't see how this could be nicely done with generics. What are these testability reasons you speak of?

untitaker · 2021-11-30T12:49:40Z

if I want to for example test with codespans, and once without, or benchmark with/without codespans I would have to recompile everything many times

generics idea:

enum Token<S> {
    StartTag(StartTag<S>),
    ...
}

struct StartTag<S> {
    name: String,
    attributes: BTreeMap<String, String>,
    span: S,
}

type Span = Range<char>;
type TokenWithoutSpans = Token<()>;
type TokenWithSpans = Token<Span>;

and make DefaultEmitter also generic over S.

Your entire tokenizer already has the emitter as type param, so now it can be Tokenizer<DefaultEmitter<()>, _> for running without spans, and Tokenizer<DefaultEmitter<Span>, _> for with spans

so during monomorphization you get two separately optimized codepaths

untitaker added the enhancement New feature or request label Nov 30, 2021

not-my-profile mentioned this issue Nov 30, 2021

Implement Spans via generics #14

Closed

untitaker changed the title ~~Add optional spans feature to report source code spans~~ Attach input location to tokens (add spans feature) Dec 5, 2021

mre mentioned this issue Feb 1, 2022

Add html5gum as alternative link extractor lycheeverse/lychee#480

Merged

mre mentioned this issue Nov 25, 2023

Feature request: Line numbers and columns in output lycheeverse/lychee#1304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attach input location to tokens (add spans feature) #10

Attach input location to tokens (add spans feature) #10

not-my-profile commented Nov 30, 2021 •

edited

Loading

untitaker commented Nov 30, 2021 •

edited

Loading

not-my-profile commented Nov 30, 2021

untitaker commented Nov 30, 2021

not-my-profile commented Nov 30, 2021

untitaker commented Nov 30, 2021

Attach input location to tokens (add spans feature) #10

Attach input location to tokens (add spans feature) #10

Comments

not-my-profile commented Nov 30, 2021 • edited Loading

untitaker commented Nov 30, 2021 • edited Loading

not-my-profile commented Nov 30, 2021

untitaker commented Nov 30, 2021

not-my-profile commented Nov 30, 2021

untitaker commented Nov 30, 2021

not-my-profile commented Nov 30, 2021 •

edited

Loading

untitaker commented Nov 30, 2021 •

edited

Loading