Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer line numbers incorrect with Windows line endings #560

Open
cedws opened this issue Nov 30, 2024 · 1 comment
Open

Tokenizer line numbers incorrect with Windows line endings #560

cedws opened this issue Nov 30, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@cedws
Copy link

cedws commented Nov 30, 2024

When source containing Windows line endings is tokenized, the line numbers can be incorrect.

Given this file:

# a
# b
# c

This list of tokens is generated:

- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" a" [ORG]:"# a\r" [POS(line:column:level:offset)]: 1:1:0:1
- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" b" [ORG]:"\n# b\r" [POS(line:column:level:offset)]: 3:1:0:5
- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" c" [ORG]:"\n# c\r" [POS(line:column:level:offset)]: 5:1:0:9

Note that each line is numbered 1,3, and 5 respectively.

With Windows line endings replaced with Unix line endings, the lines are numbered correctly:

- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" a" [ORG]:"# a\n" [POS(line:column:level:offset)]: 1:1:0:1
- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" b" [ORG]:"# b\n" [POS(line:column:level:offset)]: 2:1:0:4
- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" c" [ORG]:"# c\n" [POS(line:column:level:offset)]: 3:1:0:7

You can add Windows line endings by running this sed expression against the input file:

sed -i -e 's/$/\r/' file

And remove them with this:

sed -i -e 's/\r$//' file

Code to reproduce:

tokens := lexer.Tokenize(src)
tokens.Dump()
@cedws cedws added the bug Something isn't working label Nov 30, 2024
@cedws
Copy link
Author

cedws commented Dec 1, 2024

Upon further investigation it looks like this only occurs due to the way comments are scanned.

go-yaml/scanner/scanner.go

Lines 660 to 675 in f4ccce9

for idx, c := range ctx.src[ctx.idx:] {
ctx.addOriginBuf(c)
switch c {
case '\n', '\r':
if ctx.previousChar() == '\\' {
continue
}
value := ctx.source(ctx.idx, ctx.idx+idx)
progress := len([]rune(value))
ctx.addToken(token.Comment(value, string(ctx.obuf), s.pos()))
s.progressColumn(ctx, progress)
s.progressLine(ctx)
ctx.clear()
return true
}
}

When a \r rune is encountered, it is immediately considered a newline and the line is progressed, even though the following \n is also part of the CRLF sequence. Removing \r from the case resolves this bug.

- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" a\r" [ORG]:"# a\r\n" [POS(line:column:level:offset)]: 1:1:0:1
- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" b\r" [ORG]:"# b\r\n" [POS(line:column:level:offset)]: 2:1:0:5
- [TYPE]:"Comment" [CHARTYPE]:"Indicator" [INDICATOR]:"Comment" [VALUE]:" c\r" [ORG]:"# c\r\n" [POS(line:column:level:offset)]: 3:1:0:9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant