Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take into account encoding of source file for syntax error #124188

Open
serhiy-storchaka opened this issue Sep 17, 2024 · 0 comments
Open

Take into account encoding of source file for syntax error #124188

serhiy-storchaka opened this issue Sep 17, 2024 · 0 comments
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Sep 17, 2024

Currently most syntax errors raised in the compiler (except these raised in the parser) use PyErr_ProgramTextObject() to get the line of the code. It does not know the encoding of the source file and interpret it as UTF-8 (failing if it contain non-UTF-8 sequences). The parser uses _PyErr_ProgramDecodedTextObject().

There are two ways to solve this issue:

  • Pass the source file encoding from the parser to the code generator. This may require changing some data structures. But this is more efficient.
  • Detect the encoding in PyErr_ProgramTextObject(). Since the latter is in the public C API, this can also affect the third-party code.

There are other issues with PyErr_ProgramTextObject():

  • It leave the BOM in the first line if the source line contains it. This is not consistent with offsets.
  • For very long lines, it returns the tail of the line that exceeds 1000 bytes. It can be short, it can start with invalid character, it is not consistent with offsets. If return incomplete line, it is better to return the head.

This all applies to PyErr_ProgramText() as well.

Linked PRs

@serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Sep 17, 2024
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Sep 17, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
serhiy-storchaka added a commit that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)

Co-authored-by: Serhiy Storchaka <[email protected]>
serhiy-storchaka added a commit that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)
serhiy-storchaka added a commit that referenced this issue Oct 7, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)

Co-authored-by: Serhiy Storchaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API
Projects
None yet
Development

No branches or pull requests

1 participant