-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect encoding per yaml spec (fix #238) #240
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Is about what I had in mind as #238 (comment) option 4.
See a few inline comments with suggestions. Also may be possible to avoid opening the file twice with different modes, but it's probably fiddly and not worth the hassle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay!
Thanks @spyoungtech and @bz2 for your help on this. This solution looks good, I wasn't aware that the YAML spec states that UTF-8 is the default.
In this case, we need to add a test for non-UTF encoding . Could you add a fake file (e.g. non-ascii/iso-8859-1
containing éàöî
) encoded in ISO-8859-1, and expect the UnicodeDecodeError
?
@bz2 @spyoungtech do you know any easy solution to accept custom user encodings, depending on the system yamllint is run from? E.g. reading LC_ALL
environment var?
Just wanted to say I'm still working on this, but have had some things come up. I should get around to fixing up the review items this weekend. Maintainers are also free to commit directly to my branch :-) |
Thanks @spyoungtech, no problem. Sure, I could fix the few issues myself, but before merging I just want to be sure that:
|
@adrienverge I think the right thing on non-unicode encodings from a user perspective is add a new error type but otherwise parse as well as possible. Could implement that either as falling back to parse via locale encoding, and/or replacing non-utf8 byte sequences on parse. |
Agree. Perhaps it would be reasonable then to use something like It may also be worth noting that the spec draft of YAML 1.2 makes UTF-32 support mandatory. |
Okay, I went ahead and committed a new change to create a new function |
Simply put, UTF-8 is the most sensible default; it's also the default of the spec. I had considered using The solution I put forward fixes the bug in #238 and also allows for an explicit encoding to be provided, should you ever decide on a mechanism to provide that encoding (perhaps command line argument? ). Another option would be to pass an appropriate argument into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for the update.
I'm very against adding dependencies (like chardet
).
Moreover this last version complicates the code and I'm pretty sure we'll get other problems on some platforms. Also yamlopen()
wrapper isn't very needed, I think.
Thinking about the problem, yamllint has always used open()
to read input files, and since commit 91763f5 it uses io.open()
. On Python 3 both these functions use locale.getpreferredencoding()
as default encoding, if none provided. It seems like the correct way to do, and it's standard.
Problem: it's just not working with io.open()
on Python 2. Since Python 2 will eventually be dropped, I propose to simply:
if sys.version_info.major < 3:
# On Python 2, io.open() causes decoding problems:
opener = open(file)
else:
opener = io.open(file, newline='')
with opener as f:
@spyoungtech @bz2 what do you think?
About tests:
-
Again, could you add a test with a fake file
non-ascii/iso-8859-1
containingéàöî
encoded in ISO-8859-1? -
Can you run yamllint on encoded files (not just
with cli.yamlopen(path)
), to check there isn't any crash from end to end?
path = os.path.join(self.wd, 'a.yaml') | ||
with cli.yamlopen(path, encoding='windows-1252') as yaml_file: | ||
yaml_file.read() | ||
self.assertEqual(yaml_file.encoding, 'windows-1252') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not test anything because it was already opened with encoding='windows-1252'
, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tests that yamlopen
correctly accepts an explicit encoding, instead of detecting the encoding with chardet.
There are tests that run the linter on all the new non-ascii files, except for the file that is made up of random bytes that cannot be decoded and exists just to be sure UTF-8 is used as a default if chardet can't detect any encoding.
I understand not wanting to add dependencies. If it's any consolation chardet itself is pure python with no external dependencies. I can also go back and revert to just checking the BOM, per the yaml spec and expecting failure or malformed data in any other case.
The motivation for using chardet was the comment from bz2 about making a best-effort to parse files even if they're not an acceptable encoding.
The yamlopen context manager was also implemented per request on review. I can remove that, too, if you prefer not to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tests that
yamlopen
correctly accepts an explicit encoding, instead of detecting the encoding with chardet.
OK, thanks for the explanation!
There are tests that run the linter on all the new non-ascii files, except for the file that is made up of random bytes that cannot be decoded and exists just to be sure UTF-8 is used as a default if chardet can't detect any encoding.
Sure. But still no tests for ISO-8859-1-encoded file, unless I missed something?
I understand not wanting to add dependencies. If it's any consolation chardet itself is pure python with no external dependencies. I can also go back and revert to just checking the BOM, per the yaml spec and expecting failure or malformed data in any other case.
The motivation for using chardet was the comment from bz2 about making a best-effort to parse files even if they're not an acceptable encoding.
Sure, I understood that. But I'm very against adding new dependencies (pure Python or not).
Have you seen my proposal about a temporary Python-2-only solution? What do you think of it?
PS: I opened this related PR: #249
The yamlopen context manager was also implemented per request on review. I can remove that, too, if you prefer not to use it.
Yes I've seen it, but I don't think it helps readability.
Resolves #238