-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore invalid Unicode in pkg-config descriptions #9609
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work!
pkgList <- lines <$> getProgramOutput verbosity pkgConfig ["--list-all"] | ||
-- To prevent malformed Unicode in the descriptions from crashing cabal, | ||
-- read without interpreting any encoding first. (#9608) | ||
pkgList <- LBS.split 10 <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we avoid the literal 10
, it confused me initially
pkgList <- LBS.split 10 <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"]) | |
pkgList <- LBS.split (fromEnum '\n') <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a Word8
, so that'd become fromIntegral (ord '\n')
which is quite verbose. But fair point, maybe one shouldn't assume extensive familiarity with ASCII from Cabal contributors :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you could also just add -- 10 is ASCII for '\n'
, or -- split on newlines (10 == fromEnum '\n')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it the verbose version, and just added a newline ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
-- The output of @pkg-config --list-all@ also includes a description | ||
-- for each package, which we do not need. | ||
let pkgNames = map (takeWhile (not . isSpace)) pkgList | ||
let pkgNamesLBS = map (LBS.takeWhile (not . isSpace . chr . fromIntegral)) pkgList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work if you have multi-point unicode characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not! You are correct, isSpace
also recognises non-breaking spaces (U+00A0) which, when considered mod 256, have the high bit set and can thus occur in UTF-8 text as part of a multi-byte unicode codepoint. I had naively expected there to be no space characters in the Unicode range between 127 and 256, which would have made this work due to the UTF-8 encoding.
What do you suggest? Something like not . isAsciiSpace
where isAsciiSpace n = n `elem` map (fromIntegral . ord) " \t"
?
pkgList <- lines <$> getProgramOutput verbosity pkgConfig ["--list-all"] | ||
-- To prevent malformed Unicode in the descriptions from crashing cabal, | ||
-- read without interpreting any encoding first. (#9608) | ||
pkgList <- LBS.split 10 <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this magic number 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an ASCII newline. This has been changed on alt-romes' suggestion to fromIntegral (ord '\n')
in a later commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already spotted by @alt-romes and fixed by the author.
BTW, would it be possible to add a small test with sample invalid characters preceded by valid ones? |
let pkgNames = map (takeWhile (not . isSpace)) pkgList | ||
let pkgNamesLBS = map (LBS.takeWhile (not . isSpace . chr . fromIntegral)) pkgList | ||
-- Now decode as UTF8 and convert to String, dropping any that fail decoding. | ||
let pkgNames = rights $ map (fmap T.unpack . T.decodeUtf8' . LBS.toStrict) pkgNamesLBS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping things without giving any output saying that you're doing that seems likely to lead to very hard-to-debug issues later. Can we log (at high verbosity) saying that we're ignoring an entry because it has invalid unicode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! You're right. I'm now logging something at Verbose
level, which seems to be -v2
(but not 100% sure). The test also checks this.
CI:
for a 9.2.7 buid. |
CI still grumpy. This time likely related to the changes themselves. |
@Mikolaj My latest commit fixes the failure in PackageTests/ExtraProgPath that I saw in the CI. It's an ugly fix, though, and you (plural) will probably have opinions on how to do this in cabal. Feel free to edit however you wish; from now on I'll have very little time work work on this until Monday 2024-01-22. EDIT: Just to summarise, the original |
Let me trigger CI again for the unlikely case the failure was a fluke... Edit: it broke the same. |
@mergify rebase |
Previously, if any of the pkg-config packages on the system had invalid Unicode in their description fields (like the Intel vpl package has at the time of writing, 2024-01-11, see haskell#9608), cabal would crash because it tried to interpret the entire `pkg-config --list-all` output as Unicode. This change, as suggested by gbaz in haskell#9608 (comment) switches to using a lazy ByteString for reading in the output, splitting on the first space in byte land, and then parsing only the package _name_ to a String. For further future-proofing, package names that don't parse as valid Unicode don't crash Cabal, but are instead ignored.
✅ Branch has been successfully rebased |
893fa00
to
2900ecd
Compare
let pkgNames = map (takeWhile (not . isSpace)) pkgList | ||
-- To prevent malformed Unicode in the descriptions from crashing cabal, | ||
-- read without interpreting any encoding first. (#9608) | ||
pkgList <- LBS.split (fromIntegral (ord '\n')) <$> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation doesn't seem very nice to me, it at least should be abstracted into it's own function.
- How about using
getProgramInvocationsIODataAndErrors
? Which doesn't crash when theexitCode
is notExitSuccess
- Could you use a parser abstraction to parse the input (you can use
Parsec
interface incabal-install-solver
)
In any case I don't think all this low level logic should be inlined here, so how about a new top level function getPkgConfigPkgList
to encapsulate it all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using parsec to split a package list on newlines is extreme overkill. Additionally, I don't think this is a lot of logic to inline, just well documented logic. And finally its actually the bulk of the function -- the existing top level function is already the encapsulation.
review is cosmentic and blocking merge
Let me restart the CI. The current failure starts with the following on macos:
|
MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX shell) anyway
@mergify backport 3.10 |
✅ Backports have been created
|
* Ignore invalid Unicode in pkg-config descriptions Previously, if any of the pkg-config packages on the system had invalid Unicode in their description fields (like the Intel vpl package has at the time of writing, 2024-01-11, see #9608), cabal would crash because it tried to interpret the entire `pkg-config --list-all` output as Unicode. This change, as suggested by gbaz in #9608 (comment) switches to using a lazy ByteString for reading in the output, splitting on the first space in byte land, and then parsing only the package _name_ to a String. For further future-proofing, package names that don't parse as valid Unicode don't crash Cabal, but are instead ignored. * Add changelog entry * cabal-install-solver: Add bounds on 'text' * No literal ASCII values, use 'ord' * Address review comments re invalid unicode from pkg-config * Add test for invalid unicode from pkg-config * Compatibility with text-1.2.5.0 * Align imports * Handle different exception type * Use only POSIX shell syntax * Add invalid-input handler in pkg-config shim This is to appease shellcheck * Actually implement all required stuff in the pkg-config shim * Less exception dance * Fix shebang lines MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX shell) anyway * Don't expect a particular representation of invalid characters --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 0b34b4e) # Conflicts: # Cabal/src/Distribution/Simple/Program/Run.hs
Thanks everyone for the reviews! And also thanks everyone for the help, in particular @Mikolaj :) |
* Ignore invalid Unicode in pkg-config descriptions Previously, if any of the pkg-config packages on the system had invalid Unicode in their description fields (like the Intel vpl package has at the time of writing, 2024-01-11, see #9608), cabal would crash because it tried to interpret the entire `pkg-config --list-all` output as Unicode. This change, as suggested by gbaz in #9608 (comment) switches to using a lazy ByteString for reading in the output, splitting on the first space in byte land, and then parsing only the package _name_ to a String. For further future-proofing, package names that don't parse as valid Unicode don't crash Cabal, but are instead ignored. * Add changelog entry * cabal-install-solver: Add bounds on 'text' * No literal ASCII values, use 'ord' * Address review comments re invalid unicode from pkg-config * Add test for invalid unicode from pkg-config * Compatibility with text-1.2.5.0 * Align imports * Handle different exception type * Use only POSIX shell syntax * Add invalid-input handler in pkg-config shim This is to appease shellcheck * Actually implement all required stuff in the pkg-config shim * Less exception dance * Fix shebang lines MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX shell) anyway * Don't expect a particular representation of invalid characters --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit 0b34b4e)
Ignore invalid Unicode in pkg-config descriptions (backport #9609)
* Ignore invalid Unicode in pkg-config descriptions Previously, if any of the pkg-config packages on the system had invalid Unicode in their description fields (like the Intel vpl package has at the time of writing, 2024-01-11, see haskell#9608), cabal would crash because it tried to interpret the entire `pkg-config --list-all` output as Unicode. This change, as suggested by gbaz in haskell#9608 (comment) switches to using a lazy ByteString for reading in the output, splitting on the first space in byte land, and then parsing only the package _name_ to a String. For further future-proofing, package names that don't parse as valid Unicode don't crash Cabal, but are instead ignored. * Add changelog entry * cabal-install-solver: Add bounds on 'text' * No literal ASCII values, use 'ord' * Address review comments re invalid unicode from pkg-config * Add test for invalid unicode from pkg-config * Compatibility with text-1.2.5.0 * Align imports * Handle different exception type * Use only POSIX shell syntax * Add invalid-input handler in pkg-config shim This is to appease shellcheck * Actually implement all required stuff in the pkg-config shim * Less exception dance * Fix shebang lines MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX shell) anyway * Don't expect a particular representation of invalid characters --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Previously, if any of the pkg-config packages on the system had invalid Unicode in their description fields (like the Intel vpl package has at the time of writing, 2024-01-11, see #9608), cabal would crash because it tried to interpret the entire
pkg-config --list-all
output as Unicode.This change, as suggested by gbaz in #9608 (comment), switches to using a lazy ByteString for reading in the output, splitting on the first space in byte land, and then parsing only the package name to a String.
For further future-proofing, package names that don't parse as valid Unicode don't crash Cabal, but are instead ignored.
Fixes #9608.