Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore invalid Unicode in pkg-config descriptions #9609

Merged
merged 16 commits into from
Jan 25, 2024

Conversation

tomsmeding
Copy link
Collaborator

@tomsmeding tomsmeding commented Jan 11, 2024

Previously, if any of the pkg-config packages on the system had invalid Unicode in their description fields (like the Intel vpl package has at the time of writing, 2024-01-11, see #9608), cabal would crash because it tried to interpret the entire pkg-config --list-all output as Unicode.

This change, as suggested by gbaz in #9608 (comment), switches to using a lazy ByteString for reading in the output, splitting on the first space in byte land, and then parsing only the package name to a String.

For further future-proofing, package names that don't parse as valid Unicode don't crash Cabal, but are instead ignored.

Fixes #9608.

  • Patches conform to the coding conventions.
  • Any changes that could be relevant to users have been recorded in the changelog.
  • The documentation has been updated, if necessary. (unnecessary, bug fix)
  • Manual QA notes have been included. (unnecessary, no CLI change)
  • Tests have been added. (Ask for help if you don’t know how to write them! Ask for an exemption if tests are too complex for too little coverage!)

Copy link
Collaborator

@alt-romes alt-romes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

pkgList <- lines <$> getProgramOutput verbosity pkgConfig ["--list-all"]
-- To prevent malformed Unicode in the descriptions from crashing cabal,
-- read without interpreting any encoding first. (#9608)
pkgList <- LBS.split 10 <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we avoid the literal 10, it confused me initially

Suggested change
pkgList <- LBS.split 10 <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"])
pkgList <- LBS.split (fromEnum '\n') <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a Word8, so that'd become fromIntegral (ord '\n') which is quite verbose. But fair point, maybe one shouldn't assume extensive familiarity with ASCII from Cabal contributors :D

Copy link
Collaborator

@alt-romes alt-romes Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could also just add -- 10 is ASCII for '\n', or -- split on newlines (10 == fromEnum '\n')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it the verbose version, and just added a newline ;)

Copy link
Member

@Mikolaj Mikolaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

-- The output of @pkg-config --list-all@ also includes a description
-- for each package, which we do not need.
let pkgNames = map (takeWhile (not . isSpace)) pkgList
let pkgNamesLBS = map (LBS.takeWhile (not . isSpace . chr . fromIntegral)) pkgList
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work if you have multi-point unicode characters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not! You are correct, isSpace also recognises non-breaking spaces (U+00A0) which, when considered mod 256, have the high bit set and can thus occur in UTF-8 text as part of a multi-byte unicode codepoint. I had naively expected there to be no space characters in the Unicode range between 127 and 256, which would have made this work due to the UTF-8 encoding.

What do you suggest? Something like not . isAsciiSpace where isAsciiSpace n = n `elem` map (fromIntegral . ord) " \t"?

pkgList <- lines <$> getProgramOutput verbosity pkgConfig ["--list-all"]
-- To prevent malformed Unicode in the descriptions from crashing cabal,
-- read without interpreting any encoding first. (#9608)
pkgList <- LBS.split 10 <$> getProgramInvocationLBS verbosity (programInvocation pkgConfig ["--list-all"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this magic number 10?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an ASCII newline. This has been changed on alt-romes' suggestion to fromIntegral (ord '\n') in a later commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already spotted by @alt-romes and fixed by the author.

@Mikolaj
Copy link
Member

Mikolaj commented Jan 11, 2024

BTW, would it be possible to add a small test with sample invalid characters preceded by valid ones?

let pkgNames = map (takeWhile (not . isSpace)) pkgList
let pkgNamesLBS = map (LBS.takeWhile (not . isSpace . chr . fromIntegral)) pkgList
-- Now decode as UTF8 and convert to String, dropping any that fail decoding.
let pkgNames = rights $ map (fmap T.unpack . T.decodeUtf8' . LBS.toStrict) pkgNamesLBS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping things without giving any output saying that you're doing that seems likely to lead to very hard-to-debug issues later. Can we log (at high verbosity) saying that we're ignoring an entry because it has invalid unicode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You're right. I'm now logging something at Verbose level, which seems to be -v2 (but not 100% sure). The test also checks this.

@Mikolaj
Copy link
Member

Mikolaj commented Jan 11, 2024

CI:

src/Distribution/Solver/Types/PkgConfigDb.hs:87:55: error:
    Not in scope: ‘T.decodeUtf8Lenient’
    Neither ‘Data.Text’ nor ‘Data.Text.Encoding’ exports ‘decodeUtf8Lenient’.
   |
87 |                             Left _ -> Left (T.unpack (T.decodeUtf8Lenient sbsname))

for a 9.2.7 buid.

@tomsmeding tomsmeding requested a review from Mikolaj January 11, 2024 15:16
@tomsmeding tomsmeding added the squash+merge me Tell Mergify Bot to squash-merge label Jan 11, 2024
@Mikolaj
Copy link
Member

Mikolaj commented Jan 12, 2024

CI still grumpy. This time likely related to the changes themselves.

@tomsmeding
Copy link
Collaborator Author

tomsmeding commented Jan 12, 2024

@Mikolaj My latest commit fixes the failure in PackageTests/ExtraProgPath that I saw in the CI. It's an ugly fix, though, and you (plural) will probably have opinions on how to do this in cabal. Feel free to edit however you wish; from now on I'll have very little time work work on this until Monday 2024-01-22.

EDIT: Just to summarise, the original getProgramOutput throw an IO exception when the invoked program (pkg-config here) returned a non-zero exit code, but getProgramOutputLBS has been upgraded to use CabalExceptions and thus its non-zero-exit-code exception did not get caught by the handler at the top of readPkgConfigDb. My fix is to convert it back to an IO exception. You will probably want to do something different here.

@mergify mergify bot added the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Jan 14, 2024
@Mikolaj
Copy link
Member

Mikolaj commented Jan 15, 2024

Let me trigger CI again for the unlikely case the failure was a fluke...

Edit: it broke the same.

@Mikolaj
Copy link
Member

Mikolaj commented Jan 15, 2024

@mergify rebase

Previously, if any of the pkg-config packages on the system had invalid
Unicode in their description fields (like the Intel vpl package has at
the time of writing, 2024-01-11, see haskell#9608), cabal would crash because
it tried to interpret the entire `pkg-config --list-all` output as
Unicode.

This change, as suggested by gbaz in
  haskell#9608 (comment)
switches to using a lazy ByteString for reading in the output, splitting
on the first space in byte land, and then parsing only the package
_name_ to a String.

For further future-proofing, package names that don't parse as valid
Unicode don't crash Cabal, but are instead ignored.
Copy link
Contributor

mergify bot commented Jan 15, 2024

rebase

✅ Branch has been successfully rebased

@Mikolaj Mikolaj force-pushed the pkg-config-badunicode branch from 893fa00 to 2900ecd Compare January 15, 2024 11:30
let pkgNames = map (takeWhile (not . isSpace)) pkgList
-- To prevent malformed Unicode in the descriptions from crashing cabal,
-- read without interpreting any encoding first. (#9608)
pkgList <- LBS.split (fromIntegral (ord '\n')) <$>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation doesn't seem very nice to me, it at least should be abstracted into it's own function.

  • How about using getProgramInvocationsIODataAndErrors? Which doesn't crash when the exitCode is not ExitSuccess
  • Could you use a parser abstraction to parse the input (you can use Parsec interface in cabal-install-solver)

In any case I don't think all this low level logic should be inlined here, so how about a new top level function getPkgConfigPkgList to encapsulate it all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using parsec to split a package list on newlines is extreme overkill. Additionally, I don't think this is a lot of logic to inline, just well documented logic. And finally its actually the bulk of the function -- the existing top level function is already the encapsulation.

@Mikolaj Mikolaj removed the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Jan 15, 2024
@mergify mergify bot added the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Jan 17, 2024
@gbaz gbaz dismissed mpickering’s stale review January 23, 2024 21:24

review is cosmentic and blocking merge

@Mikolaj
Copy link
Member

Mikolaj commented Jan 24, 2024

Let me restart the CI. The current failure starts with the following on macos:

Running: /Users/runner/work/cabal/cabal/cabal-testsuite/PackageTests/PkgConfigParse/./pkg-config --list-all
Failed to query pkg-config, Cabal will continue without solving for pkg-config
constraints:
/Users/runner/work/cabal/cabal/cabal-testsuite/PackageTests/PkgConfigParse/./pkg-config:
createProcess: posix_spawnp: does not exist (No such file or directory)

MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX
shell) anyway
@mergify mergify bot merged commit 0b34b4e into haskell:master Jan 25, 2024
50 checks passed
@Mikolaj
Copy link
Member

Mikolaj commented Jan 25, 2024

@mergify backport 3.10

Copy link
Contributor

mergify bot commented Jan 25, 2024

backport 3.10

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jan 25, 2024
* Ignore invalid Unicode in pkg-config descriptions

Previously, if any of the pkg-config packages on the system had invalid
Unicode in their description fields (like the Intel vpl package has at
the time of writing, 2024-01-11, see #9608), cabal would crash because
it tried to interpret the entire `pkg-config --list-all` output as
Unicode.

This change, as suggested by gbaz in
  #9608 (comment)
switches to using a lazy ByteString for reading in the output, splitting
on the first space in byte land, and then parsing only the package
_name_ to a String.

For further future-proofing, package names that don't parse as valid
Unicode don't crash Cabal, but are instead ignored.

* Add changelog entry

* cabal-install-solver: Add bounds on 'text'

* No literal ASCII values, use 'ord'

* Address review comments re invalid unicode from pkg-config

* Add test for invalid unicode from pkg-config

* Compatibility with text-1.2.5.0

* Align imports

* Handle different exception type

* Use only POSIX shell syntax

* Add invalid-input handler in pkg-config shim

This is to appease shellcheck

* Actually implement all required stuff in the pkg-config shim

* Less exception dance

* Fix shebang lines

MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX
shell) anyway

* Don't expect a particular representation of invalid characters

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit 0b34b4e)

# Conflicts:
#	Cabal/src/Distribution/Simple/Program/Run.hs
@tomsmeding
Copy link
Collaborator Author

Thanks everyone for the reviews! And also thanks everyone for the help, in particular @Mikolaj :)

Mikolaj pushed a commit that referenced this pull request Jan 25, 2024
* Ignore invalid Unicode in pkg-config descriptions

Previously, if any of the pkg-config packages on the system had invalid
Unicode in their description fields (like the Intel vpl package has at
the time of writing, 2024-01-11, see #9608), cabal would crash because
it tried to interpret the entire `pkg-config --list-all` output as
Unicode.

This change, as suggested by gbaz in
  #9608 (comment)
switches to using a lazy ByteString for reading in the output, splitting
on the first space in byte land, and then parsing only the package
_name_ to a String.

For further future-proofing, package names that don't parse as valid
Unicode don't crash Cabal, but are instead ignored.

* Add changelog entry

* cabal-install-solver: Add bounds on 'text'

* No literal ASCII values, use 'ord'

* Address review comments re invalid unicode from pkg-config

* Add test for invalid unicode from pkg-config

* Compatibility with text-1.2.5.0

* Align imports

* Handle different exception type

* Use only POSIX shell syntax

* Add invalid-input handler in pkg-config shim

This is to appease shellcheck

* Actually implement all required stuff in the pkg-config shim

* Less exception dance

* Fix shebang lines

MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX
shell) anyway

* Don't expect a particular representation of invalid characters

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit 0b34b4e)
mergify bot added a commit that referenced this pull request Jan 25, 2024
Ignore invalid Unicode in pkg-config descriptions (backport #9609)
erikd pushed a commit to erikd/cabal that referenced this pull request Apr 22, 2024
* Ignore invalid Unicode in pkg-config descriptions

Previously, if any of the pkg-config packages on the system had invalid
Unicode in their description fields (like the Intel vpl package has at
the time of writing, 2024-01-11, see haskell#9608), cabal would crash because
it tried to interpret the entire `pkg-config --list-all` output as
Unicode.

This change, as suggested by gbaz in
  haskell#9608 (comment)
switches to using a lazy ByteString for reading in the output, splitting
on the first space in byte land, and then parsing only the package
_name_ to a String.

For further future-proofing, package names that don't parse as valid
Unicode don't crash Cabal, but are instead ignored.

* Add changelog entry

* cabal-install-solver: Add bounds on 'text'

* No literal ASCII values, use 'ord'

* Address review comments re invalid unicode from pkg-config

* Add test for invalid unicode from pkg-config

* Compatibility with text-1.2.5.0

* Align imports

* Handle different exception type

* Use only POSIX shell syntax

* Add invalid-input handler in pkg-config shim

This is to appease shellcheck

* Actually implement all required stuff in the pkg-config shim

* Less exception dance

* Fix shebang lines

MacOS doesn't have /usr/bin/sh, and /bin/sh is the standard (for a POSIX
shell) anyway

* Don't expect a particular representation of invalid characters

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
attention: needs-review merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days squash+merge me Tell Mergify Bot to squash-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cabal 3.10.2.1 cannot parse pkg-config: hGetContents: invalid argument (invalid byte sequence)
7 participants