Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paths with unicode characters in them? #6

Closed
edsko opened this issue Nov 19, 2015 · 25 comments
Closed

Paths with unicode characters in them? #6

edsko opened this issue Nov 19, 2015 · 25 comments

Comments

@edsko
Copy link
Contributor

edsko commented Nov 19, 2015

Happened to see this, don't know if it actually matters or not. But StringTable.construct calls ByteString.Char8.pack, which throws away a lot of information. Paths with unicode characters will probably break?

@mrkkrp
Copy link

mrkkrp commented Dec 24, 2015

Yes, they do break! When package is created, entires (or headers, I don't know proper terminology) look like gibberish. In fact, tar doesn't think it's a proper tar archive at all. I've spent about an hour looking where my app is doing something wrong, but it turns out that the library has bugs.

Here is an example:

~/Downloads $ tar -xvf foo.tar
/usr/bin/tar: This does not look like a tar archive
/usr/bin/tar: Skipping to next header
/usr/bin/tar: Exiting with failure status due to previous errors

And with help of Emacs I can see:

-rw-r--r--       0/0       82492644 01 �>65 E@0=8 :>@>;O!.flac

Where is something unprintable at all. This must be fixed ASAP.

@mrkkrp
Copy link

mrkkrp commented Dec 24, 2015

@dcoutts, Is PR disarible or you can fix it yourself?

@dcoutts
Copy link
Contributor

dcoutts commented Jan 4, 2016

@mrkkrp this isn't a new problem right? It's never done unicode.

Yes, a comprehensive fix would be welcome, but this isn't easy. It still has to work with arbitrary unix files which are not necessarily unicode.

@mrkkrp
Copy link

mrkkrp commented Jan 4, 2016

@dcoutts, I didn't know it's not supposed to work with Unicode. But well, it's 2016, Unicode is everywhere. And there are a lot of coutries that use non-Latin scripts, so once you choose to work with tar archives in Haskell and you have to deal with non-Latin script, you have this problem.

Oh, OK. Can you describe why exactly Unicode is so hard? All the tools for ByteString ecoding/decoding available, and UTF-8 is the same as ASCII if it doesn't contain Unicode characters.

Also, where to look if I want to properly fix this? (I now either need to fix it or call extenral tar application instead, which is not very pretty.)

@mrkkrp
Copy link

mrkkrp commented Jan 4, 2016

Can't we just use utf8-string for example and replace some calls to pack/unpack from Data.ByteString.Char8 with calls to fromString / toString from Data.ByteString.UTF8? That should work for file paths without Unicode characters as well as for those with Unicode characters in them. Am I missing something important here?

@edsko
Copy link
Contributor Author

edsko commented Jan 4, 2016

How do you know the paths are UTF8 encoded, and not something else?

@mrkkrp
Copy link

mrkkrp commented Jan 4, 2016

I don't see any problems here. We're talking about FilePath, which is a synonym for String, list of Chars. Every char is not a byte, but something that already can represent any Unicode value.

Now if we take just UTF-8, it's designed to be backward compatible with ASCII. This means that ByteString representing UTF-8-encoded string is the same as ByteString representing ASCII string (one byte per character, this how it currently works, as I understand). So, no regression will happen if we switch, with respect to this limited collection of characters, things will be all the same.

As Wikipedia puts it:

  • Backward compatibility: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that ASCII text is valid UTF-8, and UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.

For non-ASCII characters however, it's not possible to represent them using only one byte per character, so there will be difference and Unicode paths will be represented by longer ByteStrings, but I don't see any problem here either, just put that sequence of bytes into that string table and extract them afterwards decoding them as UTF-8 strings.

I see the following problems, however:

  • I don't know if there is any standard with respect to encoding that should be used. I mean, is this OS-dependent? Linux uses UTF-8 everywhere, but Windows does not. How tar application should know how to interpret file names? I don't know. I guess if we go with UTF-8 it will be much better than truncated characters anyway.
  • As I understand from quick reading of source code, file paths are limited in length. If this is a limitation from tar format specification, then Unicode paths that can be put into a tar archive will be shorter than non-Unicode ones.

Anyway, this change is a must, because otherwise use of this library is very limited (as Haskell community's tool to be used as part of applications used mostly by programmers).

Even if you deal with Latin alphabet only, there are various characters that can be in paths, like quotes: “” (note, they are different from "", which cannot be in paths in Windows, although they are in ASCII range, but “” on the other hand can, and they are proper punctuation to be used anyway), there are copyright signs ©, a lot of punctuation that is not in ASCII range.

I can imagine you don't use these things in names of source files, but this doesn't mean other (possibly non-technical) people don't put Unicode in names of files, and they may be direct users of some Haskell program that uses this library.

@mrkkrp
Copy link

mrkkrp commented Jan 4, 2016

Hmm, TAR officially doesn't support non ASCII characters. Too bad, but I think I saw tar-archives that contain paths with Unicode in them. Strange, I'll need to read more about workaround and how it's generally done.

@mrkkrp
Copy link

mrkkrp commented Jan 4, 2016

Anyway since tar specification specifies ASCII range explicitely and UTF-8 and ASCII are the same in that range, I think that idea with UTF-8 should be perfectly OK.

@mrkkrp
Copy link

mrkkrp commented Jan 4, 2016

I'm waiting for @dcoutts opinion. Perhaps I should just use more-modern archive format. This is unbelievable that it doesn't support anything but ASCII, what a flaw…


So if it's specification that's broken, then I suggest we close the issue, because this library implements the specification well. I'll just switch to zip, it will be also more familiar for my non-techy users. Sorry for prolonged disscussion.

@dcoutts
Copy link
Contributor

dcoutts commented Jan 10, 2016

I'm not opposed to following whatever convention other tar impls use when it comes to unicode. But note that it isn't a trivial matter of sticking in a few to/fromUTF8 calls (remember that not all unix files are unicode but all windows/osx ones are). See for example https://docs.python.org/2/library/tarfile.html#tar-unicode

I think a good time to tackle this problem is when we add pax support (isssue #1). The posix pax standard explicitly supports file name encodings, and utf8 in particular.

@ezyang
Copy link

ezyang commented Sep 2, 2016

I'm surprised by the discussion here. There is a very simple solution which is unambiguously the right thing to do: use withFilePath from System.Posix.Internals (in base) to encode a FilePath into the OS-specific encoding, and then blast that straight into the tarball. The point is that people expect tar to work like how an invocation of the tar program on the filesystem would work, and the convention is that you just preserve the raw encoding of the data directly.

EDIT: OK, I'll retract this. If you followed my suggestion, then if you used tar on Windows, all of the files would be blasted into the tarball using UTF-16 encoding. Which will totally do the right thing on Windows (Unicode will be supported properly) and also totally miss the point, if you were hoping to pass the tarball on to someone else. Ouch.

@23Skidoo
Copy link
Member

23Skidoo commented Sep 2, 2016

Can't we make this case an error instead of silently accepting? Current behaviour causes problems for users:

haskell/cabal#3758
commercialhaskell/stack#2557

@ezyang
Copy link

ezyang commented Sep 2, 2016

I support erroring. The truncation from Char8.pack is basically never right, IMO.

@ezyang
Copy link

ezyang commented Sep 2, 2016

Also, is there an interface for passing tar direct ByteString encodings of the desired file paths? This would at least let end users make a decision what encoding they want.

@hasufell
Copy link
Member

hasufell commented Jan 18, 2020

What's the status of this? The current implementation is breaking filenames. All filepaths should be ByteString (aka RawFilePath). This is a low-level library, if someone wants to add a String or Text interface on top, that's fine.

EDIT: afais gnu tar specifies:

The name, linkname, magic, uname, and gname are null-terminated character strings. All other fields are zero-filled octal numbers in ASCII.

But this probably isn't portable for Mac OS and windows...

EDIT2: I think I'll create a tar-bytestring fork that is specifically targeted for POSIX platforms. At least that fixes half of the problem.

EDIT3: https://hackage.haskell.org/package/tar-bytestring

@hasufell
Copy link
Member

This is what tar-conduit does: https://github.com/snoyberg/tar-conduit/blob/81283887aaa9771c0f2db53cb4e86700da4c2d9e/src/Data/Conduit/Tar/Types.hs#L151

It encodes and decodes as UTF-8. I'd say that's a pretty good bet. For unpacking, we could provide a version that allows to set the encoding... or we make use of something like https://hackage.haskell.org/package/charsetdetect-ae

@Bodigrim
Copy link
Contributor

I pushed 423e6af, prohibiting non-ASCII file names. At the very least, we should not silently corrupt Unicode data. A stategic solution would be to migrate to PosixPath and leave encoding questions to clients.

@hasufell
Copy link
Member

A stategic solution would be to migrate to PosixPath and leave encoding questions to clients.

There are some non-trivial parts there, because although the tar spec demands unix semantics, the library also works on windows (see toTarPath). Since we use the FilePath representation currently, we don't have to convert the filenames between the platforms (just the separators are changed). With OsPath, it seems we would need a way to convert between PosixPath and WindowsPath. So we kinda have to assume utf8 here too at least on windows?

@Bodigrim
Copy link
Contributor

Yes, I'd assume UTF-8 on Windows.

@mpilgrem
Copy link

mpilgrem commented Dec 10, 2023

Would it be possible for Codec.Archive.Tar.Entry to export the data constructor of TarPath?

I've written something for Stack that works around fromTarPath using BS.Char8.unpack (Stack needs that to be (T.unpack . T.decodeUtf8Lenient)), but the code needs access to the data constructor.

EDIT: In the interim, I've realised I can convert the FilePath back into a ByteString, and start again:

fromTarPath :: TarPath -> FilePath
fromTarPath = T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath

@Bodigrim
Copy link
Contributor

@mpilgrem I recommend against T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath: if tar ever learns to support Unicode so that Tar.fromTarPath returns a Unicode-enabled String, then BS.Char8.pack allows to convert a seemingly innocent path without any dots and slashes to something like ../../Windows/System32/Kernel.dll and corrupt your system files.

#78 is a way forward.

@mpilgrem
Copy link

@Bodigrim, thanks for the warning. My second attempt below makes use of isUTF8Encoded from the utf8-string package:

fromTarPath :: TarPath -> FilePath
fromTarPath tp = if isUTF8Encoded rawFilePath
  then
    T.unpack $ T.decodeUtf8Lenient $ BS.Char8.pack rawFilePath
  else
    -- A future version of Tar.fromTarPath may itself assume that 'TarPath' is
    -- UTF8 encoded.
    rawFilePath
 where
  rawFilePath = Tar.fromTarPath tp

@hasufell
Copy link
Member

PR here: #88

@Bodigrim
Copy link
Contributor

Unicode filenames should work now, after aa683b0. I switched TarPath to PosixString; since it's not exposed, this is not a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants