Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zip extraction issues #112

Open
ghost opened this issue Jan 9, 2021 · 33 comments · May be fixed by #232
Open

Zip extraction issues #112

ghost opened this issue Jan 9, 2021 · 33 comments · May be fixed by #232

Comments

@ghost
Copy link

ghost commented Jan 9, 2021

Reported by @tonurics on Arch Linux bugtracker (FS#69253).

To clarify, assume you have zip file called "foo.zip" containing one item: "readme.txt"
Taking the same file above and instead using switch e [to extract to the working directory, i.e 7z e foo.zip], one of two things will happen [depending on the zip file]: sometimes 7z creates a file named "IBM437.so" with the contents of "readme.txt".

Seeing IBM437 makes me think of c104127...

Note: Arch Linux already switched to your p7zip fork and is providing 17.03 in the official repository.

@tonurics
Copy link

tonurics commented Jan 9, 2021

I've added some additional comments to the ticket on the Arch Linux bugtracker, along with a zip file you can use to reproduce the issues.

@tonurics
Copy link

I've updated the Arch Linux bugtracker with test results from releases and builds created from commits.

I skipped c3c8629 as it was marked as failing. Having said that jinfeihan57@c104127 is the first commit where the issue appears; which is also the one @jloqfjgk hinted at above.

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Is this issue Arch-specific or it is reproduced on other distros using p7zip built from the same source tree?

PS: Attaching sample file from Arch issue:

I found a small zip file that can be used to reproduce the issue in 17.03-1.
Both the x and e switch options will result in the issues I described above. In the case of the e switch: the "IBM437.so" file will appear, and not the error trying to replace the working directory.
ComicInfo.zip

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

UPD: Built jinfeihan57/p7zip master on my Mint 20, issue is not reproducing on attached sample. Should I do it on Arch instead? What exact distro version should I use?

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

I tested the zip file attached above with the three 7z release binaries posted at: https://github.com/jinfeihan57/p7zip/releases

p7zip 17.01 = No Issue
p7zip 17.02 = No Issue
p7zip 17.03 = Broken

Tested that 17.03 binary on Mint 20, works fine on attached sample. Need to know exact Arch distro version to test on it.

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Also, Arch Linux is rolling release so there is no version number.

Thanks! Will try to reproduce

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

If I pass env LANG=C then the bug can be reproduced.

Still not reproducting on my Mint even with this trick. Do you use sample attached above, ComicInfo.zip?

Btw, can anyone please help with QUICK way to install Arch in VirtualBox? Tried zen installer, it completed, but system stays unbootable. Maybe anyone can provide ready to run virtualbox image?

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Try disabling UEFI in VirtualBox

Already disabled.

follow the official installation guide

It looks like it should take an hour or more. Can't afford it, alas. Isn't there a faster way?

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Will try preinstalled version from here https://www.osboxes.org/arch-linux/

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Will try preinstalled version from here https://www.osboxes.org/arch-linux/
Remember to do pacman -Syu so that all packages will be updated to the latest version.

Damn, it refuses to work also

VirtualBox_arch_10_01_2021_16_14_57

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Had to do the following:

  1. from https://bbs.archlinux.org/viewtopic.php?pid=1653252#p1653252
    edit /etc/pacman.d/gnupg/gpg.conf and change the keyserver line to:
    keyserver hkp://keyserver.ubuntu.com

  2. from https://wiki.archlinux.org/index.php/Pacman/Package_signing#Cannot_import_keys
    sudo pacman-key --init
    sudo pacman-key --populate
    sudo pacman-key --refresh-keys
    sudo pacman -Sy archlinux-keyring

sudo pacman -Syu

For full upgrade to complete. That Arch is a bit tricky thing :)

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Trying to unzip attached sample looks a bit buggy (is askes to overwrite non-existent file), but extraction itself works ok no .so file is created.
VirtualBox_arch_10_01_2021_18_45_20
VirtualBox_arch_10_01_2021_18_48_12

@unxed
Copy link
Contributor

unxed commented Jan 10, 2021

Maybe I'm doing something wrong or need some additional steps to reproduce IBM437.so creation?
Exact image I used: https://nav.dl.sourceforge.net/project/osboxes/v/vb/4-Ar---c-x/2019.05/KDE/201905-kde-64bit.7z

@tonurics
Copy link

tonurics commented Jan 10, 2021

For completeness, here are my localization settings:
/etc/locale.gen
en_US.UTF-8 UTF-8
en_US ISO-8859-1
[If you change this: run locale-gen and restart.]

/etc/locale.conf
LANG=en_US.UTF-8
[If you change this: restart.]

@unxed You appear to only be using the x [eXtract files with full paths] command; which creates superfluous directories, where instead, it should be should be creating files.

The e [Extract files from archive (without using directory names)] command on the other hand does one of two things [depending on the zip archive]. It will either: try to replace the working directory [and fail] or create a file named "IBM437.so" with the contents of one the the file entries in the zip archive.

You can read my ticket on the Arch Linux bugtracker here: https://bugs.archlinux.org/task/69253?project=1&string=p7zip

@tonurics
Copy link

tonurics commented Jan 10, 2021

As for your Arch installation, I'm not sure you have a fully working system. If you continue to run into problems reproducing the issue:
*) Revert the changes you made to: /etc/pacman.d/gnupg/gpg.conf
*) Re-initialize the pacman keyring: pacman-key --init
*) Fetch the archlinux signing keys: pacman-key --populate archlinux
*) Clear out any junk: pacman-key --refresh-keys
*) Force update of repos and packages [note "yy"]: pacman -Syyu

As Arch is a rolling release, if you have an image that hasn't been updated for some time: it can get stuck where new packages can't be updated and the best option is to install a fresh copy [which is generally pretty quick, if you follow along with the Installation guide; you might even have just been able to use the install media as a sort of CLI only live CD]. Having said all that, things for taking time to get Arch running for testing!

@txtsd
Copy link

txtsd commented Jan 10, 2021

I can confirm this is happening on Arch with 17.03

Arch doesn't do anything strange to build it though: make OPTFLAGS="$CPPFLAGS $CFLAGS" 7z 7zr 7za
See the PKGBUILD here: https://github.com/archlinux/svntogit-packages/blob/packages/p7zip/trunk/PKGBUILD

@foutrelis
Copy link

foutrelis commented Jan 10, 2021

At the end of the new code added in c104127, res seems to contain:

ComicInfo.xml(NUL)(NUL)(NUL): gconv_end(NUL)/usr/lib/gconv/IBM437.so

So the IBM437.so filename originates from garbage at the end of the output string. Perhaps s_utf8 can be trimmed to the appropriate length thus avoiding ConvertUTF8ToUnicode carrying over the unwanted characters to the output string?

The following approach correctly extracts ComicInfo.xml but my understanding of the code is very limited so it could be going about it the wrong way. (Also removed the + 1 from the buffer allocation because the *_SetEnd methods already handle null termination from what I gathered.)

diff --git a/CPP/7zip/Archive/Zip/ZipItem.cpp b/CPP/7zip/Archive/Zip/ZipItem.cpp
index 353e895..4d48296 100644
--- a/CPP/7zip/Archive/Zip/ZipItem.cpp
+++ b/CPP/7zip/Archive/Zip/ZipItem.cpp
@@ -423,10 +423,10 @@ void CItem::GetUnicodeString(UString &res, const AString &s, bool isComment, boo
       const char* src = s.Ptr();
       size_t slen = s.Len();
       size_t dlen = slen * 4;
-      const char* dest = s_utf8.GetBuf_SetEnd(dlen + 1); // (source length * 4) + null termination
+      const char* dest = s_utf8.GetBuf_SetEnd(dlen); // (source length * 4)
 
       size_t done = iconv(cd, (char**)&src, &slen, (char**)&dest, &dlen);
-      bzero((size_t*)dest + done, 1);
+      s_utf8.ReleaseBuf_SetEnd(s_utf8.Len() - dlen);
 
       iconv_close(cd);
 

@jinfeihan57
Copy link
Contributor

I don't work on arch linux.
Anybody solve this issue?
OR does arch linux have a compile macro(like APPLE),so we can skipe the code.

@biopsin
Copy link

biopsin commented Jan 15, 2021

For me 17.03 testing outputs
$ 7z x -y ComicInfo.zip => ComicInfo.xml
but
$ 7z e ComicInfo.zip => IBM437.s

$ file ComicInfo.xml
ComicInfo.xml: XML 1.0 document, ASCII text
and
$ file IBM437.s
IBM437.s: XML 1.0 document, ASCII text

@tonurics
Copy link

@biopsin I assume you are seeing still seeing the erroneous folder creation problem with the x [eXtract files with full paths] command, even though you didn't give it a new output path; i.e. 7z x -o/tmp/1 foo.zip. Your use of -y [Assume Yes on all queries] would have 7z silently replace the folders with files.

I would be cautious of using the -y switch with e [Extract files from archive (without using directory names)] command, if a zip archive has multiple file entries: they will all want to overwrite each other into the one "IBM437.so" file.

@biopsin
Copy link

biopsin commented Jan 15, 2021

Yes I see it; extracting with 7z x $1
but not with 7z e $1

@foutrelis
Copy link

OR does arch linux have a compile macro(like APPLE),so we can skipe the code.

I wouldn't consider this to be an Arch Linux issue. More likely this is related to newer libstdc++ or some other core library.

For what it's worth, this is also reproducible on Fedora 32 and Fedora 33 (7z e ComicInfo.zip creates IBM437.s). I would conclude that it happens with GCC 10's libstdc++ but Ubuntu 20.10 appears to be unaffected.

@unxed: Could you test on Fedora 33 (or Fedora 32) which works as a live disc and doesn't require installation/configuration?

@tonurics
Copy link

tonurics commented Jan 15, 2021

I just tested the following distros livecds in VirtualBox [except Arch] using default settings and the compiled p7zip 17.03 release:

Distro Based on Result e Filename glibc Version
Sparky Linux Debian Testing Pass N/A 2.31
Arch Linux Independent Fail IBM437.so 2.32
Clear Linux Independent Fail IBM437.s 2.31
PCLinuxOS Independent Fail IBM437.s 2.31
Solus Independent Fail IBM437.s 2.29

Note: Seeing that Sparky Linux passed [which is Debian Testing], I would assume: that all other Debian distros based would likely pass. So stuck to testing independent upstream source distros with recent releases.

@foutrelis
Copy link

It doesn't really matter why Debian doesn't repro this exact bug. Increasing the length/size of s_utf8 to something like 1024 bytes results in incorrect filename on Debian 10:

root@debian:~# ls
'ComicInfo>'$'\n''PK'$'\001\002''?'   ComicInfo.zip   p7zip

And on second run:

root@debian:~# ./p7zip/bin/7z e ComicInfo.zip
...
Would you like to replace the existing file:
  Path:     ./ComicInfo>
PK?
  Size:     590 bytes (1 KiB)
  Modified: 2001-01-01 09:01:00
with the file from archive:
  Path:     ComicInfo.xml
...

Either a ReleaseBuf_SetEnd call is needed to trim s_utf8 (see https://github.com/jinfeihan57/p7zip/issues/112#issuecomment-757566333) or other code needs to be adjusted to ignore the random data in the string buffer after the first NUL character.

@tonurics
Copy link

tonurics commented Jan 16, 2021

Can this bug be reproduced on musl-based distributions, such as Alpine Linux?

@jloqfjgk [All your comments appear to have been deleted.] For edification: I was actually curious about that myself and did try to test it. But the 17.03 release binary didn't run. And not knowing enough about Alpine/musl: didn't want to try compiling it myself and create any red herrings or false feedback.

@ghost ghost mentioned this issue Jan 16, 2021
@buzztaiki
Copy link

What is the status of this issue?
Do you want me to create a Dockerfile to reproduce this issue?

@spvkgn
Copy link

spvkgn commented May 28, 2021

On Ubuntu 20.04 I have this issue with Zip extraction.

$ wget -q https://github.com/jinfeihan57/p7zip/archive/refs/heads/master.zip
$ ls -la
итого 7352
drwxrwxr-x  2 pavel pavel      60 мая 28 20:15 .
drwxrwxrwt 27 root  root      800 мая 28 20:09 ..
-rw-rw-r--  1 pavel pavel 7528082 мая 28 19:52 master.zip
$ 7z x master.zip 

7-Zip [64] 17.04 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.04 (locale=ru_RU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz (1067A),ASM)

Scanning the drive for archives:
1 file, 7528082 bytes (7352 KiB)

Extracting archive: master.zip
--
Path = master.zip
Type = zip
Physical Size = 7528082
Comment = eb1bbb0d0327a103850fec519015986e72a1ebf0

    
Would you like to replace the existing file:
  Path:     ./p7zip-master/.github/workflows/macos-build.yml
  Size:     0 bytes
  Modified: 2021-05-28 19:52:47
with the file from archive:
  Path:     p7zip-master/.github/workflows/macos-build.yml
  Size:     3024 bytes (3 KiB)
  Modified: 2021-05-17 13:21:36
? (Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit? q

Archives with Errors: 1



Break signaled
$ 7za x master.zip 

7-Zip (a) [64] 17.04 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.04 (locale=ru_RU.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz (1067A),ASM)

Scanning the drive for archives:
1 file, 7528082 bytes (7352 KiB)

Extracting archive: master.zip
--
Path = master.zip
Type = zip
Physical Size = 7528082
Comment = eb1bbb0d0327a103850fec519015986e72a1ebf0

    
Would you like to replace the existing file:
  Path:     ./p7zip-master/.github/workflows/linux- build.yml
  Size:     2970 bytes (3 KiB)
  Modified: 2021-05-17 13:21:36
with the file from archive:
  Path:     p7zip-master/.github/workflows/linux- build.yml
  Size:     2970 bytes (3 KiB)
  Modified: 2021-05-17 13:21:36
? (Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit? q

Archives with Errors: 1



Break signaled

@spvkgn
Copy link

spvkgn commented May 29, 2021

On Ubuntu 20.04 I have this issue with Zip extraction.

I've tried to revert commit c104127 and that issue has gone.

@peazip
Copy link

peazip commented Oct 2, 2021

Hi, I don't know if it is related with the same issue: I've tested that some ZIP archives are not extracted correctly with p7zip 17.04, but they are correctly extracted using the vanilla p7zip 16.02,

I've tested the issue on Ubuntu 18.04 and 20.04 x86_64.

Some example of the archives showing the issue are:
https://github.com/pkoutoupis/rapiddisk/archive/refs/tags/7.2.1.zip
https://www.tosecdev.org/downloads/category/53-2021-08-08?download=105:tosec-dat-pack-complete-3246-tosec-v2021-08-08

Extraction ends successfully but part of the content is not extracted using p7zip 17.04 - e.g. some files are extracted as empty folders, some files are not extracted at all - and I can't find anything about the error(s) in the log of the operation.

Extracting same archives with the vanilla p7zip 16.02, or with other archive managers, produces the correct number of files and folders as output.

Listing and testing the archives with p7zip 17.04 is successful and no issues (e.g. data errors) are reported for the archives.

@biopsin
Copy link

biopsin commented Oct 2, 2021

@peazip just tested extracting 7.2.1.zip and du -sh rapiddisk-7.2.1/ = yealds 464K with 17.04, but I'm however currently testing with https://github.com/jinfeihan57/p7zip/issues/112#issuecomment-757566333

and the other one 357M - does this sound correct to you?

@peazip
Copy link

peazip commented Oct 2, 2021

@peazip just tested extracting 7.2.1.zip and du -sh rapiddisk-7.2.1/ = yealds 464K with 17.04, but I'm however currently testing with #112 (comment)

and the other one 357M - does this sound correct to you?

Correct, the archives extracted with p7zip 16.02 are 464KB and 357MB respectively, same result extracting them with other archive managers.
But extracting them with p7zip 17.04 (from last published release, https://github.com/jinfeihan57/p7zip/releases/tag/v17.04 ) they are extracted to 412KB and 310MB (also, the number of files and dirs is not what expected).
I'm glad to hear it is fixed in the code and I hope the fix will be available soon in a new release.

@libTorrentUser
Copy link

libTorrentUser commented Apr 15, 2023

foutrelis already found the source of the bug. CItem::GetUnicodeString has an extremely serious bug that makes code that calls it think the length of the string it receives is 4x longer than the string's actual length.

What then happens is that the function that splits file paths gets confused and thinks the file path does not end at the file name. For instance, it thinks that something like

"/d1/d2/file.txt" 

is

"/d1/d2/file.txt(randomGarbageThatKeepsGoingAndGoing)"

Because of that, depending on the position of the moon, if it is raining and if you are wearing matching socks, when that path is split, sometimes the rest of the code will think that "file.txt" is not the last part of the path, but yet another directory. And then it will create a directory named "file.txt" and continue with the file extraction. And it will see that it is supposed to extract a file named "file.txt" inside that last directory it has just created. A directory that happens to be named... "file.txt"

And that is why I and the people posting here in this issue keep seeing 7z saying that there is already a file named "xxx" inside the directory you are extracting the archive and keeps asking if you want to overwrite it or not.

A quick fix is to simply set the length of s_utf8 to the correct value before passing it to ConvertUTF8ToUnicode here is this line

if (ConvertUTF8ToUnicode(s_utf8, res) || ignore_Utf8_Errors)

The code inside that function also seems to be doing something weird with the return of the ìconv call. I don't think the return value of that call does what the code thinks it does. One thing it should be doing instead would be to compare how much further dest moved from the original position and use that to set the actual length of the converted string. That was how I fixed the bug.

This bug is still active in the latest release (v17.05) and, although I didn't check, I wouldn't be surprised if something similar happens in other parts of the code. That being said, it seems the newer commits changed CItem::GetUnicodeString so perhaps the bug might have been fixed.

While we wait, here is a patch that should fix the issue

--- old/CPP/7zip/Archive/Zip/ZipItem.cpp	2023-03-03 12:16:28.000000000 +0000
+++ new/CPP/7zip/Archive/Zip/ZipItem.cpp	2023-04-15 18:03:22.996666664 +0000
@@ -424,13 +424,16 @@
       size_t slen = s.Len();
       size_t dlen = slen * 4;
       const char* dest = s_utf8.GetBuf_SetEnd(dlen + 1); // (source length * 4) + null termination
+      const char* destStart = dest;
 
       size_t done = iconv(cd, (char**)&src, &slen, (char**)&dest, &dlen);
       bzero((size_t*)dest + done, 1);
 
       iconv_close(cd);
 
-      if (ConvertUTF8ToUnicode(s_utf8, res) || ignore_Utf8_Errors)
+      AString s_utf8_correctLength;
+      s_utf8_correctLength.SetFrom(s_utf8, dest - destStart);
+      if (ConvertUTF8ToUnicode(s_utf8_correctLength, res) || ignore_Utf8_Errors)
         return;
     }    
   }

zip-fix.string.length.patch

libTorrentUser pushed a commit to libTorrentUser/build that referenced this issue Apr 22, 2023
latest.sh can now retrieve the latest commit on Master in github
repositories

download.sh correctly deals with .zip files now. Also switch from
p7zip to unzip when dealing with .zip because the current version
of p7zip (17.05) has a very dangerous bug in the zip file handler
p7zip-project/p7zip#112
unxed added a commit to unxed/p7zip that referenced this issue May 22, 2024
@unxed
Copy link
Contributor

unxed commented May 22, 2024

Managed to reproduce bug on my Mint 21.3. Followed this instruction:
#112 (comment)

Bug do not reproduces in far2l terminal, but reproduces in GNOME Terminal.

As for ComicInfo.zip, bug reproduces or not depending on current folder name, and only if binary was built on Alpine Linux (it was built by @spvkgn in alpine:latest container, probably it's the current stable alpine version). For example, it reproduces if archive and archiver are inside ~/Проверка or ~/test test folders.

After applying some fixes to #232 it stopped reproducing for me. Thanks @libTorrentUser !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants