Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detect archive format fails if the file extension does not match the actual file format #236

Open
dhananjay-gune opened this issue Aug 20, 2024 · 17 comments
Assignees
Milestone

Comments

@dhananjay-gune
Copy link

I have a real use case where my legacy application produces an archive with 7z format but gives it a .zip extension!
Sorry, but this is beyond my control :-(

Observations:

  1. If I set the archive format as bit7z::BitFormat::Auto, then the BitArchiveReader is not able to read items from this file.
    auto entries = reader.items(); fails.
  2. If the file's extension is changed to something unknown e.g. .xyz, then it works!

It appears that the Reader tries to find the format from the file extension (.zip) and then gets confused along the way!
❓ Would it be possible for the Reader to solely use the actual signature of the file format instead of the extension?

Here is a sample archive file with actual format of 7z, but extension .zip:
TestMismatchOfFileExtensionAndActualFormat.7z.zip

In some cases it was observed that some such files (actual format = 7z, and extension = .zip) actually get opened, but wrong entries are returned. This could be similar to the issue: #235

@rikyoz rikyoz self-assigned this Aug 20, 2024
@rikyoz
Copy link
Owner

rikyoz commented Aug 20, 2024

Hi!

This is strange, and I cannot seem to able to replicate this issue 🤔.

I'm using the same archive you attached, with the following simple code:

try {
    Bit7zLibrary lib{R"(C:\Program Files\7-Zip\7z.dll)"};
    BitArchiveReader reader{lib, R"(E:\Downloads\TestMismatchOfFileExtensionAndActualFormat.7z.zip)"};
    std::println("Detected format: {}", reader.detectedFormat().value());
    for (const auto& entry : reader) {
        std::println(" - {}", entry.path());
    }
} catch (const BitException& ex) {
    std::println("\n{}", ex.what());
}

Output:

Detected format: 7 <- BitFormat::SevenZip, the actual format of the archive.
 - Sub Folder 2
 - SubFolder1
 - Root File 2.txt
 - Root File 3.7z
 - RootFile1.txt
 - Sub Folder 2\Sub File 2.2.txt
 - Sub Folder 2\Sub File 2.3.tar
 - Sub Folder 2\SubFile2.1.txt
 - SubFolder1\Sub File 1.2.txt
 - SubFolder1\SubFile1.1.txt
 - SubFolder1\SubFile1.3.zip

If I set the archive format as bit7z::BitFormat::Auto, then the BitArchiveReader is not able to read items from this file.
auto entries = reader.items(); fails.

What is the value of reader.detectedFormat().value() just after the constructor?

It appears that the Reader tries to find the format from the file extension (.zip) and then gets confused along the way!
❓ Would it be possible for the Reader to solely use the actual signature of the file format instead of the extension?

This is currently not possible and should not be needed.
If the format is detected by the extension, and the format != the actual format, the subsequent attempt to open the archive with the wrong format should fail.
In this case, bit7z knows that it has only detected the format by the extension (which is prone to errors), so it will try to detect the format again, this time by the file signature (more reliable).
If a format is detected, bit7z tries to open the archive again with the new format found.
And usually this second attempt is successful.

I suppose I could add a new build option that disables detection by extension, but I'm a bit hesitant because this library already has many build options, and testing them all is becoming increasingly complex.

I'll continue to investigate the issue, but without replicating it is a bit difficult.

@dhananjay-gune
Copy link
Author

I am using bit7z v4.0.0 and 7z.dll v18.05
I am welded to this version of 7z.dll due to legacy reasons - impossible to change.

Here is my code:

try
{
    const BitInFormat& archiveFormat = GetInputArchiveFormat(); // this returns 👉  bit7z::BitFormat::Auto
    Bit7zLibrary lib(this->m_7ZipDllPath);
    BitArchiveReader reader(lib, archivePath, archiveFormat); // throws an exception for a .zip extn. 👉 Failed to detect the format of the file: No known signature found.   error_code = 15;
    auto detectedFormat = reader.detectedFormat().value(); // After changing the extension to .xyz, the value is 👉 7!
    entries = reader.items();
}
catch (const std::exception& ex)
{
    return OpResult(ex); // failure
}
return OpResult(); // success

Sorry - I by mistake mentioned that the reader.items() fails, but it is the ctor that throws the exception.

When I change the file extension from .zip to .xyz then it works!

Thanks for your prompt replies 🙏

@rikyoz
Copy link
Owner

rikyoz commented Aug 21, 2024

Thanks for the further details!
I've just tested the same program of before, always using bit7z v4.0.8 but this time with the 7z.dll v18.05.
It works fine as before, same output, so I can't reproduce the issue 🤔.

Failed to detect the format of the file: No known signature found. error_code = 15;

This is really strange. If the file is indeed a 7z archive, bit7z should find the signature. The code for detecting the archive format is quite old now and has been tested many times, not just by me.
I wouldn't rule out the possibility of a weird bug, of course, but not being able to reproduce the problem makes it hard to find it.

Which compiler and architecture are you using?

When I change the file extension from .zip to .xyz then it works!

This is even stranger, because what does not seem to work is the signature format detection 🤔.

@dhananjay-gune
Copy link
Author

dhananjay-gune commented Aug 22, 2024

My Env:

  • Win10 64bit

  • Visual Studio 2022 64bit v17.11.1

    • Platform Toolset = Visual Studio 2022 (v143) (latest)
    • C++ Language standard = ISO C++20 Standard (/std:c++20)
    • C Language standard = Default (Legacy MSVC)
  • bit7z

    • Platform = 64bit (Linker = MachineX64 (/MACHINE:X64))
    • Pre-Processor Macros:

      WIN32
      _WINDOWS
      UNICODE
      _UNICODE
      WIN64
      CMAKE_INTDIR="Debug"
      BIT7Z_USE_NATIVE_STRING
      BIT7Z_AUTO_FORMAT
      BIT7Z_AUTO_PREFIX_LONG_PATHS="ON"

See if this gives you any clues.

I have also attached a test app / project that will help in reproducing the problem Bit7Issue_FileExtensionAndFormatMismatch.zip

  • I have included the version of bit7z that I am using.
  • It also has the test data for failure and success (same file with .zip and .xyz extension)
    • Have a look at the folder Bit7Issue_FileExtensionAndFormatMismatch\Bit7zTestApp\bin\x64\Debug

Thanks again for your support! 🙏

@rikyoz
Copy link
Owner

rikyoz commented Aug 22, 2024

Thanks again for your support! 🙏

No problem at all! Thanks for all the details and test project you provided, it really helped me troubleshoot! 🙏

I have also attached a test app / project that will help in reproducing the problem

I have good news: thanks to your test project, I was finally able to reproduce the problem.

And I also found the cause: you are using an old version of bit7z, specifically v4.0.0-rc.
This version had a bug where detection from the signature didn't work in some cases (#134), and yours is one of them.
This bug was due to the fact that the first time bit7z tries to open the archive, 7-Zip might seek the file stream past the beginning of the archive file, making the subsequent file signature check done by bit7z invalid (as it would not be reading the file signature, but other bytes of the file).
This bug has been fixed in the stable release v4.0.0.

This also explains why I could not replicate the problem: I assumed you were using the v4.0.0 or later, and in my tests I was using the latest v4.0.8.

If you replace your version of bit7z with the latest v4.0.8, the test program works fine:

image

I strongly recommend using the latest stable version v4.0.8, as it contains the mentioned bugfix and many other improvements.

@dhananjay-gune
Copy link
Author

Thanks a ton... 🤗 I will try with the latest version v4.0.8 and give you feedback.

Quick question: How to identify from the source files, which exact version one is using?
I checked CMakeLists.txt which shows v4.0.0 and not v4.0.0-rc

@rikyoz
Copy link
Owner

rikyoz commented Aug 22, 2024

Quick question: How to identify from the source files, which exact version one is using?

I checked CMakeLists.txt which shows v4.0.0 and not v4.0.0-rc

You can usually check it in the CMakeLists.txt. The v4.0.0-rc was actually an exception, as the -rc suffix caused some problems in CMake, at least at the time, so I just used 4.0.0. In this case, I simply recognized that the code was using an old code style that more recent versions are not using anymore.

As a side note, it has happened in the past that I have simply forgotten to update the version in the CMakeLists.txt on some patch releases. This shouldn't happen anymore, though, as I'm trying to be consistent with updating it on every release (and I'm also looking into implementing some automatic checks to avoid such missed updates).

@dhananjay-gune
Copy link
Author

OK - I verified with v4.0.8 that my test program works ! :-)

Then I wanted to use it in my application for which I am trying to (re)build bit7z with <LanguageStandard>stdcpp20</LanguageStandard> but it gives a compilation error!
If I use stdcpp17 it compiles! How come? The previous version v4.0.0-rc that I am using is compiling successfully with stdcpp20.

image

image

PS: I have manually modified the bit7z.vcxproj to suit my project - attached herewith.
Notably, I have removed the <CustomBuild> tag that was using CMake.
Modified_bit7z_vcxproj.zip

I might have missed or messed something.

Is it ok to use stdcpp17 for bit7 and stdcpp20 for my DLL? i.e. will this version difference create any problem at runtime?

Thanks for your support!

@rikyoz
Copy link
Owner

rikyoz commented Aug 26, 2024

Sorry for the late reply!

OK - I verified with v4.0.8 that my test program works ! :-)

Perfect!

Then I wanted to use it in my application for which I am trying to (re)build bit7z with stdcpp20 but it gives a compilation error!
If I use stdcpp17 it compiles! How come? The previous version v4.0.0-rc that I am using is compiling successfully with stdcpp20.

Yeah, the v4.0.0-rc had some problems handling internal conversions from/to std::filesystem::paths and std::string, at least on MSVC. Basically, bit7z uses UTF-8 encoded strings, while MSVC uses the system codepage encoding for the conversions.
Later versions of the library fix this by using the path::u8string() method method wherever there is a conversion to std::string.
But there's a catch: such a method only returns a std::string in C++17, in C++20 the same method returns a std::u8string object. Basically, the C++20 standard broke the compatibility with the old code, as std::u8string cannot be easily converted to std::string. They're two different and incompatible string types.
This change makes it impossible to build bit7z with C++20, at least for now.
I worked on a patch some time ago, but haven't had time to test it yet, so for now the library can only be built with C++17.
I'll definitely fix this, possibly in a v4.0.9.

Is it ok to use stdcpp17 for bit7z and stdcpp20 for my DLL? i.e. will this version difference create any problem at runtime?

That is a good question. I have been looking for an answer for a long time.
In general, the answer depends on the compiler and standard library you are using.
For example, according to this Stack Overflow answer, there shouldn't be a problem using GCC.
For MSVC I couldn't find a definitive answer, but from my tests there is no problem at all.
I even added a unit test application that checks for exactly this scenario (on the develop branch). Basically, the bit7z library is built in C++17 and linked to the unit test application, which is built in C++20.
It always works fine:
image

@dhananjay-gune
Copy link
Author

dhananjay-gune commented Aug 27, 2024

Thanks for your reply and detailed explanation - it helps.
To be on the safe side, I will build my DLL in C++17. Once bit7z becomes compatible with C++20, I will migrate :-)

@dhananjay-gune
Copy link
Author

Today, one of our testers reported that if the archive contains a file name with special characters e.g. CO₂.txt (subscript 2) then my version i.e. v4.0.0-rc throws an exception:

"Failed to get the stream: Operation aborted" ErrorCode = -2147467260

The callstack shows that the error is thrown by should_format_long_path():

image

2024-08-27 21_41_26-SevenZipInterop (Debugging) - Microsoft Visual Studio

I tested with v4.0.8 and it seems to work fine!
I observed that should_format_long_path() impl in 4.0.0-rc calls auto path_str = path.string(); whereas 4.0.8 calls const auto& pathStr = path.native(); and fortunately the native() returns a wstring instead of string on MSVC (I don't know if C++17 and C++20 have diff impl of native()).

So it appears that I will have to migrate to this version! But I have some questions / doubts:

  • In v4.0.8 could there be any other issues related to special-chars-in-file-name?
  • Is it supporting the full set of Unicode chars?
  • Are there any languages or characters that are NOT supported? or could have troubles in conversion e.g. from wstring to UTF-8 and vice versa?
  • When bit7z is compiled using UNICODE, _UNICODE flags, would it be possible that all the strings, especially file names, are handled solely in Unicode (wstring) without any conversion to UTF-8 or whatever format that could lead to potential loss of information or failures in processing? just for the sake of example, in the screenshot in the prior post, it is seen that read_sym_link_as_string() returns a u8string or string - could it potentially result in loss of information? (I don't know in what situation this function is called - sorry if I totally misunderstood it); or say if there is an exception e.g. Access Denied - std::exception::what() returns a char * instead of wchar_t * - I don't know what it will return if the message text from OS is non-Latin. I know this is coming from stl and not from bit7z; and sorry again if I misunderstood it due to lack of my knowledge.

Sorry for these questions, but I am really worried now, as our application needs to process files from the users all over the world in any language! And any failure in zipping/unzipping becomes a blocker!

Thanks again for your continuing support! 🙏 🙏

@rikyoz
Copy link
Owner

rikyoz commented Aug 28, 2024

I tested with v4.0.8 and it seems to work fine!
I observed that should_format_long_path() impl in 4.0.0-rc calls auto path_str = path.string(); whereas 4.0.8 calls const auto& pathStr = path.native(); and fortunately the native() returns a wstring instead of string on MSVC (I don't know if C++17 and C++20 have diff impl of native()).

This is both a fix and an improvement I made some time ago.

First, it fixes problems like the one your tester found in the old v4.0.0-rc: MSVC's implementation of path::string() uses the system codepage encoding for the converted string, instead of UTF-8 (which is the default encoding used by bit7z). System codepage encodings, unlike UTF-8, may not support all Unicode codepoints, hence the conversion error.

Also, this improves the performance of the code: the string() function always performs a string conversion (on Windows), while native() simply returns a reference to the string stored in the path object; in the specific case of should_format_long_path(), this conversion wasn't really needed, and using native() made it faster (as it didn't allocate a new string anymore).

and fortunately the native() returns a wstring instead of string on MSVC (I don't know if C++17 and C++20 have diff impl of native()).

On Windows, the native char type is wchar_t, so the native string type is always a std::wstring. This doesn't change between different C++ standards.

So it appears that I will have to migrate to this version! But I have some questions / doubts:

I think I need to clarify some details: since you are using bit7z with the BIT7Z_USE_NATIVE_STRING build option, the library does not perform any string encoding conversion in any part of the code, except in read_sym_link_as_string().

This is because bit7z, 7-Zip, and MSVC's std::filesystem::path all use wide strings in this case. Since they all use the same string type, no conversion is needed.
As a side note, wide strings are UTF-16 encoded on Windows, so they support the full set of Unicode codepoints.

The call to u8string() in read_sym_link_as_string() is the only exception: this function is used to read the path pointed to by a symbolic link on the file system so that it can be stored in an archive; symbolic links are basically stored as text files containing the path to the target file of the link.

On Windows, symbolic links are not so common, so I'm not sure if this applies to your use cases. In any case, u8string() will do a conversion from UTF-16 to UTF-8, and as far as I know such a conversion should not fail.

I think I've answered most of your questions already, but I'll address each of them anyway.

In v4.0.8 could there be any other issues related to special-chars-in-file-name?

As I said, when using BIT7Z_USE_NATIVE_STRING no string conversion is performed, so there should not be any such issues.

Is it supporting the full set of Unicode chars?

Are there any languages or characters that are NOT supported? or could have troubles in conversion e.g. from wstring to UTF-8 and vice versa?

Wide strings are UTF-16 encoded on Windows, so they support the full set of Unicode codepoints.

Note, however, that Windows limits the set of characters that can be used for filenames and paths.

If your program only needs to handle archives created on Windows, this is not a problem.
On the other hand, archives created on other operating systems may contain elements with unsupported characters: if this is the case, you may want to consider building bit7z with the BIT7Z_PATH_SANITIZATION enabled; this option will cause bit7z to replace all unsupported characters and sequences with a '_' character.

Without the BIT7Z_PATH_SANITIZATION option enabled, bit7z will fail to create such files with unsupported characters and filenames; there is not much bit7z can do in this case, as it is a limitation of the operating system itself.

Other than that, there's no other restriction.

When bit7z is compiled using UNICODE, _UNICODE flags, would it be possible that all the strings, especially file names, are handled solely in Unicode (wstring) without any conversion to UTF-8 or whatever format that could lead to potential loss of information or failures in processing?

just for the sake of example, in the #236 (comment), it is seen that read_sym_link_as_string() returns a u8string or string - could it potentially result in loss of information?

The UNICODE, _UNICODE flags are needed for bit7z to work properly with 7-Zip, so I would not remove them. As for converting to UTF-8, I already replied above, but there should be no loss of information as far as I know. MSVC's path::u8string() throws an exception if it can't convert the string to UTF-8.

or say if there is an exception e.g. Access Denied - std::exception::what() returns a char * instead of wchar_t * - I don't know what it will return if the message text from OS is non-Latin. I know this is coming from stl and not from bit7z;

std::exception::what() always returns a const char* string.
This string is likely encoded using the system codepage encoding, at least when using MSVC.

Sorry for these questions, but I am really worried now, as our application needs to process files from the users all over the world in any language! And any failure in zipping/unzipping becomes a blocker!

No problem! I hope I have clarified the issue and reassured you on this matter.

And just to be clear, even without the BIT7Z_USE_NATIVE_STRING option enabled, there are no known problems with string encoding handling.
The only difference in this case is that bit7z will only use UTF-8 encoded std::strings in its API instead of std::wstrings, and it will have to internally convert to the latter when communicating with 7-Zip and the Windows APIs; such conversions are kept to the bare minimum required.
But this is not your case anyway.

Thanks again for your continuing support! 🙏 🙏

You're welcome!

@dhananjay-gune
Copy link
Author

Thank you very much for these clarifications and reassurance! 🙏 🙏
I will use BIT7Z_PATH_SANITIZATION to be on the safer side!

@rikyoz
Copy link
Owner

rikyoz commented Aug 28, 2024

You're welcome! 😄
Thank you for using bit7z! 🙏

@rikyoz rikyoz closed this as completed Aug 28, 2024
@dhananjay-gune
Copy link
Author

With v4.0.8, although the auto-detect-format seemed to work on the simple test data that I manually prepared, our tester reported that it still fails on some real data.

I have attached a sample of such data here that contains the same file with different extensions.
TestData.zip

My test app reports the following:

{
EXTRACTING    D:\Bit7Issue_FileExtensionAndFormatMismatch\Bit7zTestApp\bin\x64\Debug\TestData\bit7z_v4.0.8_FailsToExtractFileWithZipExtensionBut7zFormat.zip
destDir = "D:\Bit7Issue_FileExtensionAndFormatMismatch\Bit7zTestApp\bin\x64\Debug\Output\bit7z_v4.0.8_FailsToExtractFileWithZipExtensionBut7zFormat.zip.extracted"
detected format = 1  ❌👈 wrong format
num entries = 4 reader.items()  SUCCESS :-)    ❌👈  wrong number of entries
ERROR :-(  BitException:  Failed to extract the archive: CRC failed  errorCode = 3    ❌👈

⚠️ please see my comments below
}

{
EXTRACTING    D:\Bit7Issue_FileExtensionAndFormatMismatch\Bit7zTestApp\bin\x64\Debug\TestData\bit7z_v4.0.8_FailsToExtractFileWithZipExtensionBut7zFormat_CorrectExtension.zip.7z
destDir = "D:\Bit7Issue_FileExtensionAndFormatMismatch\Bit7zTestApp\bin\x64\Debug\Output\bit7z_v4.0.8_FailsToExtractFileWithZipExtensionBut7zFormat_CorrectExtension.zip.7z.extracted"
detected format = 7
num entries = 2 reader.items()  SUCCESS :-) 
extractTo()  SUCCESS :-) ✔️ 
}

{
EXTRACTING    D:\Bit7Issue_FileExtensionAndFormatMismatch\Bit7zTestApp\bin\x64\Debug\TestData\bit7z_v4.0.8_FailsToExtractFileWithZipExtensionBut7zFormat_UnknownExtension.xyz
destDir = "D:\Bit7Issue_FileExtensionAndFormatMismatch\Bit7zTestApp\bin\x64\Debug\Output\bit7z_v4.0.8_FailsToExtractFileWithZipExtensionBut7zFormat_UnknownExtension.xyz.extracted"
detected format = 7
num entries = 2 reader.items()  SUCCESS :-)
extractTo()  SUCCESS :-) ✔️ 
}

⚠️ Despite the error, it manages to extract some items from the archive entry finMiscComponents.pod_raw which is a zip file.
However, these items become corrupt - see the screenshot below.
2024-08-29 11_44_07-img

I request you to please reopen this issue.

I sincerely request you to please provide a way of always using the file's signature to determine the archive format, instead of file's extension, otherwise it produces disastrous results.

Thank you 🙏

@rikyoz rikyoz reopened this Aug 29, 2024
@rikyoz
Copy link
Owner

rikyoz commented Aug 29, 2024

Hi!
The problem is always the same as in #235: 7-Zip tries to find the beginning of a zip archive inside the 7z file, and it manages to find it, which is why it doesn't throw immediately.

image

In the picture, I have highlighted the signature of the 7z archive in red, and that of the zip archive in blue. Since bit7z recognised the zip format from the file extension, it tries to open it using such format. Then, 7-Zip searches for the archive start and detects the signature highlighted in blue, confirming that it has found a zip archive, and opening it successfully (for some reason).

As for why the extraction fails and some of the extracted files are corrupted, my hunch is that it fails because the parent 7z file is compressed (unlike that other tar archive): basically, the zip file is compressed twice (once zip and then 7z), but 7-Zip doesn't know this and just tries to extract it using the zip format.

As I said in the other issue, this is not a problem with format detection by extension per se, but rather the default behaviour of 7-Zip. This can be turned off, and I'm currently working on an API that will do just that.
The API will probably look something like this:

enum struct ArchiveStart : std::uint8_t {
    ScanFile, ///< Search the whole input file for the archive's start (if the file format supports it).
    FileStart ///< Check only the file start for the archive's start.
};
//...
// Throwing an error if the file is not a zip archive from the start.
BitArchiveReader( lib, "path-to-archive.zip", ArchiveStart::FileStart, BitFormat::Zip );

To keep the backward compatibility with older versions of bit7z, the default behavior will still be ArchiveStart::ScanFile:

// Searching for the archive's start by seeking through the input file.
BitArchiveReader( lib, "path-to-archive.zip", BitFormat::Zip );

I'll keep this issue opened, and close it once this feature will ship.
After testing it, I plan to release it in a next patch version v4.0.9.

@dhananjay-gune
Copy link
Author

Thanks for your detailed explanation.
As long as both - bit7z and 7z - together are somehow able to detect the actual archive format irrespective of the extension, it is ok 😊

Please also test using the file I gave you 😊

Thanks!

@rikyoz rikyoz added this to the v4.0.9 milestone Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants