Skip to content

Conversation

@mgautierfr
Copy link
Contributor

@mgautierfr mgautierfr commented Feb 5, 2025

Fix #800

This PR needs testing files from openzim/zim-testing-suite#13
(and a release of zim-testing-suite project to be correctly used here) and so, CI is failing (as expected).
The simpler for locally testing this PR is to copy https://github.com/openzim/zim-testing-suite/tree/no_v0/data/nov0 into you local zim-testing-suite v0.6.0 directory.

An important change is that listing/titleOrdered/v1 is now created all the time, even if there is no front article.

This means that there will be no fallback to "all entries" (random, suggestion without xapian) when creator do not properly set front articles.
This case is really rare as we automatically mark item as front article if mimetype starts with "text/html" when no hint is given.

mgautierfr added a commit to openzim/zim-testing-suite that referenced this pull request Feb 5, 2025
Generated files are using the commands:
```
mkdir data/nov0
scripts/create_test_zimfiles --all src/small_zimfile_data/ data/nov0
zimrecreate data/withns/wikipedia_en_climate_change_mini_2024-06.zim data/nov0/wikipedia_en_climate_change_mini_2024-06.zim
zimrecreate data/withns/wikibooks_be_all_nopic_2017-02.zim data/nov0/wikibooks_be_all_nopic_2017-02.zim
zimsplit data/nov0/wikibooks_be_all_nopic_2017-02.zim --prefix data/nov0/wikibooks_be_all_nopic_2017-02_splitted.zim --size 51200
```

`zimwriterfs`, `zimrecreate`, `zimsplit` are localy build using
- commit b5127fc9 of libzim (commit in PR openzim/libzim#949)
- commit 482a6708 of zimtools (head of PR openzim/zim-tools#452)

Using non released version of libzim was necessary as we need test files
to test the PR and without the PR merged, we cannot release a libzim
generating such kind of files.
@mgautierfr mgautierfr marked this pull request as ready for review February 5, 2025 15:59
src/fileimpl.cpp Outdated
Comment on lines 238 to 239
if (header.getMinorVersion() >= Fileheader::zimMinorVersionWithoutV0) {
throw ZimFileFormatError("Zim file doesn't contain a title ordered index");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reiterating the concern raised during the review of an earlier PR when ZIM file minor version was bumped - such a check is not future proof and will need to be updated when the major version changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would answer you the same answer then 🙂 and updating the major version implies adapt the code of the reader anyway.

Then I would agree that we can update the check to header.getMajorVersion() == Fileheader::zimMajorVersion && header.getMinorVersion() >= Fileheader::zimMinorVersionWithoutV0 to be explicit and help maintainer to update the code at next major version.

And finally, I could tell that we can simply don't use the minor version and just check if titleIdxPos is 0. And this is how I have updated the code.

mgautierfr added a commit to openzim/zim-testing-suite that referenced this pull request Feb 11, 2025
Generated files are using the commands:
```
mkdir data/noTitleListingV0
scripts/create_test_zimfiles --all src/small_zimfile_data/ data/noTitleListingV0
zimrecreate data/withns/wikipedia_en_climate_change_mini_2024-06.zim data/noTitleListingV0/wikipedia_en_climate_change_mini_2024-06.zim
zimrecreate data/withns/wikibooks_be_all_nopic_2017-02.zim data/noTitleListingV0/wikibooks_be_all_nopic_2017-02.zim
zimsplit data/noTitleListingV0/wikibooks_be_all_nopic_2017-02.zim --prefix data/noTitleListingV0/wikibooks_be_all_nopic_2017-02_splitted.zim --size 51200
```

`zimwriterfs`, `zimrecreate`, `zimsplit` are localy build using
- commit b5127fc9 of libzim (commit in PR openzim/libzim#949)
- commit 482a6708 of zimtools (head of PR openzim/zim-tools#452)

Using non released version of libzim was necessary as we need test files
to test the PR and without the PR merged, we cannot release a libzim
generating such kind of files.
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine though I've identified a few things that can be slightly improved. Please rebase&fixup for the final iteration.

Comment on lines 101 to 105
try {
if(bool(hints.at(FRONT_ARTICLE))) {
m_hasFrontArticles = true;
dirent->setFrontArticle();
m_handledDirents.push_back(dirent);
}
} catch(std::out_of_range&) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to figure out why the try/catch was needed here. I think that the following helper function (which can be reused in xapianHandler.cpp) will make the code a little easier to comprehend:

bool isFrontArticle(const Dirent& dirent, const Hints& hints)
{
  // By definition, dirent not in `C` namespace are not FRONT_ARTICLE
  if (dirent.getNamespace() != NS::C) {
    return false;
  }
  try {
    return bool(hints.at(FRONT_ARTICLE));
  } catch(std::out_of_range&) {
    return false;
  }
}

Then it looks like some code can be moved from zim::writer::Item::getAmendedHints() into this new function, upon which the prospects for zim::writer::Item::getAmendedHints() to stay on the stage start to dim since what remains from it can be replaced in a similar fashion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for isFrontArticle helper.
However, I'm not sure about getAmendedHints. It needs an item to amend the hints and at this level we don't have one (and we may not have one at all when adding redirect)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for isFrontArticle helper.

Don't you want to immediately reuse it in xapianHandler.cpp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Commit "Introduce helper function isFrontArticle" directly modified

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and can be merged as is. However if we want this PR to shine as a perfect diamond in the global history of treatment of title-ordered indices, there are a couple of improvements listed below:

src/fileheader.h Outdated
static const uint16_t zimOldMajorVersion;
static const uint16_t zimMajorVersion;
static const uint16_t zimMinorVersion;
static const uint16_t zimMinorVersionWithoutV0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remains unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (the order of the commit has bit changed as this commit use zimMinorVersionWithoutV0 and its used in removed in next commits. New order introduce commit bumping minor version after commit not loading v0)

This means that there will be no fallback to "all entries"
(random, suggestion without xapian) when creator do not properly set
front articles.
This case is really rare as we automatically mark item as front article
if mimetype starts with "text/html" when no hint is given.
This has few consequences on the creation process:
- TitleListingHandle generate only the v1 title listing.
- We don't have to track titleListingHandle as a special handler and we
  don't need to store `m_titleListBlobOffset`.
- We set title idx pos to 0 in the header.
  So reader part (fast check, getMimeListEndUpperLimit) has to adapt.
- The tricky part is that we still need to create the handlers dirents
  before setting the entry indexes. This is done through the
 `handler->getDirents()` call.
Now we don't generate v0 (full) title index, we don't need to store all
dirents. We can store only the ones we need and we avoid a filter when
writing the title index.
It is useless as we don't need to filter the stored dirents.
This seems not relevant right now as we have only one category but it will
be when we'll have more.
PR openzim/zim-testing-suite#11 changes the zim
file wikipedia_en_climate_change to a more recent version.

However, no release of zim-testing-suite has been done and no adaptation
on libzim side neither.

It is time to do so. Especially as we will update our testing content.
PR openzim/zim-testing-suite#13 adds a new "noTitlelistingV0" test
category, ie zim files without v0 title index.

We have to adapt our tests to this.
Comparing the minorVersion "all the time" is enough.
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LPTM. Thanks for playing perfectionist!

@kelson42 kelson42 merged commit e67b598 into main Feb 12, 2025
27 of 28 checks passed
@kelson42 kelson42 deleted the no_v0 branch February 12, 2025 13:48
veloman-yunkan added a commit that referenced this pull request Apr 16, 2025
Absence of the title index v0 is now signaled via the titleIdxPos set to
0xffffffffffffffff instead of 0 since the latter resulted in ZIM files
incompatible with older versions of libzim. Several such ZIM files have
found home in zim-testing-suite (v 0.7.0) and have to be replaced. In
theory, each of those ZIM files could be patched with a few bytes in two
locations (titleIdxPos in the ZIM file header and the checksum at the
end of the file). An easier solution is to recreate them, but I don't
want it to be done with a locally built version of zim-tools (like it
was done in openzim/zim-testing-suite#13 which references a libzim
commit that was eventually edited/replaced during the review/rebase of
PR#949 and is no longer part of the main branch).  That's why I preserve
the two changes brought by #949 related to the handling of `titleIdxPos
== 0` (marked with temporary comments introduced by this commit). Once
this PR is merged and we have an official CI build we can use it to
update zim-testing-suite, whereupon the marked changes can be rolled
back (in a separate PR).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove old (v0) titleindex (title pointer list)

4 participants