-
-
Notifications
You must be signed in to change notification settings - Fork 64
Remove listing/titleOrdered/v0. #949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Generated files are using the commands: ``` mkdir data/nov0 scripts/create_test_zimfiles --all src/small_zimfile_data/ data/nov0 zimrecreate data/withns/wikipedia_en_climate_change_mini_2024-06.zim data/nov0/wikipedia_en_climate_change_mini_2024-06.zim zimrecreate data/withns/wikibooks_be_all_nopic_2017-02.zim data/nov0/wikibooks_be_all_nopic_2017-02.zim zimsplit data/nov0/wikibooks_be_all_nopic_2017-02.zim --prefix data/nov0/wikibooks_be_all_nopic_2017-02_splitted.zim --size 51200 ``` `zimwriterfs`, `zimrecreate`, `zimsplit` are localy build using - commit b5127fc9 of libzim (commit in PR openzim/libzim#949) - commit 482a6708 of zimtools (head of PR openzim/zim-tools#452) Using non released version of libzim was necessary as we need test files to test the PR and without the PR merged, we cannot release a libzim generating such kind of files.
src/fileimpl.cpp
Outdated
| if (header.getMinorVersion() >= Fileheader::zimMinorVersionWithoutV0) { | ||
| throw ZimFileFormatError("Zim file doesn't contain a title ordered index"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reiterating the concern raised during the review of an earlier PR when ZIM file minor version was bumped - such a check is not future proof and will need to be updated when the major version changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would answer you the same answer then 🙂 and updating the major version implies adapt the code of the reader anyway.
Then I would agree that we can update the check to header.getMajorVersion() == Fileheader::zimMajorVersion && header.getMinorVersion() >= Fileheader::zimMinorVersionWithoutV0 to be explicit and help maintainer to update the code at next major version.
And finally, I could tell that we can simply don't use the minor version and just check if titleIdxPos is 0. And this is how I have updated the code.
Generated files are using the commands: ``` mkdir data/noTitleListingV0 scripts/create_test_zimfiles --all src/small_zimfile_data/ data/noTitleListingV0 zimrecreate data/withns/wikipedia_en_climate_change_mini_2024-06.zim data/noTitleListingV0/wikipedia_en_climate_change_mini_2024-06.zim zimrecreate data/withns/wikibooks_be_all_nopic_2017-02.zim data/noTitleListingV0/wikibooks_be_all_nopic_2017-02.zim zimsplit data/noTitleListingV0/wikibooks_be_all_nopic_2017-02.zim --prefix data/noTitleListingV0/wikibooks_be_all_nopic_2017-02_splitted.zim --size 51200 ``` `zimwriterfs`, `zimrecreate`, `zimsplit` are localy build using - commit b5127fc9 of libzim (commit in PR openzim/libzim#949) - commit 482a6708 of zimtools (head of PR openzim/zim-tools#452) Using non released version of libzim was necessary as we need test files to test the PR and without the PR merged, we cannot release a libzim generating such kind of files.
veloman-yunkan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly fine though I've identified a few things that can be slightly improved. Please rebase&fixup for the final iteration.
src/writer/titleListingHandler.cpp
Outdated
| try { | ||
| if(bool(hints.at(FRONT_ARTICLE))) { | ||
| m_hasFrontArticles = true; | ||
| dirent->setFrontArticle(); | ||
| m_handledDirents.push_back(dirent); | ||
| } | ||
| } catch(std::out_of_range&) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to figure out why the try/catch was needed here. I think that the following helper function (which can be reused in xapianHandler.cpp) will make the code a little easier to comprehend:
bool isFrontArticle(const Dirent& dirent, const Hints& hints)
{
// By definition, dirent not in `C` namespace are not FRONT_ARTICLE
if (dirent.getNamespace() != NS::C) {
return false;
}
try {
return bool(hints.at(FRONT_ARTICLE));
} catch(std::out_of_range&) {
return false;
}
}Then it looks like some code can be moved from zim::writer::Item::getAmendedHints() into this new function, upon which the prospects for zim::writer::Item::getAmendedHints() to stay on the stage start to dim since what remains from it can be replaced in a similar fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for isFrontArticle helper.
However, I'm not sure about getAmendedHints. It needs an item to amend the hints and at this level we don't have one (and we may not have one at all when adding redirect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for
isFrontArticlehelper.
Don't you want to immediately reuse it in xapianHandler.cpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Commit "Introduce helper function isFrontArticle" directly modified
350c412 to
178f436
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and can be merged as is. However if we want this PR to shine as a perfect diamond in the global history of treatment of title-ordered indices, there are a couple of improvements listed below:
src/fileheader.h
Outdated
| static const uint16_t zimOldMajorVersion; | ||
| static const uint16_t zimMajorVersion; | ||
| static const uint16_t zimMinorVersion; | ||
| static const uint16_t zimMinorVersionWithoutV0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This remains unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (the order of the commit has bit changed as this commit use zimMinorVersionWithoutV0 and its used in removed in next commits. New order introduce commit bumping minor version after commit not loading v0)
This means that there will be no fallback to "all entries" (random, suggestion without xapian) when creator do not properly set front articles. This case is really rare as we automatically mark item as front article if mimetype starts with "text/html" when no hint is given.
This has few consequences on the creation process: - TitleListingHandle generate only the v1 title listing. - We don't have to track titleListingHandle as a special handler and we don't need to store `m_titleListBlobOffset`. - We set title idx pos to 0 in the header. So reader part (fast check, getMimeListEndUpperLimit) has to adapt. - The tricky part is that we still need to create the handlers dirents before setting the entry indexes. This is done through the `handler->getDirents()` call.
Now we don't generate v0 (full) title index, we don't need to store all dirents. We can store only the ones we need and we avoid a filter when writing the title index.
It is useless as we don't need to filter the stored dirents.
This seems not relevant right now as we have only one category but it will be when we'll have more.
PR openzim/zim-testing-suite#11 changes the zim file wikipedia_en_climate_change to a more recent version. However, no release of zim-testing-suite has been done and no adaptation on libzim side neither. It is time to do so. Especially as we will update our testing content.
PR openzim/zim-testing-suite#13 adds a new "noTitlelistingV0" test category, ie zim files without v0 title index. We have to adapt our tests to this.
Comparing the minorVersion "all the time" is enough.
veloman-yunkan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LPTM. Thanks for playing perfectionist!
Absence of the title index v0 is now signaled via the titleIdxPos set to 0xffffffffffffffff instead of 0 since the latter resulted in ZIM files incompatible with older versions of libzim. Several such ZIM files have found home in zim-testing-suite (v 0.7.0) and have to be replaced. In theory, each of those ZIM files could be patched with a few bytes in two locations (titleIdxPos in the ZIM file header and the checksum at the end of the file). An easier solution is to recreate them, but I don't want it to be done with a locally built version of zim-tools (like it was done in openzim/zim-testing-suite#13 which references a libzim commit that was eventually edited/replaced during the review/rebase of PR#949 and is no longer part of the main branch). That's why I preserve the two changes brought by #949 related to the handling of `titleIdxPos == 0` (marked with temporary comments introduced by this commit). Once this PR is merged and we have an official CI build we can use it to update zim-testing-suite, whereupon the marked changes can be rolled back (in a separate PR).
Fix #800
This PR needs testing files from openzim/zim-testing-suite#13
(and a release of zim-testing-suite project to be correctly used here) and so, CI is failing (as expected).
The simpler for locally testing this PR is to copy https://github.com/openzim/zim-testing-suite/tree/no_v0/data/nov0 into you local zim-testing-suite v0.6.0 directory.
An important change is that
listing/titleOrdered/v1is now created all the time, even if there is no front article.This means that there will be no fallback to "all entries" (random, suggestion without xapian) when creator do not properly set front articles.
This case is really rare as we automatically mark item as front article if mimetype starts with "text/html" when no hint is given.