Skip to content

Conversation

@veloman-yunkan
Copy link
Collaborator

Fixes #992

Xapian's parser discards words/terms longer than a certain limit (default: 64). However our workaround for handling titles that are made fully of non-word characters has created a loophole for words of arbitrary length to slip in. That leads to crashes if Xapian's
hard limit on the length of a term (max 245 bytes) is exceeded.

This PR closes the described loophole and validates the fix with a new unit test.

Note that an alternative fix is possible (indexing the truncated title instead of discarding it completel) but that would complicate the verification of the bug fix (because the potentially dangerous title would remain in the expected output of the unit test and the unit test would have to be converted into a less explicit crash/no-crash variant).

@codecov
Copy link

codecov bot commented Sep 14, 2025

Codecov Report

❌ Patch coverage is 38.46154% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.16%. Comparing base (d13bfdc) to head (2f9654e).

Files with missing lines Patch % Lines
src/writer/xapianIndexer.cpp 38.46% 0 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1004      +/-   ##
==========================================
- Coverage   56.22%   56.16%   -0.06%     
==========================================
  Files         101      101              
  Lines        4989     5001      +12     
  Branches     2170     2178       +8     
==========================================
+ Hits         2805     2809       +4     
- Misses        738      739       +1     
- Partials     1446     1453       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan Thank you for tackling this issue. If I get it right your fix allows to ignore silently the entry requested to be indexed and moves forward.

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

  • Either ignore (like you have implemented at the low level)
  • Truncate (like @benoit74 did)
  • Stop and request the software user to rename the entry he wants to index

Doing so will allow, in addition, to the scraper developer/user to write/get a custom error message.

I understand that this brings complexity one level higher on the software stack, but I don't want us to just ignore things deep in the code and let the user (libzim user) believing that everything was OK.

Copy link

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedbacks:

  • completely ignoring too long words is probably going to be deceptive to our users ; at least the case we have encountered in mwoffliner is an WPAR (arabic) article with only one single very long arabic word as title ; if we ignore this title, then it means this article will not be searchable at all in suggestions, which I'm quite sure will be reported as a bug by users ; truncating seems more appealing and easy to understand for users from my PoV, and the reason you've mentioned (testing complexity) for not implementing this is not really convincing me
  • I don't see any explanation of why we stick with Xapian default 64 bytes limit while we could support 245 bytes ; 64 bytes is not that much especially for alphabets needing 2, 3 or even 4 bytes per character (and again in these languages we tend to often have single "words" titles, think e.g. about Chinese)
  • Tests miss cases where redirects titles are too long (quite important to check this works too should handling of titles become different for redirects)
  • It was explained in issue that this problem was caused by glass (default) Xapian backend ; I do not see any issue or code comment mentioning we can remove this limit on word length once we change Xapian backend, and I think it is mostly mandatory since this limit has not been pushed by any usability concern but only by a technical limitation

@benoit74
Copy link

Just saw kelson42 comment, I'm quite aligned with the idea of pushing the problem to the libzim "user" with a clear exception. We should mention this somewhere in the documentation (not saying that users will easily find this documentation, but at least we will be able to point to this documentation whenever needed).

I like the fact that we do not take decision at a low level for a problem which might have different solutions depending on the context (truncating vs dropping vs altering too long word).

I'm a bit concerned about the fact that we will start to push technical Xapian limitations to the "user". It means that it will be harder to maintain in the long run, e.g. once Xapian technical changes, we will have to modify all scrapers. But I feel like it is an acceptable price to pay for flexibility.

@veloman-yunkan
Copy link
Collaborator Author

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

@kelson42 I can agree with such an approach but we won't be able to implement it (at least, as easily) for the case of titles consisting of two or more words. Then our implementation will be inconsistent - long words inside multiword titles will be silently ignored, whereas in single-word titles the user will be in control.

@kelson42
Copy link
Contributor

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

@kelson42 I can agree with such an approach but we won't be able to implement it (at least, as easily) for the case of titles consisting of two or more words. Then our implementation will be inconsistent - long words inside multiword titles will be silently ignored, whereas in single-word titles the user will be in control.

@veloman-yunkan I don't understand why " for the case of titles consisting of two or more words" they will be silently ignored. Sorry.

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Sep 22, 2025

If confirmed, this is not the right approach I believe. The correct approach IMO is to trigger an exception and then the software using the libzim should decide what to do with it:

@kelson42 I can agree with such an approach but we won't be able to implement it (at least, as easily) for the case of titles consisting of two or more words. Then our implementation will be inconsistent - long words inside multiword titles will be silently ignored, whereas in single-word titles the user will be in control.

@veloman-yunkan I don't understand why " for the case of titles consisting of two or more words" they will be silently ignored. Sorry.

We treat certain titles as a special case (see #765) and it's only in our handling of such titles that we can intervene. For titles failing to meet our special case criteria Xapian is fully in charge.

Revisiting what the special case is I see that it is more complex than a single-word title. The general formulation is "text that appears empty to the Xapian parser", which happens when the text is made up of exclusively the following types of words:

  • "pseudo-words" consisting of punctuation only
  • stopwords (Update: stopwords are not used for title indexing)
  • words exceeding the max length limit

Thus a multi-word title all of the words in which are too long will also pass through our special handling, but dealing with it will be more difficult because we will have to parse it!

Having written all this I now realize that my fix works under the wrong assumption (and #765 is also affected by it).

@kelson42
Copy link
Contributor

kelson42 commented Sep 22, 2025

Thank you for the explanation, but I still don't understand why "we won't be able to implement" a detection (in the special case treatment) of the words which are too long and trigger an exception when at least one of them is too long.

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Sep 25, 2025

We are able to implement anything within the current limits of the public knowledge and technology. But reimplementing parts of libzim's dependencies in order to fix some of their shortcomings is suboptimal. In this case we will have to tinker with parsing the text (something that Xapian did fully on its own).

@kelson42
Copy link
Contributor

kelson42 commented Sep 26, 2025

@veloman-yunkan If you think that Xapian does not fully does the job, why we don't create an upstream issue then?

@veloman-yunkan veloman-yunkan force-pushed the protection-against-too-long-single-word-titles branch from e5b844b to be83e09 Compare October 26, 2025 14:41
@veloman-yunkan veloman-yunkan force-pushed the protection-against-too-long-single-word-titles branch from be83e09 to 4669964 Compare December 12, 2025 11:44
@veloman-yunkan
Copy link
Collaborator Author

It turned out that, contrary to my earlier comment, indexing of titles doesn't use stopwords. I think that the current fix can be considered an acceptable remedy for #992.

  • I don't see any explanation of why we stick with Xapian default 64 bytes limit while we could support 245 bytes ; 64 bytes is not that much especially for alphabets needing 2, 3 or even 4 bytes per character (and again in these languages we tend to often have single "words" titles, think e.g. about Chinese)

Agree. I increased the length limit of indexable words to 240 bytes.

@veloman-yunkan If you think that Xapian does not fully does the job, why we don't create an upstream issue then?

IMHO, the current implementation is within reasonable assumptions about content that we may encounter in practice (based on a quick skim through https://en.wikipedia.org/wiki/Longest_words and https://en.wikipedia.org/wiki/Longest_word_in_English). One shouldn't disturb maintainers of open-source projects with issues that apply only to imaginary scenarios or intentional attempts to break commonsense norms/rules. Let's return to this discussion when we encounter a real-world example of a word longer than 240 bytes in a title that we'd absolutely want to NOT be ignored while indexing.

@benoit74
Copy link

One real world situation (which is at the root of the #992 issue) is this article

This article title is longer than the 240 bytes you propose in this PR, but this article title is indeed composed of one single word.

In mwoffliner we use the article title to populate the suggestion search index.

Reading the tests I'm a bit confused.

Do I get it correctly that with this PR we will still be able to find this article in the suggestion search?

@veloman-yunkan
Copy link
Collaborator Author

One real world situation (which is at the root of the #992 issue) is this article

This article title is longer than the 240 bytes you propose in this PR, but this article title is indeed composed of one single word.

In mwoffliner we use the article title to populate the suggestion search index.

Reading the tests I'm a bit confused.

Do I get it correctly that with this PR we will still be able to find this article in the suggestion search?

No, that word will be dropped. An alternative would be to truncate it, but as I already pointed out, that would be inconsistent since the word would be dropped if it was accompanied by another one.

@benoit74
Copy link

Wouldn't this be a technical issue to have a search entry without any word?

@veloman-yunkan
Copy link
Collaborator Author

Wouldn't this be a technical issue to have a search entry without any word?

I cannot think of any technical issue to be caused by such an entry. We had a similar situation before with titles consisting exclusively of non-word characters (#763) and the only complaint was that articles with such titles cannot be found.

@kelson42
Copy link
Contributor

kelson42 commented Dec 13, 2025

@veloman-yunkan Reading all the comments, I'm still a bit lost about the situation. What is sure is that XapianIndexer::indexTitle() needs a clear documentation for the users also explaining the edge cases... and hopeful that will also enlight me.

Regarding the handling of very long strings, it seems we still have scenarios where the title (or part of the title) won't silently be indexed. To me this is NOK and this is the role to the libzim's client to decide what to do on this (based on the linzim triggered exception). I still don't really understand why, but from your comment, you seem to imply that reimplementing the Xapian tokenizer on our side can not be the right approach. I agree on this, but we can rely on the Xapian tokenizer to detect the tokens over the size limit... before sending to the whole title to the indexation. The Tokenizer API is open and available AFAIK?!

Xapian's parser (TermGenerator) discards words/terms longer than a
certain limit (default: 64). However terms can be added to the indexed
documents directly via Xapian::Document::add_term(). We do that in order
to index titles that are made fully of non-word characters but our
implementation opens up a loophole for words of arbitrary length to slip
in (when the title is a single word). That leads to crashes if Xapian's
hard limit on the length of a term (max 245 characters) is exceeded.

The new unit test demonstrates the existence of a loophole.
Titles consisting of a single word exceeding the max word length limit
are not indexed (similarly to how that long word would be handled if it
were not the only word in the title).

An alternative might be to index a truncated version of the title but
that would complicate the verification of the bug fix (because the
potentially dangerous title would remain in the expected output of the
unit test).
@veloman-yunkan veloman-yunkan force-pushed the protection-against-too-long-single-word-titles branch from 4669964 to 192df20 Compare December 14, 2025 15:15
Xapian may silently omit words exceeding a certain size limit during
indexing and there is no API to detect that it happened. This commit
is a hacky way of detecting such omissions at the cost of false
positives trigerred by titles containing too much whitespace and/or
punctuation.

Also decreased the limit on the max term size because of crashes
in the newly extracted test-case Suggestion.handlingOfTooLongWords.
@veloman-yunkan veloman-yunkan force-pushed the protection-against-too-long-single-word-titles branch from 192df20 to 2f9654e Compare December 14, 2025 15:22
@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan Reading all the comments, I'm still a bit lost about the situation. What is sure is that XapianIndexer::indexTitle() needs a clear documentation for the users also explaining the edge cases... and hopeful that will also enlight me.

Regarding the handling of very long strings, it seems we still have scenarios where the title (or part of the title) won't silently be indexed. To me this is NOK and this is the role to the libzim's client to decide what to do on this (based on the linzim triggered exception). I still don't really understand why, but from your comment, you seem to imply that reimplementing the Xapian tokenizer on our side can not be the right approach. I agree on this, but we can rely on the Xapian tokenizer to detect the tokens over the size limit... before sending to the whole title to the indexation. The Tokenizer API is open and available AFAIK?!

@kelson42 XapianIndexer::indexTitle() is not part of the public API, however I added a note to its doc comment.

Xapian doesn't have a separate tokenizer. The closest is TermGenerator that indexes given text (by parsing it, and directly adding parsed words/terms to the document object).

I have come up with a hack that detects situations when long words have been omitted while indexing titles however the heuristic would also be triggered by excessive whitespace and/or punctuation. BTW, while implementing it I ran against a situation when the increased limit on the size of the longest indexable word resulted in crashes on the toy example so I restored the default value of 64 bytes.

@kelson42
Copy link
Contributor

@kelson42 XapianIndexer::indexTitle() is not part of the public API, however I added a note to its doc comment.

@veloman-yunkan Thank you, but I can not see the documentation in readthedocs... either I just look at the wrong place or something is wrong somewhere.

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 XapianIndexer::indexTitle() is not part of the public API, however I added a note to its doc comment.

@veloman-yunkan Thank you, but I can not see the documentation in readthedocs... either I just look at the wrong place or something is wrong somewhere.

As I had written, XapianIndexer::indexTitle() (and, moreover, XapianIndexer in its entirety) is not part of the libzim's public API. That's why it cannot be found on https://libzim.readthedocs.io/. The (updated) doc comment can be seen in this PR.

@kelson42
Copy link
Contributor

@kelson42 XapianIndexer::indexTitle() is not part of the public API, however I added a note to its doc comment.

@veloman-yunkan Thank you, but I can not see the documentation in readthedocs... either I just look at the wrong place or something is wrong somewhere.

As I had written, XapianIndexer::indexTitle() (and, moreover, XapianIndexer in its entirety) is not part of the libzim's public API. That's why it cannot be found on https://libzim.readthedocs.io/. The (updated) doc comment can be seen in this PR.

@veloman-yunkan It has to be visible for the library users/developers. It should not be only an internal documentation. Remember, @benoit74 has open the issue because he had crash and wanted to understand what/how this is working. The RTD documentation is computed for each PR, for this PR this is at https://libzim--1004.org.readthedocs.build/en/1004/. If XapianIndexer::indexTitle() is not the right place, we should put it somewhere else, but I'm not the right person to decide where (but my opinion is that it should be put at the public part of the Indexing API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Xapian issue when title is too "long"

4 participants