Skip to content

Commit b2941cd

Browse files
Protection against too long single-word titles
Titles consisting of a single word exceeding the max word length limit are not indexed (similarly to how that long word would be handled if it were not the only word in the title). An alternative might be to index a truncated version of the title but that would complicate the verification of the bug fix (because the potentially dangerous title would remain in the expected output of the unit test).
1 parent 754f9f9 commit b2941cd

File tree

2 files changed

+8
-4
lines changed

2 files changed

+8
-4
lines changed

src/writer/xapianIndexer.cpp

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,9 +130,12 @@ size_t getTermCount(const Xapian::Document& d)
130130

131131
void XapianIndexer::indexTitle(const std::string& path, const std::string& title, const std::string& targetPath)
132132
{
133+
const size_t MAX_WORD_LENGTH = 64;
134+
133135
assert(indexingMode == IndexingMode::TITLE);
134136
Xapian::Stem stemmer;
135137
Xapian::TermGenerator indexer;
138+
indexer.set_max_word_length(MAX_WORD_LENGTH);
136139
indexer.set_flags(Xapian::TermGenerator::FLAG_CJK_NGRAM);
137140
try {
138141
stemmer = Xapian::Stem(stemmer_language);
@@ -162,7 +165,9 @@ void XapianIndexer::indexTitle(const std::string& path, const std::string& title
162165
// only ANCHOR_TERM was added, hence unaccentedTitle is made solely of
163166
// non-word characters. Then add entire title as a single term.
164167
currentDocument.remove_term(*currentDocument.termlist_begin());
165-
currentDocument.add_term(unaccentedTitle);
168+
if ( unaccentedTitle.size() <= MAX_WORD_LENGTH ) {
169+
currentDocument.add_term(unaccentedTitle);
170+
}
166171
}
167172
}
168173

test/suggestion.cpp

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -753,11 +753,10 @@ TEST(Suggestion, titleEdgeCases) {
753753
);
754754

755755
EXPECT_SUGGESTED_TITLES(archive, "awordthatis",
756-
w65, // a very long word slips in when it is the only word of a title
757756
w64,
758757
"Is " + w64 + " too long?"
759-
// "Is " + w65 + " too long?" isn't included because w65 has been ignored
760-
// during indexing
758+
// w65 and "Is " + w65 + " too long?" aren't included because w65 has
759+
// been ignored during indexing
761760
);
762761
}
763762

0 commit comments

Comments
 (0)