-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use correct tokenizer when building termsetquery #4983
base: main
Are you sure you want to change the base?
Conversation
I think we want this to behave such as:
should behave exactly like
|
@@ -42,26 +43,43 @@ impl From<TermQuery> for QueryAst { | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
fn get_field(field_path: &str, schema: &'a TantivySchema) -> Option<Field> {
if let (field, _field_path) = schema.find_field(&self.field) {
return Some(field);
}
schema.get_field(DYNAMIC_FIELD_NAME).ok()
}
let field = schema | ||
.find_field(&self.field) | ||
.or_else(|| schema.find_field(DYNAMIC_FIELD_NAME))? | ||
.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let field = schema | |
.find_field(&self.field) | |
.or_else(|| schema.find_field(DYNAMIC_FIELD_NAME))? | |
.0; | |
let field = get_field(&self.field, schema)?; |
self.build_ast( | ||
schema, | ||
tokenizer_manager, | ||
self.get_tokenizer(schema) | ||
.or(Some("raw")) | ||
.map(ToString::to_string), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.build_ast( | |
schema, | |
tokenizer_manager, | |
self.get_tokenizer(schema) | |
.or(Some("raw")) | |
.map(ToString::to_string), | |
) | |
let tokenizer: String = self.get_tokenizer(schema) | |
.unwrap_or("raw") | |
.to_string(); | |
self.build_ast( | |
schema, | |
tokenizer_manager, | |
tokenizer | |
) |
&self, | ||
schema: &TantivySchema, | ||
tokenizer_manager: &TokenizerManager, | ||
tokenizer: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is never called with None
_search_fields: &[String], | ||
_with_validation: bool, | ||
) -> Result<TantivyQueryAst, InvalidQuery> { | ||
self.build_ast(schema, tokenizer_manager, Some("raw".to_string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.build_ast(schema, tokenizer_manager, Some("raw".to_string())) | |
self.build_ast(schema, tokenizer_manager, "raw".to_string()) |
|
||
use super::{BuildTantivyAst, QueryAst}; | ||
use crate::query_ast::utils::DYNAMIC_FIELD_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
term_query.rs
should not have been modified at all.
This PR is obfuscating what it does with logic that only concerns term_set_query.
The only thing being factorized is build_ast
as far as I could tell, and it is definitely not worth it.
@@ -56,15 +63,34 @@ impl TermSetQuery { | |||
field: full_path.to_string(), | |||
value: value.to_string(), | |||
}; | |||
let ast = | |||
term_query.build_tantivy_ast_call(schema, tokenizer_manager, &[], false)?; | |||
let ast = term_query.ast_for_term_extraction(schema, tokenizer_manager)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not build a TermQuery
at all and add
a function
build_ast_for_term(field_path: &str, value: &str, schema: &Schema, tokenizer_manager: &TokenizerManager)
let str_term_count = terms.iter().filter(|term| is_term_str(term)).count(); | ||
if str_term_count <= 1 { | ||
for term in terms { | ||
all_terms.insert(term); | ||
} | ||
} else { | ||
// we have a string, and it got split into multiple tokens, so we want an | ||
// intersection of them | ||
let mut phrase = Vec::with_capacity(terms.len()); | ||
for term in terms { | ||
if is_term_str(&term) { | ||
phrase.push(term); | ||
} else { | ||
all_terms.insert(term); | ||
} | ||
} | ||
intersections.push(phrase); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is rather odd here.
You implicitly rely on the fact that you know that the query ast is a disjunction over different field:
e.g.
(text query with possibly more than one term OR numerical query)
If you want to take that kind of shortcuts (after having considered other solutions), add comments to explain why it works.
Ideally, assert that the query is indeed what you think it is.
Ok(term_set_query.into()) | ||
use tantivy::query::{BooleanQuery, Query, TermQuery, TermSetQuery}; | ||
|
||
let (terms_it, intersections) = self.make_term_iterator(schema, tokenizer_manager)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need a comment here.
let typ = val.json_path_type().unwrap_or_else(|| val.typ()); | ||
typ == Type::Str | ||
} | ||
|
||
impl TermSetQuery { | ||
fn make_term_iterator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has a very non trivial purpose and a retrun type. It needs a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments in PR.
As far as I can tell, there are no reason to put code in the term_query.rs
file.
Description
fix #4945
How was this PR tested?
added integration test (also fix an integration test not cleaning up behind)