-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: how can you escape double quotes in search queries? #185
Comments
Hi @JLHasson, thanks for the report! Could you make a tiny self-contained example we can run to reproduce the error? A simple approach could be to adapt one of the unit tests. Please also report the version of tantivy you're using test. I found problems with escaping quotes back in version 0.19.2 but I don't think I've seen those problems since. A simple reproducer we can just copy-paste and run would save time. |
Hey @cjrh ! Sure thing: $ poetry show | grep tantivy
tantivy 0.21.0 Code: import tantivy
query = '36"'
doc = {"title": 'some title with 36"'}
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True)
_schema = schema_builder.build()
_index = tantivy.Index(_schema)
writer = _index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()
_index.reload()
query = _index.parse_query(query) # Search all fields
# ---> [13](vscode-notebook-cell:?execution_count=34&line=13) query = _index.parse_query(query) # Search all fields
#
# ValueError: Syntax Error: 36" |
Thank you for the reproducer, that helps. It seems that the issue is the import tantivy
query = r'title:"36\""'
doc = {"title": r'some title with 36\"'}
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True, tokenizer_name="en_stem")
_schema = schema_builder.build()
_index = tantivy.Index(_schema)
writer = _index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()
_index.reload()
query = _index.parse_query(query) # Search all fields
print(query)
searcher = _index.searcher()
results = searcher.search(query, 3).hits
print(results) Note that the
You can see that the escaped quote in the query is lost after parsing. The This seems like a bug in the |
For your specific use-case, it may be necessary to make your own tokenizer. |
Hmm, I was wrong about the import tantivy
query = r'title:"36\""'
doc = {"title": r'some title with 36"'}
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True, tokenizer_name="default")
schema = schema_builder.build()
index = tantivy.Index(schema)
writer = index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()
index.reload()
query = index.parse_query(query) # Search all fields
print(query)
print(index.searcher().search(query, 1).hits) Output:
Again though, the extra |
I'm going to leave this issue open until I have a chance to add documentation about this specifically. Also, I really need to find some time to address #25 to make it easier to customize tokenization. |
I have a query containing double quotes, e.g. a measurement in inches - '36"'. Right now I'm using
tantivy.Index.query_parser(query)
to parse the query. I have two questions:SyntaxError
, is this a bug?The approaches I've tried, all of which result in the same error above:
Escape the double quote:
36\"
Double or triple escape the quote:
36\\\"
or36\\"
Encode:
'36"'.encode('utf-8')
Unicode from here:
"36\\u{FF02}"
Raw python string:
r'36"'
Raw python string w/ escapes from above:
r'36\"'
...What should I do to get Tantivy to interpret this not as a field search, but as a literal double quote?
The text was updated successfully, but these errors were encountered: