Question: how can you escape double quotes in search queries? #185

JLHasson · 2024-01-10T23:09:20Z

I have a query containing double quotes, e.g. a measurement in inches - '36"'. Right now I'm using tantivy.Index.query_parser(query) to parse the query. I have two questions:

What is the default behavior of the analyzer in this case for double quotes? (can probably answer myself with a bit more digging)
I haven't been able to find a way to parse this without hitting SyntaxError, is this a bug?

---> 21     query = self._index.parse_query(q.query)  # Search all fields
     22     search_results = self._tantivy_results_to_docs(
     23         self._searcher.search(query, q.limit).hits
     24     )
     25     return [
     26         SearchResultFromSystem(
     27             result=SearchResult(query=q, result=self._tantivy_doc_to_dict(doc)),
   (...)
     32         for idx, (score, doc) in enumerate(search_results)
     33     ]

ValueError: Syntax Error: fawkes 36" blue vanity

The approaches I've tried, all of which result in the same error above:
Escape the double quote: 36\"
Double or triple escape the quote: 36\\\" or 36\\"
Encode: '36"'.encode('utf-8')
Unicode from here: "36\\u{FF02}"
Raw python string: r'36"'
Raw python string w/ escapes from above: r'36\"'...

What should I do to get Tantivy to interpret this not as a field search, but as a literal double quote?

The text was updated successfully, but these errors were encountered:

cjrh · 2024-01-10T23:27:16Z

Hi @JLHasson, thanks for the report!

Could you make a tiny self-contained example we can run to reproduce the error? A simple approach could be to adapt one of the unit tests. Please also report the version of tantivy you're using test. I found problems with escaping quotes back in version 0.19.2 but I don't think I've seen those problems since. A simple reproducer we can just copy-paste and run would save time.

JLHasson · 2024-01-10T23:39:49Z

Hey @cjrh ! Sure thing:

$ poetry show | grep tantivy
tantivy                   0.21.0

Code:

import tantivy
query = '36"'
doc = {"title": 'some title with 36"'}
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True)
_schema = schema_builder.build()
_index = tantivy.Index(_schema)
writer = _index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()

_index.reload()
query = _index.parse_query(query)  # Search all fields
# ---> [13](vscode-notebook-cell:?execution_count=34&line=13) query = _index.parse_query(query)  # Search all fields
# 
# ValueError: Syntax Error: 36"

cjrh · 2024-01-11T10:24:23Z

Thank you for the reproducer, that helps.

It seems that the issue is the default tokenizer being used for the text field. If you change the tokenizer to raw or en_stem, the parse_query() call no longer fails. raw is not super useful for general text fields, but hopefully you can work with en_stem.

import tantivy
query = r'title:"36\""'
doc = {"title": r'some title with 36\"'}


schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True, tokenizer_name="en_stem")
_schema = schema_builder.build()
_index = tantivy.Index(_schema)
writer = _index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()


_index.reload()
query = _index.parse_query(query)  # Search all fields
print(query)


searcher = _index.searcher()
results = searcher.search(query, 3).hits
print(results)

Note that the en_stem tokenizer will be stemming the tokens, which may not be what you want. This is the output that is produced running the above:

$ python main.py 
Query(TermQuery(Term(field=0, type=Str, "36")))
[(0.28768211603164673, <tantivy.DocAddress object at 0x71eb645fdf70>)]

You can see that the escaped quote in the query is lost after parsing. The r"36\"" becomes just 36.

This seems like a bug in the default tokenizer.

cjrh · 2024-01-11T10:25:01Z

For your specific use-case, it may be necessary to make your own tokenizer.

cjrh · 2024-01-11T10:38:44Z

Hmm, I was wrong about the default tokenizer having a bug, it seems the same escaping works as I described above in my en_stem example:

import tantivy
query = r'title:"36\""'
doc = {"title": r'some title with 36"'}

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True, tokenizer_name="default")
schema = schema_builder.build()
index = tantivy.Index(schema)
writer = index.writer()
writer.add_document(tantivy.Document(**doc))
writer.commit()

index.reload()
query = index.parse_query(query)  # Search all fields
print(query)
print(index.searcher().search(query, 1).hits)

Output:

$ python main2.py 
Query(TermQuery(Term(field=0, type=Str, "36")))
[(0.28768211603164673, <tantivy.DocAddress object at 0x783eb87fdf50>)]

Again though, the extra " is lost after parsing because the tokenizer rules in default drops it.

cjrh · 2024-01-11T10:53:53Z

I'm going to leave this issue open until I have a chance to add documentation about this specifically. Also, I really need to find some time to address #25 to make it easier to customize tokenization.

cjrh added the query Issues related to the tantivy query engine label Jan 20, 2024

cjrh closed this as completed in eba0d55 Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: how can you escape double quotes in search queries? #185

Question: how can you escape double quotes in search queries? #185

JLHasson commented Jan 10, 2024

cjrh commented Jan 10, 2024

JLHasson commented Jan 10, 2024

cjrh commented Jan 11, 2024

cjrh commented Jan 11, 2024

cjrh commented Jan 11, 2024 •

edited

Loading

cjrh commented Jan 11, 2024

Question: how can you escape double quotes in search queries? #185

Question: how can you escape double quotes in search queries? #185

Comments

JLHasson commented Jan 10, 2024

cjrh commented Jan 10, 2024

JLHasson commented Jan 10, 2024

cjrh commented Jan 11, 2024

cjrh commented Jan 11, 2024

cjrh commented Jan 11, 2024 • edited Loading

cjrh commented Jan 11, 2024

cjrh commented Jan 11, 2024 •

edited

Loading