Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ., ", and unicode characters search error in subfields in map score modifiers #1008

Open
wants to merge 19 commits into
base: mainline
Choose a base branch
from

Conversation

RaynorChavez
Copy link
Member

@RaynorChavez RaynorChavez commented Oct 16, 2024

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    Bug fix

  • What is the current behavior? (You can also link to an open issue here)

Scenario 1:

  • Say we have the document: {"_id": "1_map", "text_field": "a photo of a cat", "map_score_mods_float": {"a.subsubfield": 0.5}},
  • And we attempt to access the score modifier in a query through: "add_to_score": [{"field_name": "map_score_mods_float.a.subsubfield", "weight": 2}]
  • Then we will get an error:
status_code: 500
ValueError: too many values to unpack (expected 2)

Scenario 2:

  • with the document {"_id": "1_map", "text_field": "a photo of a cat", "map_score_mods_float": {"a"subsubfield": 0.5}},
  • we will get an error: Failed for special character '"': MarqoWebError: MarqoWebError Error message: {'message': "Marqo encountered an unexpected internal error.\nPlease create an issue on Marqo's GitHub repo (https://github.com/marqo-ai/marqo/issues) if this problem persists.", 'code': None, 'type': None, 'link': ''} status_code: 500, type: None, code: None, link:
  • In marqo, the error is: marqo.vespa.exceptions.VespaStatusError: 400: {"root":{"id":"toplevel","relevance":1.0,"fields":{"totalCount":0},"errors":[{"code":3,"summary":"Illegal query","message":"Could not set 'ranking.features.query(marqo__add_weights_tensor)' to '{\"map_score_mods_float.a\\\"subsubfield\": 2.0}': Could not parse '{\"map_score_mods_float.a\\\"subsubfield\": 2.0}' as a tensor of type tensor(p{}): At value position 26: Expected a ':' but got 's'"}]}}

Scenario 3:

  • for all subfields unicode characters and \\
  • users will be able to index, but will get a silent error when searching because of the discrepancy between the encoding format for these characters between add_docs and the query - thus, search will be possible, but the correct score_modifier will not be matched and applied
  • What is the new behavior (if this is a feature change)?
  • For score modifiers with ., expected behavior is that we no longer get an internal error saying too many values to unpack and that the score is correctly applied.
  • For score modifiers with ", expected behavior is that we no longer have the 500 internal error when searching and that the score modifier is applied to the score As discussed, we will let this silent error like other unicode characters until we hear back from the vespa team. Currently, we will encode " in unicode. I have confirmed that this will not match with documents explicitly containing the \u0022 encoding.
  • For all other unicode characters, we will let these silent fail for now.

The main changes to enable this were:

  • For . we simply split the query along the first . character, resulting in only 2 partitions, instead of in all ., which may result in multiple partitions if the query score modifier has more that one .
  • For ", Vespa seems to have a bug wherein it attempts to escape " in the queries by transforming it to \\\", leading to an invalid query. To solve this without modifying vespa code, we switch to url encoding (where " is stored as %22) in both documents and queries. Note that this only applies to the subfields of score modifier fields in both unstructured and structured indexes . Use re.sub(r'(")', lambda m: r'\u0022', s) .
  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
  • Need someone to check
  • Have unit tests been run against this PR? (Has there also been any additional testing?)

  • Related Python client changes (link commit/PR here)

  • Related documentation changes (link commit/PR here)

  • Other information:

  • Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

@RaynorChavez RaynorChavez changed the title Fix . search error when subfields in map score modifiers contain . Fix . and " search error when subfields in map score modifiers contain . Oct 17, 2024
@RaynorChavez RaynorChavez changed the title Fix . and " search error when subfields in map score modifiers contain . Fix ., ", and unicode characters search error when subfields in map score modifiers contain . Oct 17, 2024
@RaynorChavez RaynorChavez changed the title Fix ., ", and unicode characters search error when subfields in map score modifiers contain . Fix ., ", and unicode characters search error in subfields in map score modifiers Oct 17, 2024
Copy link
Collaborator

@wanliAlex wanliAlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants