Support limits on records loaded from Lucene index #10298
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Allows the records (rids) retrieved from the Lucene search to be limited, where it is known that the remainder of the query does not require the entire set to be loaded. This is useful when the underlying Lucene query returns many results, but the query overall is only intended to return a small number of them (and in the ranked order from Lucene).
This mode is opt in, by providing a
limit
metadata element to the Lucene search function. A value ofselect
uses the skip/limit in theSELECT
statement to determine the max hits, and an integral value specifies an explicit max hits (e.g. for a safety margin where subsequent query filter/order operations are desired).Motivation
100% of our Lucene index queries apply all of the filtering criteria in the Lucene query, and we have some pathological scenarios where those criteria can be ranking (non mandatory) and on very general criteria.
In the worst case this resulted in millions of RIDs being loaded from the Lucene index, when we would only want the top 100.
This causes high memory pressure (and often out of memory errors), with some of the RID arrays loaded being 800MB.
Related issues
Neo4J has a similar capability: https://community.neo4j.com/t/full-text-search-skip-and-limit/58773
Additional Notes
Checklist
[x] I have run the build using
mvn clean package
command[x] My unit tests cover both failure and success scenarios