Skip to content

add DictionaryEncodedValueIndex.getValueIterator and use it for ExpressionPredicateIndexSupplier#19023

Merged
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:use-dictionary-iterator-for-expression-predicate-index-supplier
Feb 18, 2026
Merged

add DictionaryEncodedValueIndex.getValueIterator and use it for ExpressionPredicateIndexSupplier#19023
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:use-dictionary-iterator-for-expression-predicate-index-supplier

Conversation

@clintropolis
Copy link
Member

changes:

  • Added getValueIterator method to DictionaryEncodedValueIndex to give an easy way for consumers to iterate the dictionary values in order
  • ExpressionPredicateIndexSupplier now uses getValueIterator to scan the dictionary values, offering a performance improvement, particularly when using front-coding
  • fixed a few other places that were iterating the dictionary using get to use iterator instead

Credit to #19004 for the added benchmark query and bringing this issue to attention, where when using front-coding it was causing computing the indexes to be slower than just doing a full scan (at least in some cases, such as this query)

before:

Benchmark                        (complexCompression)  (deferExpressionDimensions)  (jsonObjectStorageEncoding)  (query)  (rowsPerSegment)  (schemaType)  (storageType)   (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  522.387 ± 22.942  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  501.122 ± 17.559  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  547.506 ± 15.055  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  446.650 ±  5.308  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  572.099 ± 67.823  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  499.534 ± 19.926  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  549.607 ± 25.846  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  496.660 ± 16.439  ms/op

after:

Segment)  (schemaType)  (storageType)   (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  428.333 ±  14.320  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  364.073 ±   5.671  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  423.951 ±  12.710  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  371.926 ±   5.133  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  424.357 ±  10.445  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  419.708 ±  71.678  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  444.724 ± 112.962  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  373.843 ±   8.409  ms/op

I also considered adding a getBitmapsIterator to DictionaryEncodedValueIndex, but ultimately decided against it because most of the bitmap get methods do some coercion of null values to empty bitmaps so they can't just use the underlying Indexed iterator directly... which sounded a bit more tedious than i wanted to deal with. Perhaps can consider doing this as a follow-up so the places that are iterating both dictionaries and collecting the corresponding bitmaps can just both use iterators instead of keeping a counter, or making some convenient structure to iterate both things at the same time so we don't even need to keep in sync...

…ssionPredicateIndexSupplier

changes:
* Added `getValueIterator` method to `DictionaryEncodedValueIndex` to give an easy way for consumers to iterate the dictionary values in order
* `ExpressionPredicateIndexSupplier` now uses `getValueIterator` to scan the dictionary values, offering a performance improvement, particularly when using front-coding
* fixed a few other places that were iterating the dictionary using get to use iterator instead
@jtuglu1
Copy link
Contributor

jtuglu1 commented Feb 14, 2026

I wonder if it's worth adding some sort of perf section to the CI. Assuming we mandate that all perf-sensitive changes add a corresponding benchmark, presumably we can create a unit test that checks for statistically significant regressions in most core areas of the code? This might help catch more issues before they hit a release. Just would need to limit the execution time of said benchmarks.

…erator-for-expression-predicate-index-supplier
@clintropolis
Copy link
Member Author

I wonder if it's worth adding some sort of perf section to the CI. Assuming we mandate that all perf-sensitive changes add a corresponding benchmark, presumably we can create a unit test that checks for statistically significant regressions in most core areas of the code? This might help catch more issues before they hit a release. Just would need to limit the execution time of said benchmarks.

I think that is a lot easier said than done, we'd have to be very strategic about what we run if we are talking about benchmarks like was added here, since the benchmarks require quite an absurd amount of time to run them all for all combinations.

Just running SqlExpressionBenchmark with no specific parameters i see

# Run progress: 0.00% complete, ETA 43 days, 09:04:00

which to be fair is quite a bit of an over estimate since many combinations are not run together (like we only test compression stuff on MMAP segments), so it wouldn't really be that long, but it would still be a very long time.

If i restrict it to 1 set of combinations (and still run both non-vectorized and vectorized), e.g. -p storageType=MMAP -p schemype=explicit -p jsonObjectStorageEncoding=NONE -p complexCompression=lz4 -p stringEncoding=UTF8

it still estimates almost 11 hours

# Run progress: 0.00% complete, ETA 10:50:40

And this is just 1 benchmark. There are a ton of other files, many with just as many combinations.

This is also ignoring that these benchmarks are only as good as their coverage. The thing this PR is improving basically only applied to front-coding with sufficiently large buckets along with queries using enough virtual columns in filters so that it became slower than not using the indexes. There were very many cases where using the indexes was an improvement, and even this query was faster when not using front-coding, so we would need to have the right combinations to be able to catch stuff before.

Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one comment that is maybe more of a teaching moment for me when it comes code in the query path

Comment on lines 404 to 410
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this logic is the same as getValue is there sense in having something like private String getUsingGlobalIndex(int index) that both can share? I'm not as used to the hot query path though, would that risk perf or would the compiler inline it and result in no runtime difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah they probably could use a shared method, i could measure just to be sure; this sort of code is copied in a lot of places all across the nested column stuff since quite a lot of places have to do a similar translation from local field ids to global ids, i had a dream that someday i would try to consolidate them, but haven't really got around to it yet 😅

@clintropolis
Copy link
Member Author

clintropolis commented Feb 18, 2026

CI failure seems unrelated (but also failing a lot, ive retried many times)

@clintropolis clintropolis merged commit b8f62f4 into apache:master Feb 18, 2026
111 of 115 checks passed
@clintropolis clintropolis deleted the use-dictionary-iterator-for-expression-predicate-index-supplier branch February 18, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments