add DictionaryEncodedValueIndex.getValueIterator and use it for ExpressionPredicateIndexSupplier by clintropolis · Pull Request #19023 · apache/druid

clintropolis · 2026-02-13T23:25:30Z

changes:

Added getValueIterator method to DictionaryEncodedValueIndex to give an easy way for consumers to iterate the dictionary values in order
ExpressionPredicateIndexSupplier now uses getValueIterator to scan the dictionary values, offering a performance improvement, particularly when using front-coding
fixed a few other places that were iterating the dictionary using get to use iterator instead

Credit to #19004 for the added benchmark query and bringing this issue to attention, where when using front-coding it was causing computing the indexes to be slower than just doing a full scan (at least in some cases, such as this query)

before:

Benchmark                        (complexCompression)  (deferExpressionDimensions)  (jsonObjectStorageEncoding)  (query)  (rowsPerSegment)  (schemaType)  (storageType)   (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  522.387 ± 22.942  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  501.122 ± 17.559  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  547.506 ± 15.055  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  446.650 ±  5.308  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  572.099 ± 67.823  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  499.534 ± 19.926  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  549.607 ± 25.846  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  496.660 ± 16.439  ms/op

after:

Segment)  (schemaType)  (storageType)   (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  428.333 ±  14.320  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  364.073 ±   5.671  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  423.951 ±  12.710  ms/op
SqlExpressionBenchmark.querySql                  NONE                   fixedWidth                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  371.926 ±   5.133  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  424.357 ±  10.445  ms/op
SqlExpressionBenchmark.querySql                  NONE         fixedWidthNonNumeric                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  419.708 ±  71.678  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        false  avgt    5  444.724 ± 112.962  ms/op
SqlExpressionBenchmark.querySql                  NONE                       always                        SMILE       61           1500000      explicit           MMAP  FRONT_CODED_16_V1        force  avgt    5  373.843 ±   8.409  ms/op

I also considered adding a getBitmapsIterator to DictionaryEncodedValueIndex, but ultimately decided against it because most of the bitmap get methods do some coercion of null values to empty bitmaps so they can't just use the underlying Indexed iterator directly... which sounded a bit more tedious than i wanted to deal with. Perhaps can consider doing this as a follow-up so the places that are iterating both dictionaries and collecting the corresponding bitmaps can just both use iterators instead of keeping a counter, or making some convenient structure to iterate both things at the same time so we don't even need to keep in sync...

…ssionPredicateIndexSupplier changes: * Added `getValueIterator` method to `DictionaryEncodedValueIndex` to give an easy way for consumers to iterate the dictionary values in order * `ExpressionPredicateIndexSupplier` now uses `getValueIterator` to scan the dictionary values, offering a performance improvement, particularly when using front-coding * fixed a few other places that were iterating the dictionary using get to use iterator instead

jtuglu1 · 2026-02-14T18:36:05Z

I wonder if it's worth adding some sort of perf section to the CI. Assuming we mandate that all perf-sensitive changes add a corresponding benchmark, presumably we can create a unit test that checks for statistically significant regressions in most core areas of the code? This might help catch more issues before they hit a release. Just would need to limit the execution time of said benchmarks.

…erator-for-expression-predicate-index-supplier

clintropolis · 2026-02-17T19:48:01Z

I wonder if it's worth adding some sort of perf section to the CI. Assuming we mandate that all perf-sensitive changes add a corresponding benchmark, presumably we can create a unit test that checks for statistically significant regressions in most core areas of the code? This might help catch more issues before they hit a release. Just would need to limit the execution time of said benchmarks.

I think that is a lot easier said than done, we'd have to be very strategic about what we run if we are talking about benchmarks like was added here, since the benchmarks require quite an absurd amount of time to run them all for all combinations.

Just running SqlExpressionBenchmark with no specific parameters i see

# Run progress: 0.00% complete, ETA 43 days, 09:04:00

which to be fair is quite a bit of an over estimate since many combinations are not run together (like we only test compression stuff on MMAP segments), so it wouldn't really be that long, but it would still be a very long time.

If i restrict it to 1 set of combinations (and still run both non-vectorized and vectorized), e.g. -p storageType=MMAP -p schemype=explicit -p jsonObjectStorageEncoding=NONE -p complexCompression=lz4 -p stringEncoding=UTF8

it still estimates almost 11 hours

# Run progress: 0.00% complete, ETA 10:50:40

And this is just 1 benchmark. There are a ton of other files, many with just as many combinations.

This is also ignoring that these benchmarks are only as good as their coverage. The thing this PR is improving basically only applied to front-coding with sufficiently large buckets along with queries using enough virtual columns in filters so that it became slower than not using the indexes. There were very many cases where using the indexes was an improvement, and even this query was faster when not using front-coding, so we would need to have the right combinations to be able to catch stuff before.

capistrant

left one comment that is maybe more of a teaching moment for me when it comes code in the query path

capistrant · 2026-02-17T20:20:27Z

processing/src/main/java/org/apache/druid/segment/nested/NestedFieldColumnIndexSupplier.java

as this logic is the same as getValue is there sense in having something like private String getUsingGlobalIndex(int index) that both can share? I'm not as used to the hot query path though, would that risk perf or would the compiler inline it and result in no runtime difference?

ah they probably could use a shared method, i could measure just to be sure; this sort of code is copied in a lot of places all across the nested column stuff since quite a lot of places have to do a similar translation from local field ids to global ids, i had a dream that someday i would try to consolidate them, but haven't really got around to it yet 😅

clintropolis · 2026-02-18T20:53:27Z

CI failure seems unrelated (but also failing a lot, ive retried many times)

github-actions bot added the Area - Segment Format and Ser/De label Feb 13, 2026

clintropolis mentioned this pull request Feb 13, 2026

Configurable index disabling for virtual columns #19004

Open

9 tasks

fix test

e189af8

Merge remote-tracking branch 'upstream/master' into use-dictionary-it…

e11bd20

…erator-for-expression-predicate-index-supplier

gianm approved these changes Feb 17, 2026

View reviewed changes

capistrant approved these changes Feb 17, 2026

View reviewed changes

share method

76c2f41

clintropolis merged commit b8f62f4 into apache:master Feb 18, 2026
111 of 115 checks passed

clintropolis deleted the use-dictionary-iterator-for-expression-predicate-index-supplier branch February 18, 2026 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add DictionaryEncodedValueIndex.getValueIterator and use it for ExpressionPredicateIndexSupplier#19023

add DictionaryEncodedValueIndex.getValueIterator and use it for ExpressionPredicateIndexSupplier#19023
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:use-dictionary-iterator-for-expression-predicate-index-supplier

clintropolis commented Feb 13, 2026

Uh oh!

jtuglu1 commented Feb 14, 2026 •

edited

Loading

Uh oh!

clintropolis commented Feb 17, 2026

Uh oh!

capistrant left a comment

Uh oh!

capistrant Feb 17, 2026

Uh oh!

clintropolis Feb 18, 2026

Uh oh!

clintropolis commented Feb 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

clintropolis commented Feb 13, 2026

Uh oh!

jtuglu1 commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Feb 17, 2026

Uh oh!

capistrant left a comment

Choose a reason for hiding this comment

Uh oh!

capistrant Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

clintropolis Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

clintropolis commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

jtuglu1 commented Feb 14, 2026 •

edited

Loading

clintropolis commented Feb 18, 2026 •

edited

Loading