[Feature](agg) add agg function entropy#60833
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
5182bf0 to
33d8b77
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds the entropy aggregate function from DuckDB to Apache Doris. The function calculates Shannon Entropy using a frequency map and computing the empirical distribution function, with entropy measured in bits (base-2 logarithm).
Changes:
- Added backend (C++) implementation for entropy calculation using hash maps for frequency tracking
- Added frontend (Java) function definition and registration in Nereids planner
- Added comprehensive regression tests and unit tests covering various data types and edge cases
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| be/src/vec/aggregate_functions/aggregate_function_entropy.h | Core implementation of entropy calculation logic with support for numeric, string, and generic data types |
| be/src/vec/aggregate_functions/aggregate_function_entropy.cpp | Factory registration for the entropy aggregate function with type dispatching |
| be/src/vec/aggregate_functions/aggregate_function_simple_factory.cpp | Registered entropy function in the aggregate function factory |
| be/test/vec/aggregate_functions/agg_entropy_test.cpp | Unit tests covering numeric, string, generic, nullable, and empty input cases |
| be/test/vec/aggregate_functions/agg_function_test.h | Fixed empty block handling in deserialization tests |
| be/test/testutil/column_helper.h | Enhanced helper to support creating blocks with different column types |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/agg/Entropy.java | Frontend function definition extending NullableAggregateFunction |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/AggregateFunctionVisitor.java | Added visitor method for entropy function |
| fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinAggregateFunctions.java | Registered entropy in builtin aggregate functions catalog |
| regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy | Comprehensive regression tests covering all data types, NULL handling, window functions, and edge cases |
| regression-test/data/query_p0/aggregate/aggregate_function_entropy.out | Expected output for regression tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // entropy of empty set = 0 | ||
| execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0})); |
There was a problem hiding this comment.
There is a discrepancy in the expected behavior for empty input. The unit test expects entropy of an empty set to be 0.0 (line 112), but the regression test expects NULL (line 114 of the .out file). This inconsistency needs to be resolved. Based on the regression test and the fact that the function inherits from NullableAggregateFunction, the correct behavior should likely be NULL for empty input, which means the unit test expectation at line 112 is incorrect.
| // entropy of empty set = 0 | |
| execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0})); | |
| // entropy of empty set = NULL | |
| execute(block, ColumnHelper::create_nullable_column_with_name<DataTypeFloat64>({0.0}, {1})); |
What problem does this PR solve?
add new aggregate function entropy
Issue Number: #48203
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)