Skip to content

[Feature](agg) add agg function entropy#60833

Open
wrlcke wants to merge 1 commit intoapache:masterfrom
wrlcke:functions/entropy
Open

[Feature](agg) add agg function entropy#60833
wrlcke wants to merge 1 commit intoapache:masterfrom
wrlcke:functions/entropy

Conversation

@wrlcke
Copy link
Contributor

@wrlcke wrlcke commented Feb 25, 2026

What problem does this PR solve?

add new aggregate function entropy

Issue Number: #48203

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Feb 25, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds the entropy aggregate function from DuckDB to Apache Doris. The function calculates Shannon Entropy using a frequency map and computing the empirical distribution function, with entropy measured in bits (base-2 logarithm).

Changes:

  • Added backend (C++) implementation for entropy calculation using hash maps for frequency tracking
  • Added frontend (Java) function definition and registration in Nereids planner
  • Added comprehensive regression tests and unit tests covering various data types and edge cases

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
be/src/vec/aggregate_functions/aggregate_function_entropy.h Core implementation of entropy calculation logic with support for numeric, string, and generic data types
be/src/vec/aggregate_functions/aggregate_function_entropy.cpp Factory registration for the entropy aggregate function with type dispatching
be/src/vec/aggregate_functions/aggregate_function_simple_factory.cpp Registered entropy function in the aggregate function factory
be/test/vec/aggregate_functions/agg_entropy_test.cpp Unit tests covering numeric, string, generic, nullable, and empty input cases
be/test/vec/aggregate_functions/agg_function_test.h Fixed empty block handling in deserialization tests
be/test/testutil/column_helper.h Enhanced helper to support creating blocks with different column types
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/agg/Entropy.java Frontend function definition extending NullableAggregateFunction
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/AggregateFunctionVisitor.java Added visitor method for entropy function
fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinAggregateFunctions.java Registered entropy in builtin aggregate functions catalog
regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy Comprehensive regression tests covering all data types, NULL handling, window functions, and edge cases
regression-test/data/query_p0/aggregate/aggregate_function_entropy.out Expected output for regression tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +111 to +112
// entropy of empty set = 0
execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0}));
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a discrepancy in the expected behavior for empty input. The unit test expects entropy of an empty set to be 0.0 (line 112), but the regression test expects NULL (line 114 of the .out file). This inconsistency needs to be resolved. Based on the regression test and the fact that the function inherits from NullableAggregateFunction, the correct behavior should likely be NULL for empty input, which means the unit test expectation at line 112 is incorrect.

Suggested change
// entropy of empty set = 0
execute(block, ColumnHelper::create_column_with_name<DataTypeFloat64>({0.0}));
// entropy of empty set = NULL
execute(block, ColumnHelper::create_nullable_column_with_name<DataTypeFloat64>({0.0}, {1}));

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants