Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Allow User defined functions to potentially reuse their argument arrays (to avoid new allocations) #13516

Open
alamb opened this issue Nov 21, 2024 · 0 comments · May be fixed by #13507
Open
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

Is your feature request related to a problem or challenge?

Arrow Arrays are designed to be immutable and use shared references extensively, but it is possible to reuse the underlying buffer in some cases when there are no other references (see the arrow unary_mut kernel for example)

At the time of writing, DataFusion scalar functions (ScalarFunctionImpl must always allocate a new array when generating output. They can not reuse the existing underlying memory, even if the source array will never be used again

This is because the invoke signature gets the arguments as reference (slice of ColumnarValue) rather than by ownership

fn invoke_batch(
    &self,
    args: &[ColumnarValue],
    number_rows: usize,
) -> Result<ColumnarValue, DataFusionError>

For example, an expression like (a + b) + c will be evaluated like

  • a + b --> temp_array
  • temp_array + c --> result_array

Resulting in two new allocations

Describe the solution you'd like

It would be really nice if it were possible to evaluate (a + b) + c like this (with no new allocations)

  • a + b --> a (write output to a, reusing allocation)
  • a + c --> a (now add c, also reusing allocation)

And the result would be a new array that re-used the original allocation of the a array

Describe alternatives you've considered

Now that this is merged

I think we can make it possible in the future to reuse allocations by changing what is passed into ScalarFunctionArgs

Since we haven't yet released a version with ScalarFunctionArgs we can change its signature without breaking APIs until DataFusion 44 is released

Additional context

I have a draft of the basic idea here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant