Skip to content

GenericByteViewArray: support finding total length of all strings #9435

@neilconway

Description

@neilconway

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It would be useful to provide a method to determine the total length of all the strings in a StringViewArray. For example, in the implementation of concat() and concat_ws() UDFs in DataFusion, this would be useful to pre-allocate the output buffers for the result of the UDF.

When the input is a StringViewArray, we currently use total_buffer_bytes_used() as a capacity hint, but this underestimates when the array contains short strings (discussion).

Describe the solution you'd like

A method on GenericByteViewArray like so:

/// Returns the total byte length of all strings in the array.
pub fn total_bytes_len(&self) -> usize {
    self.views()
        .iter()
        .map(|v| (*v as u32) as usize)
        .sum()
}

Describe alternatives you've considered

total_buffer_bytes_used() is better than nothing but it underestimates.

We could iterate over the views and take the lowest 32-bits ourselves, but depending on Arrow implementation details like that seems unfortunate.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions