-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It would be useful to provide a method to determine the total length of all the strings in a StringViewArray. For example, in the implementation of concat() and concat_ws() UDFs in DataFusion, this would be useful to pre-allocate the output buffers for the result of the UDF.
When the input is a StringViewArray, we currently use total_buffer_bytes_used() as a capacity hint, but this underestimates when the array contains short strings (discussion).
Describe the solution you'd like
A method on GenericByteViewArray like so:
/// Returns the total byte length of all strings in the array.
pub fn total_bytes_len(&self) -> usize {
self.views()
.iter()
.map(|v| (*v as u32) as usize)
.sum()
}Describe alternatives you've considered
total_buffer_bytes_used() is better than nothing but it underestimates.
We could iterate over the views and take the lowest 32-bits ourselves, but depending on Arrow implementation details like that seems unfortunate.
Additional context