Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion datafusion/functions/src/string/concat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,9 @@ impl ScalarUDFImpl for ConcatFunc {
DataType::Utf8View => {
let string_array = as_string_view_array(array)?;

data_size += string_array.len();
// This is an estimate; in particular, it will
// undercount arrays of short strings (<= 12 bytes).
data_size += string_array.total_buffer_bytes_used();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not include the inlined strings (shorter than 12 bytes). An array of only short strings will report 0.
Is this a problem ?!

If YES then let's use something like:

self.views().iter().sum().min(usize::MAX) as usize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.

On your suggested fix: summing the entire 128 bit view seems wrong (we don't want to add the buffer indexes and offsets, etc.). I think it would work to just use the low 32 bits of each view, but I'm leery of depending on an Arrow implementation detail like that.

We could iterate over the strings and sum their length, but that seems like overkill for a buffer sizing hint.

How about:

  1. Use total_buffer_bytes_used() for now; this is just a capacity hint, so an estimate is okay
  2. Add a comment for the short-strings case
  3. Open an issue in Arrow to have them add a helper for this use-case, as it seems reasonably useful in general

Copy link
Member

@martin-g martin-g Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is similar to the impl at https://github.com/apache/arrow-rs/blob/2f40f78e4feae3aee261d9608cede9535e1429e0/arrow-array/src/array/byte_view_array.rs#L685. It just does not filter out the small strings and uses u128 for the sum before the final cast to usize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in Arrow casts *v to u32, which takes the low 32 bits. Summing all 128 bit values and then taking the min of that value and usize::MAX does not seem to do the right thing, unless I'm misunderstanding completely.

In any case, I'd prefer to not depend on Arrow implementation details, if we can avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed apache/arrow-rs#9435 for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right!

let column = if array.is_nullable() {
ColumnarValueRef::NullableStringViewArray(string_array)
} else {
Expand Down
2 changes: 2 additions & 0 deletions datafusion/functions/src/string/concat_ws.rs
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,8 @@ impl ScalarUDFImpl for ConcatWsFunc {
DataType::Utf8View => {
let string_array = as_string_view_array(array)?;

// This is an estimate; in particular, it will
// undercount arrays of short strings (<= 12 bytes).
data_size += string_array.total_buffer_bytes_used();
let column = if array.is_nullable() {
ColumnarValueRef::NullableStringViewArray(string_array)
Expand Down
23 changes: 7 additions & 16 deletions datafusion/functions/src/strings.rs
Original file line number Diff line number Diff line change
Expand Up @@ -152,43 +152,34 @@ impl StringViewArrayBuilder {
}
ColumnarValueRef::NullableArray(array) => {
if !CHECK_VALID || array.is_valid(i) {
self.block.push_str(
std::str::from_utf8(array.value(i).as_bytes()).unwrap(),
);
self.block.push_str(array.value(i));
}
}
ColumnarValueRef::NullableLargeStringArray(array) => {
if !CHECK_VALID || array.is_valid(i) {
self.block.push_str(
std::str::from_utf8(array.value(i).as_bytes()).unwrap(),
);
self.block.push_str(array.value(i));
}
}
ColumnarValueRef::NullableStringViewArray(array) => {
if !CHECK_VALID || array.is_valid(i) {
self.block.push_str(
std::str::from_utf8(array.value(i).as_bytes()).unwrap(),
);
self.block.push_str(array.value(i));
}
}
ColumnarValueRef::NonNullableArray(array) => {
self.block
.push_str(std::str::from_utf8(array.value(i).as_bytes()).unwrap());
self.block.push_str(array.value(i));
}
ColumnarValueRef::NonNullableLargeStringArray(array) => {
self.block
.push_str(std::str::from_utf8(array.value(i).as_bytes()).unwrap());
self.block.push_str(array.value(i));
}
ColumnarValueRef::NonNullableStringViewArray(array) => {
self.block
.push_str(std::str::from_utf8(array.value(i).as_bytes()).unwrap());
self.block.push_str(array.value(i));
}
}
}

pub fn append_offset(&mut self) {
self.builder.append_value(&self.block);
self.block = String::new();
self.block.clear();
}

pub fn finish(mut self) -> StringViewArray {
Expand Down