Replies: 2 comments
-
An implementation is under work. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Implemented in #2306 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Background
One of the missing features in Velox is the functionality of a zero-copy vector view. Given an existing vector of any type, if the user only wants a continuous subrange of the vector, in the current implementation, the user has to allocate a new vector and copy the underlying data, which is very expensive.
Interface
The following method will be added to BaseVector:
Some invariants:
Performance characteristics:
Implementation
FlatVector
For
FlatVector::rawValues()
andFlatVector::values()
,value_
buffer, and return the start of that buffer as the result.BufferView
in thevalues_
field, taking offset and size into consideration, and return the start of that buffer, which is already offsetted.For nulls buffer we treat it same as a boolean type value buffer.
The semantics of non-bit types are exactly the same as
Array::raw_values()
in Arrow, which returns an offset-applied address from the underlying buffer. The different treatment of a buffer representing bits is the place where we diverge from Arrow: in Arrow,Array::null_bitmap()
is always unoffsetted and it is up to the user to left shift the result byArray::offset()
bits. This is error-prone and inconsistent, since in case ofArray::raw_values()
, Arrow returns an offsetted address, but inArray::null_bitmap()
it returns an unoffsetted address.Do we want to support slicing
OPAQUE
vectors? CurrentlyBuffer::as<std::shared_ptr<void>>()
is throwing an exception.Encodings
Nulls buffer will be sliced the same way as FlatVector.
For different encodings:
Peeling an offset dictionary produces the selected indices into base data. When rewrapping the result of the dictionary, the resulting DictionaryVector has the same offset as the original if the indices are reused. If the indices are new, then the resulting vector has no offset.
For LazyVector, since it is not possible to pass rows and hook to the slice, we decide to disable taking the slice on an unloaded vector. Another alternative is to force a full load when slice is called. But forcing full load is tricky because we need to make sure that any potentially wrappings (dict) over the vector are updated during the loading. See
DictionaryVector::loadedVector
.Test & Benchmark
Some dedicated aspects should be tested in unit tests:
In addition to the normal unit tests, fuzzer will be enhanced to generate some slices. The percentage of that does not need to be large as this change should not affect evaluation heavily.
Benchmark will be taken for:
Q & A
Will allocating & copying the bits buffer affect performance negatively?
The allocation and copy of a bit buffer is not a concern here since
How is mutability handled for the slice?
We have 2 cases:
Buffer::mutableRawValues()
will check if the buffer is uniquely owned, but forBufferView
the reference count is 1, so it might returnrawValues_
directly. We need to add an extra check here to block this path. Also a similar check needs to be added toBuffer::mutableNulls()
andBaseVector::ensureWritable
.How will the offset change the expression evaluation?
It should not change any behavior in expression evaluation except maybe dictionary rewrapping. With this new design, we no longer leak the concept of offset to the user, including expression eval. So all the vectors, whether or not offsetted, should appear exactly the same. This also saves us a lot of work in terms of testing the feature.
Beta Was this translation helpful? Give feedback.
All reactions