-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Support Shredded Objects in variant_get #8166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you @carpecodeum -- I will try and review this tomorrow first thing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this! Left a bunch of comments to get things started, but probably somebody else should review as well, because I am not a neutral reviewer of this particular code :P
#[allow(unused)] | ||
pub(crate) trait OutputBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to just delete output builder-related code at some point?
(tho I guess that could be a follow-on PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that we could potentially simplify this in future PRs
parquet-variant/src/path.rs
Outdated
/// Create from &str with support for dot notation | ||
impl<'a> From<&'a str> for VariantPath<'a> { | ||
fn from(path: &'a str) -> Self { | ||
VariantPath::new(vec![path.into()]) | ||
// Support dot notation: "a.x" -> ["a", "x"] | ||
if path.contains('.') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dangerous without some well-defined way to escape values. Otherwise, a column name that contains a dot is perfectly legal but would produce an incorrect path.
I believe Iceberg uses canonical jsonpath as the string path format, but I'm not sure if we want to take a stance in such low-level code?
delta-kernel-rs faced a similar dilemma and ended up defining macro magic that takes string literals in and parses them after verifying there are no special characters. By calling the macro, caller affirms it is a "simple" column path where splitting on dots is correct.
See this derive macro and the companion proc macro that wraps it up nicely.
I don't know if we want to go to such lengths here tho. Probably better to just use the unambiguous long-form, or define a test helper for benefit of unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the long-form more, but I just added this just as a proof of concept, to see what opinion everyone has of this, can fallback to the original approach, maybe @alamb can also give his opinion?
parquet-variant/src/path.rs
Outdated
let elements: Vec<VariantPathElement<'a>> = path | ||
.split('.') | ||
.map(|part| part.into()) | ||
.collect(); | ||
VariantPath::new(elements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm guessing the type annotation is unnecessary, because VariantPath::new
already constrains it to a vec of path elements?
let elements: Vec<VariantPathElement<'a>> = path | |
.split('.') | |
.map(|part| part.into()) | |
.collect(); | |
VariantPath::new(elements) | |
VariantPath::new(path.split('.').map(Into::into).collect()) |
match &self.shredding_state { | ||
ShreddingState::Unshredded { metadata, value } => { | ||
Variant::new(metadata.value(index), value.value(index)) | ||
ShreddingState::Unshredded { value } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This change will conflict with
We may want to be intentional about which PR merges first? (AFAIK, this PR should "just work" even if the other PR merges later... but it's probably more convenient to let the other PR merge first).
/// additional fields), or NULL (`v:a` was an object containing only the single expected field `b`). | ||
/// | ||
/// Finally, `v.typed_value.a.typed_value.b.value` is either NULL (`v:a.b` was an integer) or else a | ||
/// variant value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// variant value. | |
/// variant value (which could be `Variant::Null`). |
|
||
#[allow(unused)] | ||
pub(crate) fn make_shredding_row_builder<'a>( | ||
//metadata: &BinaryViewArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll eventually need this parameter, for shredding and unshredding operations that manipulate object fields.
#[allow(unused)] | ||
pub(crate) fn make_shredding_row_builder<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this used by shredded_get_path
which is used by variant_get
?
#[allow(unused)] | ||
struct PrimitiveVariantShreddingRowBuilder<T: ArrowPrimitiveType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another spurious annotation?
/// Used for actual shredding of binary variant values into shredded variant values | ||
#[allow(unused)] | ||
struct ShreddedVariantRowBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just out leave this struct for now, and remove the code and types that support it?
/// Like VariantShreddingRowBuilder, but for (partially shredded) structs which need special | ||
/// handling on a per-field basis. | ||
#[allow(unused)] | ||
struct VariantShreddingStructRowBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, probably best added by a follow-on PR that actually uses it?
thanks for the review @scovich , i am still resolving most of the comments |
f8c455d
to
f519bb4
Compare
f519bb4
to
2326b55
Compare
Just as a heads up I will be out this upcoming week, so I will be much slower on reviews / merging Please just let me know if anything needs my attention and I'll get to it as quickyl as I can |
Thats fine @alamb ! whenever you are available, I think you can provide one final review of this, and we can merge this |
Thanks @carpecodeum THere seems to be some CI failures, can you please resolve them? ![]() @scovich what are your thoughts about merging this PR in and iterating? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich what are your thoughts about merging this PR in and iterating?
Sorry for the delays. My second review pass was a bit disorienting:
- force-merge makes it really hard to tell visually what actually changed
- The PR must have originally been stacked on some other PR, and it looks like several of my first-round comments were actually for the other PR, since they're no longer part of the diff (see below)
- the shredding spec "null" questions and sucked up a lot of time as well
Overall, we're close to merge. There are still some unit test issues that should be quick/easy to address.
My biggest concerns are about NULL handling, nested null masks and logical_nulls
, e.g.
... and so naturally, because they're important, they're all hidden/invisible because "outdated". We could probably tackle this as a follow-up item, but it's a super important one to resolve quickly and I don't think we fully know what the correct approach should be?
Also, some specific questions for @alamb are still pending:
Comments that apparently applied to a stacked PR's parent?
let value = if let Some(value_col) = inner.column_by_name("value") { | ||
if let Some(binary_view) = value_col.as_binary_view_opt() { | ||
Some(binary_view.clone()) | ||
} else { | ||
return Err(ArrowError::NotYetImplemented(format!( | ||
"VariantArray 'value' field must be BinaryView, got {}", | ||
value_col.data_type() | ||
))); | ||
} | ||
} else { | ||
None | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit?
let value = if let Some(value_col) = inner.column_by_name("value") { | |
if let Some(binary_view) = value_col.as_binary_view_opt() { | |
Some(binary_view.clone()) | |
} else { | |
return Err(ArrowError::NotYetImplemented(format!( | |
"VariantArray 'value' field must be BinaryView, got {}", | |
value_col.data_type() | |
))); | |
} | |
} else { | |
None | |
}; | |
let value = match inner.column_by_name("value") { | |
Some(value_col) => match value_col.as_binary_view_opt() { | |
Some(binary_view) => Some(binary_view.clone()), | |
None => return Err(ArrowError::NotYetImplemented(format!( | |
"VariantArray 'value' field must be BinaryView, got {}", | |
value_col.data_type() | |
))), | |
} | |
None => None, | |
}) |
ShreddingState::PartiallyShredded { value, typed_value, .. } => { | ||
// PartiallyShredded case (formerly ImperfectlyShredded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect naming. The term "partially shredded" is specifically defined in the variant shredding spec, and that definition does not agree with how the current code uses the name:
An object is partially shredded when the
value
is an object and thetyped_value
is a shredded object.
Partial shredding is just one specific kind of imperfect shredding -- any type can shred imperfectly, producing both value
and typed_value
columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this PR didn't make the naming decisions, but the wording in the PR is not consistent with these namings. I would personally prefer to fix the naming here to be consistent with how the PR uses it, but we could also try to adjust the PR wording to match these enum variant names.
ShreddingState::Typed { typed_value, .. } => { | ||
// Typed case (formerly PerfectlyShredded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of Typed
as the name here. It doesn't match any language in the variant shredding spec, and IMO it doesn't really self-describe either. At a minimum we should consider StronglyTyped
? (in contrast to variant normally being weakly typed)?
But I'm curious why "perfectly shredded" and "imperfectly shredded" are unclear or otherwise unwelcome? (see comment below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I think I figured out the confusion. The spec says this:
Both
value
andtyped_value
are optional fields used together to encode a single value.
Values in the two fields must be interpreted according to the following table:
value
typed_value
Meaning null null The value is missing; only valid for shredded object fields non-null null The value is present and may be any type, including null null non-null The value is present and is the shredded type non-null non-null The value is present and is a partially shredded object An object is partially shredded when the
value
is an object and thetyped_value
is a shredded object.
Writers must not produce data where bothvalue
andtyped_value
are non-null, unless the Variant value is an object.
... but those are row-wise concepts -- assuming that both columns are physically present -- and this code is dealing with column-wise concepts where one or both columns could be physically missing. If we produced that table, it might look something like this:
The
value
andtyped_value
columns are optional columns that together encode columns of variant values. Either or both columns may be physically missing, which can be interpreted according to the following table:
value
typed_value
Meaning missing missing All values are missing; only valid for shredded object fields present missing All values are unshredded, and can be any type, including null missing present All values are present and the shredded type present present At lease some values are present and can be any type, including null A shredded object field is perfectly shredded when the
typed_value
column is present and thevalue
column is either all-null or missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if typed_value.is_null(index) { | ||
Variant::new(metadata.value(index), value.value(index)) | ||
Variant::new(self.metadata.value(index), value.value(index)) | ||
} else { | ||
typed_value_to_variant(typed_value, index) | ||
} | ||
} | ||
ShreddingState::AllNull { .. } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another naming nit: This should probably be called ShreddingState::Missing
, to match terminology of the shredding spec?
/// a: SHREDDED_VARIANT_FIELD { | ||
/// value: BINARY, | ||
/// typed_value: STRUCT { | ||
/// a: SHREDDED_VARIANT_FIELD { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// a: SHREDDED_VARIANT_FIELD { | |
/// b: SHREDDED_VARIANT_FIELD { |
Ok(Box::new(VariantPathRowBuilder { | ||
builder: $inner_builder, | ||
path, | ||
}) as Box<dyn VariantShreddingRowBuilder + 'a>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the cast actually needed? The function's return value should constrain it already?
fn nulls(&self) -> Option<&NullBuffer> { | ||
// According to the shredding spec, ShreddedVariantFieldArray should be | ||
// physically non-nullable - SQL NULL is inferred by both value and | ||
// typed_value being physically NULL | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rescuing from github oblivion:
We may need to override the logical_nulls method.
If nulls
is hard-wired to return None
then we definitely need to provide logical_nulls
?
Otherwise people have to manually dig into the guts of the array?
let (metadata, y_field_value) = { | ||
let mut builder = parquet_variant::VariantBuilder::new(); | ||
let mut obj = builder.new_object(); | ||
obj.insert("x", Variant::Int32(42)); | ||
obj.insert("y", Variant::from("foo")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's illegal for a partially shredded object to mention the same field in both the value
and typed_value
columns. To add "x" to the dictionary, manually just use the VariantBuilder::with_field_names
method?
))), | ||
} | ||
} | ||
pub(crate) mod row_builder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't cargo fmt have fixed this?
@@ -95,10 +95,10 @@ impl<'a> From<Vec<VariantPathElement<'a>>> for VariantPath<'a> { | |||
} | |||
} | |||
|
|||
/// Create from &str | |||
/// Create from &str with support for dot notation | |||
impl<'a> From<&'a str> for VariantPath<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rescuing #8166 (comment) from github oblivion:
This is dangerous without some well-defined way to escape values. Otherwise, a column name that contains a dot is perfectly legal but would produce an incorrect path.
I agree with the long-form more, but I just added this just as a proof of concept, to see what opinion everyone has of this, can fallback to the original approach, maybe @alamb can also give his opinion?
(impl From makes it part of the public API, and I'm nervious about giving people a massive footgun)
if let Some(typed_value) = typed_value.clone() { | ||
builder = builder.with_field("typed_value", typed_value); | ||
} | ||
if let Some(nulls) = nulls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value
and typed_value
both do copy here; do we need to align null
with them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those other two are present in both inner
and shredding_state
, hence theclone
calls. The nulls
are only used once, so no clone
needed.
} | ||
|
||
#[allow(unused)] | ||
pub(crate) fn from_parts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add some doc for this?
// If the requested path element is not present in `typed_value`, and `value` is missing, then | ||
// we know it does not exist; it, and all paths under it, are all-NULL. | ||
let missing_path_step = || { | ||
let Some(_value_field) = shredding_state.value_field() else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just want to return different results for different value_field
, does match
make it cleaner here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this?
let missing_path_step = || match shredding_state.value_field().is_some() {
true => ShreddedPathStep::NotShredded,
false => ShreddedPathStep::Missing,
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought the following before, but these two are the same
match shredding_state.value_field() {
Some(_) => ShreddedPathStep::Missing,
None => ShreddedPathStep::NotShredded,
};
|
||
/// A thin wrapper whose only job is to extract a specific path from a variant value and pass the | ||
/// result to a nested builder. | ||
struct VariantPathRowBuilder<'a, T: VariantShreddingRowBuilder> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between VariantPathRowBuilder
and PrimitiveVariantShreddingRowBuilder
is that VariantPathRowBuilder
contains a path that is used to retrieve the value in append_value
. They can be merged if we move the value extraction logic to the caller. Will this become cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand? The caller already followed shredded path steps as far as possible before creating any builder. This path builder is used to extract the remaining path steps, on a row-by-row basis, from the values of a binary variant column the caller encountered before the path was exhausted. Not sure how that could be moved to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not describing it clearly, the caller
I described before is the caller of the builder(the logic in src/variant_get/mod.rs#shredded_get_path
), I commented this because the VariantPathRowBuilder
contains a more path
than the PrimitiveVariantShreddingRowBuilder
, but the remaining seems the same, the path
in VariantPathRowBuilder
will only used in append_value
, this can be handled in the caller --shredded_get_path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Let's keep an eye out for that -- I originally thought there would be multiple call sites (once we support the other scenarios besides just primitives) and that it would be helpful to factor out the pathing. But if it turns out that there really is just one callsite, then I agree we could eliminate the extra layer of abstraction.
// First, try to downcast to StructArray | ||
let Some(struct_array) = typed_value.as_any().downcast_ref::<StructArray>() else { | ||
// Downcast failure - if strict cast options are enabled, this should be an error | ||
if !cast_options.safe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If typed_value
can't be cast to StructArray
, and cast_options.safe
is false
, then we will return an Err
. Will there be some chance the path located in the value
, if it is yes, return an Err
here is expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This scenario could happen if e.g. we asked for v:a.b.c::INT
and v.typed_value.a.typed_value
is shredded as something other than struct. That's not an error at all, there's no rule that the path we asked for has to match the shredding of the underlying data. It just means we need to fetch the value from v.typed_value.a.value
instead.
|
||
/// Simple test to check if nested paths are supported by current implementation | ||
#[test] | ||
fn test_simple_nested_path_support() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to assert the expected behavior in this test function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
|
||
#[test] | ||
fn test_null_buffer_union_for_shredded_paths() { | ||
use arrow::compute::CastOptions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add another row where path a.x
does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @carpecodeum @mprammer, @scovich and @klion26
I went through the various comments from @klion26 and @scovich and I think they can all be done in follow on PRs.
Given this PR's size, number of comments, and relative age, I would like to propose we try to accelerate progress by unblock this PR and parallelize the work by:
- Merge this PR in
- Work on comments as follow on PRs
I am happy to file follow on tickets to track the additional work
I have begun making some PRs to help get this PR ready for merge
- Fix Clippy: cmu-db#4
Finally, I wanted to apologize for my absence -- I have been away but am now back and hopefully we'll accelerate things a bit
use std::sync::Arc; | ||
|
||
pub(crate) fn make_shredding_row_builder<'a>( | ||
//metadata: &BinaryViewArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be removed
//metadata: &BinaryViewArray, |
|
||
/// Simple test to check if nested paths are supported by current implementation | ||
#[test] | ||
fn test_simple_nested_path_support() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
@@ -388,43 +525,720 @@ mod test { | |||
VariantArray::try_new(Arc::new(struct_array)).expect("should create variant array"), | |||
) | |||
} | |||
/// This test manually constructs a shredded variant array representing objects | |||
/// like {"x": 1, "y": "foo"} and {"x": 42} and tests extracting the "x" field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great set of ideas, but perhaps one we can do as a follow on PR
@carpecodeum , @scovich points out that this PR now also has merge conflicts, likely due to merging this
Would you like help doing this? I can make another PR (targeting your branch to do so). |
Hi @alamb @scovich My apologies, I had been quite busy since the past week, I would like to help with the merge conflict resolution, I realise with so many changes going in right now, stuff has become hard to track, including with this PR as well |
I had some boring meetings today where I could guide a surprisingly helpful AI assistant through the merge process and subsequent fixups. PTAL? (I tried making cmu-db#5 against your branch, but I must have done it completely wrong... either that, or the merge commit threw things off; maybe somebody smarter than me can figure that one out?) |
Attn @alamb @carpecodeum @liamzwbao :
I noticed that I spent some time today, trying to refactor it to be row-oriented (while keeping a similar code structure), but late in the process I realized that several array types break it because they need to do some holistic column-level transformation, which cannot be captured efficiently by a row-wise approach:
I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open. NOTE: Even tho Thoughts? |
Yes, this sounds like a very plausible approach
This is my feeling too. @carpecodeum if you don't have time to work on this PR in the next day or two, perhaps @scovich or @liamzwbao could open a new PR (starting with the code from this PR) that we can finish up?
👍 |
@alamb My apologies, I have been very occupied with some other things lately, I think it would be best to let @scovich & @liamzwbao to make a new PR with the changes from this PR to speed things up, I would be happy to be an active reviewer or take up any follow up tickets from that PR as things calm down next week. |
Sounds like a plan to me |
UPDATE: I started doing some pathfinding on a row-oriented |
|
Which issue does this PR close?
variant_get
: typed path access (STEP 1) #8150Rationale for this change
Support variant_get for any input (shredded or otherwise), any depth of object field path steps, and casting to one primitive data type, eg. Some(DataType::Int32). This is build on top of changes in #8122
What changes are included in this PR?
Add support for extracting fields from both shredded and non-shredded variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them to Int32 with proper NULL handling for type mismatches.
Are these changes tested?
Yes, tests are added for non-shredded vs shredded inputs at depths 0, 1, and 2
Are there any user-facing changes?
not yet
Thanks to @mprammer 🥷