Skip to content

Conversation

carpecodeum
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Support variant_get for any input (shredded or otherwise), any depth of object field path steps, and casting to one primitive data type, eg. Some(DataType::Int32). This is build on top of changes in #8122

What changes are included in this PR?

Add support for extracting fields from both shredded and non-shredded variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them to Int32 with proper NULL handling for type mismatches.

Are these changes tested?

Yes, tests are added for non-shredded vs shredded inputs at depths 0, 1, and 2

Are there any user-facing changes?

not yet

Thanks to @mprammer 🥷

@carpecodeum carpecodeum changed the title Support Shredded Objects in variant_get [Variant] Support Shredded Objects in variant_get Aug 18, 2025
@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Aug 18, 2025
@carpecodeum
Copy link
Contributor Author

cc - @alamb @scovich

@alamb
Copy link
Contributor

alamb commented Aug 19, 2025

Thank you @carpecodeum -- I will try and review this tomorrow first thing

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! Left a bunch of comments to get things started, but probably somebody else should review as well, because I am not a neutral reviewer of this particular code :P

Comment on lines 37 to 38
#[allow(unused)]
pub(crate) trait OutputBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to just delete output builder-related code at some point?
(tho I guess that could be a follow-on PR?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that we could potentially simplify this in future PRs

Comment on lines 98 to 102
/// Create from &str with support for dot notation
impl<'a> From<&'a str> for VariantPath<'a> {
fn from(path: &'a str) -> Self {
VariantPath::new(vec![path.into()])
// Support dot notation: "a.x" -> ["a", "x"]
if path.contains('.') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous without some well-defined way to escape values. Otherwise, a column name that contains a dot is perfectly legal but would produce an incorrect path.

I believe Iceberg uses canonical jsonpath as the string path format, but I'm not sure if we want to take a stance in such low-level code?

delta-kernel-rs faced a similar dilemma and ended up defining macro magic that takes string literals in and parses them after verifying there are no special characters. By calling the macro, caller affirms it is a "simple" column path where splitting on dots is correct.

See this derive macro and the companion proc macro that wraps it up nicely.

I don't know if we want to go to such lengths here tho. Probably better to just use the unambiguous long-form, or define a test helper for benefit of unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the long-form more, but I just added this just as a proof of concept, to see what opinion everyone has of this, can fallback to the original approach, maybe @alamb can also give his opinion?

Comment on lines 103 to 107
let elements: Vec<VariantPathElement<'a>> = path
.split('.')
.map(|part| part.into())
.collect();
VariantPath::new(elements)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm guessing the type annotation is unnecessary, because VariantPath::new already constrains it to a vec of path elements?

Suggested change
let elements: Vec<VariantPathElement<'a>> = path
.split('.')
.map(|part| part.into())
.collect();
VariantPath::new(elements)
VariantPath::new(path.split('.').map(Into::into).collect())

Comment on lines 182 to 183
match &self.shredding_state {
ShreddingState::Unshredded { metadata, value } => {
Variant::new(metadata.value(index), value.value(index))
ShreddingState::Unshredded { value } => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This change will conflict with

We may want to be intentional about which PR merges first? (AFAIK, this PR should "just work" even if the other PR merges later... but it's probably more convenient to let the other PR merge first).

/// additional fields), or NULL (`v:a` was an object containing only the single expected field `b`).
///
/// Finally, `v.typed_value.a.typed_value.b.value` is either NULL (`v:a.b` was an integer) or else a
/// variant value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// variant value.
/// variant value (which could be `Variant::Null`).


#[allow(unused)]
pub(crate) fn make_shredding_row_builder<'a>(
//metadata: &BinaryViewArray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll eventually need this parameter, for shredding and unshredding operations that manipulate object fields.

Comment on lines 28 to 29
#[allow(unused)]
pub(crate) fn make_shredding_row_builder<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this used by shredded_get_path which is used by variant_get?

Comment on lines 150 to 151
#[allow(unused)]
struct PrimitiveVariantShreddingRowBuilder<T: ArrowPrimitiveType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another spurious annotation?

Comment on lines 302 to 304
/// Used for actual shredding of binary variant values into shredded variant values
#[allow(unused)]
struct ShreddedVariantRowBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just out leave this struct for now, and remove the code and types that support it?

Comment on lines 346 to 349
/// Like VariantShreddingRowBuilder, but for (partially shredded) structs which need special
/// handling on a per-field basis.
#[allow(unused)]
struct VariantShreddingStructRowBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, probably best added by a follow-on PR that actually uses it?

@carpecodeum
Copy link
Contributor Author

thanks for the review @scovich , i am still resolving most of the comments

@carpecodeum carpecodeum force-pushed the shredding-variant-part1 branch from f8c455d to f519bb4 Compare August 22, 2025 18:50
@carpecodeum carpecodeum force-pushed the shredding-variant-part1 branch from f519bb4 to 2326b55 Compare August 22, 2025 19:01
@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

Just as a heads up I will be out this upcoming week, so I will be much slower on reviews / merging

Please just let me know if anything needs my attention and I'll get to it as quickyl as I can

@carpecodeum
Copy link
Contributor Author

Just as a heads up I will be out this upcoming week, so I will be much slower on reviews / merging

Please just let me know if anything needs my attention and I'll get to it as quickyl as I can

Thats fine @alamb ! whenever you are available, I think you can provide one final review of this, and we can merge this

@alamb
Copy link
Contributor

alamb commented Aug 26, 2025

Thanks @carpecodeum

THere seems to be some CI failures, can you please resolve them?

Screenshot 2025-08-26 at 6 18 17 AM

@scovich what are your thoughts about merging this PR in and iterating?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich what are your thoughts about merging this PR in and iterating?

Sorry for the delays. My second review pass was a bit disorienting:

  • force-merge makes it really hard to tell visually what actually changed
  • The PR must have originally been stacked on some other PR, and it looks like several of my first-round comments were actually for the other PR, since they're no longer part of the diff (see below)
  • the shredding spec "null" questions and sucked up a lot of time as well

Overall, we're close to merge. There are still some unit test issues that should be quick/easy to address.

My biggest concerns are about NULL handling, nested null masks and logical_nulls, e.g.

... and so naturally, because they're important, they're all hidden/invisible because "outdated". We could probably tackle this as a follow-up item, but it's a super important one to resolve quickly and I don't think we fully know what the correct approach should be?

Also, some specific questions for @alamb are still pending:

Comments that apparently applied to a stacked PR's parent?

Comment on lines +109 to +120
let value = if let Some(value_col) = inner.column_by_name("value") {
if let Some(binary_view) = value_col.as_binary_view_opt() {
Some(binary_view.clone())
} else {
return Err(ArrowError::NotYetImplemented(format!(
"VariantArray 'value' field must be BinaryView, got {}",
value_col.data_type()
)));
}
} else {
None
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit?

Suggested change
let value = if let Some(value_col) = inner.column_by_name("value") {
if let Some(binary_view) = value_col.as_binary_view_opt() {
Some(binary_view.clone())
} else {
return Err(ArrowError::NotYetImplemented(format!(
"VariantArray 'value' field must be BinaryView, got {}",
value_col.data_type()
)));
}
} else {
None
};
let value = match inner.column_by_name("value") {
Some(value_col) => match value_col.as_binary_view_opt() {
Some(binary_view) => Some(binary_view.clone()),
None => return Err(ArrowError::NotYetImplemented(format!(
"VariantArray 'value' field must be BinaryView, got {}",
value_col.data_type()
))),
}
None => None,
})

Comment on lines +210 to +211
ShreddingState::PartiallyShredded { value, typed_value, .. } => {
// PartiallyShredded case (formerly ImperfectlyShredded)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect naming. The term "partially shredded" is specifically defined in the variant shredding spec, and that definition does not agree with how the current code uses the name:

An object is partially shredded when the value is an object and the typed_value is a shredded object.

Partial shredding is just one specific kind of imperfect shredding -- any type can shred imperfectly, producing both value and typed_value columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this PR didn't make the naming decisions, but the wording in the PR is not consistent with these namings. I would personally prefer to fix the naming here to be consistent with how the PR uses it, but we could also try to adjust the PR wording to match these enum variant names.

Comment on lines 202 to +203
ShreddingState::Typed { typed_value, .. } => {
// Typed case (formerly PerfectlyShredded)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of Typed as the name here. It doesn't match any language in the variant shredding spec, and IMO it doesn't really self-describe either. At a minimum we should consider StronglyTyped? (in contrast to variant normally being weakly typed)?

But I'm curious why "perfectly shredded" and "imperfectly shredded" are unclear or otherwise unwelcome? (see comment below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I think I figured out the confusion. The spec says this:

Both value and typed_value are optional fields used together to encode a single value.
Values in the two fields must be interpreted according to the following table:

value typed_value Meaning
null null The value is missing; only valid for shredded object fields
non-null null The value is present and may be any type, including null
null non-null The value is present and is the shredded type
non-null non-null The value is present and is a partially shredded object

An object is partially shredded when the value is an object and the typed_value is a shredded object.
Writers must not produce data where both value and typed_value are non-null, unless the Variant value is an object.

... but those are row-wise concepts -- assuming that both columns are physically present -- and this code is dealing with column-wise concepts where one or both columns could be physically missing. If we produced that table, it might look something like this:

The value and typed_value columns are optional columns that together encode columns of variant values. Either or both columns may be physically missing, which can be interpreted according to the following table:

value typed_value Meaning
missing missing All values are missing; only valid for shredded object fields
present missing All values are unshredded, and can be any type, including null
missing present All values are present and the shredded type
present present At lease some values are present and can be any type, including null

A shredded object field is perfectly shredded when the typed_value column is present and the value column is either all-null or missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if typed_value.is_null(index) {
Variant::new(metadata.value(index), value.value(index))
Variant::new(self.metadata.value(index), value.value(index))
} else {
typed_value_to_variant(typed_value, index)
}
}
ShreddingState::AllNull { .. } => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another naming nit: This should probably be called ShreddingState::Missing, to match terminology of the shredding spec?
image

/// a: SHREDDED_VARIANT_FIELD {
/// value: BINARY,
/// typed_value: STRUCT {
/// a: SHREDDED_VARIANT_FIELD {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// a: SHREDDED_VARIANT_FIELD {
/// b: SHREDDED_VARIANT_FIELD {

Ok(Box::new(VariantPathRowBuilder {
builder: $inner_builder,
path,
}) as Box<dyn VariantShreddingRowBuilder + 'a>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cast actually needed? The function's return value should constrain it already?

Comment on lines +389 to +394
fn nulls(&self) -> Option<&NullBuffer> {
// According to the shredding spec, ShreddedVariantFieldArray should be
// physically non-nullable - SQL NULL is inferred by both value and
// typed_value being physically NULL
None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescuing from github oblivion:

We may need to override the logical_nulls method.

If nulls is hard-wired to return None then we definitely need to provide logical_nulls?
Otherwise people have to manually dig into the guts of the array?

Comment on lines +720 to +724
let (metadata, y_field_value) = {
let mut builder = parquet_variant::VariantBuilder::new();
let mut obj = builder.new_object();
obj.insert("x", Variant::Int32(42));
obj.insert("y", Variant::from("foo"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's illegal for a partially shredded object to mention the same field in both the value and typed_value columns. To add "x" to the dictionary, manually just use the VariantBuilder::with_field_names method?

))),
}
}
pub(crate) mod row_builder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't cargo fmt have fixed this?

@@ -95,10 +95,10 @@ impl<'a> From<Vec<VariantPathElement<'a>>> for VariantPath<'a> {
}
}

/// Create from &str
/// Create from &str with support for dot notation
impl<'a> From<&'a str> for VariantPath<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescuing #8166 (comment) from github oblivion:

This is dangerous without some well-defined way to escape values. Otherwise, a column name that contains a dot is perfectly legal but would produce an incorrect path.

I agree with the long-form more, but I just added this just as a proof of concept, to see what opinion everyone has of this, can fallback to the original approach, maybe @alamb can also give his opinion?

(impl From makes it part of the public API, and I'm nervious about giving people a massive footgun)

if let Some(typed_value) = typed_value.clone() {
builder = builder.with_field("typed_value", typed_value);
}
if let Some(nulls) = nulls {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value and typed_value both do copy here; do we need to align null with them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those other two are present in both inner and shredding_state, hence theclone calls. The nulls are only used once, so no clone needed.

}

#[allow(unused)]
pub(crate) fn from_parts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some doc for this?

// If the requested path element is not present in `typed_value`, and `value` is missing, then
// we know it does not exist; it, and all paths under it, are all-NULL.
let missing_path_step = || {
let Some(_value_field) = shredding_state.value_field() else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just want to return different results for different value_field, does match make it cleaner here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this?

let missing_path_step = || match shredding_state.value_field().is_some() {
    true => ShreddedPathStep::NotShredded,
    false => ShreddedPathStep::Missing,
};

Copy link
Member

@klion26 klion26 Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought the following before, but these two are the same

match shredding_state.value_field() {
            Some(_) => ShreddedPathStep::Missing,
            None => ShreddedPathStep::NotShredded,
};


/// A thin wrapper whose only job is to extract a specific path from a variant value and pass the
/// result to a nested builder.
struct VariantPathRowBuilder<'a, T: VariantShreddingRowBuilder> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between VariantPathRowBuilder and PrimitiveVariantShreddingRowBuilder is that VariantPathRowBuilder contains a path that is used to retrieve the value in append_value. They can be merged if we move the value extraction logic to the caller. Will this become cleaner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand? The caller already followed shredded path steps as far as possible before creating any builder. This path builder is used to extract the remaining path steps, on a row-by-row basis, from the values of a binary variant column the caller encountered before the path was exhausted. Not sure how that could be moved to the caller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not describing it clearly, the caller I described before is the caller of the builder(the logic in src/variant_get/mod.rs#shredded_get_path), I commented this because the VariantPathRowBuilder contains a more path than the PrimitiveVariantShreddingRowBuilder, but the remaining seems the same, the path in VariantPathRowBuilder will only used in append_value, this can be handled in the caller --shredded_get_path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Let's keep an eye out for that -- I originally thought there would be multiple call sites (once we support the other scenarios besides just primitives) and that it would be helpful to factor out the pathing. But if it turns out that there really is just one callsite, then I agree we could eliminate the extra layer of abstraction.

// First, try to downcast to StructArray
let Some(struct_array) = typed_value.as_any().downcast_ref::<StructArray>() else {
// Downcast failure - if strict cast options are enabled, this should be an error
if !cast_options.safe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If typed_value can't be cast to StructArray, and cast_options.safe is false, then we will return an Err. Will there be some chance the path located in the value, if it is yes, return an Err here is expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This scenario could happen if e.g. we asked for v:a.b.c::INT and v.typed_value.a.typed_value is shredded as something other than struct. That's not an error at all, there's no rule that the path we asked for has to match the shredding of the underlying data. It just means we need to fetch the value from v.typed_value.a.value instead.


/// Simple test to check if nested paths are supported by current implementation
#[test]
fn test_simple_nested_path_support() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assert the expected behavior in this test function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!


#[test]
fn test_null_buffer_union_for_shredded_paths() {
use arrow::compute::CastOptions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add another row where path a.x does not exist

@carpecodeum
Copy link
Contributor Author

Thank you for your reviews @klion26 and @scovich I will work on the suggestions

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @carpecodeum @mprammer, @scovich and @klion26

I went through the various comments from @klion26 and @scovich and I think they can all be done in follow on PRs.

Given this PR's size, number of comments, and relative age, I would like to propose we try to accelerate progress by unblock this PR and parallelize the work by:

  1. Merge this PR in
  2. Work on comments as follow on PRs

I am happy to file follow on tickets to track the additional work

I have begun making some PRs to help get this PR ready for merge

Finally, I wanted to apologize for my absence -- I have been away but am now back and hopefully we'll accelerate things a bit

use std::sync::Arc;

pub(crate) fn make_shredding_row_builder<'a>(
//metadata: &BinaryViewArray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be removed

Suggested change
//metadata: &BinaryViewArray,


/// Simple test to check if nested paths are supported by current implementation
#[test]
fn test_simple_nested_path_support() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

@@ -388,43 +525,720 @@ mod test {
VariantArray::try_new(Arc::new(struct_array)).expect("should create variant array"),
)
}
/// This test manually constructs a shredded variant array representing objects
/// like {"x": 1, "y": "foo"} and {"x": 42} and tests extracting the "x" field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great set of ideas, but perhaps one we can do as a follow on PR

@alamb
Copy link
Contributor

alamb commented Sep 4, 2025

@carpecodeum , @scovich points out that this PR now also has merge conflicts, likely due to merging this

the rebase will have to figure out how to move the newly merged output builder functionality to an appropriate row builder equivalent.

Would you like help doing this? I can make another PR (targeting your branch to do so).

@carpecodeum
Copy link
Contributor Author

@carpecodeum , @scovich points out that this PR now also has merge conflicts, likely due to merging this

the rebase will have to figure out how to move the newly merged output builder functionality to an appropriate row builder equivalent.

Would you like help doing this? I can make another PR (targeting your branch to do so).

Hi @alamb @scovich My apologies, I had been quite busy since the past week, I would like to help with the merge conflict resolution, I realise with so many changes going in right now, stuff has become hard to track, including with this PR as well

@scovich
Copy link
Contributor

scovich commented Sep 4, 2025

@carpecodeum , @scovich points out that this PR now also has merge conflicts, likely due to merging this

the rebase will have to figure out how to move the newly merged output builder functionality to an appropriate row builder equivalent.

Would you like help doing this? I can make another PR (targeting your branch to do so).

Hi @alamb @scovich My apologies, I had been quite busy since the past week, I would like to help with the merge conflict resolution, I realise with so many changes going in right now, stuff has become hard to track, including with this PR as well

I had some boring meetings today where I could guide a surprisingly helpful AI assistant through the merge process and subsequent fixups. PTAL?

(I tried making cmu-db#5 against your branch, but I must have done it completely wrong... either that, or the merge commit threw things off; maybe somebody smarter than me can figure that one out?)

@scovich
Copy link
Contributor

scovich commented Sep 6, 2025

Attn @alamb @carpecodeum @liamzwbao :

I noticed that cast_to_variant follows the same problematic column-oriented approach as the output builder this PR replaces with a row builder. In particular, it converts entire columns to variant, which means repeatedly re-encoding deeply nested variant values, transferring their bytes to new arrays, and recreating incomplete metadata columns that immediately get thrown away.

I spent some time today, trying to refactor it to be row-oriented (while keeping a similar code structure), but late in the process I realized that several array types break it because they need to do some holistic column-level transformation, which cannot be captured efficiently by a row-wise approach:

  • maps - cast the key column to string
  • dictionaries - convert the dictionary value column and then clone values from it as needed
  • run-end encoding - similar to dictionary, but also need to convert the incoming logical index to physical run index, which is a linear search in row-oriented code (= quadratic cost!)

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

Thoughts?

@alamb
Copy link
Contributor

alamb commented Sep 6, 2025

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

Yes, this sounds like a very plausible approach

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

This is my feeling too. @carpecodeum if you don't have time to work on this PR in the next day or two, perhaps @scovich or @liamzwbao could open a new PR (starting with the code from this PR) that we can finish up?

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

👍

@carpecodeum
Copy link
Contributor Author

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

Yes, this sounds like a very plausible approach

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

This is my feeling too. @carpecodeum if you don't have time to work on this PR in the next day or two, perhaps @scovich or @liamzwbao could open a new PR (starting with the code from this PR) that we can finish up?

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

👍

@alamb My apologies, I have been very occupied with some other things lately, I think it would be best to let @scovich & @liamzwbao to make a new PR with the changes from this PR to speed things up, I would be happy to be an active reviewer or take up any follow up tickets from that PR as things calm down next week.

@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

@alamb My apologies, I have been very occupied with some other things lately, I think it would be best to let @scovich & @liamzwbao to make a new PR with the changes from this PR to speed things up, I would be happy to be an active reviewer or take up any follow up tickets from that PR as things calm down next week.

Sounds like a plan to me

@scovich
Copy link
Contributor

scovich commented Sep 8, 2025

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

Yes, this sounds like a very plausible approach

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

This is my feeling too. @carpecodeum if you don't have time to work on this PR in the next day or two, perhaps @scovich or @liamzwbao could open a new PR (starting with the code from this PR) that we can finish up?

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

👍

UPDATE: I started doing some pathfinding on a row-oriented cast_to_variant, and it turns out not to need this PR. The approach is the same, but shredding needs variant-to-arrow row builders while the casting needs arrow-to-variant row builders. I'll try to post a PR soon that takes a stab at the latter.

@alamb alamb closed this in fb7d02e Sep 8, 2025
@scovich
Copy link
Contributor

scovich commented Sep 9, 2025

UPDATE: I started doing some pathfinding on a row-oriented cast_to_variant, and it turns out not to need this PR. The approach is the same, but shredding needs variant-to-arrow row builders while the casting needs arrow-to-variant row builders. I'll try to post a PR soon that takes a stab at the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Support Shredded Objects in variant_get: typed path access (STEP 1)
4 participants