[Variant] Support Shredded Objects in variant_get #8166

carpecodeum · 2025-08-18T18:50:57Z

Which issue does this PR close?

Closes [Variant] Support Shredded Objects in variant_get: typed path access (STEP 1) #8150

Rationale for this change

Support variant_get for any input (shredded or otherwise), any depth of object field path steps, and casting to one primitive data type, eg. Some(DataType::Int32). This is build on top of changes in #8122

What changes are included in this PR?

Add support for extracting fields from both shredded and non-shredded variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them to Int32 with proper NULL handling for type mismatches.

Are these changes tested?

Yes, tests are added for non-shredded vs shredded inputs at depths 0, 1, and 2

Are there any user-facing changes?

not yet

Thanks to @mprammer 🥷

carpecodeum · 2025-08-19T18:55:59Z

cc - @alamb @scovich

alamb · 2025-08-19T19:00:12Z

Thank you @carpecodeum -- I will try and review this tomorrow first thing

scovich

Thanks for tackling this! Left a bunch of comments to get things started, but probably somebody else should review as well, because I am not a neutral reviewer of this particular code :P

scovich · 2025-08-19T20:15:17Z

parquet-variant-compute/src/variant_get/output/mod.rs

+#[allow(unused)]
 pub(crate) trait OutputBuilder {


Are we planning to just delete output builder-related code at some point?
(tho I guess that could be a follow-on PR?)

You're right that we could potentially simplify this in future PRs

scovich · 2025-08-19T20:23:11Z

parquet-variant/src/path.rs

+/// Create from &str with support for dot notation
 impl<'a> From<&'a str> for VariantPath<'a> {
    fn from(path: &'a str) -> Self {
-        VariantPath::new(vec![path.into()])
+        // Support dot notation: "a.x" -> ["a", "x"]
+        if path.contains('.') {


This is dangerous without some well-defined way to escape values. Otherwise, a column name that contains a dot is perfectly legal but would produce an incorrect path.

I believe Iceberg uses canonical jsonpath as the string path format, but I'm not sure if we want to take a stance in such low-level code?

delta-kernel-rs faced a similar dilemma and ended up defining macro magic that takes string literals in and parses them after verifying there are no special characters. By calling the macro, caller affirms it is a "simple" column path where splitting on dots is correct.

See this derive macro and the companion proc macro that wraps it up nicely.

I don't know if we want to go to such lengths here tho. Probably better to just use the unambiguous long-form, or define a test helper for benefit of unit tests?

I agree with the long-form more, but I just added this just as a proof of concept, to see what opinion everyone has of this, can fallback to the original approach, maybe @alamb can also give his opinion?

scovich · 2025-08-19T20:25:23Z

parquet-variant/src/path.rs

+            let elements: Vec<VariantPathElement<'a>> = path
+                .split('.')
+                .map(|part| part.into())
+                .collect();
+            VariantPath::new(elements)


nit: I'm guessing the type annotation is unnecessary, because VariantPath::new already constrains it to a vec of path elements?

Suggested change

let elements: Vec<VariantPathElement<'a>> = path

.split('.')

.map(|part| part.into())

.collect();

VariantPath::new(elements)

VariantPath::new(path.split('.').map(Into::into).collect())

scovich · 2025-08-19T20:28:58Z

parquet-variant-compute/src/variant_array.rs

        match &self.shredding_state {
-            ShreddingState::Unshredded { metadata, value } => {
-                Variant::new(metadata.value(index), value.value(index))
+            ShreddingState::Unshredded { value } => {


NOTE: This change will conflict with

[Varint] Implement ShreddingState::AllNull variant #8093

We may want to be intentional about which PR merges first? (AFAIK, this PR should "just work" even if the other PR merges later... but it's probably more convenient to let the other PR merge first).

scovich · 2025-08-19T20:30:44Z

parquet-variant-compute/src/variant_array.rs

+/// additional fields), or NULL (`v:a` was an object containing only the single expected field `b`).
+///
+/// Finally, `v.typed_value.a.typed_value.b.value` is either NULL (`v:a.b` was an integer) or else a
+/// variant value.


Suggested change

/// variant value.

/// variant value (which could be `Variant::Null`).

scovich · 2025-08-19T21:59:00Z

parquet-variant-compute/src/variant_get/output/struct_output.rs

+
+#[allow(unused)]
+pub(crate) fn make_shredding_row_builder<'a>(
+    //metadata: &BinaryViewArray,


We'll eventually need this parameter, for shredding and unshredding operations that manipulate object fields.

scovich · 2025-08-19T21:59:38Z

parquet-variant-compute/src/variant_get/output/struct_output.rs

+#[allow(unused)]
+pub(crate) fn make_shredding_row_builder<'a>(


Isn't this used by shredded_get_path which is used by variant_get?

scovich · 2025-08-19T22:01:08Z

parquet-variant-compute/src/variant_get/output/struct_output.rs

+#[allow(unused)]
+struct PrimitiveVariantShreddingRowBuilder<T: ArrowPrimitiveType> {


another spurious annotation?

scovich · 2025-08-19T22:03:25Z

parquet-variant-compute/src/variant_get/output/struct_output.rs

+/// Used for actual shredding of binary variant values into shredded variant values
+#[allow(unused)]
+struct ShreddedVariantRowBuilder {


Should we just out leave this struct for now, and remove the code and types that support it?

scovich · 2025-08-19T22:04:10Z

parquet-variant-compute/src/variant_get/output/struct_output.rs

+/// Like VariantShreddingRowBuilder, but for (partially shredded) structs which need special
+/// handling on a per-field basis.
+#[allow(unused)]
+struct VariantShreddingStructRowBuilder {


As above, probably best added by a follow-on PR that actually uses it?

carpecodeum · 2025-08-21T17:28:04Z

thanks for the review @scovich , i am still resolving most of the comments

alamb · 2025-08-23T10:36:27Z

Just as a heads up I will be out this upcoming week, so I will be much slower on reviews / merging

Please just let me know if anything needs my attention and I'll get to it as quickyl as I can

carpecodeum · 2025-08-24T01:11:17Z

Just as a heads up I will be out this upcoming week, so I will be much slower on reviews / merging

Please just let me know if anything needs my attention and I'll get to it as quickyl as I can

Thats fine @alamb ! whenever you are available, I think you can provide one final review of this, and we can merge this

alamb · 2025-08-26T10:19:02Z

Thanks @carpecodeum

THere seems to be some CI failures, can you please resolve them?

@scovich what are your thoughts about merging this PR in and iterating?

scovich

@scovich what are your thoughts about merging this PR in and iterating?

Sorry for the delays. My second review pass was a bit disorienting:

force-merge makes it really hard to tell visually what actually changed
The PR must have originally been stacked on some other PR, and it looks like several of my first-round comments were actually for the other PR, since they're no longer part of the diff (see below)
the shredding spec "null" questions and sucked up a lot of time as well

Overall, we're close to merge. There are still some unit test issues that should be quick/easy to address.

My biggest concerns are about NULL handling, nested null masks and logical_nulls, e.g.

... and so naturally, because they're important, they're all hidden/invisible because "outdated". We could probably tackle this as a follow-up item, but it's a super important one to resolve quickly and I don't think we fully know what the correct approach should be?

Also, some specific questions for @alamb are still pending:

Comments that apparently applied to a stacked PR's parent?

scovich · 2025-08-25T14:48:14Z

parquet-variant-compute/src/variant_array.rs

+        let value = if let Some(value_col) = inner.column_by_name("value") {
+            if let Some(binary_view) = value_col.as_binary_view_opt() {
+                Some(binary_view.clone())
+            } else {
+                return Err(ArrowError::NotYetImplemented(format!(
+                    "VariantArray 'value' field must be BinaryView, got {}",
+                    value_col.data_type()
+                )));
+            }
+        } else {
+            None
+        };


tiny nit?

Suggested change

let value = if let Some(value_col) = inner.column_by_name("value") {

if let Some(binary_view) = value_col.as_binary_view_opt() {

Some(binary_view.clone())

} else {

return Err(ArrowError::NotYetImplemented(format!(

"VariantArray 'value' field must be BinaryView, got {}",

value_col.data_type()

)));

}

} else {

None

};

let value = match inner.column_by_name("value") {

Some(value_col) => match value_col.as_binary_view_opt() {

Some(binary_view) => Some(binary_view.clone()),

None => return Err(ArrowError::NotYetImplemented(format!(

"VariantArray 'value' field must be BinaryView, got {}",

value_col.data_type()

))),

}

None => None,

})

scovich · 2025-08-25T14:54:49Z

parquet-variant-compute/src/variant_array.rs

+            ShreddingState::PartiallyShredded { value, typed_value, .. } => {
+                // PartiallyShredded case (formerly ImperfectlyShredded)


This is incorrect naming. The term "partially shredded" is specifically defined in the variant shredding spec, and that definition does not agree with how the current code uses the name:

An object is partially shredded when the value is an object and the typed_value is a shredded object.

Partial shredding is just one specific kind of imperfect shredding -- any type can shred imperfectly, producing both value and typed_value columns.

I realize this PR didn't make the naming decisions, but the wording in the PR is not consistent with these namings. I would personally prefer to fix the naming here to be consistent with how the PR uses it, but we could also try to adjust the PR wording to match these enum variant names.

scovich · 2025-08-25T14:57:02Z

parquet-variant-compute/src/variant_array.rs

            ShreddingState::Typed { typed_value, .. } => {
+                // Typed case (formerly PerfectlyShredded)


Not a fan of Typed as the name here. It doesn't match any language in the variant shredding spec, and IMO it doesn't really self-describe either. At a minimum we should consider StronglyTyped? (in contrast to variant normally being weakly typed)?

But I'm curious why "perfectly shredded" and "imperfectly shredded" are unclear or otherwise unwelcome? (see comment below)

Ah! I think I figured out the confusion. The spec says this:

Both value and typed_value are optional fields used together to encode a single value.
Values in the two fields must be interpreted according to the following table:

value typed_value Meaning

null null The value is missing; only valid for shredded object fields

non-null null The value is present and may be any type, including null

null non-null The value is present and is the shredded type

non-null non-null The value is present and is a partially shredded object

An object is partially shredded when the value is an object and the typed_value is a shredded object.
Writers must not produce data where both value and typed_value are non-null, unless the Variant value is an object.

... but those are row-wise concepts -- assuming that both columns are physically present -- and this code is dealing with column-wise concepts where one or both columns could be physically missing. If we produced that table, it might look something like this:

The value and typed_value columns are optional columns that together encode columns of variant values. Either or both columns may be physically missing, which can be interpreted according to the following table:

value typed_value Meaning

missing missing All values are missing; only valid for shredded object fields

present missing All values are unshredded, and can be any type, including null

missing present All values are present and the shredded type

present present At lease some values are present and can be any type, including null

A shredded object field is perfectly shredded when the typed_value column is present and the value column is either all-null or missing.

Aside:

GH-519: [Variant] Disambiguate SQL NULL (missing) from Variant null parquet-format#520

scovich · 2025-08-25T14:59:12Z

parquet-variant-compute/src/variant_array.rs

                if typed_value.is_null(index) {
-                    Variant::new(metadata.value(index), value.value(index))
+                    Variant::new(self.metadata.value(index), value.value(index))
                } else {
                    typed_value_to_variant(typed_value, index)
                }
            }
            ShreddingState::AllNull { .. } => {


Another naming nit: This should probably be called ShreddingState::Missing, to match terminology of the shredding spec?

scovich · 2025-08-26T12:36:28Z

parquet-variant-compute/src/variant_array.rs

+///         a: SHREDDED_VARIANT_FIELD {
+///             value: BINARY,
+///             typed_value: STRUCT {
+///                 a: SHREDDED_VARIANT_FIELD {


Suggested change

/// a: SHREDDED_VARIANT_FIELD {

/// b: SHREDDED_VARIANT_FIELD {

scovich · 2025-08-26T12:44:31Z

parquet-variant-compute/src/variant_get/output/row_builder.rs

+            Ok(Box::new(VariantPathRowBuilder {
+                builder: $inner_builder,
+                path,
+            }) as Box<dyn VariantShreddingRowBuilder + 'a>)


Is the cast actually needed? The function's return value should constrain it already?

scovich · 2025-08-26T12:50:53Z

parquet-variant-compute/src/variant_array.rs

+    fn nulls(&self) -> Option<&NullBuffer> {
+        // According to the shredding spec, ShreddedVariantFieldArray should be 
+        // physically non-nullable - SQL NULL is inferred by both value and 
+        // typed_value being physically NULL
+        None
+    }


Rescuing from github oblivion:

[Variant] Support Shredded Objects in variant_get #8166 (comment)

We may need to override the logical_nulls method.

If nulls is hard-wired to return None then we definitely need to provide logical_nulls?
Otherwise people have to manually dig into the guts of the array?

scovich · 2025-08-26T13:02:11Z

parquet-variant-compute/src/variant_get/mod.rs

+        let (metadata, y_field_value) = {
+            let mut builder = parquet_variant::VariantBuilder::new();
+            let mut obj = builder.new_object();
+            obj.insert("x", Variant::Int32(42));
+            obj.insert("y", Variant::from("foo"));


It's illegal for a partially shredded object to mention the same field in both the value and typed_value columns. To add "x" to the dictionary, manually just use the VariantBuilder::with_field_names method?

scovich · 2025-08-26T13:03:21Z

parquet-variant-compute/src/variant_get/output/mod.rs

-        ))),
-    }
-}
+pub(crate) mod row_builder;


Shouldn't cargo fmt have fixed this?

scovich · 2025-08-26T13:07:52Z

parquet-variant/src/path.rs

@@ -95,10 +95,10 @@ impl<'a> From<Vec<VariantPathElement<'a>>> for VariantPath<'a> {
    }
 }

-/// Create from &str
+/// Create from &str with support for dot notation
 impl<'a> From<&'a str> for VariantPath<'a> {


Rescuing #8166 (comment) from github oblivion:

This is dangerous without some well-defined way to escape values. Otherwise, a column name that contains a dot is perfectly legal but would produce an incorrect path.

I agree with the long-form more, but I just added this just as a proof of concept, to see what opinion everyone has of this, can fallback to the original approach, maybe @alamb can also give his opinion?

(impl From makes it part of the public API, and I'm nervious about giving people a massive footgun)

klion26 · 2025-08-29T07:37:18Z

parquet-variant-compute/src/variant_array.rs

+        if let Some(typed_value) = typed_value.clone() {
+            builder = builder.with_field("typed_value", typed_value);
+        }
+        if let Some(nulls) = nulls {


value and typed_value both do copy here; do we need to align null with them?

Those other two are present in both inner and shredding_state, hence theclone calls. The nulls are only used once, so no clone needed.

klion26 · 2025-08-29T07:38:12Z

parquet-variant-compute/src/variant_array.rs

+    }
+
+    #[allow(unused)]
+    pub(crate) fn from_parts(


Do we need to add some doc for this?

klion26 · 2025-08-29T08:04:01Z

parquet-variant-compute/src/variant_get/mod.rs

+    // If the requested path element is not present in `typed_value`, and `value` is missing, then
+    // we know it does not exist; it, and all paths under it, are all-NULL.
+    let missing_path_step = || {
+        let Some(_value_field) = shredding_state.value_field() else {


We just want to return different results for different value_field, does match make it cleaner here?

Do you mean this?

let missing_path_step = || match shredding_state.value_field().is_some() { true => ShreddedPathStep::NotShredded, false => ShreddedPathStep::Missing, };

Yes, I thought the following before, but these two are the same

match shredding_state.value_field() { Some(_) => ShreddedPathStep::Missing, None => ShreddedPathStep::NotShredded, };

klion26 · 2025-08-30T10:55:10Z

parquet-variant-compute/src/variant_get/output/row_builder.rs

+
+/// A thin wrapper whose only job is to extract a specific path from a variant value and pass the
+/// result to a nested builder.
+struct VariantPathRowBuilder<'a, T: VariantShreddingRowBuilder> {


The difference between VariantPathRowBuilder and PrimitiveVariantShreddingRowBuilder is that VariantPathRowBuilder contains a path that is used to retrieve the value in append_value. They can be merged if we move the value extraction logic to the caller. Will this become cleaner?

Not sure I understand? The caller already followed shredded path steps as far as possible before creating any builder. This path builder is used to extract the remaining path steps, on a row-by-row basis, from the values of a binary variant column the caller encountered before the path was exhausted. Not sure how that could be moved to the caller?

Sorry for not describing it clearly, the caller I described before is the caller of the builder(the logic in src/variant_get/mod.rs#shredded_get_path), I commented this because the VariantPathRowBuilder contains a more path than the PrimitiveVariantShreddingRowBuilder, but the remaining seems the same, the path in VariantPathRowBuilder will only used in append_value, this can be handled in the caller --shredded_get_path.

Ah. Let's keep an eye out for that -- I originally thought there would be multiple call sites (once we support the other scenarios besides just primitives) and that it would be helpful to factor out the pathing. But if it turns out that there really is just one callsite, then I agree we could eliminate the extra layer of abstraction.

klion26 · 2025-08-30T14:27:39Z

parquet-variant-compute/src/variant_get/mod.rs

+            // First, try to downcast to StructArray
+            let Some(struct_array) = typed_value.as_any().downcast_ref::<StructArray>() else {
+                // Downcast failure - if strict cast options are enabled, this should be an error
+                if !cast_options.safe {


If typed_value can't be cast to StructArray, and cast_options.safe is false, then we will return an Err. Will there be some chance the path located in the value, if it is yes, return an Err here is expected?

Good catch. This scenario could happen if e.g. we asked for v:a.b.c::INT and v.typed_value.a.typed_value is shredded as something other than struct. That's not an error at all, there's no rule that the path we asked for has to match the shredding of the underlying data. It just means we need to fetch the value from v.typed_value.a.value instead.

klion26 · 2025-08-31T05:00:05Z

parquet-variant-compute/src/variant_get/mod.rs

+
+    /// Simple test to check if nested paths are supported by current implementation
+    #[test]  
+    fn test_simple_nested_path_support() {


Do we need to assert the expected behavior in this test function?

klion26 · 2025-08-31T07:32:35Z

parquet-variant-compute/src/variant_get/mod.rs

+
+    #[test]
+    fn test_null_buffer_union_for_shredded_paths() {
+        use arrow::compute::CastOptions;


Do we need to add another row where path a.x does not exist

carpecodeum · 2025-09-01T16:22:59Z

Thank you for your reviews @klion26 and @scovich I will work on the suggestions

alamb

Thank you for this contribution @carpecodeum @mprammer, @scovich and @klion26

I went through the various comments from @klion26 and @scovich and I think they can all be done in follow on PRs.

Given this PR's size, number of comments, and relative age, I would like to propose we try to accelerate progress by unblock this PR and parallelize the work by:

Merge this PR in
Work on comments as follow on PRs

I am happy to file follow on tickets to track the additional work

I have begun making some PRs to help get this PR ready for merge

Fix Clippy: cmu-db#4

Finally, I wanted to apologize for my absence -- I have been away but am now back and hopefully we'll accelerate things a bit

alamb · 2025-09-04T19:57:28Z

parquet-variant-compute/src/variant_get/output/row_builder.rs

+use std::sync::Arc;
+
+pub(crate) fn make_shredding_row_builder<'a>(
+    //metadata: &BinaryViewArray,


could be removed

Suggested change

//metadata: &BinaryViewArray,

alamb · 2025-09-04T20:00:24Z

parquet-variant-compute/src/variant_get/mod.rs

+
+    /// Simple test to check if nested paths are supported by current implementation
+    #[test]  
+    fn test_simple_nested_path_support() {


alamb · 2025-09-04T20:17:04Z

parquet-variant-compute/src/variant_get/mod.rs

@@ -388,43 +525,720 @@ mod test {
            VariantArray::try_new(Arc::new(struct_array)).expect("should create variant array"),
        )
    }
+    /// This test manually constructs a shredded variant array representing objects 
+    /// like {"x": 1, "y": "foo"} and {"x": 42} and tests extracting the "x" field


This is a great set of ideas, but perhaps one we can do as a follow on PR

alamb · 2025-09-04T20:41:58Z

@carpecodeum , @scovich points out that this PR now also has merge conflicts, likely due to merging this

[Variant] Support typed access for numeric types in variant_get #8179

the rebase will have to figure out how to move the newly merged output builder functionality to an appropriate row builder equivalent.

Would you like help doing this? I can make another PR (targeting your branch to do so).

carpecodeum · 2025-09-04T23:40:29Z

@carpecodeum , @scovich points out that this PR now also has merge conflicts, likely due to merging this

[Variant] Support typed access for numeric types in variant_get #8179

the rebase will have to figure out how to move the newly merged output builder functionality to an appropriate row builder equivalent.

Would you like help doing this? I can make another PR (targeting your branch to do so).

Hi @alamb @scovich My apologies, I had been quite busy since the past week, I would like to help with the merge conflict resolution, I realise with so many changes going in right now, stuff has become hard to track, including with this PR as well

scovich · 2025-09-04T23:44:47Z

@carpecodeum , @scovich points out that this PR now also has merge conflicts, likely due to merging this

[Variant] Support typed access for numeric types in variant_get #8179

the rebase will have to figure out how to move the newly merged output builder functionality to an appropriate row builder equivalent.

Would you like help doing this? I can make another PR (targeting your branch to do so).

Hi @alamb @scovich My apologies, I had been quite busy since the past week, I would like to help with the merge conflict resolution, I realise with so many changes going in right now, stuff has become hard to track, including with this PR as well

I had some boring meetings today where I could guide a surprisingly helpful AI assistant through the merge process and subsequent fixups. PTAL?

[Variant] Support Shredded Objects in variant_get (take 2) #8280

(I tried making cmu-db#5 against your branch, but I must have done it completely wrong... either that, or the merge commit threw things off; maybe somebody smarter than me can figure that one out?)

scovich · 2025-09-06T03:42:39Z

Attn @alamb @carpecodeum @liamzwbao :

while making an early pass over [Variant]: Implement DataType::ListView/LargeListView support for cast_to_variant kernel #8241
and re-reviewing [Variant] Refactor cast_to_variant #8235

I noticed that cast_to_variant follows the same problematic column-oriented approach as the output builder this PR replaces with a row builder. In particular, it converts entire columns to variant, which means repeatedly re-encoding deeply nested variant values, transferring their bytes to new arrays, and recreating incomplete metadata columns that immediately get thrown away.

I spent some time today, trying to refactor it to be row-oriented (while keeping a similar code structure), but late in the process I realized that several array types break it because they need to do some holistic column-level transformation, which cannot be captured efficiently by a row-wise approach:

maps - cast the key column to string
dictionaries - convert the dictionary value column and then clone values from it as needed
run-end encoding - similar to dictionary, but also need to convert the incoming logical index to physical run index, which is a linear search in row-oriented code (= quadratic cost!)

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

Thoughts?

alamb · 2025-09-06T09:39:23Z

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

Yes, this sounds like a very plausible approach

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

This is my feeling too. @carpecodeum if you don't have time to work on this PR in the next day or two, perhaps @scovich or @liamzwbao could open a new PR (starting with the code from this PR) that we can finish up?

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

👍

carpecodeum · 2025-09-06T15:32:16Z

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

Yes, this sounds like a very plausible approach

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

This is my feeling too. @carpecodeum if you don't have time to work on this PR in the next day or two, perhaps @scovich or @liamzwbao could open a new PR (starting with the code from this PR) that we can finish up?

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

👍

@alamb My apologies, I have been very occupied with some other things lately, I think it would be best to let @scovich & @liamzwbao to make a new PR with the changes from this PR to speed things up, I would be happy to be an active reviewer or take up any follow up tickets from that PR as things calm down next week.

alamb · 2025-09-08T10:50:36Z

@alamb My apologies, I have been very occupied with some other things lately, I think it would be best to let @scovich & @liamzwbao to make a new PR with the changes from this PR to speed things up, I would be happy to be an active reviewer or take up any follow up tickets from that PR as things calm down next week.

Sounds like a plan to me

scovich · 2025-09-08T11:45:16Z

I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework cast_to_variant to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins.

Yes, this sounds like a very plausible approach

In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open.

This is my feeling too. @carpecodeum if you don't have time to work on this PR in the next day or two, perhaps @scovich or @liamzwbao could open a new PR (starting with the code from this PR) that we can finish up?

NOTE: Even tho variant_get and cast_to_variant would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the metadata column already exists; the latter would use a normal metadata builder. So cast_to_variant may actually be lower hanging fruit for converting "stuff" to binary variant, with variant_get support for the same landing later when we sort out the read-only metadata thing.

👍

UPDATE: I started doing some pathfinding on a row-oriented cast_to_variant, and it turns out not to need this PR. The approach is the same, but shredding needs variant-to-arrow row builders while the casting needs arrow-to-variant row builders. I'll try to post a PR soon that takes a stab at the latter.

scovich · 2025-09-09T01:41:04Z

UPDATE: I started doing some pathfinding on a row-oriented cast_to_variant, and it turns out not to need this PR. The approach is the same, but shredding needs variant-to-arrow row builders while the casting needs arrow-to-variant row builders. I'll try to post a PR soon that takes a stab at the latter.

[Variant] Implement row builders for cast_to_variant #8299

carpecodeum changed the title ~~Support Shredded Objects in variant_get~~ [Variant] Support Shredded Objects in variant_get Aug 18, 2025

github-actions bot added the parquet-variant parquet-variant* crates label Aug 18, 2025

scovich reviewed Aug 19, 2025

View reviewed changes

This was referenced Aug 20, 2025

[Variant] Implement VariantArray::value for shredded variants #8105

Merged

[Variant] Support typed access for numeric types in variant_get #8179

Merged

[Variant] Very rough pathfinding for variant get/shredding

81270f1

carpecodeum force-pushed the shredding-variant-part1 branch from f8c455d to f519bb4 Compare August 22, 2025 18:50

[ADD] add shredding support for variant objects

2326b55

carpecodeum force-pushed the shredding-variant-part1 branch from f519bb4 to 2326b55 Compare August 22, 2025 19:01

[FIX] remove unused annotation

882aa4d

scovich reviewed Aug 26, 2025

View reviewed changes

klion26 reviewed Aug 31, 2025

View reviewed changes

alamb mentioned this pull request Sep 4, 2025

Fix Clippy in shredding part 1: cmu-db/arrow-rs#4

Open

alamb reviewed Sep 4, 2025

View reviewed changes

This was referenced Sep 4, 2025

Merge arrow-rs/main into shredding-variant-part1 cmu-db/arrow-rs#5

Open

[Variant] Support Shredded Objects in variant_get (take 2) #8280

Merged

alamb closed this in fb7d02e Sep 8, 2025

	/// variant value.
	/// variant value (which could be `Variant::Null`).

		#[allow(unused)]
		pub(crate) fn make_shredding_row_builder<'a>(

		#[allow(unused)]
		struct PrimitiveVariantShreddingRowBuilder<T: ArrowPrimitiveType> {

		ShreddingState::PartiallyShredded { value, typed_value, .. } => {
		// PartiallyShredded case (formerly ImperfectlyShredded)

		ShreddingState::Typed { typed_value, .. } => {
		// Typed case (formerly PerfectlyShredded)

`value`	`typed_value`	Meaning
null	null	The value is missing; only valid for shredded object fields
non-null	null	The value is present and may be any type, including null
null	non-null	The value is present and is the shredded type
non-null	non-null	The value is present and is a partially shredded object

`value`	`typed_value`	Meaning
missing	missing	All values are missing; only valid for shredded object fields
present	missing	All values are unshredded, and can be any type, including null
missing	present	All values are present and the shredded type
present	present	At lease some values are present and can be any type, including null

	/// a: SHREDDED_VARIANT_FIELD {
	/// b: SHREDDED_VARIANT_FIELD {

[Variant] Support Shredded Objects in variant_get #8166

[Variant] Support Shredded Objects in variant_get #8166

Uh oh!

Conversation

carpecodeum commented Aug 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

carpecodeum commented Aug 19, 2025

Uh oh!

alamb commented Aug 19, 2025

Uh oh!

scovich left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carpecodeum commented Aug 21, 2025

Uh oh!

alamb commented Aug 23, 2025

Uh oh!

carpecodeum commented Aug 24, 2025

Uh oh!

alamb commented Aug 26, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment •

edited

Loading

klion26 Sep 2, 2025 •

edited

Loading