Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 8, 2026

Review only the most recent commit. b9172ea

What changes are proposed in this pull request?

Implements checkpoint writing with stats_parsed support, allowing checkpoints to include structured statistics alongside or instead of JSON statistics based on table properties:

  • delta.checkpoint.writeStatsAsJson
  • delta.checkpoint.writeStatsAsStruct

New: stats_transform.rs

  • StatsTransformConfig: Configuration from table properties
  • build_stats_transform(): Builds transform expression handling all 4 scenarios:
writeStatsAsJson writeStatsAsStruct stats stats_parsed
true false COALESCE(stats, ToJson(stats_parsed)) drop
true true COALESCE(stats, ToJson(stats_parsed)) COALESCE(stats_parsed, ParseJson(stats))
false true drop COALESCE(stats_parsed, ParseJson(stats))
false false drop drop

Modified: checkpoint/mod.rs

  • CheckpointDataResult: Returns iterator + output schema
  • TransformingCheckpointIterator: Applies stats transform internally, handles CheckpointMetadata separately (different schema)
  • checkpoint_data(): Orchestrates schema building, transform creation, and iterator construction

Fixed: evaluate_expression.rs

  • Preserve source struct's null bitmap when building output for nested transforms
  • Fixes "unmasked nulls for non-nullable field" errors when transforming batches with null Add rows

This PR affects the following public APIs

checkpoint_data() return type changed:

  • Before: DeltaResult<ActionReconciliationIterator>
  • After: DeltaResult<TransformingCheckpointIterator>

finalize() parameter changed:

  • Before: checkpoint_data: ActionReconciliationIterator
  • After: checkpoint_data: TransformingCheckpointIterator

New public types:

  • TransformingCheckpointIterator - wraps the action iterator and applies stats transforms based on table properties

How was this change tested?

  • All existing checkpoint tests pass
  • New unit tests for StatsTransformConfig and schema building
  • test_checkpoint_data_struct_enabled - verifies stats_parsed in output schema
  • test_checkpoint_data_default_settings - verifies default behavior
  • test_checkpoint_stats_iteration - verifies transform application
  • test_all_stats_config_combinations - tests all 16 combinations of settings across two checkpoints

DrakeLin and others added 2 commits January 8, 2026 19:44
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
@DrakeLin DrakeLin changed the title [WIP] Checkpoint write [WIP] Checkpoint write parsed stats Jan 8, 2026
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Jan 8, 2026
@DrakeLin DrakeLin force-pushed the drake-lin_data/checkpoint-write branch from f729a52 to 98458a9 Compare January 8, 2026 21:46
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.35%. Comparing base (d8e24f5) to head (b9172ea).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/expressions/transforms.rs 84.31% 12 Missing and 4 partials ⚠️
ffi/src/expressions/engine_visitor.rs 0.00% 14 Missing ⚠️
kernel/src/scan/data_skipping/stats_schema.rs 96.26% 9 Missing and 5 partials ⚠️
kernel/src/checkpoint/mod.rs 91.66% 1 Missing and 5 partials ⚠️
kernel/src/checkpoint/stats_transform.rs 97.55% 6 Missing ⚠️
...src/engine/arrow_expression/evaluate_expression.rs 96.90% 2 Missing and 4 partials ⚠️
kernel/src/expressions/mod.rs 64.28% 5 Missing ⚠️
kernel/src/table_configuration.rs 85.71% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1594      +/-   ##
==========================================
+ Coverage   84.04%   84.35%   +0.31%     
==========================================
  Files         118      120       +2     
  Lines       32168    33634    +1466     
  Branches    32168    33634    +1466     
==========================================
+ Hits        27035    28373    +1338     
- Misses       3789     3879      +90     
- Partials     1344     1382      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin force-pushed the drake-lin_data/checkpoint-write branch 5 times, most recently from d21b024 to 87a1bd5 Compare January 9, 2026 01:09
@DrakeLin DrakeLin changed the title [WIP] Checkpoint write parsed stats feat: checkpoint write supports parsed stats Jan 9, 2026
@DrakeLin DrakeLin force-pushed the drake-lin_data/checkpoint-write branch from 87a1bd5 to d4c3f43 Compare January 9, 2026 01:20
@DrakeLin DrakeLin marked this pull request as ready for review January 9, 2026 01:20
@DrakeLin DrakeLin force-pushed the drake-lin_data/checkpoint-write branch from d4c3f43 to 1029b40 Compare January 9, 2026 01:21
@DrakeLin DrakeLin force-pushed the drake-lin_data/checkpoint-write branch from 1029b40 to b9172ea Compare January 9, 2026 01:24
}

// After all actions, yield the checkpoint metadata batch (if any) unchanged
self.checkpoint_metadata.take().map(Ok)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on comments abovr this would have a different schema though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, created 2nd checkpoint schema for V2

///
/// # Engine Usage
///
/// # Returns: [`ActionReconciliationIterator`] containing the checkpoint data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like useful information? Why is it being dropped?

output_schema.clone().into(),
)?;

// Create action reconciliation iterator (without checkpoint metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is without checkpoint metadata important point here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by adding checkpoint metadata into v2 checkpoint schema

/// let mut checkpoint_data = writer.checkpoint_data(&engine)?;
/// let output_schema = checkpoint_data.output_schema().clone();
/// while let Some(batch) = checkpoint_data.next() {
/// let data = batch?.apply_selection_vector()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we not modify parquet to take in FilteredEngineData as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was ever done

/// full checkpoint batch is produced with the modified Add action.
pub(crate) fn build_stats_transform(
config: &StatsTransformConfig,
stats_schema: SchemaRef,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to clarify what stats_schema should look like.

let config = StatsTransformConfig::from_table_properties(self.snapshot.table_properties());

// Get stats schema from table configuration.
// This already excludes partition columns and applies column mapping.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do need to handle partitionValues_parsed appropriately:

partitionValues_parsed: In this struct, the column names correspond to the partition columns and the values are stored in their corresponding data type. This is a required field when the table is partitioned and the table property delta.checkpoint.writeStatsAsStruct is set to true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to do this in a followup PR

nullable: field.nullable,
metadata: field.metadata.clone(),
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this if statement should never fail, if there is an add action? Can we add a check to ensure it doesn't (we should also validate that stats isn't already included in the ADD struct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we separate out the engine change into its own PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a stack -- PR description said not to review this commit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies i missed "Review only the most recent commit. b9172ea"


let source_data: &dyn ProvidesColumnByName = match source_data {
// For nested transforms, get the source struct's null bitmap to preserve null rows
let source_null_buffer = source_array.as_ref().and_then(|arr| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this should be collapsed into the mach, so we have let source_data, source_null_buffer. Otherwise this could panic on the down-scast for RecordBatch?

})
.collect();
let data = StructArray::try_new(output_fields.into(), output_cols, None)?;
let data = StructArray::try_new(output_fields.into(), output_cols, source_null_buffer)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, for finding this, I suspect this might enable some code cleanup for changes I made where a lot of columns had to be converted to nullable.

/// maxValues: <derived min/max schema>,
/// }
/// ```
pub(crate) fn expected_stats_schema(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: on naming, parsed_stats_schema_from_config, or something else to indicate this is producing the derived parse stats schema from configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep as is, since writing stats also uses this

| &PrimitiveType::Timestamp
| &PrimitiveType::TimestampNtz
| &PrimitiveType::String
// | &PrimitiveType::Boolean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is bool commented out?


// Get stats schema from table configuration.
// This already excludes partition columns and applies column mapping.
let stats_schema = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question i have is whether we want this coupling to TableConfiguration (and automatic derivation of the stats schema), here or if the schema should be taken as a parameter. I thought we were trying to keep kernel code decoupled from config in general. CC @nicklan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Engine should be free to fetch those confs (or override them), while kernel should just do whatever engine said. This does seem like a reasonable default behavior tho, so probably we just need to split out the "with defaults" version that engines can use if they don't want to figure it out themselves?

let physical_schema = StructType::try_new(
self.schema()
.fields()
.filter(|field| !partition_columns.contains(field.name()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on why partition columns are filtered out.

/// schema for statistics based on the table configuration. Often times the consfigration
/// is based on operator experience or automates systems as to what statistics are most
/// useful for a given table.
pub fn expected_stats_schema(&self) -> DeltaResult<SchemaRef> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on naming, "expected_stats_schema" doesn't quite make sense to me. stats_schema or parsed_stats_schema make more sense. It would be good to document where this is expected to be used (it can't be used within add actions yet right? So does it need to be public at this point?

It would also be good to document how column mapping mode effects the schema.

If this will be a public function long term then copying/moving the details about which configuration options impact (instead of the delegating) is important. Including a basic example the stats schema returned here given a specific input schema.

Another open question here is whether as a public facing thing returned schemas, columns here should be column major (or stat major as they currently are). I think keeping stat major is probably OK.

Copy link
Collaborator

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it up to other reviewers but something to keep in mind for future PRs, is it seems this could easily be divided into 3 more focused PRs:

  1. Engine changes to support ParseJSON
  2. Schema construction changes.
  3. the checkpoint changes

I think better docs at least initially is the main blocker for me for merging, and understanding some of the iterator changes.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These stats schema questions probably belong on a different PR but I couldn't easily find it so leaving them here instead...

.unwrap_or(true);

if !should_include {
self.path.pop();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stray ? could very easily break this path stack protocol.
Can we create a newtype wrapper with appropriate impl Drop instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but that would break the recursive self.transform call, because the &mut self.path would make the borrow checker unhappy. Hmm.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice start! I don't see anything worrisome in the approach.

Comment on lines +105 to +106
Arc::new(Expression::variadic(
VariadicExpressionOp::Coalesce,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably time to define an Expression::coalesce helper method, since there are already three call sites like this upstream, and now we're adding two new ones with this PR?

Comment on lines +107 to +113
vec![
Expression::column([ADD_NAME, STATS_FIELD]),
Expression::unary(
UnaryExpressionOp::ToJson,
Expression::column([ADD_NAME, STATS_PARSED_FIELD]),
),
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fully static. We could potentially define it as a static init instead?

// Insert stats_parsed right after stats
fields.push(StructField::nullable(
STATS_PARSED_FIELD,
DataType::Struct(Box::new(stats_schema.clone())),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a TODO somewhere to use Arc instead of Box for DataType::Struct. This might be a good reason to address the TODO pronto -- wide stats schemas will pay a high cost at every query right now.

Comment on lines +178 to +196
let fields: Vec<StructField> = base_schema
.fields()
.map(|field| {
if field.name == ADD_NAME {
if let DataType::Struct(add_struct) = &field.data_type {
let modified_add = build_add_output_schema(config, add_struct, stats_schema);
return StructField {
name: field.name.clone(),
data_type: DataType::Struct(Box::new(modified_add)),
nullable: field.nullable,
metadata: field.metadata.clone(),
};
}
}
field.clone()
})
.collect();

Arc::new(StructType::new_unchecked(fields))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nearly identical to build_checkpoint_read_schema_with_stats. Is there a way to factor out the common code?

Comment on lines +169 to +171
arr.as_any()
.downcast_ref::<StructArray>()
.and_then(|s| s.nulls().cloned())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to cast here when Array::nulls is available?
Are we intentionally relying on a failed cast to not always propagate the null array?

let result = parse_json_impl(json_strings, arrow_schema)?;

// Return as StructArray
Ok(Arc::new(StructArray::from(result)) as ArrayRef)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised the upcast is needed here?

DrakeLin added a commit that referenced this pull request Jan 23, 2026
## 🥞 Stacked PR
Use this
[link](https://github.com/delta-io/delta-kernel-rs/pull/1642/files) to
review incremental changes.
-
[**stack/stats-schema**](#1642)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1642/files)]
-
[stack/write-stats-stack](#1656)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1656/files/f218eef7dafc67390b5fa3de7c4beea9e18acff4..38d6f4f61025af9adc10cd19737c3a6fb724350f)]

---------
## What changes are proposed in this pull request?

Followups on stats_schema pr from
#1594


<!--
**Uncomment** this section if there are any changes affecting public
APIs. Else, **delete** this section.
### This PR affects the following public APIs
If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.
Note that _new_ public APIs are not considered breaking.
-->

## How was this change tested?
@DrakeLin DrakeLin closed this Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants