[Kernel-Spark] Phase 1: Basic Deletion Vector read support #5774

huan233usc · 2026-01-05T09:14:48Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/dv_pr1_refactor_parquet_format [Files changed]
- stack/dv_pr2_phase1_basic_read [Files changed]
  - stack/dv_pr3_phase2_vectorized [Files changed]
    - stack/dv_pr4_phase3_file_splitting [Files changed]
      - stack/dv_pr5_streaming_support [Files changed]

Which Delta project/connector is this regarding?

Description

Add basic deletion vector (DV) read support for the Spark V2 connector using row-based filtering.

Changes:

DvSchemaContext: POJO to manage DV schema context (column indices, output schema)
DeletionVectorReadFunction: Wraps base reader to filter deleted rows and project out DV column
PartitionUtils: Creates DV-aware PartitionReaderFactory with DeltaParquetFileFormatV2
AddserializeToBase64() to Kernel's DeletionVectorDescriptor

How it works:

Add __delta_internal_is_row_deleted column to read schema
Filter rows where DV column != 0 (deleted)
Project out DV column from output

How was this patch tested?

DvSchemaContextTest: Unit tests for schema manipulation
DeletionVectorReadFunctionTest: Unit tests for row filtering and projection
Golden table tests with DV tables pass

Does this PR introduce any user-facing changes?

No

…-io#5813)  #### Which Delta project/connector is this regarding?  - [ ] Spark - [ ] Standalone - [x] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR introduce the basic build support into `build.sbt` for Flink connector. It also introduce multiple interface classes that Flink connector can use to access Delta Kernel.  ## How was this patch tested? UT  ## Does this PR introduce _any_ user-facing changes? No

delta-io#5813)" (delta-io#5868) This reverts commit 52bc9d2. #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [X] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Reverting 52bc9d2. Want it merged with all the right approvals. ## How was this patch tested? Just a revert. ## Does this PR introduce _any_ user-facing changes? No.

This PR implements Phase 1 of DV support in the V2 connector with the following features: 1. DV metadata in PartitionedFile: - Convert Kernel DeletionVectorDescriptor to Spark format - Store serialized DV in otherConstantMetadataColumnValues 2. Schema augmentation: - Add __delta_internal_is_row_deleted column when table has DVs - DeltaParquetFileFormat generates this column using DV bitmap 3. DV filtering in connector: - DVFilteringIterator filters rows where is_row_deleted != 0 - Projects out the DV column from output (clean data to Spark) 4. Phase 1 limitations (to be addressed in later phases): - Vectorized reader disabled (Phase 2 will add ColumnarBatch support) - File splitting disabled (Phase 3 will use _metadata.row_index) The connector returns clean, already-filtered data - Spark plans never see DV columns.

huan233usc · 2026-01-22T05:08:24Z

@juliuszsompolski Can you also review this PR (and the stack)? All are DV related.

juliuszsompolski · 2026-01-22T09:39:52Z

Thanks, I will take a look! @andreaschat-db could you also review?

huan233usc force-pushed the stack/dv_pr2_phase1_basic_read branch 4 times, most recently from 4edc486 to a75472d Compare January 11, 2026 05:44

huan233usc force-pushed the stack/dv_pr2_phase1_basic_read branch 23 times, most recently from 9614135 to 07406cd Compare January 14, 2026 23:46

harperjiang and others added 11 commits January 21, 2026 18:22

Fix SparkPartitionReader to support generic Iterator for DV filtering

2fb13dc

Use Iterator<InternalRow> in PR2 (row-based only)

3c8d53c

Rename DeletionVectorFilterIterator to DeletedRowFilterIterator

ff9d465

Use ProjectedInternalRow to avoid data copy (like Iceberg)

889ee07

Add getGeometry method for Spark 4.0 compatibility

2ef7b3b

Use Spark's ProjectingInternalRow instead of custom ProjectedInternalRow

c269417

Use Scala Iterator filter().map() pattern for DV filtering

da63bfa

Merge DeletedRowFilterIterator into DeletionVectorReadFunction

eab92b5

huan233usc force-pushed the stack/dv_pr2_phase1_basic_read branch 7 times, most recently from d1528af to e299735 Compare January 22, 2026 01:34

Pass outputSchema to avoid building schema in loop

78715ca

huan233usc force-pushed the stack/dv_pr2_phase1_basic_read branch from e299735 to 78715ca Compare January 22, 2026 03:54

huan233usc marked this pull request as ready for review January 22, 2026 05:05

huan233usc requested review from gengliangwang, tdas and zikangh and removed request for gengliangwang and zikangh January 22, 2026 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel-Spark] Phase 1: Basic Deletion Vector read support #5774

[Kernel-Spark] Phase 1: Basic Deletion Vector read support #5774

huan233usc commented Jan 5, 2026 •

edited

Loading

Uh oh!

huan233usc commented Jan 22, 2026

Uh oh!

juliuszsompolski commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[Kernel-Spark] Phase 1: Basic Deletion Vector read support #5774

Are you sure you want to change the base?

[Kernel-Spark] Phase 1: Basic Deletion Vector read support #5774

Conversation

huan233usc commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

Which Delta project/connector is this regarding?

Description

Changes:

How it works:

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

huan233usc commented Jan 22, 2026

Uh oh!

juliuszsompolski commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

huan233usc commented Jan 5, 2026 •

edited

Loading