Skip to content

Conversation

@huan233usc
Copy link
Collaborator

@huan233usc huan233usc commented Jan 5, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Add basic deletion vector (DV) read support for the Spark V2 connector using row-based filtering.

Changes:

  • DvSchemaContext: POJO to manage DV schema context (column indices, output schema)
  • DeletionVectorReadFunction: Wraps base reader to filter deleted rows and project out DV column
  • PartitionUtils: Creates DV-aware PartitionReaderFactory with DeltaParquetFileFormatV2
  • AddserializeToBase64() to Kernel's DeletionVectorDescriptor

How it works:

  1. Add __delta_internal_is_row_deleted column to read schema
  2. Filter rows where DV column != 0 (deleted)
  3. Project out DV column from output

How was this patch tested?

  • DvSchemaContextTest: Unit tests for schema manipulation
  • DeletionVectorReadFunctionTest: Unit tests for row filtering and projection
  • Golden table tests with DV tables pass

Does this PR introduce any user-facing changes?

No

harperjiang and others added 11 commits January 21, 2026 18:22
…-io#5813)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [ ] Spark
- [ ] Standalone
- [x] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description
This PR introduce the basic build support into `build.sbt` for Flink
connector.
It also introduce multiple interface classes that Flink connector can
use to access Delta Kernel.
<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->

## How was this patch tested?
UT
<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->

## Does this PR introduce _any_ user-facing changes?
No
<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
delta-io#5813)" (delta-io#5868)

This reverts commit 52bc9d2.

#### Which Delta project/connector is this regarding?
- [ ] Spark
- [ ] Standalone
- [X] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

Reverting 52bc9d2. Want it merged with
all the right approvals.

## How was this patch tested?

Just a revert.

## Does this PR introduce _any_ user-facing changes?

No.
This PR implements Phase 1 of DV support in the V2 connector with the following features:

1. DV metadata in PartitionedFile:
   - Convert Kernel DeletionVectorDescriptor to Spark format
   - Store serialized DV in otherConstantMetadataColumnValues

2. Schema augmentation:
   - Add __delta_internal_is_row_deleted column when table has DVs
   - DeltaParquetFileFormat generates this column using DV bitmap

3. DV filtering in connector:
   - DVFilteringIterator filters rows where is_row_deleted != 0
   - Projects out the DV column from output (clean data to Spark)

4. Phase 1 limitations (to be addressed in later phases):
   - Vectorized reader disabled (Phase 2 will add ColumnarBatch support)
   - File splitting disabled (Phase 3 will use _metadata.row_index)

The connector returns clean, already-filtered data - Spark plans never see DV columns.
@huan233usc huan233usc force-pushed the stack/dv_pr2_phase1_basic_read branch 7 times, most recently from d1528af to e299735 Compare January 22, 2026 01:34
@huan233usc huan233usc force-pushed the stack/dv_pr2_phase1_basic_read branch from e299735 to 78715ca Compare January 22, 2026 03:54
@huan233usc huan233usc marked this pull request as ready for review January 22, 2026 05:05
@huan233usc huan233usc requested review from gengliangwang, tdas and zikangh and removed request for gengliangwang and zikangh January 22, 2026 05:05
@huan233usc
Copy link
Collaborator Author

@juliuszsompolski Can you also review this PR (and the stack)? All are DV related.

@juliuszsompolski
Copy link
Contributor

Thanks, I will take a look! @andreaschat-db could you also review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants