Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink Merge On Read Behavior? Equality & Positional Deletes #11535

Open
FranMorilloAWS opened this issue Nov 13, 2024 · 5 comments
Open

Flink Merge On Read Behavior? Equality & Positional Deletes #11535

FranMorilloAWS opened this issue Nov 13, 2024 · 5 comments
Labels
question Further information is requested

Comments

@FranMorilloAWS
Copy link

Query engine

Apache Flink

Question

Can somebody explain how Delete files are implemented with Apache Flink? Spark only makes use of Positional Deletes, but Apache Flink seems that we are using both? Not sure why Flink would need to use both?

@FranMorilloAWS FranMorilloAWS added the question Further information is requested label Nov 13, 2024
@pvary
Copy link
Contributor

pvary commented Nov 13, 2024

The discussion here might help you: #10935 (comment)

@FranMorilloAWS
Copy link
Author

Hi @pvary , but still is not clear in which scenarios does flink decide to either do a positional delete or an equality delete when commiting the snapshot. Also i have seen snapshots commits that may have both. There is an Alibaba blog post that mentions this is due to avoid inconsistencies but again not clear and not documented anywhere: https://www.alibabacloud.com/blog/how-to-analyze-cdc-data-in-iceberg-data-lake-using-flink_597838

@pvary
Copy link
Contributor

pvary commented Nov 14, 2024

Equality delete:

  • Written if the ID first deleted during a checkpoint

Positional delete:

  • A record is inserted with a given ID, and then it is deleted during the same checkpoint

@FranMorilloAWS
Copy link
Author

Why we need to use both? Is there an example scenario we can go over? Thanks in advanced for answering me :)

@pvary
Copy link
Contributor

pvary commented Nov 14, 2024

Imagine a scenario where a specific Id is updated twice. Equality based delete is not enough in this case to remove the outdated first record and keep the second record.
Positional delete is not enough in itself, since we need to find the data file and the specific rownum to delete the record. In edge cases this would require us to do a full table scan for every record...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants