Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Write DVs for V3 MoR tables #11561

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

amogh-jahagirdar
Copy link
Contributor

No description provided.

@amogh-jahagirdar
Copy link
Contributor Author

Still some failing tests, and figuring out a good pattern to extend the existing Delete/Merge/Update tests to run against DVs. Something also worth thinking about for V3 is preventing partition granularity from being set since it's at odds with DVs.
We could also take a stance that when an upgrade to V3 is performed, we invalidate the delete granularity property completely since in V3 we should only be writing DVs and not position deletes (the granularity property is really only relevant for V2 pos deletes)

* PartitioningDVWriter is a PartitioningWriter implementation which writes DVs for a given file
* position
*/
public class PartitioningDVWriter<T>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this to avoid changing quite a bit of machinery in SparkPositionDeltaWrite which relies on PartitioningWriter. With this adapter we can just return this writer at a single point, and the existing delegate logic in SparkPositionDeltaWrite just works (line 495 in SparkPositionDeltaWRite). Another advantage is PartitioningWriter is a long standing interface which user's existing code could've been written against and so this implementation could just fit that interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we'd have to update SparkPositioNDeltaWrite DeleteOnlyDataWriter and DataAndDelteWriter to take in a function as a delegate (instead of a writer as a delegate) and the function either just uses the existing DV writer for V3 tables or goes to the Fanout writers. But this just makes it more complicated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having this class here makes a lot of sense and I've reused it for #11545

@github-actions github-actions bot removed the MR label Nov 20, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the spark-write-dv-files branch 2 times, most recently from 5e8e727 to 0f00dfc Compare November 20, 2024 18:03
@amogh-jahagirdar amogh-jahagirdar changed the title (WIP) Write DVs in Spark for V3 tables Spark: Write DVs for V3 tables Nov 20, 2024
@amogh-jahagirdar amogh-jahagirdar changed the title Spark: Write DVs for V3 tables Spark: Write DVs for V3 MoR tables Nov 20, 2024
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review November 20, 2024 18:42
@@ -521,7 +528,10 @@ public void deleteSingleRecordProducesDeleteOperation() throws NoSuchTableExcept
} else {
// this is a RowDelta that produces a "delete" instead of "overwrite"
validateMergeOnRead(currentSnapshot, "1", "1", null);
validateProperty(currentSnapshot, ADD_POS_DELETE_FILES_PROP, "1");
TableOperations ops = ((HasTableOperations) table).operations();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need ops to get the version. The super class already has formatVersion defined, so we can just examine that field

validateProperty(currentSnapshot, ADD_POS_DELETE_FILES_PROP, "1");
TableOperations ops = ((HasTableOperations) table).operations();
String property =
ops.current().formatVersion() >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ops.current().formatVersion() >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;
formatVersion >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;

@@ -397,4 +450,9 @@ protected void assertAllBatchScansVectorized(SparkPlan plan) {
List<SparkPlan> batchScans = SparkPlanUtil.collectBatchScans(plan);
assertThat(batchScans).hasSizeGreaterThan(0).allMatch(SparkPlan::supportsColumnar);
}

protected int formatVersion(Table table) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants