Spark: Write DVs for V3 MoR tables #11561

amogh-jahagirdar · 2024-11-15T17:11:46Z

No description provided.

amogh-jahagirdar · 2024-11-15T17:15:08Z

Still some failing tests, and figuring out a good pattern to extend the existing Delete/Merge/Update tests to run against DVs. Something also worth thinking about for V3 is preventing partition granularity from being set since it's at odds with DVs.
We could also take a stance that when an upgrade to V3 is performed, we invalidate the delete granularity property completely since in V3 we should only be writing DVs and not position deletes (the granularity property is really only relevant for V2 pos deletes)

amogh-jahagirdar · 2024-11-15T17:19:09Z

core/src/main/java/org/apache/iceberg/io/PartitioningDVWriter.java

+ * PartitioningDVWriter is a PartitioningWriter implementation which writes DVs for a given file
+ * position
+ */
+public class PartitioningDVWriter<T>


I wrote this to avoid changing quite a bit of machinery in SparkPositionDeltaWrite which relies on PartitioningWriter. With this adapter we can just return this writer at a single point, and the existing delegate logic in SparkPositionDeltaWrite just works (line 495 in SparkPositionDeltaWRite). Another advantage is PartitioningWriter is a long standing interface which user's existing code could've been written against and so this implementation could just fit that interface.

Alternatively we'd have to update SparkPositioNDeltaWrite DeleteOnlyDataWriter and DataAndDelteWriter to take in a function as a delegate (instead of a writer as a delegate) and the function either just uses the existing DV writer for V3 tables or goes to the Fanout writers. But this just makes it more complicated

I think having this class here makes a lot of sense and I've reused it for #11545

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

nastra · 2024-11-21T07:07:59Z

spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

@@ -521,7 +528,10 @@ public void deleteSingleRecordProducesDeleteOperation() throws NoSuchTableExcept
    } else {
      // this is a RowDelta that produces a "delete" instead of "overwrite"
      validateMergeOnRead(currentSnapshot, "1", "1", null);
-      validateProperty(currentSnapshot, ADD_POS_DELETE_FILES_PROP, "1");
+      TableOperations ops = ((HasTableOperations) table).operations();


I don't think we need ops to get the version. The super class already has formatVersion defined, so we can just examine that field

nastra · 2024-11-21T07:08:14Z

spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

-      validateProperty(currentSnapshot, ADD_POS_DELETE_FILES_PROP, "1");
+      TableOperations ops = ((HasTableOperations) table).operations();
+      String property =
+          ops.current().formatVersion() >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;


Suggested change

ops.current().formatVersion() >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;

formatVersion >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;

nastra · 2024-11-21T07:18:10Z

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

@@ -397,4 +450,9 @@ protected void assertAllBatchScansVectorized(SparkPlan plan) {
    List<SparkPlan> batchScans = SparkPlanUtil.collectBatchScans(plan);
    assertThat(batchScans).hasSizeGreaterThan(0).allMatch(SparkPlan::supportsColumnar);
  }
+
+  protected int formatVersion(Table table) {


I don't think this is necessary anymore

github-actions bot added spark core data labels Nov 15, 2024

amogh-jahagirdar force-pushed the spark-write-dv-files branch from 7df6e83 to 0b63df8 Compare November 15, 2024 17:12

amogh-jahagirdar commented Nov 15, 2024

View reviewed changes

amogh-jahagirdar force-pushed the spark-write-dv-files branch from 0b63df8 to b116637 Compare November 19, 2024 14:57

github-actions bot added the MR label Nov 19, 2024

amogh-jahagirdar force-pushed the spark-write-dv-files branch from b116637 to 0fdfd0c Compare November 19, 2024 17:28

nastra reviewed Nov 20, 2024

View reviewed changes

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the spark-write-dv-files branch from 0fdfd0c to b698904 Compare November 20, 2024 16:25

github-actions bot removed the MR label Nov 20, 2024

amogh-jahagirdar force-pushed the spark-write-dv-files branch 2 times, most recently from 5e8e727 to 0f00dfc Compare November 20, 2024 18:03

amogh-jahagirdar changed the title ~~(WIP) Write DVs in Spark for V3 tables~~ Spark: Write DVs for V3 tables Nov 20, 2024

amogh-jahagirdar changed the title ~~Spark: Write DVs for V3 tables~~ Spark: Write DVs for V3 MoR tables Nov 20, 2024

amogh-jahagirdar force-pushed the spark-write-dv-files branch from 0f00dfc to 546f6b8 Compare November 20, 2024 18:42

amogh-jahagirdar marked this pull request as ready for review November 20, 2024 18:42

amogh-jahagirdar requested review from nastra, aokolnychyi and rdblue November 20, 2024 18:42

Write DVs in Spark for V3 tables

e6806d0

amogh-jahagirdar force-pushed the spark-write-dv-files branch from 546f6b8 to e6806d0 Compare November 20, 2024 19:47

nastra reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Write DVs for V3 MoR tables #11561

Spark: Write DVs for V3 MoR tables #11561

amogh-jahagirdar commented Nov 15, 2024

amogh-jahagirdar commented Nov 15, 2024

amogh-jahagirdar Nov 15, 2024

amogh-jahagirdar Nov 15, 2024

nastra Nov 19, 2024

nastra Nov 21, 2024

nastra Nov 21, 2024

nastra Nov 21, 2024

	ops.current().formatVersion() >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;
	formatVersion >= 3 ? ADDED_DVS_PROP : ADD_POS_DELETE_FILES_PROP;

Spark: Write DVs for V3 MoR tables #11561

Are you sure you want to change the base?

Spark: Write DVs for V3 MoR tables #11561

Conversation

amogh-jahagirdar commented Nov 15, 2024

amogh-jahagirdar commented Nov 15, 2024

amogh-jahagirdar Nov 15, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Nov 15, 2024

Choose a reason for hiding this comment

nastra Nov 19, 2024

Choose a reason for hiding this comment

nastra Nov 21, 2024

Choose a reason for hiding this comment

nastra Nov 21, 2024

Choose a reason for hiding this comment

nastra Nov 21, 2024

Choose a reason for hiding this comment