Spark: Remove extra columns for ColumnBatch #11551

huaxingao · 2024-11-14T16:57:41Z

In Equality Delete, we build ColumnarBatchReader for the equality delete filter columns to read their values and determine which rows are deleted. If these filter columns are not among the requested columns, they are considered extra and should be removed before returning the ColumnBatch to Spark.

Suppose the table schema includes C1, C2, C3, C4, C5. If the query is: SELECT C5 FROM table, and the equality delete filter is on C3 and C4,

We read the values of C3 and C4 to identify which rows are deleted. However, we do not want to include these values in the ColumnBatch that we return to Spark.

huaxingao · 2024-11-14T17:09:14Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

@@ -622,6 +624,41 @@ public void testPosDeletesOnParquetFileWithMultipleRowGroups() throws IOExceptio
    assertThat(rowSet(tblName, tbl, "*")).hasSize(193);
  }

+  @TestTemplate
+  public void testEqualityDeleteWithDifferentScanAndDeleteColumns() throws IOException {


This test is expected to pass even without the fix provided by this PR. Currently, the extra columns returned to Spark do not cause any problems. However, with Comet native execution, since Comet allocates arrays in a pre-allocated list and relies on the requested schema to determine the number of columns in the batch, this test would fail without the fix proposed in this PR.

singhpk234 · 2024-11-14T18:33:32Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+  // is 2. Since when creating the DeleteFilter, we append these extra columns in the end of the
+  // requested schema, we can just remove them from the end of the ColumnVector.


[doubt] Is it possible to fix this at the place where these extra columns are appended in the end of the requested schema, this would probably help us in avoiding extra memory in the first place and expensive copy of Columbatch ?

Thanks for your comment!

The extra columns are appended to the requested schema in DeleteFilter.fileProjection. The values of these extra columns are read in ColumnarBatchReader and used to identify which rows are deleted in applyEqDelete. I remove the extra columns right after calling applyEqDelete.

Thank you for the response !

[doubt] Considering applyEquality delete anyways does another projection on top of the schema returned from DeleteFilte.fileProjection

iceberg/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

Line 196 in bf8d25f

Schema deleteSchema = TypeUtil.select(requiredSchema, ids);

can we not add one another param in the fileProjection like we did here to include additional fiels based on the boolean flag ? :

iceberg/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

Line 272 in 06dc721

boolean needRowPosCol) {

so that we get what columns we actually need in the first place ? to avoid removing extra columns post filter evaluation

Thanks for taking a look!
We traverse the schema and build a VectorizedReader for each of the column in VectorizedReaderBuilder, this is done before DeleteFilte.fileProjection

huaxingao · 2024-11-15T17:46:48Z

cc @flyrain @szehon-ho @viirya

Spark: Remove extra columns for ColumnBatch

06dc721

github-actions bot added spark data labels Nov 14, 2024

huaxingao commented Nov 14, 2024

View reviewed changes

singhpk234 reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Remove extra columns for ColumnBatch #11551

Spark: Remove extra columns for ColumnBatch #11551

huaxingao commented Nov 14, 2024

huaxingao Nov 14, 2024

singhpk234 Nov 14, 2024 •

edited

Loading

huaxingao Nov 15, 2024

singhpk234 Nov 18, 2024 •

edited

Loading

huaxingao Nov 18, 2024

huaxingao commented Nov 15, 2024

		// is 2. Since when creating the DeleteFilter, we append these extra columns in the end of the
		// requested schema, we can just remove them from the end of the ColumnVector.

Spark: Remove extra columns for ColumnBatch #11551

Are you sure you want to change the base?

Spark: Remove extra columns for ColumnBatch #11551

Conversation

huaxingao commented Nov 14, 2024

huaxingao Nov 14, 2024

Choose a reason for hiding this comment

singhpk234 Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

huaxingao Nov 15, 2024

Choose a reason for hiding this comment

singhpk234 Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

huaxingao Nov 18, 2024

Choose a reason for hiding this comment

huaxingao commented Nov 15, 2024

singhpk234 Nov 14, 2024 •

edited

Loading

singhpk234 Nov 18, 2024 •

edited

Loading