Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Remove extra columns for ColumnBatch #11551

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

huaxingao
Copy link
Contributor

In Equality Delete, we build ColumnarBatchReader for the equality delete filter columns to read their values and determine which rows are deleted. If these filter columns are not among the requested columns, they are considered extra and should be removed before returning the ColumnBatch to Spark.

Suppose the table schema includes C1, C2, C3, C4, C5. If the query is: SELECT C5 FROM table, and the equality delete filter is on C3 and C4,

We read the values of C3 and C4 to identify which rows are deleted. However, we do not want to include these values in the ColumnBatch that we return to Spark.

@@ -622,6 +624,41 @@ public void testPosDeletesOnParquetFileWithMultipleRowGroups() throws IOExceptio
assertThat(rowSet(tblName, tbl, "*")).hasSize(193);
}

@TestTemplate
public void testEqualityDeleteWithDifferentScanAndDeleteColumns() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is expected to pass even without the fix provided by this PR. Currently, the extra columns returned to Spark do not cause any problems. However, with Comet native execution, since Comet allocates arrays in a pre-allocated list and relies on the requested schema to determine the number of columns in the batch, this test would fail without the fix proposed in this PR.

Comment on lines +56 to +57
// is 2. Since when creating the DeleteFilter, we append these extra columns in the end of the
// requested schema, we can just remove them from the end of the ColumnVector.
Copy link
Contributor

@singhpk234 singhpk234 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] Is it possible to fix this at the place where these extra columns are appended in the end of the requested schema, this would probably help us in avoiding extra memory in the first place and expensive copy of Columbatch ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment!

The extra columns are appended to the requested schema in DeleteFilter.fileProjection. The values of these extra columns are read in ColumnarBatchReader and used to identify which rows are deleted in applyEqDelete. I remove the extra columns right after calling applyEqDelete.

Copy link
Contributor

@singhpk234 singhpk234 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the response !

[doubt] Considering applyEquality delete anyways does another projection on top of the schema returned from DeleteFilte.fileProjection

Schema deleteSchema = TypeUtil.select(requiredSchema, ids);

can we not add one another param in the fileProjection like we did here to include additional fiels based on the boolean flag ? :

so that we get what columns we actually need in the first place ? to avoid removing extra columns post filter evaluation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!
We traverse the schema and build a VectorizedReader for each of the column in VectorizedReaderBuilder, this is done before DeleteFilte.fileProjection

@huaxingao
Copy link
Contributor Author

cc @flyrain @szehon-ho @viirya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants