-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Remove extra columns for ColumnBatch #11551
base: main
Are you sure you want to change the base?
Conversation
@@ -622,6 +624,41 @@ public void testPosDeletesOnParquetFileWithMultipleRowGroups() throws IOExceptio | |||
assertThat(rowSet(tblName, tbl, "*")).hasSize(193); | |||
} | |||
|
|||
@TestTemplate | |||
public void testEqualityDeleteWithDifferentScanAndDeleteColumns() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is expected to pass even without the fix provided by this PR. Currently, the extra columns returned to Spark do not cause any problems. However, with Comet native execution, since Comet allocates arrays in a pre-allocated list and relies on the requested schema to determine the number of columns in the batch, this test would fail without the fix proposed in this PR.
// is 2. Since when creating the DeleteFilter, we append these extra columns in the end of the | ||
// requested schema, we can just remove them from the end of the ColumnVector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] Is it possible to fix this at the place where these extra columns are appended in the end of the requested schema, this would probably help us in avoiding extra memory in the first place and expensive copy of Columbatch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment!
The extra columns are appended to the requested schema in DeleteFilter.fileProjection. The values of these extra columns are read in ColumnarBatchReader and used to identify which rows are deleted in applyEqDelete. I remove the extra columns right after calling applyEqDelete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the response !
[doubt] Considering applyEquality delete anyways does another projection on top of the schema returned from DeleteFilte.fileProjection
Schema deleteSchema = TypeUtil.select(requiredSchema, ids); |
can we not add one another param in the fileProjection like we did here to include additional fiels based on the boolean flag ? :
boolean needRowPosCol) { |
so that we get what columns we actually need in the first place ? to avoid removing extra columns post filter evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look!
We traverse the schema and build a VectorizedReader
for each of the column in VectorizedReaderBuilder, this is done before DeleteFilte.fileProjection
In Equality Delete, we build
ColumnarBatchReader
for the equality delete filter columns to read their values and determine which rows are deleted. If these filter columns are not among the requested columns, they are considered extra and should be removed before returning theColumnBatch
to Spark.Suppose the table schema includes C1, C2, C3, C4, C5. If the query is:
SELECT C5 FROM table
, and the equality delete filter is on C3 and C4,We read the values of C3 and C4 to identify which rows are deleted. However, we do not want to include these values in the
ColumnBatch
that we return to Spark.