-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add TableUtil to provide access to a table's format version #11620
base: main
Are you sure you want to change the base?
Conversation
@@ -120,7 +120,7 @@ private FileIO fileIO(Table table) { | |||
return table.io(); | |||
} | |||
|
|||
private Table lazyTable() { | |||
protected Table lazyTable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected
is for test purpose. Should we add @VisibleForTesting
or similar (just as indication) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method isn't used for test purposes, but rather to get access to the underlying table in the subclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, right. I misread. Good to me then.
* BaseMetadataTable} do not have a format version. The only exception to this is {@link | ||
* PositionDeletesTable}, where the format version is fetched from the underlying table. | ||
*/ | ||
public static int formatVersion(Table table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this approach. I just wonder if we should not define the format version just as table property (and set when using delete vectors for instance). It's more "hardcoded" but probably easier to evaluate (and fallback to V2 if the table property is not present).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly prefer we keep it out of table properties. IMHO, Table properties are user facing hints or guides and the system should not rely on them for correct behavior. Having the correct format version is important for correct behavior so I would keep it out of properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough (good point about mutable/user facing aspect). I'm just a little "concerned" about the "complexity" of this method to get the format version (it remembers me the hack we did in JDBC Catalog 😄 ).
|
||
// being able to read the format version from the PositionDeletesTable is mainly needed in | ||
// SparkPositionDeletesRewrite when determining whether to rewrite V2 position deletes to DVs | ||
if (table instanceof BaseMetadataTable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we had scala or even kotlin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons I want scala is I really want to just enumerate cases here. I would recommend we just go through all of our cases narrower then broader if we have exceptions so
If (PositionDeleteTable)
return format version
else if (MetaTable) {
Sorry Brah
}
else if (HasTableOperations) {
return format version
}
else {
// Sorry Brah
}
Although I am curious why position delete table needs a format version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I am curious why position delete table needs a format version?
@RussellSpitzer this is mainly for SparkPositionDeletesRewrite
(which operates against the position deletes table). Basically when rewriting existing position deletes we need to know whether we need to rewrite them to V2 position deletes or to DVs by looking at the underlying format version of the table. A table that was upgraded to V3 can still have V2 position deletes, meaning that these would then have to be rewritten as DVs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just be looking at the underlying table then? Shouldn't the converter look at the base table rather than the metadata table?
Ie
formatVersion(metadataTable.baseTable)
Rather than
formatVersion(metadataTable)
Implicitly calling metadataTable.baseTable
but only sometimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend we just go through all of our cases narrower then broader if we have exceptions so
I would prefer that too, but the fact that SerializableMetadataTable
is a subclass of SerializableTable
which in turn implements HasTableOperations
makes this more difficult and you still need to differentiate there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just be looking at the underlying table then? Shouldn't the converter look at the base table rather than the metadata table?
I'm not fully sure I follow your comment. Do you mean the calling site should first check whether it's a metadata table before calling TableUtil.formatVersion(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But I also don't think the caller should need to check. Shouldn't the caller know what it's doing? Like if it is compacting DeleteVectors it knows it has a DeleteVectorMetadataTable and therefor it uses the parent table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would require changing a bunch of more places, because effectively we have a Broadcast<Table>
in SparkPositionDeletesRewrite
:
Lines 130 to 131 in 8351248
Broadcast<Table> tableBroadcast = | |
sparkContext.broadcast(SerializableTableWithSize.copyOf(table)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a an argument for ease of current implementation. But do you think it's the right decision going forward to have a format version for position deletes metadata table and none of the other metadata tables?
This is an alternative impl to #11587