-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive: Optimize tableExists API in hive catalog #11597
base: main
Are you sure you want to change the base?
Conversation
Skip creation of hive table operation when check existence of iceberg table in hive catalog
FYI @szehon-ho and @haizhou-zhao if you are interested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @dramaticlly , nice work
@@ -386,6 +386,12 @@ public void testHiveTableAndIcebergTableWithSameName(TableType tableType) | |||
|
|||
assertThat(catalog.tableExists(TABLE_IDENTIFIER)).isTrue(); | |||
HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(DB_NAME, hiveTableName); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually try to create a different test for every feature. In this case, if a test fails, it is easier to understand what went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree however this change I proposed here does not introduce a new feature. I am trying to simplify the existing flow and found out existing test coverage is sufficient verify the table existence check of iceberg table in HiveCatalog. If you have specific recommendations for additional test in mind, I am happy to create a new one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have a separate test for the tableExists
.
@Test
public void testTableExists() throws TException {
[..]
// create an iceberg table after hive table dropped
catalog.createTable(identifier, SCHEMA, PartitionSpec.unpartitioned());
assertThat(catalog.tableExists(identifier)).isTrue();
catalog.dropTable(identifier, true);
assertThat(catalog.tableExists(identifier)).isFalse();
[..]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I I agree with you on different test for new feature. But in this particular case, it was already part of dropTable tests in
iceberg/core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java
Lines 838 to 845 in a52afdc
assertThat(catalog.tableExists(TABLE)).as("Table should not exist before create").isFalse(); | |
catalog.buildTable(TABLE, SCHEMA).create(); | |
assertThat(catalog.tableExists(TABLE)).as("Table should exist after create").isTrue(); | |
boolean dropped = catalog.dropTable(TABLE); | |
assertThat(dropped).as("Should drop a table that does exist").isTrue(); | |
assertThat(catalog.tableExists(TABLE)).as("Table should not exist after drop").isFalse(); |
catalog.tableExists
was tested pretty extensively in all other existing unit tests as well.
Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed. How do we handle corrupt metadata in other catalog implementations? |
Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted. Personally I think we can focus on former only and here's my thought process There are roughly 3 places where
|
Thanks for the check @dramaticlly! Thanks, |
Do we want to do the same opimization for the |
Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass. |
@@ -412,6 +412,28 @@ private void validateTableIsIcebergTableOrView( | |||
} | |||
} | |||
|
|||
@Override | |||
public boolean tableExists(TableIdentifier identifier) { | |||
if (!isValidIdentifier(identifier)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I realize, loadTable works for metadata table. Can we check this?
HiveOperationsBase.validateTableIsIceberg(table, fullTableName(name, identifier)); | ||
return true; | ||
} catch (NoSuchTableException | NoSuchObjectException e) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if a hive table with the same name exists in HMS? I think tableExists
will return false, which is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the HMS call is getTable (line 423), not tableExists as I also was confused in https://github.com/apache/iceberg/pull/11597/files#r1853126846, so it should work iiuc
Skip creation of hive table operation when check existence of iceberg table in hive catalog.
Today the table existence rely on load the table first and return true if table can be loaded
iceberg/api/src/main/java/org/apache/iceberg/catalog/Catalog.java
Lines 279 to 286 in 3badfe0
I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.
This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.
Why this is safe?
HiveOperationsBase.validateTableIsIceberg
is also used in catalog listTables API to differentiate the iceberg table from hive table