Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: Optimize tableExists API in hive catalog #11597

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dramaticlly
Copy link
Contributor

Skip creation of hive table operation when check existence of iceberg table in hive catalog.

Today the table existence rely on load the table first and return true if table can be loaded

default boolean tableExists(TableIdentifier identifier) {
try {
loadTable(identifier);
return true;
} catch (NoSuchTableException e) {
return false;
}
}

I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.

This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.

Why this is safe? HiveOperationsBase.validateTableIsIceberg is also used in catalog listTables API to differentiate the iceberg table from hive table

Skip creation of hive table operation when check existence of iceberg table in hive catalog
@github-actions github-actions bot added the hive label Nov 20, 2024
@dramaticlly
Copy link
Contributor Author

FYI @szehon-ho and @haizhou-zhao if you are interested

Copy link
Contributor

@haizhou-zhao haizhou-zhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @dramaticlly , nice work

@@ -386,6 +386,12 @@ public void testHiveTableAndIcebergTableWithSameName(TableType tableType)

assertThat(catalog.tableExists(TABLE_IDENTIFIER)).isTrue();
HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(DB_NAME, hiveTableName);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually try to create a different test for every feature. In this case, if a test fails, it is easier to understand what went wrong.

Copy link
Contributor Author

@dramaticlly dramaticlly Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree however this change I proposed here does not introduce a new feature. I am trying to simplify the existing flow and found out existing test coverage is sufficient verify the table existence check of iceberg table in HiveCatalog. If you have specific recommendations for additional test in mind, I am happy to create a new one!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a separate test for the tableExists.

@Test
public void testTableExists() throws TException {
[..]
    // create an iceberg table after hive table dropped
    catalog.createTable(identifier, SCHEMA, PartitionSpec.unpartitioned());
    assertThat(catalog.tableExists(identifier)).isTrue();
    catalog.dropTable(identifier, true);
    assertThat(catalog.tableExists(identifier)).isFalse();
[..]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I I agree with you on different test for new feature. But in this particular case, it was already part of dropTable tests in

assertThat(catalog.tableExists(TABLE)).as("Table should not exist before create").isFalse();
catalog.buildTable(TABLE, SCHEMA).create();
assertThat(catalog.tableExists(TABLE)).as("Table should exist after create").isTrue();
boolean dropped = catalog.dropTable(TABLE);
assertThat(dropped).as("Should drop a table that does exist").isTrue();
assertThat(catalog.tableExists(TABLE)).as("Table should not exist after drop").isFalse();
and catalog.tableExists was tested pretty extensively in all other existing unit tests as well.

@pvary
Copy link
Contributor

pvary commented Nov 20, 2024

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

@dramaticlly
Copy link
Contributor Author

dramaticlly commented Nov 20, 2024

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted.

Personally I think we can focus on former only and here's my thought process

There are roughly 3 places where catalog.tableExists was used in iceberg code base

  1. Check before table can be registered in registerTable API
  • I believe behaviour change is allowed here as long as entry exist in catalog, register shall fail regardless files is corrupted
  1. Check before table stage creation, this is only used in REST catalog handler
  • I believe behaviour change is also allowed here as long as entry exist in catalog, stage creation shall fail regardless version files is corrupted
  1. REST API to check for table existence:
    head:
    tags:
    - Catalog API
    summary: Check if a table exists
    operationId: tableExists
  • I think this is what I originally hoped for to optimize on, to speed up on the existence check without rely on reading metadata first. The reason is that sometimes existence check is all we need without subsequent load table call

@pvary
Copy link
Contributor

pvary commented Nov 21, 2024

Thanks for the check @dramaticlly!
I agree that this behavioural change is small, but I would like to raise awareness around the community about this. Could you please write a letter to the dev list describing what is planned here? So if there is someone who is against the change, they could raise their voices.

Thanks,
Peter

@pvary
Copy link
Contributor

pvary commented Nov 21, 2024

Do we want to do the same opimization for the viewExists method too?

@karuppayya
Copy link
Contributor

Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass.
With the change, tableExists would pass with only access to HMS table?

@@ -412,6 +412,28 @@ private void validateTableIsIcebergTableOrView(
}
}

@Override
public boolean tableExists(TableIdentifier identifier) {
if (!isValidIdentifier(identifier)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I realize, loadTable works for metadata table. Can we check this?

Comment on lines +424 to +427
HiveOperationsBase.validateTableIsIceberg(table, fullTableName(name, identifier));
return true;
} catch (NoSuchTableException | NoSuchObjectException e) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a hive table with the same name exists in HMS? I think tableExists will return false, which is confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the HMS call is getTable (line 423), not tableExists as I also was confused in https://github.com/apache/iceberg/pull/11597/files#r1853126846, so it should work iiuc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants