Hive: Optimize tableExists API in hive catalog #11597

dramaticlly · 2024-11-20T00:22:31Z

Skip creation of hive table operation when check existence of iceberg table in hive catalog.

Today the table existence rely on load the table first and return true if table can be loaded

iceberg/api/src/main/java/org/apache/iceberg/catalog/Catalog.java

Lines 279 to 286 in 3badfe0

    
           default boolean tableExists(TableIdentifier identifier) { 
        
             try { 
        
               loadTable(identifier); 
        
               return true; 
        
             } catch (NoSuchTableException e) { 
        
               return false; 
        
             } 
        
           }

I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.

This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.

Why this is safe? HiveOperationsBase.validateTableIsIceberg is also used in catalog listTables API to differentiate the iceberg table from hive table

Skip creation of hive table operation when check existence of iceberg table in hive catalog

dramaticlly · 2024-11-20T00:22:57Z

FYI @szehon-ho and @haizhou-zhao if you are interested

haizhou-zhao

Thx @dramaticlly , nice work

pvary · 2024-11-20T06:32:49Z

hive-metastore/src/test/java/org/apache/iceberg/hive/HiveTableTest.java

@@ -386,6 +386,12 @@ public void testHiveTableAndIcebergTableWithSameName(TableType tableType)

    assertThat(catalog.tableExists(TABLE_IDENTIFIER)).isTrue();
    HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(DB_NAME, hiveTableName);
+


We usually try to create a different test for every feature. In this case, if a test fails, it is easier to understand what went wrong.

I agree however this change I proposed here does not introduce a new feature. I am trying to simplify the existing flow and found out existing test coverage is sufficient verify the table existence check of iceberg table in HiveCatalog. If you have specific recommendations for additional test in mind, I am happy to create a new one!

I would prefer to have a separate test for the tableExists.

@Test public void testTableExists() throws TException { [..] // create an iceberg table after hive table dropped catalog.createTable(identifier, SCHEMA, PartitionSpec.unpartitioned()); assertThat(catalog.tableExists(identifier)).isTrue(); catalog.dropTable(identifier, true); assertThat(catalog.tableExists(identifier)).isFalse(); [..] }

Yeah I I agree with you on different test for new feature. But in this particular case, it was already part of dropTable tests in

iceberg/core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java

Lines 838 to 845 in a52afdc

assertThat(catalog.tableExists(TABLE)).as("Table should not exist before create").isFalse();

catalog.buildTable(TABLE, SCHEMA).create();

assertThat(catalog.tableExists(TABLE)).as("Table should exist after create").isTrue();

boolean dropped = catalog.dropTable(TABLE);

assertThat(dropped).as("Should drop a table that does exist").isTrue();

assertThat(catalog.tableExists(TABLE)).as("Table should not exist after drop").isFalse();

and catalog.tableExists was tested pretty extensively in all other existing unit tests as well.

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

pvary · 2024-11-20T06:36:26Z

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

dramaticlly · 2024-11-20T19:58:28Z

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted.

Personally I think we can focus on former only and here's my thought process

There are roughly 3 places where catalog.tableExists was used in iceberg code base

Check before table can be registered in registerTable API

iceberg/core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

Line 82 in 3badfe0

if (tableExists(identifier)) {

I believe behaviour change is allowed here as long as entry exist in catalog, register shall fail regardless files is corrupted

Check before table stage creation, this is only used in REST catalog handler

iceberg/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java

Line 228 in 3badfe0

if (catalog.tableExists(ident)) {

I believe behaviour change is also allowed here as long as entry exist in catalog, stage creation shall fail regardless version files is corrupted

REST API to check for table existence:

iceberg/open-api/rest-catalog-open-api.yaml

Lines 1129 to 1133 in 3badfe0

    
           head: 
        
             tags: 
        
               - Catalog API 
        
             summary: Check if a table exists 
        
             operationId: tableExists

I think this is what I originally hoped for to optimize on, to speed up on the existence check without rely on reading metadata first. The reason is that sometimes existence check is all we need without subsequent load table call

pvary · 2024-11-21T13:19:08Z

Thanks for the check @dramaticlly!
I agree that this behavioural change is small, but I would like to raise awareness around the community about this. Could you please write a letter to the dev list describing what is planned here? So if there is someone who is against the change, they could raise their voices.

Thanks,
Peter

pvary · 2024-11-21T13:24:33Z

Do we want to do the same opimization for the viewExists method too?

karuppayya · 2024-11-21T23:07:09Z

Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass.
With the change, tableExists would pass with only access to HMS table?

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

szehon-ho · 2024-11-22T01:40:14Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

@@ -412,6 +412,28 @@ private void validateTableIsIcebergTableOrView(
    }
  }

+  @Override
+  public boolean tableExists(TableIdentifier identifier) {
+    if (!isValidIdentifier(identifier)) {


Actually I realize, loadTable works for metadata table. Can we check this?

kevinjqliu · 2024-11-22T02:13:35Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

+      HiveOperationsBase.validateTableIsIceberg(table, fullTableName(name, identifier));
+      return true;
+    } catch (NoSuchTableException | NoSuchObjectException e) {
+      return false;


what happens if a hive table with the same name exists in HMS? I think tableExists will return false, which is confusing.

I think the HMS call is getTable (line 423), not tableExists as I also was confused in https://github.com/apache/iceberg/pull/11597/files#r1853126846, so it should work iiuc

Hive: Optimize tableExists API in hive catalog

600f292

Skip creation of hive table operation when check existence of iceberg table in hive catalog

github-actions bot added the hive label Nov 20, 2024

haizhou-zhao approved these changes Nov 20, 2024

View reviewed changes

pvary reviewed Nov 20, 2024

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Show resolved Hide resolved

Add a newline after if/else

77d984c

szehon-ho reviewed Nov 22, 2024

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Show resolved Hide resolved

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Show resolved Hide resolved

Add current thread interrupt

47dc279

szehon-ho reviewed Nov 22, 2024

View reviewed changes

kevinjqliu reviewed Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive: Optimize tableExists API in hive catalog #11597

Hive: Optimize tableExists API in hive catalog #11597

dramaticlly commented Nov 20, 2024

dramaticlly commented Nov 20, 2024

haizhou-zhao left a comment

pvary Nov 20, 2024

dramaticlly Nov 20, 2024 •

edited

Loading

pvary Nov 21, 2024

dramaticlly Nov 22, 2024

pvary commented Nov 20, 2024

dramaticlly commented Nov 20, 2024 •

edited

Loading

pvary commented Nov 21, 2024

pvary commented Nov 21, 2024

karuppayya commented Nov 21, 2024

szehon-ho Nov 22, 2024

kevinjqliu Nov 22, 2024

szehon-ho Nov 22, 2024

	default boolean tableExists(TableIdentifier identifier) {
	try {
	loadTable(identifier);
	return true;
	} catch (NoSuchTableException e) {
	return false;
	}
	}

		@@ -386,6 +386,12 @@ public void testHiveTableAndIcebergTableWithSameName(TableType tableType)

		assertThat(catalog.tableExists(TABLE_IDENTIFIER)).isTrue();
		HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(DB_NAME, hiveTableName);

	assertThat(catalog.tableExists(TABLE)).as("Table should not exist before create").isFalse();

	catalog.buildTable(TABLE, SCHEMA).create();
	assertThat(catalog.tableExists(TABLE)).as("Table should exist after create").isTrue();

	boolean dropped = catalog.dropTable(TABLE);
	assertThat(dropped).as("Should drop a table that does exist").isTrue();
	assertThat(catalog.tableExists(TABLE)).as("Table should not exist after drop").isFalse();

Hive: Optimize tableExists API in hive catalog #11597

Are you sure you want to change the base?

Hive: Optimize tableExists API in hive catalog #11597

Conversation

dramaticlly commented Nov 20, 2024

dramaticlly commented Nov 20, 2024

haizhou-zhao left a comment

Choose a reason for hiding this comment

pvary Nov 20, 2024

Choose a reason for hiding this comment

dramaticlly Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

pvary Nov 21, 2024

Choose a reason for hiding this comment

dramaticlly Nov 22, 2024

Choose a reason for hiding this comment

pvary commented Nov 20, 2024

dramaticlly commented Nov 20, 2024 • edited Loading

pvary commented Nov 21, 2024

pvary commented Nov 21, 2024

karuppayya commented Nov 21, 2024

szehon-ho Nov 22, 2024

Choose a reason for hiding this comment

kevinjqliu Nov 22, 2024

Choose a reason for hiding this comment

szehon-ho Nov 22, 2024

Choose a reason for hiding this comment

dramaticlly Nov 20, 2024 •

edited

Loading

dramaticlly commented Nov 20, 2024 •

edited

Loading