Skip to content

Iceberg planning time significantly longer than Hive #26789

@yingsu00

Description

@yingsu00

Example:
TPCHQ7
Hive

Image

https://grafana.ibm.prestodb.dev/d/cc6a4ccf-2fc7-4220-8f4e-397b03061acf/query-detail-comparisons?orgId=1&var-query1=20251204_084902_00055_xhkuu&var-query2=20251204_084725_00052_xhkuu

TPCHQ8

Image

https://grafana.ibm.prestodb.dev/d/cc6a4ccf-2fc7-4220-8f4e-397b03061acf/query-detail-comparisons?orgId=1&var-query1=20251204_084213_00049_xhkuu&var-query2=20251204_084227_00050_xhkuu

I reran TPCDS Q60 on a medium cluster, and the query elapsed time was stable at 16s and planning time at 13s. I took JFR profile on the coordinator and saw

  • The planning thread(s) is repeatedly blocking in socket reads to the remote service.
  • The total blocking time on those reads is ~11–12s.
  • end-to-end planning time (~13s) lines up almost exactly with the cumulated socket-read time.
    The hotspot was in IcebergHiveMetadata.getTableStatistics and through ThrifthiveMetastore. This confirms that network I/O to the metastore is the dominant cause of slow planning, not local CPU.
Image

By manually enabling the metastore cache in com/facebook/presto/hive/metastore/InMemoryCachingHiveMetastore.java,

private InMemoryCachingHiveMetastore(...) 
{
    ...
    //        switch (metastoreCacheScope) {
//            case PARTITION:
//                partitionCacheExpiresAfterWriteMillis = expiresAfterWriteMillis;
//                partitionCacheRefreshMills = refreshMills;
//                partitionCacheMaxSize = maximumSize;
//                cacheExpiresAfterWriteMillis = OptionalLong.of(0);
//                cacheRefreshMills = OptionalLong.of(0);
//                cacheMaxSize = 0;
//                break;
//
//            case ALL:
//                partitionCacheExpiresAfterWriteMillis = expiresAfterWriteMillis;
//                partitionCacheRefreshMills = refreshMills;
//                partitionCacheMaxSize = maximumSize;
//                cacheExpiresAfterWriteMillis = expiresAfterWriteMillis;
//                cacheRefreshMills = refreshMills;
//                cacheMaxSize = maximumSize;
//                break;
//
//            default:
//                throw new IllegalArgumentException("Unknown metastore-cache-scope: " + metastoreCacheScope);
//        }

        partitionCacheExpiresAfterWriteMillis = OptionalLong.of(999999999999L);
        partitionCacheRefreshMills = OptionalLong.of(999999999999L);
        partitionCacheMaxSize = 999999999999L;
        cacheExpiresAfterWriteMillis = OptionalLong.of(999999999999L);
        cacheRefreshMills = OptionalLong.of(999999999999L);
        cacheMaxSize = 999999999999L;
    ...
}

The query planning time for TPCDS 65 and 56 dropped from >13s to 1.6s. However the config property for it was disabled because the CacheTtl was hard coded to 0.

@agrawalreetika pointed out that Iceberg metadata caching was disabled in #24326. In the issue @imjalpreet mentioned "The ideal long-term solution would involve allowing caching of specific metadata, such as table statistics or the list of table names, which do not violate the ACID contract. However, as an immediate fix, we should disallow enabling the metastore cache when using the Iceberg connector." However the mentioned metadata caches were not implemented yet. @imjalpreet's #24376 add caching for getTable(), but it doesn't work for getTableStatistics().

@zacw7 : When you did the runs earlier this year which showed similar performance between Iceberg vs Hive, was it before #24326 was merged? I'm wondering why you didn't see this before.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    🆕 Unprioritized

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions