Skip to content

Latest commit

 

History

History
180 lines (127 loc) · 17.7 KB

iceberg-rest-service.md

File metadata and controls

180 lines (127 loc) · 17.7 KB
title slug keywords license
Iceberg REST catalog service
/iceberg-rest-service
Iceberg REST catalog
This software is licensed under the Apache License version 2.

Background

The Apache Gravitino Iceberg REST Server follows the Apache Iceberg REST API specification and acts as an Iceberg REST catalog server.

Capabilities

  • Supports the Apache Iceberg REST API defined in Iceberg 1.3.1, and supports all namespace and table interfaces. Token, and Config interfaces aren't supported yet.
  • Works as a catalog proxy, supporting Hive and JDBC as catalog backend.
  • Provides a pluggable metrics store interface to store and delete Iceberg metrics.
  • When writing to HDFS, the Gravitino Iceberg REST catalog service can only operate as the specified HDFS user and doesn't support proxying to other HDFS users. See How to access Apache Hadoop for more details.

:::info Builds with Apache Iceberg 1.3.1. The Apache Iceberg table format version is 1 by default. Builds with Hadoop 2.10.x. There may be compatibility issues when accessing Hadoop 3.x clusters. :::

Apache Gravitino Iceberg REST catalog service configuration

Assuming the Gravitino server is deployed in the GRAVITINO_HOME directory, you can locate the configuration options in $GRAVITINO_HOME/conf/gravitino.conf. There are four configuration properties for the Iceberg REST catalog service:

  1. REST Catalog Server Configuration: you can specify the HTTP server properties like host and port.

  2. Gravitino Iceberg metrics store Configuration: you could implement a custom Iceberg metrics store and set corresponding configuration.

  3. Gravitino Iceberg Catalog backend Configuration: you have the option to set the specified catalog-backend to either jdbc or hive.

  4. Other Iceberg Catalog Properties Defined by Apache Iceberg: allows you to configure additional properties defined by Apache Iceberg.

Please refer to the following sections for details.

REST catalog server configuration

Configuration item Description Default value Required Since Version
gravitino.auxService.names The auxiliary service name of the Gravitino Iceberg REST catalog service. Use iceberg-rest. (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.classpath The classpath of the Gravitino Iceberg REST catalog service; includes the directory containing jars and configuration. It supports both absolute and relative paths, for example, catalogs/lakehouse-iceberg/libs, catalogs/lakehouse-iceberg/conf (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.host The host of the Gravitino Iceberg REST catalog service. 0.0.0.0 No 0.2.0
gravitino.auxService.iceberg-rest.httpPort The port of the Gravitino Iceberg REST catalog service. 9001 No 0.2.0
gravitino.auxService.iceberg-rest.minThreads The minimum number of threads in the thread pool used by the Jetty web server. minThreads is 8 if the value is less than 8. Math.max(Math.min(Runtime.getRuntime().availableProcessors() * 2, 100), 8) No 0.2.0
gravitino.auxService.iceberg-rest.maxThreads The maximum number of threads in the thread pool used by the Jetty web server. maxThreads is 8 if the value is less than 8, and maxThreads must be greater than or equal to minThreads. Math.max(Runtime.getRuntime().availableProcessors() * 4, 400) No 0.2.0
gravitino.auxService.iceberg-rest.threadPoolWorkQueueSize The size of the queue in the thread pool used by Gravitino Iceberg REST catalog service. 100 No 0.2.0
gravitino.auxService.iceberg-rest.stopTimeout The amount of time in ms for the Gravitino Iceberg REST catalog service to stop gracefully. For more information, see org.eclipse.jetty.server.Server#setStopTimeout. 30000 No 0.2.0
gravitino.auxService.iceberg-rest.idleTimeout The timeout in ms of idle connections. 30000 No 0.2.0
gravitino.auxService.iceberg-rest.requestHeaderSize The maximum size of an HTTP request. 131072 No 0.2.0
gravitino.auxService.iceberg-rest.responseHeaderSize The maximum size of an HTTP response. 131072 No 0.2.0
gravitino.auxService.iceberg-rest.customFilters Comma-separated list of filter class names to apply to the APIs. (none) No 0.4.0

The filter in customFilters should be a standard javax servlet filter. You can also specify filter parameters by setting configuration entries in the style gravitino.auxService.iceberg-rest.<class name of filter>.param.<param name>=<value>.

Apache Iceberg metrics store configuration

Gravitino provides a pluggable metrics store interface to store and delete Iceberg metrics. You can develop a class that implements org.apache.gravitino.catalog.lakehouse.iceberg.web.metrics and add the corresponding jar file to the Iceberg REST service classpath directory.

Configuration item Description Default value Required Since Version
gravitino.auxService.iceberg-rest.metricsStore The Iceberg metrics storage class name. (none) No 0.4.0
gravitino.auxService.iceberg-rest.metricsStoreRetainDays The days to retain Iceberg metrics in store, the value not greater than 0 means retain forever. -1 No 0.4.0
gravitino.auxService.iceberg-rest.metricsQueueCapacity The size of queue to store metrics temporally before storing to the persistent storage. Metrics will be dropped when queue is full. 1000 No 0.4.0

Apache Gravitino Iceberg catalog backend configuration

:::info The Gravitino Iceberg REST catalog service uses the memory catalog backend by default. You can specify a Hive or JDBC catalog backend for production environment. :::

Apache Hive backend configuration

Configuration item Description Default value Required Since Version
gravitino.auxService.iceberg-rest.catalog-backend The Catalog backend of the Gravitino Iceberg REST catalog service. Use the value hive for a Hive catalog. memory Yes 0.2.0
gravitino.auxService.iceberg-rest.uri The Hive metadata address, such as thrift://127.0.0.1:9083. (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.warehouse The warehouse directory of the Hive catalog, such as /user/hive/warehouse-hive/. (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.catalog-backend-name The catalog backend name passed to underlying Iceberg catalog backend. Catalog name in JDBC backend is used to isolate namespace and tables. hive for Hive backend, jdbc for JDBC backend, memory for memory backend No 0.5.2

JDBC backend configuration

Configuration item Description Default value Required Since Version
gravitino.auxService.iceberg-rest.catalog-backend The Catalog backend of the Gravitino Iceberg REST catalog service. Use the value jdbc for a JDBC catalog. memory Yes 0.2.0
gravitino.auxService.iceberg-rest.uri The JDBC connection address, such as jdbc:postgresql://127.0.0.1:5432 for Postgres, or jdbc:mysql://127.0.0.1:3306/ for mysql. (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.warehouse The warehouse directory of JDBC catalog. Set the HDFS prefix if using HDFS, such as hdfs://127.0.0.1:9000/user/hive/warehouse-jdbc (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.catalog-backend-name The catalog name passed to underlying Iceberg catalog backend. Catalog name in JDBC backend is used to isolate namespace and tables. jdbc for JDBC backend No 0.5.2
gravitino.auxService.iceberg-rest.jdbc.user The username of the JDBC connection. (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.jdbc.password The password of the JDBC connection. (none) Yes 0.2.0
gravitino.auxService.iceberg-rest.jdbc-initialize Whether to initialize the meta tables when creating the JDBC catalog. true No 0.2.0
gravitino.auxService.iceberg-rest.jdbc-driver com.mysql.jdbc.Driver or com.mysql.cj.jdbc.Driver for MySQL, org.postgresql.Driver for PostgreSQL. (none) Yes 0.3.0

If you have a JDBC Iceberg catalog prior, you must set catalog-backend-name to keep consistent with your Jdbc Iceberg catalog name to operate the prior namespace and tables.

:::caution You must download the corresponding JDBC driver to the catalogs/lakehouse-iceberg/libs directory. :::

Other Apache Iceberg catalog properties

You can add other properties defined in Iceberg catalog properties. The clients property for example:

Configuration item Description Default value Required
gravitino.auxService.iceberg-rest.clients The client pool size of the catalog. 2 No

:::info catalog-impl has no effect. :::

HDFS configuration

The Gravitino Iceberg REST catalog service adds the HDFS configuration files core-site.xml and hdfs-site.xml from the directory defined by gravitino.auxService.iceberg-rest.classpath, for example, catalogs/lakehouse-iceberg/conf, to the classpath.

Starting the Apache Gravitino Iceberg REST catalog service

To start the service:

./bin/gravitino.sh start

To verify whether the service has started:

curl  http://127.0.0.1:9001/iceberg/v1/config

Normally you will see the output like {"defaults":{},"overrides":{}}%.

Exploring the Apache Gravitino and Apache Iceberg REST catalog service with Apache Spark

Deploying Apache Spark with Apache Iceberg support

Follow the Spark Iceberg start guide to set up Apache Spark's and Apache Iceberg's environment.

Starting the Apache Spark client with the Apache Iceberg REST catalog

Configuration item Description
spark.sql.catalog.${catalog-name}.type The Spark catalog type; should set to rest.
spark.sql.catalog.${catalog-name}.uri Spark Iceberg REST catalog URI, such as http://127.0.0.1:9001/iceberg/.

For example, we can configure Spark catalog options to use Gravitino Iceberg REST catalog with the catalog name rest.

./bin/spark-sql -v \
--packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog  \
--conf spark.sql.catalog.rest.type=rest  \
--conf spark.sql.catalog.rest.uri=http://127.0.0.1:9001/iceberg/

You may need to adjust the Iceberg Spark runtime jar file name according to the real version number in your environment.

Exploring Apache Iceberg with Apache Spark SQL

// First change to use the `rest` catalog
USE rest;
CREATE DATABASE IF NOT EXISTS dml;
CREATE TABLE dml.test (id bigint COMMENT 'unique id') using iceberg;
DESCRIBE TABLE EXTENDED dml.test;
INSERT INTO dml.test VALUES (1), (2);
SELECT * FROM dml.test;

You could try Spark with Gravitino REST catalog service in our playground.