Using the iceberg-connector for existing parquet files #19042
-
Does this need a "migration process"? Here is my situation: I have 1000's of parquet files created (using python's pyarrow.parquet writer), using no compression, into partitioned date folders in a S3 bucket. My fundamental question is should I be able to query these directly from Trino using the iceberg connector, or do I first need some migration process to convert them into the iceberg table format, as metadata inside of these parquet files? I have been trying to CREATE_TABLE on the existing S3 bucket, and running SELECT queries against that but this does not return anything for those existing parquet files. I can, however, add new data through INSERT commands, and query the inserted data. After the INSERT is run, there is a /metadata and /data subfolder folder at the location that I entered during the CREATE TABLE command. The /data contains the new parquet files, and the /metadata is a collection of stats, avro, etc.. which I assume is the mapping of the table names to the parquet files so that I can run the distributed queries. How would I create the /metadata for the existing parquet files, or do I need to migrate the existing data via INSERT commands. (or some python library) One caveat is that I'm not using Hive.. as I am running everything from Kubernetes, without a Hadoop cluster. I have Nessie set up as the metastore. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 12 replies
-
Iceberg != bunch of Parquet files. There are multiple ways of generating that metadata:
Also note that you don't need a "Hadoop cluster" for the Hive connector - you just need the Hive metastore (either Hive standalone metastore or AWS Glue would work). |
Beta Was this translation helpful? Give feedback.
-
Thanks for your help! Going to pursue the Hive metastore options -- I wasn't aware it could be run independent of Hadoop. Also is the Iceberg connector my best option for the requirement to query the parquet data in S3 using Trino, or is there another recommended approach? (Not dependent on running in AWS -- such as Glue) Another question: is the iceberg metadata completely separate from the parquet files, as independent files in the /metadata folder, or does it add something to the parquet files themselves? I'm not sure what other tools will be accessing these files, and wouldn't want to lock them into using Iceberg. |
Beta Was this translation helpful? Give feedback.
Iceberg != bunch of Parquet files.
Additional metadata is required for it to be an iceberg table.
There are multiple ways of generating that metadata:
CREATE TABLE iceberg.default.iceberg_table AS SELECT * FROM hive.default.hive_table
migrate
procedure in Trino https://trino.io/docs/current/connector/iceberg.html#migrate-table