Using the iceberg-connector for existing parquet files #19042

logicpeters · 2023-09-14T13:25:21Z

logicpeters
Sep 14, 2023

Does this need a "migration process"?

Here is my situation:

I have 1000's of parquet files created (using python's pyarrow.parquet writer), using no compression, into partitioned date folders in a S3 bucket.

My fundamental question is should I be able to query these directly from Trino using the iceberg connector, or do I first need some migration process to convert them into the iceberg table format, as metadata inside of these parquet files? I have been trying to CREATE_TABLE on the existing S3 bucket, and running SELECT queries against that but this does not return anything for those existing parquet files. I can, however, add new data through INSERT commands, and query the inserted data.

After the INSERT is run, there is a /metadata and /data subfolder folder at the location that I entered during the CREATE TABLE command. The /data contains the new parquet files, and the /metadata is a collection of stats, avro, etc.. which I assume is the mapping of the table names to the parquet files so that I can run the distributed queries.

How would I create the /metadata for the existing parquet files, or do I need to migrate the existing data via INSERT commands. (or some python library)

One caveat is that I'm not using Hive.. as I am running everything from Kubernetes, without a Hadoop cluster. I have Nessie set up as the metastore.

Answered by hashhar

Sep 15, 2023

My fundamental question is should I be able to query these directly from Trino using the iceberg connector, or do I first need some migration process to convert them into the iceberg table format, as metadata inside of these parquet files?

Iceberg != bunch of Parquet files.
Additional metadata is required for it to be an iceberg table.

There are multiple ways of generating that metadata:

Create a Hive table over existing files then CREATE TABLE iceberg.default.iceberg_table AS SELECT * FROM hive.default.hive_table
Create a Hive table over existing files then migrate it using the migrate procedure in Trino https://trino.io/docs/current/connector/iceberg.html#migrate-table
Change the ing…

View full answer

hashhar · 2023-09-15T10:53:49Z

hashhar
Sep 15, 2023
Collaborator

My fundamental question is should I be able to query these directly from Trino using the iceberg connector, or do I first need some migration process to convert them into the iceberg table format, as metadata inside of these parquet files?

Iceberg != bunch of Parquet files.
Additional metadata is required for it to be an iceberg table.

There are multiple ways of generating that metadata:

Create a Hive table over existing files then CREATE TABLE iceberg.default.iceberg_table AS SELECT * FROM hive.default.hive_table
Create a Hive table over existing files then migrate it using the migrate procedure in Trino https://trino.io/docs/current/connector/iceberg.html#migrate-table
Change the ingest process so that it uses the Iceberg API (python or Java libraries exist) to add files to the Iceberg table

Also note that you don't need a "Hadoop cluster" for the Hive connector - you just need the Hive metastore (either Hive standalone metastore or AWS Glue would work).

8 replies

hashhar May 30, 2024
Collaborator

Yes you can. However that would make it so that in Hive connector you'll see Iceberg tables too (but not be able to query them) and in Iceberg connector you'll see Hive tables too (but not be able to query them).

Other than that there are no issues with sharing a metastore. Also read the docs please - specifically iceberg.hive-catalog-name in https://trino.io/docs/current/connector/iceberg.html and hive.iceberg-catalog-name in https://trino.io/docs/current/connector/hive.html#table-redirection

deepika094 May 31, 2024

Thank you @hashhar ! works well.

deepika094 May 31, 2024

One more question.. while creating iceberg table using select query , select can pull data from any data source if we have connectors setup for that in trino ? like Hive/ Glue , RDS, and other RDBMS systems as well ? And if so Trino manages the metadata creation , datatypes conversions etc ?

hashhar May 31, 2024
Collaborator

correct. Iceberg connector at the end of the day is just another Trino connector so everything you do with other connectors works similarly.

The only thing to be aware of is that Iceberg may not support the data type from the source (in which case you can CAST or transform them somehow in the query) and that Iceberg supports the concepts of partitioning and bucketing which the source might not have.

Same applies when going in reverse direction (select from iceberg and write to RDBMS for example) - the RDBMS may not support all data types that Iceberg does (so do a CAST or transform in the query) and Iceberg specific things (like snapshots, partitioning, bucketing etc.) will be lost.

deepika094 May 31, 2024

Gotcha! Amazing. Thats very helpful.

logicpeters · 2023-09-19T12:41:57Z

logicpeters
Sep 19, 2023
Author

Thanks for your help! Going to pursue the Hive metastore options -- I wasn't aware it could be run independent of Hadoop. Also is the Iceberg connector my best option for the requirement to query the parquet data in S3 using Trino, or is there another recommended approach? (Not dependent on running in AWS -- such as Glue)

Another question: is the iceberg metadata completely separate from the parquet files, as independent files in the /metadata folder, or does it add something to the parquet files themselves? I'm not sure what other tools will be accessing these files, and wouldn't want to lock them into using Iceberg.

4 replies

hashhar Sep 19, 2023
Collaborator

Iceberg adds field ids to the parquet files however that doesn't prevent other tools from reading those files.

However there are also additional files like delete files (which basically record which rows are deleted from the table) and during read the readers merge the data files with the delete files for example. There are also snapshots (i.e. when you change the data via UPDATE/DELETE/MERGE) which mean that the old files don't get deleted as a result of destructive operations - instead new files get created and the metadata gets updated to list all the files which belong to the table as of a given instant in time.

So if you use a tool which doesn't understand Iceberg and just treats the files as bunch of Parquet files then you'll get incorrect results because it won't understand what files are active or what files represent deletes etc.

However almost every tool these days supports Iceberg so there's no good reason to stick with the plain Parquet files that Hive uses. I wouldn't call Iceberg files as locked-in though because you can run procedures like vaccum or optimize on the Iceberg table which would merge all the delete files into the data files and final result will be same as bunch of Parquet files which don't need special machinery to return correct results.

hashhar Sep 19, 2023
Collaborator

cc: @rdblue @findinpath since you are both more knowledgeable about Hive and Iceberg than me.

hashhar Sep 19, 2023
Collaborator

Also is the Iceberg connector my best option for the requirement to query the parquet data in S3 using Trino, or is there another recommended approach? (Not dependent on running in AWS -- such as Glue)

For this part you have three options:

Use Hive connector (setup Hive metastore + point to S3)
- Simplest option but also the least flexible and performant and maintainable (in terms of schema and partition evolution)
Use Iceberg connector (setup an Iceberg catalog (multiple impls exist, Hive metastore, an RDBMS catalog etc.) + point to S3)
- Needs readers to understand Iceberg (not a problem since almost all tools do)
Use Delta Lake connector (setup Hive metastore + point to S3)
- Needs readers to understand DeltaLake (less tools support this compared to Iceberg)

logicpeters Sep 19, 2023
Author

Thanks a lot, this is very useful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the iceberg-connector for existing parquet files #19042

{{title}}

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using the iceberg-connector for existing parquet files #19042

logicpeters Sep 14, 2023

Replies: 2 comments · 12 replies

hashhar Sep 15, 2023 Collaborator

hashhar May 30, 2024 Collaborator

deepika094 May 31, 2024

deepika094 May 31, 2024

hashhar May 31, 2024 Collaborator

deepika094 May 31, 2024

logicpeters Sep 19, 2023 Author

hashhar Sep 19, 2023 Collaborator

hashhar Sep 19, 2023 Collaborator

hashhar Sep 19, 2023 Collaborator

logicpeters Sep 19, 2023 Author

logicpeters
Sep 14, 2023

Replies: 2 comments 12 replies

hashhar
Sep 15, 2023
Collaborator

hashhar May 30, 2024
Collaborator

hashhar May 31, 2024
Collaborator

logicpeters
Sep 19, 2023
Author

hashhar Sep 19, 2023
Collaborator

hashhar Sep 19, 2023
Collaborator

hashhar Sep 19, 2023
Collaborator

logicpeters Sep 19, 2023
Author