Can't use the extension if my data catalog did not create a version-hint.text file #29

jacopotagliabue · 2023-11-21T23:26:23Z

My s3 bucket with iceberg (picture below) cannot be queried with

iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)

nor

iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)

In particular the system is trying to find a very specific file (so the * pattern gets ignored):

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text

Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.

That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?

Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?

The text was updated successfully, but these errors were encountered:

jacopotagliabue · 2023-11-21T23:31:19Z

Sorry to be a bit clearer: even if we fix the version-hint problem, the fact that the system is looking at https://bucket.s3.amazonaws.com/iceberg/metadata/ as a base path seems to be not aligned with the state of my data lake (see the picture above for the current layout, written by Spark Nessie).

Happy to help debug this if there's something we can quickly try out.

harel-e · 2023-11-26T16:37:12Z

I ran into similar issue using AWS with Glue as the catalog for Iceberg.

The metadata files stored in S3 are of the following pattern:

00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json
00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json

I suspect Glue holds the pointer to the current metadata.

rustyconover · 2023-11-27T02:44:30Z

It does. You can see the current pointer in table properties if you call Glue’s DescribeTable

…

On Sun, Nov 26, 2023 at 10:37 Harel Efraim ***@***.***> wrote: I ran into similar issue using AWS with Glue as the catalog for Iceberg. The metadata files stored in S3 are of the following pattern: 00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json I suspect Glue holds the pointer to the current metadata. — Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFSWJLGUYXUROAA5YGISITYGNV4HAVCNFSM6AAAAAA7VLFQWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWHAZTAMZYGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

samansmink · 2023-11-27T11:42:54Z

Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check:
#18

jacopotagliabue · 2023-11-27T13:36:24Z

@samansmink thanks, but the work-around does not seem the work tough: I get s3://bucet/iceberg/taxi_fhvhv_bbb/metadata/aaa.metadata.json from my datacatalog manually and pass it to my query:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

I still get a 404 with version file

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "....metadata.json/metadata/version-hint.text": 404 (Not Found)

As if it was trying to append the metadata/version-hint.text to my JSON path. Am I doing something dumb?

jacopotagliabue · 2023-11-27T13:47:47Z

Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is No such file or directory on a path the scan found

"s3a://bucketiceberg/taxi_fhvhv/metadata/snap-aaaa.avro"

If I try with allow_moved_paths (the only thing it came to mind), I then get:

duckdb.duckdb.InvalidInputException: Invalid Input Error: Enabling allow_moved_paths is not enabled for directly scanning metadata files.

Any way around all of this?

jacopotagliabue · 2023-11-27T14:09:55Z

Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:

duckdb.duckdb.IOException: IO Error: Cannot open file "s3a://.......avro": No such file or directory

A nessie (written with Spark) file system uses s3a:// as the prefix, not s3 like presumably duckdb does. In fact, if I manually change s3a://.......avro into s3://.......avro, I can find the file in my data lake!

Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones.

What do you think could be a short-term work-around @samansmink ?

samansmink · 2023-11-27T14:30:14Z

@jacopotagliabue s3a urls are indeed not supported currently.

If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue

jacopotagliabue · 2023-11-27T14:34:38Z

That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue.

samansmink · 2023-11-27T14:51:25Z

duckdb/duckdb#9817

harel-e · 2023-12-17T15:35:19Z

For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly.

Once you load the table from the catalog, you can issue the following method that will return the latest metadata location.
You can use that location with iceberg_scan function.

public static String currentMetadataLocation(org.apache.iceberg.Table table) {
    return ((BaseTable) table).operations().current().metadataFileLocation();
}

I tested it on both Glue and Nessie.

It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on

jacopotagliabue · 2023-12-19T21:07:47Z

hi @harel-e, just making sure I understand.

If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

you are able to get duckdb iceberg working?

harel-e · 2023-12-20T05:15:41Z

Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups:

a. AWS S3 + AWS Glue
b. MinIO + Nessie

wolfeidau · 2024-03-08T23:42:20Z

I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.

select count(*) from iceberg_scan('s3://cfanalytics-abc123/cloudfront_logs_analytics/metadata/abf3a652-02cb-4a8e-8b6c-2089a2acfe6c.metadata.json');

Works for me at the moment.

teaguesterling · 2024-03-24T20:04:48Z

This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0)

There are a few other oddities and observations:

If you manually create a version-hint.text file pointing to one of the existing metadata.json files, the iceberg scanner ends up looking for a file prefixed with a "v" (e.g., 00000-d30b41d6-48c0-42db-b32e-29083b874a80 in version-hint.text looks for v00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json (but only 00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json exists in the directory)
If you also copy the create the .metadata.json to the expected v....metadata.json path, everything works as expected.
If you accidentally create the .metadata.json file as a binary minio lx.meta file (as I did), you can crash DuckDB with a a segfault --- which may be more of a security risk than anything else.
If the version-hint.text contains invalid characters for a path (e.g., a trailing newline) they will be directly included in the requested ...metadata.json path.

The prefixing of the "v" when looking for the .metadata.json seems to be the most burdensome as it's not terribly difficult to maintain a version-hint.text file but it would be difficult to rename versions.

jacopotagliabue · 2024-04-04T14:55:40Z

Confirming that

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

still does not work with Dremio created table, Nessie catalog.

Error is:
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "https://bauplan-openlake-db87a23.s3.amazonaws.com/iceberg/taxi_fhvhv_partitioned/metadata/00000-136374fe-87d3-4cc6-8202-0a11f6af0b56.metadata.json/metadata/version-hint.text": 404 (Not Found)

Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them?

ekrata-main · 2024-04-07T02:33:47Z

Can confirm that this still does not work for iceberg tables created with catalog.create_table()

query: f"SELECT * FROM iceberg_scan('{lakehouse_path}') WHERE id = {mock_team_id}"

error: duckdb.duckdb.HTTPException: HTTP Error: Unable to connect to URL "https://local-lakehousesta-locallakehousebuck-mnrnr57ascjc.s3.amazonaws.com/metadata/version-hint.text": 404 (Not Found)

Pyiceberg workaround: Load the Iceberg table using a pyiceberg catalog (i'm using glue), then use the metadata_location field for the scan.

lakehouse_catalog = load_catalog( "glue", **{"type": "glue", "s3.region": "us-east-1"} )

team_table = lakehouse_catalog.load_table("default.Team")

changed_team_record = conn.sql( f"SELECT * FROM iceberg_scan('{team_table.metadata_location}') WHERE id = {mock_team_id}" ).to_df()

mike-luabase · 2024-06-13T20:33:27Z

I'm using this work around for a sqlite catalog

import shutil
import os
import sqlite3

def create_metadata_for_tables(tables):
    """
    Iterate through all tables and create metadata files.

    Parameters:
        tables (list): A list of dictionaries, each representing an Iceberg table with a 'metadata_location'.
    """
    for table in tables:
        metadata_location = table['metadata_location'].replace('file://', '')
        metadata_dir = os.path.dirname(metadata_location)
        new_metadata_file = os.path.join(metadata_dir, 'v1.metadata.json')
        version_hint_file = os.path.join(metadata_dir, 'version-hint.text')

        # Ensure the metadata directory exists
        os.makedirs(metadata_dir, exist_ok=True)

        # Copy the metadata file to v1.metadata.json
        shutil.copy(metadata_location, new_metadata_file)
        print(f"Copied metadata file to {new_metadata_file}")

        # Create the version-hint.text file with content "1"
        with open(version_hint_file, 'w') as f:
            f.write('1')
        print(f"Created {version_hint_file} with content '1'")

def get_iceberg_tables(database_path):
    """
    Connect to the SQLite database and retrieve the list of Iceberg tables.

    Parameters:
        database_path (str): The path to the SQLite database file.

    Returns:
        list: A list of dictionaries, each representing an Iceberg table.
    """
    # Connect to the SQLite database
    con_meta = sqlite3.connect(database_path)
    con_meta.row_factory = sqlite3.Row

    # Create a cursor object to execute SQL queries
    cursor = con_meta.cursor()

    # Query to list all tables in the database
    query = 'SELECT * FROM "iceberg_tables" ORDER BY "catalog_name", "table_namespace", "table_name";'

    # Execute the query
    cursor.execute(query)

    # Fetch all results
    results = cursor.fetchall()

    # Convert results to list of dictionaries
    table_list = []
    for row in results:
        row_dict = {key: row[key] for key in row.keys()}
        table_list.append(row_dict)

    # Close the connection
    con_meta.close()

    return table_list

Usage:

database_path = "/your/path"

# Retrieve the list of Iceberg tables
tables = get_iceberg_tables(database_path)

# Create metadata for each table
create_metadata_for_tables(tables)

# Print the final tables list
for table in tables:
    print(table)

karakanb · 2024-06-17T22:04:55Z

I can confirm that the issue persists on duckdb v1.0.0 1f98600c2c and the getting started examples from the Apache Iceberg docs using local minio. the file lives in the bucket warehouse with the full URI s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json, I can confirm I can access the file using the AWS CLI with the same path.

here's what I did in duckdb cli:

INSTALL iceberg;
LOAD iceberg;
INSTALL httpfs;
LOAD httpfs;

CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID 'key-here',
    SECRET 'secret-here',
    REGION 'us-east-1',
    ENDPOINT '127.0.0.1:9000',
    USE_SSL 'false'
);

SELECT * FROM iceberg_scan('s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json');

> HTTP Error: Unable to connect to URL "http://warehouse.minio.iceberg.orb.local:9000/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json": 404 (Not Found)

lamb-russell · 2024-07-24T02:27:09Z

PyIceberg (as of version 0.6) created iceberg tables don't create these version-hint.text files. This makes it difficult to use iceberg with duckdb unless you're reading from files creating with spark or another engine that creates those version-hint.text files. Being reliant on a spark engine to do writes undermines one major advantage duckdb: that it can do more with less. PyIceberg is a lightweight iceberg library that can be a great pairing with duckdb and this extension.

Not sure if true but Version-hint.text files aren't part of the standard spec as noted in this issue.

For an example of a notebook creating an iceberg table that cannot be read by duckdb see here

teaguesterling · 2024-07-25T00:28:40Z

I'd be interested in taking a stab at resolving this issue but would want to get some agreement on what the appropriate logic flow should be. As I see it, there's at least a few questions:

Would we want it to always check for a version-hint.txt allow it as an (on by) default option?
Would we want to allow for handling both the v0000#... vs. the 0000#... format? Do we check for the v000#... version first and then for the non-prefixed version or vice versa?
In the event that no version-hint.txt is present, what should the logic be to find the appropriate metadata file? Should we list them all out and select the one with the highest version #? If so, do we do that every time the table is read?

lamb-russell · 2024-08-02T22:22:45Z

I would suggest using version-hint.text if it's there, otherwise the caller must pass in the latest metadata path. My reasoning is it's less disruptive to keep supporting the existing version-hint by default, but there should be more clear error messaging so users know what the issue is.
allow for a default style (e.g. v0000# but have an argument for different prefix style. Default to whatever the current default is (e.g. v0000# )
Make the user provide one explicitly. They can find the latest version in their catalog.

teaguesterling · 2024-08-04T18:56:36Z

@lamb-russell I agree. I was thinking about this a bit more and thinking we'd do more-or-less what you proposed but with a few adjustments:

Agreed. Internally, I think it also makes sense to change some of the internal functions to depend on a specific metadata file instead of searching lower. That way we can identify if the error will occur much higher in the stack (and it reduces the overall complexity).
I think we could do a step further and have a table_version_format printf-style string that merges the prefix, table version, compression codec, and extensions. It may be overkill, but it'd give more flexibility in the face of a feature with no standard. The current (and default) would look something like "metadata/v%s%s.metadata.json". I could see this eventually being configurable instead of being a constant in the future.
Agreed. It may be nice to have a utility function to find the latest version with some sort of rule: either highest 00### value or latest created/modified data? Maybe that's overkill, so I am not going to touch that for now.

teaguesterling · 2024-08-04T22:03:38Z

FYI, I'll be working on this in #63. It may take me a little bit to get my dev environment up and running, but I hope to wrap it up soon.

Suggestions for the possible solution (or changes in direction) welcome.

teaguesterling · 2024-08-10T23:57:40Z

This is in a working state now. It essentially adds two parameters to the iceberg functions:

version='....' : Allows you to explicitly specify a version number for a table or a text/txt file that contains the name of a version (default: version='version-hint.text'; examples: version='12' version='my-custom-version-file.txt')
version_name_format='...': Allows you to specify a comma-delimited list of format strings that should be tried to find the metadata file name based on the supplied version (default: version_name_format='v%s%s.metadata.json,%s%s.metadata.json'; examples: version_name_format='version-%smetadata.json%s')

Addresses #29: Support missing version-hint.txt and provide additional options

teaguesterling · 2024-08-20T15:44:14Z

As this PR doesn't make it "just work" yet, are there any other changes we'd want to consider to support the absence of the version hint?

It would be heavy handed, but we could glob out all the metadata files and pick the largest version (or most recent)?

ramonvermeulen · 2024-08-23T14:03:49Z

@teaguesterling

I just made a build out of the latest main branch (including your changes) via make debug, however when I try to use the version parameter I get the following error:

D SELECT count(*) FROM iceberg_scan("/home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test", version="00011-09891886-f8e1-403b-a5e9-a36cb11a41c0");
IO Error: Cannot open file "file:///home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test/metadata/snap-9095995686510048943-0-fdce8f11-65e4-4ae8-9c8e-e0f11cd48f2d.avro": No such file or directory

It definitely finds the metadata file, otherwise it can't find the link to the avro path.
The weird thing is, the Avro file does actually exists. For example when I try to cat it without the file:// prefix:

cat /home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test/metadata/snap-9095995686510048943-0-fdce8f11-65e4-4ae8-9c8e-e0f11cd48f2d.avro      
Obj
   snapshot-id&9095995686510048943$parent-snapshot-id$866746840554898102sequence-number20format-version2avro.schema�{"type": "record", "fields": [{"name": "manifest_path", "field-id": 500, "type": "string", "doc": "Location URI with FS scheme"}, {"name": "manifest_length", "field-id": 501, "type": "long

Also to be certain that it is not some kind of OS-level permission issue, I put 777 on the exact avro file.
Any idea what could be causing it, I made the iceberg file via pyiceberg with basically their default example from the quickstart.

teaguesterling · 2024-08-25T05:36:43Z

I was able to reproduce this and also observed the same behavior when looking at the test suite files. I'm going to re-test on the original PR branch to see if I can trace it down. Oddly, it seems like the 1st filesystem action works and all subsequent actions fail.

teaguesterling · 2024-08-25T06:40:39Z

Upon further investigation, the file:// paths are coming from the metadata files themselves. I was originally mistaken in thinking this was also happening with the test suite files. It does seem to be caused by subsequent filesystem actions, but only because the paths in the metadata files don't work with duckdb_iceberg's expectations.

I suspect what you're observing is actually a different issue having to do with the way pyiceberg represents relative paths being incompatible with the way the duckdb_iceberg extension is handling them. This should probably be handled as a separate issue. It seems that we probably need to have the metadata files handle the file:// prefix.

I don't know this part of the code well enough yet to make any other claims about it. I can show that it does seem to be related to paths with a few tests:

$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro": No such file or directory
# This is the error reported above

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro test2/warehouse/default.db/taxi_dataset/metadata/*.json
test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We see that the file://...avro string is actually in the metadata.json file

$ sed -i "s|file://|$PWD/|g" test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We can remove the file:// prefix and make it a hard-coded absolute path in the JSON file
# Note that the same thing happens even if you make the path relative.

$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro": No such file or directory
# We still get an error, but it's for a different file this time, not a `snap-....avro` file, but the `...-m0.avro` file

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro test2/warehouse/default.db/taxi_dataset/metadata/*
test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
# This path is actually hard-coded in the avro file, so naively changing it will cause corruption.

I suspect it's a bigger problem that pyiceberg is writing out the paths to files relative to the working directory (note test2/... in the tests above) at creation time, though. All of this will break if the table ends up getting accessed from a different working directory

ramonvermeulen · 2024-08-28T11:22:34Z

I was able to reproduce this and also observed the same behavior when looking at the test suite files. I'm going to re-test on the original PR branch to see if I can trace it down. Oddly, it seems like the 1st filesystem action works and all subsequent actions fail.

At least glad to hear that you're able to reproduce it, then it's not just me!

Upon further investigation, the file:// paths are coming from the metadata files themselves. I was originally mistaken in thinking this was also happening with the test suite files. It does seem to be caused by subsequent filesystem actions, but only because the paths in the metadata files don't work with duckdb_iceberg's expectations.

I suspect what you're observing is actually a different issue having to do with the way pyiceberg represents relative paths being incompatible with the way the duckdb_iceberg extension is handling them. This should probably be handled as a separate issue. It seems that we probably need to have the metadata files handle the file:// prefix.

I don't know this part of the code well enough yet to make any other claims about it. I can show that it does seem to be related to paths with a few tests:
$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro": No such file or directory
# This is the error reported above

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro test2/warehouse/default.db/taxi_dataset/metadata/*.json
test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We see that the file://...avro string is actually in the metadata.json file

$ sed -i "s|file://|$PWD/|g" test2/warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
# We can remove the file:// prefix and make it a hard-coded absolute path in the JSON file
# Note that the same thing happens even if you make the path relative.

$ ./build/release/duckdb -c "FROM iceberg_metadata('test2/warehouse/default.db/taxi_dataset', version='00003-afdf5761-964a-4766-986d-c6d20f942df0')"
IO Error: Cannot open file "file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro": No such file or directory
# We still get an error, but it's for a different file this time, not a `snap-....avro` file, but the `...-m0.avro` file

$ grep -l file://test2/warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro test2/warehouse/default.db/taxi_dataset/metadata/*
test2/warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
# This path is actually hard-coded in the avro file, so naively changing it will cause corruption.
I suspect it's a bigger problem that pyiceberg is writing out the paths to files relative to the working directory (note test2/... in the tests above) at creation time, though. All of this will break if the table ends up getting accessed from a different working directory

That's interesting, in the example that I tested the paths that pyiceberg wrote were absolute and not relative to the working directory. Apparently it depends on how you configure pyiceberg it's uri and warehouse.

For example if you would do this f"file://{Path(__file__).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.

UPDATE: Yes you can.

D SELECT count(*) FROM iceberg_scan("/home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test", version="00002-49aec47d-117b-494a-a22e-6198661826b0");
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      3066766 │
└──────────────┘

As this PR doesn't make it "just work" yet, are there any other changes we'd want to consider to support the absence of the version hint?

It would be heavy handed, but we could glob out all the metadata files and pick the largest version (or most recent)?

I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.

teaguesterling · 2024-08-28T20:13:45Z

For example if you would do this f"file://{Path(file).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.

UPDATE: Yes you can.

That's great that it works without the file:// prefix! This would imply that we probably should add some more intelligent handling around schemas into the metadata loading. If file:// doesn't work, I don't know if other schemas would be expected to work either. It's probably not too hard to make this change, but will touch a lot of places.

I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.

Neither do I, but I'm not opposed to it. Perhaps something where version can also provided as a pattern in addition to a version or hint file?

teaguesterling · 2024-08-31T17:54:12Z

Actually, I took another look at this and I think the issue is deeper than the iceberg extension. It seems ducdkb wont work with the file:// at all. It may be worth raising an issue upstream about this to create a subfilesystem with the "file://" protocol.

See this example:

from 'file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet';
IO Error: No files found that match the pattern "file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet"

kevinjqliu · 2024-09-09T03:51:44Z

Duckdb's delta_scan function currently handles the file:// prefix, as mentioned in #38 (comment)

While I agree that supporting file:// natively in Duckdb should be the proper long-term fix, perhaps we can handle the prefix in this extension as a workaround.

humaidkidwai · 2024-09-17T23:35:27Z

Has anyone faced this error?
SQL Error: java.sql.SQLException: Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path"

This is what my duckdb setup looks like:

`INSTALL iceberg;
LOAD iceberg;
INSTALL httpfs;
LOAD httpfs;

SET s3_access_key_id='key';
SET s3_secret_access_key='secretKey';
SET s3_region='us-east-1';
SET s3_use_ssl=true;
SET s3_url_style='path';

SELECT *
FROM
iceberg_scan('s3://my-bucket/observation/metadata/00004-bc91e4be-ee63-4922-89eb-f7730dbbee82.metadata.json');`

kevinjqliu · 2024-09-18T16:36:47Z

That error is coming from the binder
https://github.com/duckdb/duckdb/blob/main/src/planner/table_binding.cpp#L222-L225
https://duckdb.org/docs/internals/overview.html#binder

I don't think its related to reading the iceberg table

humaidkidwai · 2024-09-18T17:09:11Z

Yeah, I realized Iceberg extension cannot read tables that have been updated or deleted. It can only read the tables that have been written once. So MoR tables cannot be read.

kevinjqliu · 2024-09-18T17:18:03Z

@humaidkidwai I see, thats a good point. Do you mind opening a new issue on this repo to track that? and perhaps include your example above? Its another issue we should tackle

Addresses duckdb#29: Support missing version-hint.txt and provide additional options

soumilshah1995 · 2024-11-02T13:48:02Z

Any solution for this ?

bhaskar-pv · 2024-11-07T06:22:51Z

Sad to see when all the libraries started support for Iceberg yet Duckdb is yet to fully support it due to this issue

ramonvermeulen · 2024-11-07T07:32:09Z

For example if you would do this f"file://{Path(file).parent}/warehouse" it works fine with absolute paths, but "file://./warehouse" and it writes to relative path. Then I found out you can also just configure uri without the file:// prefix that they use in their example, and everything seems works fine. Currently building the extension to see if I am able to load in a new (py)iceberg table that uses absolute paths without file:// prefix.

UPDATE: Yes you can.

That's great that it works without the file:// prefix! This would imply that we probably should add some more intelligent handling around schemas into the metadata loading. If file:// doesn't work, I don't know if other schemas would be expected to work either. It's probably not too hard to make this change, but will touch a lot of places.

I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution.

Neither do I, but I'm not opposed to it. Perhaps something where version can also provided as a pattern in addition to a version or hint file?

Since demand on this issue/feature seems to be increasing, I was wondering if this could still be a good potential solution? I don't know the iceberg format well enough to be really opinionated about this. If it is, I am willing to put some time into this and hopefully come up with a PR that resolves the issue for scenarios without version or hint file.

teaguesterling · 2024-11-07T16:07:35Z

I took a look at the file:// protocol and convinced myself that it should be handled in duckdb directly. It's not that Iceberg doesn't it, it's that duckdb doesn't at all. The up side of that is other protocols are likely not broken. S3 works for me.

The option to infer the latest version from file path patterns is something I've wanted to do for a while now. Maybe it's time.

I held off from including this in the version PR as I wasn't sure if it would actually make things worse/more confusing by not following the expectation of having a data catalog to provide that information. However, in the spirit of "just give me my data", maybe we should go for it.

Given the number of possible implementations, perhaps we should map out a spec on how this would work in this thread? I know the details of the extension well enough now that I would be happy to advise/help with the implementation.

teaguesterling · 2024-11-11T00:01:06Z

I've taken a some time this weekend to take a stab at finally finishing this. The new implementation adds one additional possible value you can provide to version_name that will try to guess the best table version. It also changes the default version from "version-hint.text" to "?", making this the default behavior. When table_version="?" the following steps will be followed:

Check to see if a version-hint.text file exists. If so, use it.
Create a glob pattern using the version_name_format, metadata_compression_codec values. Right now, this just uses "*" as the version but future versions could be more thoughtful about this.
Glob the metadata directory to retrieve all matching metadata files.
(Naively) sort the metadata files and pick the last one. This hopes that the latest version has the largest version number.

If you had a directory with the following files:

warehouse/default.db/taxi_dataset/metadata/00000-17a6f9e5-48bb-4367-aa00-20190eb3c16e.metadata.json
warehouse/default.db/taxi_dataset/metadata/00001-a73e34f8-1bd7-4e72-a357-641a92541ade.metadata.json
warehouse/default.db/taxi_dataset/metadata/00002-fdfafee9-a5a0-4d0b-bc09-1caf708346ab.metadata.json
warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json
warehouse/default.db/taxi_dataset/metadata/637f7b78-0006-4bda-b76a-8d7715fdcecb-m0.avro
warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro
warehouse/default.db/taxi_dataset/metadata/d8d5ba7b-9db1-447b-bc39-d94274d5e3a0-m0.avro
warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro
warehouse/default.db/taxi_dataset/metadata/snap-8163938669155459790-0-d8d5ba7b-9db1-447b-bc39-d94274d5e3a0.avro
warehouse/default.db/taxi_dataset/metadata/snap-88534494938550919-0-637f7b78-0006-4bda-b76a-8d7715fdcecb.avro

Loading with: iceberg_scan("warehouse/default.db/test", version="?", metadata_compression_codec="", version_name_format="v%s%s.metadata.json,%s%s.metadata.json") (the default values) will do the following:

Check for warehouse/default.db/taxi_dataset/metadata/version-hint.txt, which doesn't exist.
Glob the directory using the pattern "v*.metadata.json" (since the metadata_compression_codec is not "gzip")
No results are returned so it will continue to the next pattern and glob for: "*.metadata.json".
It will retrieve all .metadata.json files and pass them to the PickTableVersion function.
The table names will be sorted and the last one will be selected, resulting in table_version 00003-afdf5761-964a-4766-986d-c6d20f942df0 being selected.

Note that I haven't finished testing this yet or pushed it, but have put the implementation in place. Once we finish testing, review, and documentation it should be sufficient to close the issue as no version-hint, version, or catalog will be required to load an iceberg table the way we we'd expect to.

ramonvermeulen · 2024-11-12T07:33:49Z

I've taken a some time this weekend to take a stab at finally finishing this. The new implementation adds one additional possible value you can provide to version_name that will try to guess the best table version. It also changes the default version from "version-hint.text" to "?", making this the default behavior. When table_version="?" the following steps will be followed:

Check to see if a version-hint.text file exists. If so, use it.

Create a glob pattern using the version_name_format, metadata_compression_codec values. Right now, this just uses "*" as the version but future versions could be more thoughtful about this.

Glob the metadata directory to retrieve all matching metadata files.

(Naively) sort the metadata files and pick the last one. This hopes that the latest version has the largest version number.

If you had a directory with the following files:

warehouse/default.db/taxi_dataset/metadata/00000-17a6f9e5-48bb-4367-aa00-20190eb3c16e.metadata.json

warehouse/default.db/taxi_dataset/metadata/00001-a73e34f8-1bd7-4e72-a357-641a92541ade.metadata.json

warehouse/default.db/taxi_dataset/metadata/00002-fdfafee9-a5a0-4d0b-bc09-1caf708346ab.metadata.json

warehouse/default.db/taxi_dataset/metadata/00003-afdf5761-964a-4766-986d-c6d20f942df0.metadata.json

warehouse/default.db/taxi_dataset/metadata/637f7b78-0006-4bda-b76a-8d7715fdcecb-m0.avro

warehouse/default.db/taxi_dataset/metadata/71d373c0-e6b2-4eef-9904-c90c05e9c4c3-m0.avro

warehouse/default.db/taxi_dataset/metadata/d8d5ba7b-9db1-447b-bc39-d94274d5e3a0-m0.avro

warehouse/default.db/taxi_dataset/metadata/snap-4670253308400268764-0-71d373c0-e6b2-4eef-9904-c90c05e9c4c3.avro

warehouse/default.db/taxi_dataset/metadata/snap-8163938669155459790-0-d8d5ba7b-9db1-447b-bc39-d94274d5e3a0.avro

warehouse/default.db/taxi_dataset/metadata/snap-88534494938550919-0-637f7b78-0006-4bda-b76a-8d7715fdcecb.avro

Loading with: iceberg_scan("warehouse/default.db/test", version="?", metadata_compression_codec="", version_name_format="v%s%s.metadata.json,%s%s.metadata.json") (the default values) will do the following:

Check for warehouse/default.db/taxi_dataset/metadata/version-hint.txt, which doesn't exist.

Glob the directory using the pattern "v*.metadata.json" (since the metadata_compression_codec is not "gzip")

No results are returned so it will continue to the next pattern and glob for: "*.metadata.json".

It will retrieve all .metadata.json files and pass them to the PickTableVersion function.

The table names will be sorted and the last one will be selected, resulting in table_version 00003-afdf5761-964a-4766-986d-c6d20f942df0 being selected.

Note that I haven't finished testing this yet or pushed it, but have put the implementation in place. Once we finish testing, review, and documentation it should be sufficient to close the issue as no version-hint, version, or catalog will be required to load an iceberg table the way we we'd expect to.

Hi @teaguesterling,

Thanks again for putting in this effort, let me know if you want someone to test this functionality. I can build from source and test on a couple of different iceberg tables I have locally (some created via pyiceberg and therefore not having a version / version-hint file).

Would me nice to have some of the main contributors involved in this to get a good opinionated view on this implementation.

Any chance one of you has time to take a look at this? @samansmink @carlopi

samansmink · 2024-11-12T13:05:10Z

Happy to review a PR implementing this, I think it makes sense!

teaguesterling · 2024-11-28T16:07:31Z

I think we can consider this closed with #79, maybe pending documentation.

kevinjqliu mentioned this issue Jul 8, 2024

Duckdb iceberg_scan can read directly from metadata.json steven-luabase/duckdb-iceberg-demo#1

Closed

teaguesterling mentioned this issue Aug 4, 2024

Addresses #29: Support missing version-hint.txt and provide additional options #63

Merged

8 tasks

samansmink added a commit that referenced this issue Aug 13, 2024

Merge pull request #63 from teaguesterling/issues/29

3f6d753

Addresses #29: Support missing version-hint.txt and provide additional options

teaguesterling mentioned this issue Aug 31, 2024

File protocol not supported duckdb/duckdb#13669

Open

2 tasks

mike-luabase pushed a commit to definite-app/duckdb_iceberg that referenced this issue Oct 27, 2024

Merge pull request duckdb#63 from teaguesterling/issues/29

75f880d

Addresses duckdb#29: Support missing version-hint.txt and provide additional options

teaguesterling mentioned this issue Nov 12, 2024

Issues/29: Add version guessing functionality #79

Merged

Can't use the extension if my data catalog did not create a version-hint.text file #29

Can't use the extension if my data catalog did not create a version-hint.text file #29

Comments

jacopotagliabue commented Nov 21, 2023 • edited Loading

jacopotagliabue commented Nov 21, 2023

harel-e commented Nov 26, 2023

rustyconover commented Nov 27, 2023 via email

samansmink commented Nov 27, 2023

jacopotagliabue commented Nov 27, 2023

jacopotagliabue commented Nov 27, 2023

jacopotagliabue commented Nov 27, 2023

samansmink commented Nov 27, 2023

jacopotagliabue commented Nov 27, 2023

samansmink commented Nov 27, 2023

harel-e commented Dec 17, 2023

jacopotagliabue commented Dec 19, 2023

harel-e commented Dec 20, 2023

wolfeidau commented Mar 8, 2024

teaguesterling commented Mar 24, 2024 • edited Loading

jacopotagliabue commented Apr 4, 2024

ekrata-main commented Apr 7, 2024 • edited Loading

Pyiceberg workaround: Load the Iceberg table using a pyiceberg catalog (i'm using glue), then use the metadata_location field for the scan.

mike-luabase commented Jun 13, 2024 • edited Loading

karakanb commented Jun 17, 2024 • edited Loading

lamb-russell commented Jul 24, 2024 • edited Loading

teaguesterling commented Jul 25, 2024

lamb-russell commented Aug 2, 2024

teaguesterling commented Aug 4, 2024

teaguesterling commented Aug 4, 2024

teaguesterling commented Aug 10, 2024 • edited Loading

teaguesterling commented Aug 20, 2024

ramonvermeulen commented Aug 23, 2024 • edited Loading

teaguesterling commented Aug 25, 2024

teaguesterling commented Aug 25, 2024

ramonvermeulen commented Aug 28, 2024 • edited Loading

teaguesterling commented Aug 28, 2024

teaguesterling commented Aug 31, 2024

kevinjqliu commented Sep 9, 2024

humaidkidwai commented Sep 17, 2024

kevinjqliu commented Sep 18, 2024

humaidkidwai commented Sep 18, 2024

kevinjqliu commented Sep 18, 2024

soumilshah1995 commented Nov 2, 2024

bhaskar-pv commented Nov 7, 2024

ramonvermeulen commented Nov 7, 2024

teaguesterling commented Nov 7, 2024

teaguesterling commented Nov 11, 2024 • edited Loading

ramonvermeulen commented Nov 12, 2024

samansmink commented Nov 12, 2024

teaguesterling commented Nov 28, 2024

jacopotagliabue commented Nov 21, 2023 •

edited

Loading

teaguesterling commented Mar 24, 2024 •

edited

Loading

ekrata-main commented Apr 7, 2024 •

edited

Loading

mike-luabase commented Jun 13, 2024 •

edited

Loading

karakanb commented Jun 17, 2024 •

edited

Loading

lamb-russell commented Jul 24, 2024 •

edited

Loading

teaguesterling commented Aug 10, 2024 •

edited

Loading

ramonvermeulen commented Aug 23, 2024 •

edited

Loading

ramonvermeulen commented Aug 28, 2024 •

edited

Loading

teaguesterling commented Nov 11, 2024 •

edited

Loading