-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't use the extension if my data catalog did not create a version-hint.text file #29
Comments
Sorry to be a bit clearer: even if we fix the Happy to help debug this if there's something we can quickly try out. |
I ran into similar issue using AWS with Glue as the catalog for Iceberg. The metadata files stored in S3 are of the following pattern: 00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json I suspect Glue holds the pointer to the current metadata. |
It does.
You can see the current pointer in table properties if you call Glue’s
DescribeTable
…On Sun, Nov 26, 2023 at 10:37 Harel Efraim ***@***.***> wrote:
I ran into similar issue using AWS with Glue as the catalog for Iceberg.
The metadata files stored in S3 are of the following pattern:
00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json
00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json
I suspect Glue holds the pointer to the current metadata.
—
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFSWJLGUYXUROAA5YGISITYGNV4HAVCNFSM6AAAAAA7VLFQWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWHAZTAMZYGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check: |
@samansmink thanks, but the work-around does not seem the work tough: I get SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00' I still get a 404 with version file
As if it was trying to append the |
Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is
If I try with
Any way around all of this? |
Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:
A nessie (written with Spark) file system uses Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones. What do you think could be a short-term work-around @samansmink ? |
@jacopotagliabue s3a urls are indeed not supported currently. If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue |
That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue. |
For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly. Once you load the table from the catalog, you can issue the following method that will return the latest metadata location.
I tested it on both Glue and Nessie. It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on |
hi @harel-e, just making sure I understand. If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:
you are able to get duckdb iceberg working? |
Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups: a. AWS S3 + AWS Glue |
I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.
Works for me at the moment. |
This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0) There are a few other oddities and observations:
The prefixing of the |
Confirming that
still does not work with Dremio created table, Nessie catalog. Error is: Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them? |
Can confirm that this still does not work for iceberg tables created with catalog.create_table() query: error: Pyiceberg workaround: Load the Iceberg table using a pyiceberg catalog (i'm using glue), then use the metadata_location field for the scan.
|
I'm using this work around for a sqlite catalog import shutil
import os
import sqlite3
def create_metadata_for_tables(tables):
"""
Iterate through all tables and create metadata files.
Parameters:
tables (list): A list of dictionaries, each representing an Iceberg table with a 'metadata_location'.
"""
for table in tables:
metadata_location = table['metadata_location'].replace('file://', '')
metadata_dir = os.path.dirname(metadata_location)
new_metadata_file = os.path.join(metadata_dir, 'v1.metadata.json')
version_hint_file = os.path.join(metadata_dir, 'version-hint.text')
# Ensure the metadata directory exists
os.makedirs(metadata_dir, exist_ok=True)
# Copy the metadata file to v1.metadata.json
shutil.copy(metadata_location, new_metadata_file)
print(f"Copied metadata file to {new_metadata_file}")
# Create the version-hint.text file with content "1"
with open(version_hint_file, 'w') as f:
f.write('1')
print(f"Created {version_hint_file} with content '1'")
def get_iceberg_tables(database_path):
"""
Connect to the SQLite database and retrieve the list of Iceberg tables.
Parameters:
database_path (str): The path to the SQLite database file.
Returns:
list: A list of dictionaries, each representing an Iceberg table.
"""
# Connect to the SQLite database
con_meta = sqlite3.connect(database_path)
con_meta.row_factory = sqlite3.Row
# Create a cursor object to execute SQL queries
cursor = con_meta.cursor()
# Query to list all tables in the database
query = 'SELECT * FROM "iceberg_tables" ORDER BY "catalog_name", "table_namespace", "table_name";'
# Execute the query
cursor.execute(query)
# Fetch all results
results = cursor.fetchall()
# Convert results to list of dictionaries
table_list = []
for row in results:
row_dict = {key: row[key] for key in row.keys()}
table_list.append(row_dict)
# Close the connection
con_meta.close()
return table_list Usage: database_path = "/your/path"
# Retrieve the list of Iceberg tables
tables = get_iceberg_tables(database_path)
# Create metadata for each table
create_metadata_for_tables(tables)
# Print the final tables list
for table in tables:
print(table) |
I can confirm that the issue persists on duckdb here's what I did in duckdb cli: INSTALL iceberg;
LOAD iceberg;
INSTALL httpfs;
LOAD httpfs;
CREATE SECRET secret1 (
TYPE S3,
KEY_ID 'key-here',
SECRET 'secret-here',
REGION 'us-east-1',
ENDPOINT '127.0.0.1:9000',
USE_SSL 'false'
);
SELECT * FROM iceberg_scan('s3://warehouse/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json');
> HTTP Error: Unable to connect to URL "http://warehouse.minio.iceberg.orb.local:9000/nyc/taxis/metadata/00002-fc696445-7a22-4653-bbca-fc95d070b71a.metadata.json": 404 (Not Found) |
PyIceberg (as of version 0.6) created iceberg tables don't create these version-hint.text files. This makes it difficult to use iceberg with duckdb unless you're reading from files creating with spark or another engine that creates those version-hint.text files. Being reliant on a spark engine to do writes undermines one major advantage duckdb: that it can do more with less. PyIceberg is a lightweight iceberg library that can be a great pairing with duckdb and this extension. Not sure if true but Version-hint.text files aren't part of the standard spec as noted in this issue. For an example of a notebook creating an iceberg table that cannot be read by duckdb see here |
I'd be interested in taking a stab at resolving this issue but would want to get some agreement on what the appropriate logic flow should be. As I see it, there's at least a few questions:
|
|
@lamb-russell I agree. I was thinking about this a bit more and thinking we'd do more-or-less what you proposed but with a few adjustments:
|
FYI, I'll be working on this in #63. It may take me a little bit to get my dev environment up and running, but I hope to wrap it up soon. Suggestions for the possible solution (or changes in direction) welcome. |
This is in a working state now. It essentially adds two parameters to the iceberg functions:
|
Addresses #29: Support missing version-hint.txt and provide additional options
As this PR doesn't make it "just work" yet, are there any other changes we'd want to consider to support the absence of the version hint? It would be heavy handed, but we could glob out all the metadata files and pick the largest version (or most recent)? |
I just made a build out of the latest main branch (including your changes) via
It definitely finds the metadata file, otherwise it can't find the link to the avro path. cat /home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test/metadata/snap-9095995686510048943-0-fdce8f11-65e4-4ae8-9c8e-e0f11cd48f2d.avro
Obj
snapshot-id&9095995686510048943$parent-snapshot-id$866746840554898102sequence-number20format-version2avro.schema�{"type": "record", "fields": [{"name": "manifest_path", "field-id": 500, "type": "string", "doc": "Location URI with FS scheme"}, {"name": "manifest_length", "field-id": 501, "type": "long Also to be certain that it is not some kind of OS-level permission issue, I put 777 on the exact avro file. |
I was able to reproduce this and also observed the same behavior when looking at the test suite files. I'm going to re-test on the original PR branch to see if I can trace it down. Oddly, it seems like the 1st filesystem action works and all subsequent actions fail. |
Upon further investigation, the I suspect what you're observing is actually a different issue having to do with the way pyiceberg represents relative paths being incompatible with the way the duckdb_iceberg extension is handling them. This should probably be handled as a separate issue. It seems that we probably need to have the metadata files handle the file:// prefix. I don't know this part of the code well enough yet to make any other claims about it. I can show that it does seem to be related to paths with a few tests:
I suspect it's a bigger problem that pyiceberg is writing out the paths to files relative to the working directory (note |
At least glad to hear that you're able to reproduce it, then it's not just me!
That's interesting, in the example that I tested the paths that pyiceberg wrote were absolute and not relative to the working directory. Apparently it depends on how you configure pyiceberg it's For example if you would do this UPDATE: Yes you can. D SELECT count(*) FROM iceberg_scan("/home/ramon/Desktop/PROJECTS/personal/duckdb-iceberg-test/warehouse/default.db/test", version="00002-49aec47d-117b-494a-a22e-6198661826b0");
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 3066766 │
└──────────────┘
I think this could be a potential solution, at least you don't have to provide the version flag yourself anymore. However I don't know the internals of iceberg well enough to say if this is a good solution. |
That's great that it works without the
Neither do I, but I'm not opposed to it. Perhaps something where version can also provided as a pattern in addition to a version or hint file? |
Actually, I took another look at this and I think the issue is deeper than the iceberg extension. It seems ducdkb wont work with the See this example:
|
Duckdb's While I agree that supporting |
Has anyone faced this error? This is what my duckdb setup looks like: `INSTALL iceberg; SET s3_access_key_id='key'; SELECT * |
That error is coming from the binder I don't think its related to reading the iceberg table |
Yeah, I realized Iceberg extension cannot read tables that have been updated or deleted. It can only read the tables that have been written once. So MoR tables cannot be read. |
@humaidkidwai I see, thats a good point. Do you mind opening a new issue on this repo to track that? and perhaps include your example above? Its another issue we should tackle |
Addresses duckdb#29: Support missing version-hint.txt and provide additional options
Any solution for this ? |
Sad to see when all the libraries started support for Iceberg yet Duckdb is yet to fully support it due to this issue |
Since demand on this issue/feature seems to be increasing, I was wondering if this could still be a good potential solution? I don't know the iceberg format well enough to be really opinionated about this. If it is, I am willing to put some time into this and hopefully come up with a PR that resolves the issue for scenarios without version or hint file. |
I took a look at the file:// protocol and convinced myself that it should be handled in duckdb directly. It's not that Iceberg doesn't it, it's that duckdb doesn't at all. The up side of that is other protocols are likely not broken. S3 works for me. The option to infer the latest version from file path patterns is something I've wanted to do for a while now. Maybe it's time. I held off from including this in the version PR as I wasn't sure if it would actually make things worse/more confusing by not following the expectation of having a data catalog to provide that information. However, in the spirit of "just give me my data", maybe we should go for it. Given the number of possible implementations, perhaps we should map out a spec on how this would work in this thread? I know the details of the extension well enough now that I would be happy to advise/help with the implementation. |
I've taken a some time this weekend to take a stab at finally finishing this. The new implementation adds one additional possible value you can provide to
If you had a directory with the following files:
Loading with:
Note that I haven't finished testing this yet or pushed it, but have put the implementation in place. Once we finish testing, review, and documentation it should be sufficient to close the issue as no version-hint, version, or catalog will be required to load an iceberg table the way we we'd expect to. |
Hi @teaguesterling, Thanks again for putting in this effort, let me know if you want someone to test this functionality. I can build from source and test on a couple of different iceberg tables I have locally (some created via pyiceberg and therefore not having a version / version-hint file). Would me nice to have some of the main contributors involved in this to get a good opinionated view on this implementation. Any chance one of you has time to take a look at this? @samansmink @carlopi |
Happy to review a PR implementing this, I think it makes sense! |
I think we can consider this closed with #79, maybe pending documentation. |
My s3 bucket with iceberg (picture below) cannot be queried with
iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)
nor
iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)
In particular the system is trying to find a very specific file (so the * pattern gets ignored):
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text
Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.
That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?
Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?
The text was updated successfully, but these errors were encountered: