Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

register_filesystem has no effect after v1.1.0 upgrade #87

Open
ozen opened this issue Sep 10, 2024 · 4 comments
Open

register_filesystem has no effect after v1.1.0 upgrade #87

ozen opened this issue Sep 10, 2024 · 4 comments

Comments

@ozen
Copy link

ozen commented Sep 10, 2024

Using fsspec Filesystems used to work for me when using delta_scan. Now it doesn't, and the reason appears to be the version upgrade.

This code works as expected:

import duckdb
from fsspec import filesystem

duckdb.register_filesystem(filesystem('gcs'))
duckdb.sql("SELECT * FROM read_csv('gcs:///bucket/file.csv')")

This code used to work, now raises an exception:

import duckdb
from fsspec import filesystem

duckdb.register_filesystem(filesystem('gcs'))
duckdb.sql("SELECT * FROM delta_scan('gcs:///bucket/table')")

The exception:

NotImplementedException: Not implemented Error: Can not scan a gcs:// gs:// or r2:// url without a secret providing its endpoint currently. Please create an R2 or GCS secret containing the credentials for this endpoint and try again.

I think register_filesystem must have priority over builtin filesystems.

@samansmink
Copy link
Collaborator

Hey @ozen thanks for reporting, this is indeed something that is currently slightly quirky, however the error message provides you with a hint on the workaround, because createing a GCS type secret should make this work. Not that you should not need fsspec here.

import duckdb
duckdb.sql("CREATE SECRET gcs1 (TYPE GCS)")
duckdb.sql("SELECT * FROM read_csv('gcs://bucket/file.csv')")

Also note that using fsspec with authentication will currently not work at all because of the way part of IO is currently handled by the kernel using its internal cloud storage libaries, while the other part is handled through DuckDB. This means that any auth you configure through fsspec will not be propagated to the kernel.

Either way I will look into removing the need for the empty gcs secret here.

@ozen
Copy link
Author

ozen commented Sep 11, 2024

@samansmink thanks for the detailed answer.

From an enterprise standpoint, there are considerable differences between using HMAC keys with interoperatibility layer and using standard methods of GCP authentication. I think not every user will simply be able to use HMAC keys. fsspec provides the way to use GCP authentication schemes.

Is there any way to move the IO from the kernel to duckdb?

@samansmink
Copy link
Collaborator

Well I think it may have worked accidentally before, but only on public data. I don't really see how authentication wouldve worked there

Is there any way to move the IO from the kernel to duckdb?

Yes! This is actually what the peeps over at the delta-kernel-rs project are working on right now. So currently DuckDB relies on the kernel to do IO for things like metadata reads, deletion vector reads, checkpoints etc. However the idea is that kernel will support APIs in the future to ensure DuckDB can do all IO itself. This will allow us to remove the convoluted code in https://github.com/duckdb/duckdb_delta/blob/24d9b782b1da7676e4c8aae7b9d7650cb035276c/src/functions/delta_scan.cpp#L115 that we now require as well.

With that, we will be able to support using fsspec for delta cleanly

@ozen
Copy link
Author

ozen commented Sep 14, 2024

@samansmink Thank you again for the detailed explanation. Great to hear that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants