Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Make it easy and fast to query files on remote files (S3, iceberg, etc) #13456

Open
4 tasks
alamb opened this issue Nov 17, 2024 · 4 comments
Open
4 tasks
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 17, 2024

Is your feature request related to a problem or challenge?

I personally think making it easy to use DataFusion with the "open data lake" stack is very important over the next few months.

@julienledem wrote up a very nice piece describing The advent of the Open Data Lake

The high level idea is to make it really easy for people to build systems that query (quickly!) from parquet files stored on remote object store, including Apache Iceberg, Delta Lake, Hudi, etc.

You can already use DataFusion (and datafusion-cli) to query such data, but it takes non trivial effort to configure and tune for good performance. My idea is to make it easier to do so / make DataFusion better out of the box.

With that as a building block, people could/would build applications and systems targeting specific usecases

I don't yet fully understand where we currently stand on this goal, but I wanted to start hte discussio

Describe the solution you'd like

In my mind, the specific work this entails stuff like

Describe alternatives you've considered

One specific item, brought up by @MrPowers would be to try DataFusion with the "10B row challenge" described in https://dataengineeringcentral.substack.com/p/10-billion-row-challenge-duckdb-vs .

I suspect it would be non ideal at first, but trying it to figure out what the challenges are would help us focus our efforts

Additional context

No response

@alamb alamb added the enhancement New feature or request label Nov 17, 2024
@comphead
Copy link
Contributor

What about remote HDFS files support? We have a contribution project https://github.com/datafusion-contrib/datafusion-objectstore-hdfs which supposed to query HDFS, but not sure how far we are with that

@alamb
Copy link
Contributor Author

alamb commented Nov 19, 2024

Yes I think HDFS would be another good target

Basically I want to make sure that it is as easy as possible to use DataFusion to query data that lives on remote systems (aka where the data is not on some local NVME but must be accessed over the network)

@jonathanc-n
Copy link
Contributor

jonathanc-n commented Nov 21, 2024

I forgot to mention this here, apache/iceberg-rust#700 (write support) is a really nice issue opened up in iceberg-rust. Getting the rust implementation of iceberg up and going would probably help out datafusion a bit on the data lake side of things.

@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2024

I forgot to mention this here, apache/iceberg-rust#700 (write support) is a really nice issue opened up in iceberg-rust. Getting the rust implementation of iceberg up and going would probably help out datafusion a bit on the data lake side of things.

Yes, 100% -- one of my goals is to make it easy for this to "just work" with DataFusion. I think we are a bit away from it at the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants