Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] 2024 Q4 / 2025 Q1 Roadmap #13274

Open
alamb opened this issue Nov 6, 2024 · 20 comments
Open

[DISCUSSION] 2024 Q4 / 2025 Q1 Roadmap #13274

alamb opened this issue Nov 6, 2024 · 20 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 6, 2024

Is your feature request related to a problem or challenge?

The last roadmap discussion we had seems to have worked out well to galvanize and get us organized around some common goals

Describe the solution you'd like

Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Nov 6, 2024
@alamb alamb pinned this issue Nov 6, 2024
@alamb
Copy link
Contributor Author

alamb commented Nov 6, 2024

BTW my personal plans over the next few months are likely going to be focus on consolidating some of the gains / improvements we have made recently. That includes:

Improve the project's documentation

Performance wise I plan to

@matthewmturner
Copy link
Contributor

I am not sure if this is the place for it but I have been putting a lot of work into dft and plan on doing a release before end of year.

@alamb
Copy link
Contributor Author

alamb commented Nov 6, 2024

I am not sure if this is the place for it but I have been putting a lot of work into dft and plan on doing a release before end of year.

For anyone else following along, dft is https://github.com/datafusion-contrib/datafusion-dft

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 7, 2024

I am not sure if this is the place for it but I have been putting a lot of work into dft and plan on doing a release before end of year.

I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first

@matthewmturner
Copy link
Contributor

I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first

@jayzhan211 I agree, they are very important. Unfortunately, we have been held up because of the crates using different versions of datafusion. The idea was to converge on 42 - which iceberg and hudi currently use but deltalake (which we already have an integration for) is on 41 and hasnt been able to upgrade yet. It looks like they are skipping version 42 now and will use 43 - so hopefully this is resolved soon.

Here is some relevant work

@matthewmturner
Copy link
Contributor

@jayzhan211 and to be more explicit on my release plans, i did not plan on releasing until iceberg and hudi were added.

@jayzhan211
Copy link
Contributor

LogicalType is important too #12622

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2024

I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first

@jayzhan211 I agree, they are very important. Unfortunately, we have been held up because of the crates using different versions of datafusion.

I think another potentially very interesting approach here will be to use the FFI bindings from @timsaucer:

The idea there would be to wrap the delta / iceberge in a stable ABI (aka the FFI bindings) so we could call delta.rs / iceberg which used a different version of DataFusion from dft.

@timsaucer
Copy link
Contributor

On the python side, getting better integration with the python delta-rs package was the entire reason for pushing for the FFI bindings. I have branches ready to go for datafusion-python and delta-rs as soon as 43.0.0 releases. I also have tested it with a few of the other table providers in datafusion-contrib.

For the pure rust implementations, I think it would be best to not cross the unsafe FFI boundary if you don't have to. Unfortunately that does put additional dependencies on the other crates updating at a reasonable pace.

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2024

I think as soon as DataFusion 43.0.0 is released we'll be able to test it out:

  1. Update dft to DataFusion 43
  2. Implement a crate binding (in dft to delta-rs with older datafusion version)

It should be quite sweet

@timsaucer
Copy link
Contributor

I don't know if this discussion is the place we want to track work in the other related projects, but my top goals for 2024 Q4 are:

@matthewmturner
Copy link
Contributor

One thing thats not clear to me with the FFI approach is who the intended owner of the bindings are - should it be dft as i dont want to worry about my deps being on different datafusion versions or is it more for the table providers crates (iceberg, deltalake, hudi, etc)?

Of course in the short term it could be prototyped in dft and contributed back to those repos but im asking more in the target state where the appropriate home would be.

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2024

One thing thats not clear to me with the FFI approach is who the intended owner of the bindings are - should it be dft as i dont want to worry about my deps being on different datafusion versions or is it more for the table providers crates (iceberg, deltalake, hudi, etc)?

Of course in the short term it could be prototyped in dft and contributed back to those repos but im asking more in the target state where the appropriate home would be.

The version of DataFusion used in the bindings has to match the client program (dft in this case) so I don't think they can go in the delta/iceberg crates

One thing we might be able to do is have a separate crate like datafusion-delta-table-provider that has different feature flags for different DataFusion versions 🤔 -- but I am now more or less wildly speculating

@jonathanc-n
Copy link
Contributor

@matthewmturner Do you have anything in mind moving forward for integrating the rest of the data lakes? (such as a list of what needs to be done moving forward)

@matthewmturner
Copy link
Contributor

@matthewmturner Do you have anything in mind moving forward for integrating the rest of the data lakes? (such as a list of what needs to be done moving forward)

Yes, now that DataFusion v43 has been released I am hoping that the rust implementations of the three main data lake formats (Deltalake / Iceberg / Hudi) update to that version. Then I will:

  1. Update the existing Deltalake integration
  2. Refresh the current Hudi PR
  3. Add Iceberg integration which should be pretty easy now that it implements TableProviderFactory

I am interested in the FFI bindings but I don't anticipate working on that prior to the current release I am planning.

@alamb
Copy link
Contributor Author

alamb commented Nov 10, 2024

I have had a few days to reflect , and I personally think making it easy to integrate DataFusion into the "open data lake" stack might be my top priority over the coming months

@julienledem wrote up a very nice piece descsribing this The advent of the Open Data Lake

In my mind, the specific work this entails stuff like

  • Making it easier to use iceberg/delta/hudi with DataFusion
  • Document different tokio runtimes
  • Make parquet reader in arrow-rs faster/better on remote object stores

More to come

@matthewmturner
Copy link
Contributor

@alamb can you expand on the different runtimes point? Are referring to having a dedicated tokio runtime for CPU bound work? I actually have a ticket open for that - if that is a very important item for you I can add it to the list to do before releasing.

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2024

@alamb can you expand on the different runtimes point? Are referring to having a dedicated tokio runtime for CPU bound work? I actually have a ticket open for that - if that is a very important item for you I can add it to the list to do before releasing.

This is what I had in mind:

Thanks for the link to the dft one. That is a good one

@matthewmturner
Copy link
Contributor

@alamb i will work on that next. will ping you when ready for review.

@alamb
Copy link
Contributor Author

alamb commented Nov 17, 2024

More to come

I filed

to try and organize my thoughts here better

matthewmturner added a commit to datafusion-contrib/datafusion-dft that referenced this issue Nov 19, 2024
Add's a dedicated executor for running CPU bound work on the FlightSQL
server.

There is interest from the [DataFusion
community](apache/datafusion#13274 (comment))
for this, it was already on our
[roadmap](#197)
and I think the DFT FlightSQL server is a great place to have a
reference implementation.

Initial inspiration and context can be found
[here](https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/).

Most of the initial implementation was copied from
[here](https://github.com/influxdata/influxdb3_core/blob/6fcbb004232738d55655f32f4ad2385523d10696/executor/src/lib.rs)
with some tweaks for our current setup. In particular we dont have
metrics yet in the FlightSQL server implementation (but it is on the
[roadmap](#210))
- I expect to do a follow on where metrics are integrated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants