-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Single Source ExecutionPlan
Across All TableProviders
#13838
Comments
cc @alamb, @andygrove, @Dandandan, @jayzhan-synnada Also @findepi it'd be great if you can also chime in |
I think this would be a pretty major breaking change for all downstream consumers.
Is it easy to explain some of these scenarios? Rather than trying to use the same ExecutionPlan for all TableProviders, another thing to do might be to extend the For example instead of having to do something like if let Some(parquet_exec) = plan.as_any().downcast_ref::<Parquetexec>() {
parquet_specific_function(parquet_exec)
} The other thing we could do is trait ExecutionPlan {
// do specific thing
fn specific_function(&self) {}
} And then the pass / analysis would be able to use do plan.specific_function() I may be jumping to conclusions however and misguess what you are trying to do |
ExecutionPlan
Across All TableProviders
(I also changed the title of this ticket to refer to the struct names I think is being suggested, please let me know / correct it if I got that incorrect) |
In the simplest case, when we want to add a method to the ExecutionPlan API that applies uniformly to all sources, we have to repeat its implementation for each source. Simply checking the children count doesn’t always resolve the issue. Having another API, like is_source(), and consult that, but even this doesn’t result in perfect usage. The asymmetric sink-source pattern also feels somewhat unusual. |
What example problems would this solve? |
This comes up very often and hinders extensibility. Example situations we ran into includes things like checkpointing support, watermark generation/handling, etc. Almost none of these things (and neither other functionality that is already in upstream DataFusion) have anything to do with source operator reading a CSV or a JSON, but somehow we have separate operators like
Indeed. That's why we wanted to discuss and see how we can approach this as a community. |
If we have uniform operator on top of CSV or JSON, it will internally dispatch to a reader (CSV or JSON, etc.) but won't have any file-format (or datasource-) -specific logic. If we don't have uniform operator, we still can add checkpointing no top of it with an additional operator sitting above CSVExec, JsonExec. Thus at the first sight the two approaches look equally expressive, which means I am missing some important detail. What is it? |
This is actually what we do in our fork. They are indeed equally expressive, but this starts to become a problem as we start talking about more features; i.e. watermarks, out-of-order handling etc. Also, in case any of this logic has bearing on IO action (how things are read), it creates another set of problems. DataFusion is pull-based, so information flow is by default one-way unless you jump over extra hoops. We are lucky to be doing engineering in a field where most problems of this type has workarounds and solutions, but when they start piling on, IMO it is a good signal that some lower-level design was wrong. In this case, it seems like that is the non-uniformity on the source side. |
So my "gut" feeling is that this change would basically push complexity around (make implementing One potential way to proceed with this idea would be to sketch out what this idea would look like in a PR and try to adapt some existing open source table providers and see the impact These are some obvious candidates: |
I think this is a great idea. We can see the impact clearly on both DF-core sources and external ones. |
Is your feature request related to a problem or challenge?
I would like to revive #6339, particularly this comment: #6339 (comment).
While working on our fork-specific implementations, we have frequently encountered scenarios where it seems more appropriate to have a single exec for sources, similar to the approach used for sinks. This idea has been coming up a lot recently.
I'd like to gather the community's opinions on this and hear any counterarguments or opposing perspectives. While I understand that this change would require significant effort if approved, we are willing to contribute to making it happen.
Describe the solution you'd like
TableProvider's scan()'s the same Exec for the all file formats
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: