Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Single Source ExecutionPlan Across All TableProviders #13838

Open
berkaysynnada opened this issue Dec 19, 2024 · 10 comments
Open

[DISCUSS] Single Source ExecutionPlan Across All TableProviders #13838

berkaysynnada opened this issue Dec 19, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Dec 19, 2024

Is your feature request related to a problem or challenge?

I would like to revive #6339, particularly this comment: #6339 (comment).

While working on our fork-specific implementations, we have frequently encountered scenarios where it seems more appropriate to have a single exec for sources, similar to the approach used for sinks. This idea has been coming up a lot recently.

I'd like to gather the community's opinions on this and hear any counterarguments or opposing perspectives. While I understand that this change would require significant effort if approved, we are willing to contribute to making it happen.

Describe the solution you'd like

TableProvider's scan()'s the same Exec for the all file formats

Describe alternatives you've considered

No response

Additional context

No response

@berkaysynnada berkaysynnada added the enhancement New feature or request label Dec 19, 2024
@ozankabak
Copy link
Contributor

cc @alamb, @andygrove, @Dandandan, @jayzhan-synnada

Also @findepi it'd be great if you can also chime in

@alamb
Copy link
Contributor

alamb commented Dec 19, 2024

TableProvider's scan()'s the same Exec for the all file formats

I think this would be a pretty major breaking change for all downstream consumers.

we have frequently encountered scenarios where it seems more appropriate to have a single exec for sources, similar to the approach used for sinks. This idea has been coming up a lot recently.

Is it easy to explain some of these scenarios?

Rather than trying to use the same ExecutionPlan for all TableProviders, another thing to do might be to extend the ExecutionPlan trait with the common functionality.

For example instead of having to do something like

if let Some(parquet_exec) = plan.as_any().downcast_ref::<Parquetexec>() {
  parquet_specific_function(parquet_exec)
}

The other thing we could do is

trait ExecutionPlan {
  // do specific thing 
  fn specific_function(&self) {}
}

And then the pass / analysis would be able to use do

plan.specific_function()

I may be jumping to conclusions however and misguess what you are trying to do

@alamb alamb changed the title Single Source Exec Across All Providers [DISCUSS] Single Source ExecutionPlan Across All TableProviders Dec 19, 2024
@alamb
Copy link
Contributor

alamb commented Dec 19, 2024

(I also changed the title of this ticket to refer to the struct names I think is being suggested, please let me know / correct it if I got that incorrect)

@berkaysynnada
Copy link
Contributor Author

Is it easy to explain some of these scenarios?

In the simplest case, when we want to add a method to the ExecutionPlan API that applies uniformly to all sources, we have to repeat its implementation for each source. Simply checking the children count doesn’t always resolve the issue. Having another API, like is_source(), and consult that, but even this doesn’t result in perfect usage.

The asymmetric sink-source pattern also feels somewhat unusual.

@findepi
Copy link
Member

findepi commented Dec 19, 2024

What example problems would this solve?

@ozankabak
Copy link
Contributor

Is it easy to explain some of these scenarios?

In the simplest case, when we want to add a method to the ExecutionPlan API that applies uniformly to all sources, we have to repeat its implementation for each source.

This comes up very often and hinders extensibility. Example situations we ran into includes things like checkpointing support, watermark generation/handling, etc. Almost none of these things (and neither other functionality that is already in upstream DataFusion) have anything to do with source operator reading a CSV or a JSON, but somehow we have separate operators like CSVExec, JsonExec etc.

I think this would be a pretty major breaking change for all downstream consumers.

Indeed. That's why we wanted to discuss and see how we can approach this as a community.

@findepi
Copy link
Member

findepi commented Dec 19, 2024

If we have uniform operator on top of CSV or JSON, it will internally dispatch to a reader (CSV or JSON, etc.) but won't have any file-format (or datasource-) -specific logic.

If we don't have uniform operator, we still can add checkpointing no top of it with an additional operator sitting above CSVExec, JsonExec.

Thus at the first sight the two approaches look equally expressive, which means I am missing some important detail. What is it?

@ozankabak
Copy link
Contributor

ozankabak commented Dec 19, 2024

Thus at the first sight the two approaches look equally expressive, which means I am missing some important detail. What is it?

This is actually what we do in our fork. They are indeed equally expressive, but this starts to become a problem as we start talking about more features; i.e. watermarks, out-of-order handling etc.

Also, in case any of this logic has bearing on IO action (how things are read), it creates another set of problems. DataFusion is pull-based, so information flow is by default one-way unless you jump over extra hoops.

We are lucky to be doing engineering in a field where most problems of this type has workarounds and solutions, but when they start piling on, IMO it is a good signal that some lower-level design was wrong. In this case, it seems like that is the non-uniformity on the source side.

@alamb
Copy link
Contributor

alamb commented Dec 19, 2024

We are lucky to be doing engineering in a field where most problems of this type has workarounds and solutions, but when they start piling on, IMO it is a good signal that some lower-level design was wrong. In this case, it seems like that is the non-uniformity on the source side.

So my "gut" feeling is that this change would basically push complexity around (make implementing TableProviders outside the DataFusion core more complicated) but I don't think I have much more than that unsupported opinion to share

One potential way to proceed with this idea would be to sketch out what this idea would look like in a PR and try to adapt some existing open source table providers and see the impact

These are some obvious candidates:

@ozankabak
Copy link
Contributor

I think this is a great idea. We can see the impact clearly on both DF-core sources and external ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants