[DISCUSS] Single Source `ExecutionPlan` Across All `TableProviders` #13838

berkaysynnada · 2024-12-19T10:32:38Z

Is your feature request related to a problem or challenge?

I would like to revive #6339, particularly this comment: #6339 (comment).

While working on our fork-specific implementations, we have frequently encountered scenarios where it seems more appropriate to have a single exec for sources, similar to the approach used for sinks. This idea has been coming up a lot recently.

I'd like to gather the community's opinions on this and hear any counterarguments or opposing perspectives. While I understand that this change would require significant effort if approved, we are willing to contribute to making it happen.

Describe the solution you'd like

TableProvider's scan()'s the same Exec for the all file formats

Describe alternatives you've considered

No response

Additional context

No response

ozankabak · 2024-12-19T10:55:16Z

cc @alamb, @andygrove, @Dandandan, @jayzhan-synnada

Also @findepi it'd be great if you can also chime in

alamb · 2024-12-19T11:35:17Z

TableProvider's scan()'s the same Exec for the all file formats

I think this would be a pretty major breaking change for all downstream consumers.

we have frequently encountered scenarios where it seems more appropriate to have a single exec for sources, similar to the approach used for sinks. This idea has been coming up a lot recently.

Is it easy to explain some of these scenarios?

Rather than trying to use the same ExecutionPlan for all TableProviders, another thing to do might be to extend the ExecutionPlan trait with the common functionality.

For example instead of having to do something like

if let Some(parquet_exec) = plan.as_any().downcast_ref::<Parquetexec>() {
  parquet_specific_function(parquet_exec)
}

The other thing we could do is

trait ExecutionPlan {
  // do specific thing 
  fn specific_function(&self) {}
}

And then the pass / analysis would be able to use do

plan.specific_function()

I may be jumping to conclusions however and misguess what you are trying to do

alamb · 2024-12-19T11:36:27Z

(I also changed the title of this ticket to refer to the struct names I think is being suggested, please let me know / correct it if I got that incorrect)

berkaysynnada · 2024-12-19T11:53:46Z

Is it easy to explain some of these scenarios?

In the simplest case, when we want to add a method to the ExecutionPlan API that applies uniformly to all sources, we have to repeat its implementation for each source. Simply checking the children count doesn’t always resolve the issue. Having another API, like is_source(), and consult that, but even this doesn’t result in perfect usage.

The asymmetric sink-source pattern also feels somewhat unusual.

findepi · 2024-12-19T12:28:55Z

What example problems would this solve?

ozankabak · 2024-12-19T12:50:43Z

Is it easy to explain some of these scenarios?

In the simplest case, when we want to add a method to the ExecutionPlan API that applies uniformly to all sources, we have to repeat its implementation for each source.

This comes up very often and hinders extensibility. Example situations we ran into includes things like checkpointing support, watermark generation/handling, etc. Almost none of these things (and neither other functionality that is already in upstream DataFusion) have anything to do with source operator reading a CSV or a JSON, but somehow we have separate operators like CSVExec, JsonExec etc.

I think this would be a pretty major breaking change for all downstream consumers.

Indeed. That's why we wanted to discuss and see how we can approach this as a community.

findepi · 2024-12-19T14:18:09Z

If we have uniform operator on top of CSV or JSON, it will internally dispatch to a reader (CSV or JSON, etc.) but won't have any file-format (or datasource-) -specific logic.

If we don't have uniform operator, we still can add checkpointing no top of it with an additional operator sitting above CSVExec, JsonExec.

Thus at the first sight the two approaches look equally expressive, which means I am missing some important detail. What is it?

ozankabak · 2024-12-19T14:32:38Z

Thus at the first sight the two approaches look equally expressive, which means I am missing some important detail. What is it?

This is actually what we do in our fork. They are indeed equally expressive, but this starts to become a problem as we start talking about more features; i.e. watermarks, out-of-order handling etc.

Also, in case any of this logic has bearing on IO action (how things are read), it creates another set of problems. DataFusion is pull-based, so information flow is by default one-way unless you jump over extra hoops.

We are lucky to be doing engineering in a field where most problems of this type has workarounds and solutions, but when they start piling on, IMO it is a good signal that some lower-level design was wrong. In this case, it seems like that is the non-uniformity on the source side.

alamb · 2024-12-19T16:28:44Z

We are lucky to be doing engineering in a field where most problems of this type has workarounds and solutions, but when they start piling on, IMO it is a good signal that some lower-level design was wrong. In this case, it seems like that is the non-uniformity on the source side.

So my "gut" feeling is that this change would basically push complexity around (make implementing TableProviders outside the DataFusion core more complicated) but I don't think I have much more than that unsupported opinion to share

One potential way to proceed with this idea would be to sketch out what this idea would look like in a PR and try to adapt some existing open source table providers and see the impact

These are some obvious candidates:

ozankabak · 2024-12-19T16:44:38Z

I think this is a great idea. We can see the impact clearly on both DF-core sources and external ones.

berkaysynnada added the enhancement New feature or request label Dec 19, 2024

alamb changed the title ~~Single Source Exec Across All Providers~~ [DISCUSS] Single Source ExecutionPlan Across All TableProviders Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Single Source `ExecutionPlan` Across All `TableProviders` #13838

[DISCUSS] Single Source `ExecutionPlan` Across All `TableProviders` #13838

berkaysynnada commented Dec 19, 2024 •

edited

Loading

ozankabak commented Dec 19, 2024

alamb commented Dec 19, 2024

alamb commented Dec 19, 2024

berkaysynnada commented Dec 19, 2024

findepi commented Dec 19, 2024

ozankabak commented Dec 19, 2024

findepi commented Dec 19, 2024

ozankabak commented Dec 19, 2024 •

edited

Loading

alamb commented Dec 19, 2024 •

edited

Loading

ozankabak commented Dec 19, 2024

[DISCUSS] Single Source ExecutionPlan Across All TableProviders #13838

[DISCUSS] Single Source ExecutionPlan Across All TableProviders #13838

Comments

berkaysynnada commented Dec 19, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

ozankabak commented Dec 19, 2024

alamb commented Dec 19, 2024

alamb commented Dec 19, 2024

berkaysynnada commented Dec 19, 2024

findepi commented Dec 19, 2024

ozankabak commented Dec 19, 2024

findepi commented Dec 19, 2024

ozankabak commented Dec 19, 2024 • edited Loading

alamb commented Dec 19, 2024 • edited Loading

ozankabak commented Dec 19, 2024

[DISCUSS] Single Source `ExecutionPlan` Across All `TableProviders` #13838

[DISCUSS] Single Source `ExecutionPlan` Across All `TableProviders` #13838

berkaysynnada commented Dec 19, 2024 •

edited

Loading

ozankabak commented Dec 19, 2024 •

edited

Loading

alamb commented Dec 19, 2024 •

edited

Loading