Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[substrait] Add support for ExtensionTable #13771

Open
ccciudatu opened this issue Dec 13, 2024 · 0 comments · May be fixed by #13772
Open

[substrait] Add support for ExtensionTable #13771

ccciudatu opened this issue Dec 13, 2024 · 0 comments · May be fixed by #13772
Labels
enhancement New feature or request

Comments

@ccciudatu
Copy link
Contributor

Is your feature request related to a problem or challenge?

Custom TableProvider implementations cannot currently be encoded as ExtensionTables in Substrait. The Substrait plan will only retain the table names, which are hardly ever enough to restore the table definition on the consumer side. For UDTFs in particular, the name is completely useless, as it's always tmp_table.

Describe the solution you'd like

Add two more methods in SerializerRegistry for serializing/deserializing TableSource instances and use these new extensions in to_substrait_plan and from_substrait_plan to let users encode/decode custom table definitions as ExtensionTables.

Describe alternatives you've considered

For the cases where the user controls the table name, one hideous workaround would be to encode the whole table definition in the table name and register a custom schema provider to decode it (e.g. some_catalog.custom_schema."base64(proto_binary)"). This is a horrible hack as it requires using those names in SQL queries and it doesn't work for UDTFs.
A far better alternative is to leverage the already supported Substrait extensions (in particular, ExtensionLeaf), by implementing the SerializerRegistry trait and forcing the table to fit into a UserDefinedLogicalNodes.
However, this approach is both limited and unnatural:

  • logical plans have to be preprocessed in order to replace TableScans with Extension nodes before converting to substrait
  • The logical plan resulting from decoding Substrait can only be executed if an ExtensionPlanner is registered for handling the user-defined nodes, but in this case it would not benefit from the special treatment that tables get in DataFusion (projections and filter pushdowns for scan, various knobs to instruct the engine about the table capabilities etc.). Rewriting the decoded plan to convert the user-defined node back to a TableScan is the only way to benefit from all that.
  • Substrait also encodes projections only for tables (i.e. ReadRels), so an ExtensionLeaf can't make use of that. Rewriting the Substrait plan itself to overcome this limitation is way more tedious than it should be.

Additional context

This should be a child issue of #13318.

@ccciudatu ccciudatu added the enhancement New feature or request label Dec 13, 2024
@ccciudatu ccciudatu linked a pull request Dec 13, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant