Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support duplicate column aliases in queries #13489

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 19, 2024

Which issue does this PR close?

Rationale for this change

In SQL, selecting single column multiple times is legal and most modern
databases support this. This commit adds such support to DataFusion too.

What changes are included in this PR?

Are these changes tested?

yes

Are there any user-facing changes?

yes, more valid queries are supported

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Nov 19, 2024
@findepi findepi force-pushed the findepi/support-duplicate-column-aliases-0bf339 branch 2 times, most recently from 7d63cee to bd2307e Compare November 19, 2024 14:03
@jonahgao
Copy link
Member

allow creation of schemas for duplicated names

I think it's a good idea, as it allows the schema of the top plan to include duplicate names, thereby resolving #6543.

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example

DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;
+---+
| a |
+---+
| 1 |
+---+
1 row(s) fetched.

This query did not return an error like it does in PostgreSQL and before. Perhaps we should improve ambiguity check when searching for field names from schemas after removing check_names.

/// A struct with same fields as [`CreateExternalTable`] struct so that the DDL can be conveniently
/// destructed with validation that each field is handled, while still requiring that all
/// construction goes through the [`CreateExternalTable::new`] constructor or the builder.
pub struct CreateExternalTableFields {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think non_exhaustive discourages destructuring, but CreateExternalTableFields makes it possible again. CreateExternalTableFields and CreateExternalTable have the same fields, and I'm a bit worried that it introduces some code duplication 🤔.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does introduce code duplication and so does the builder.
When handling the Create Table/View, deconstruction is valuable without .., as it guarantees that any new field will force the code to be revisited (rather than new fields being ignored).

However, in Rust deconstruction without .. is possible only when construction is possible, and direct construction being possible precludes construction-time checks, which is undesirable.

Alternatively to this, we could allow construction of ill-formed Create Table/View objects, and have check somewhere else (plan validator), but i would be worried that such a delayed check could be missed in some code flows. The field duplication isn't a problem from maintainability perspective after all.

In SQL, selecting single column multiple times is legal and most modern
databases support this. This commit adds such support to DataFusion too.
@findepi
Copy link
Member Author

findepi commented Nov 20, 2024

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example

DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;

Good catch. This is easy to solve.

The less easy part is that

select * from (select 1 as a, 2 as a) t;

should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to #1468.

@findepi findepi force-pushed the findepi/support-duplicate-column-aliases-0bf339 branch from bd2307e to e7e778c Compare November 20, 2024 14:20
@findepi findepi marked this pull request as draft November 20, 2024 14:20
@jonahgao
Copy link
Member

The less easy part is that

select * from (select 1 as a, 2 as a) t;

should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to #1468.

We might need to introduce column index to differentiate them.
Since this case was not previously supported either, maybe we can handle it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
2 participants