-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support (order by / sort) for DataFrameWriteOptions #13874
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- this PR is really nice in my mind -- it is well commented and well tested 🏅
|
||
assert_batches_eq!( | ||
&[ | ||
"+---+---+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good test 👍
.sort(options.sort_by)? | ||
.build()? | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't LogicalPlanBuilder::copy_to
have the ordering info as well?
If we write the info to a new table, do we know avoid sorting the table again? We should expose the ordering info to the newly created table for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think copy_to inserts the data from whatever the input plan is (if there is a redundant sort the optimizer will remove it). Is that what you are asking?
It is an interesting question of how to communicate "this table is sorted" information for newly written files. Is this what you are talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I mean that after this PR we seem we don't communicate "we are sorted" yet, which is I think would support usecases such as in the issue from @TheBuilderJR (write data sorted on a column, making following read query with order by
on the same column fast).
Which issue does this PR close?
Closes #13873
Rationale for this change
DataFrameWriteOptions is missing an order by / sort by like available in SQL.
For sql we have the option to sort, for example:
You can use the WITH ORDER clause of the CREATE EXTERNAL TABLE if your data is already ordered
https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
But for writing my parquet or other format files, we don't support it
What changes are included in this PR?
Add the sort support.
Are these changes tested?
yes
Added unit testing.
Are there any user-facing changes?
Yes, we support new order option for DataFrameWriteOptions