-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyarrow writer not encoding correct URL for partitions in delta table #2978
Comments
I tried to install the 0.18.3 version, but it says its not available , so I installed 0.19.0 and tried to optimize. the table is written in the 0.17.4 version
|
Please use the latest version, this is already resolved |
@ion-elgreco deltalake = "0.21.0" I tried to even optimize without Z-order
But still see the partitions like this after optimize/ vacuum |
with optimize, I also do not see the number of partition files reduced, in fact with new partitions (with spaces) the file count has increased. This is expected? |
@ion-elgreco It only works sometimes and since there is broken partitions created the optimize fails sometimes due to this with the below error:
|
Try recreating the table with latest version |
@ion-elgreco Yes, like I mentioned in the previous comment both the write to table, optimize/ vacuum is done using the latest version (0.21.0) which still breaks due to spaces in partition. When I write to the table, there are no spaces in the partitions, but after optimize the spaces are created. My partition column is DayHour which is like (2024-1-09 21:00:00), is the spaces created because of this during optimize? Should we not have date and hour together as partition column? Is there an alternative we can do for this? |
@gprashmi are you on Windows by any chance? |
@thomasfrederikhoeck I have a windows laptop, but I run these on a kubeflow experiment on a databricks cluster. |
Okay. It was just because a similar issue (apache/arrow-rs#5592) has been fixed upstream but I don't think |
Yeah |
I guess this PR is fixed then it should also fix this: #2843 |
@thomasfrederikhoeck thank you for the update. Can you please let me know when would the delta-rs be updated to have the object-store=0.10.2? @ion-elgreco Based on the comment from @thomasfrederikhoeck it looks like this would be fixed when delta-rs uses the updated object-store=0.10.2 version. Can you please let me know if this is in plan to have the delta-rs updated to latest object-store version? |
Feel free to create a PR for it |
Maybe fixed by #2994 |
I'm not 100% sure this fixes this case so maybe leave it open @ion-elgreco ? |
Environment
Delta-rs version: 0.19.0
What happened:
We write data to delta table using delta-rs with PyArrow engine with DayHour as partition column.
I ran the optimize command using the spark sql query below on the delta table
After optimize, it creates partitions with spaces and does not properly encode the partition urls as shown in the below image i.e; it creates new partitions url with spaces (.zstd.parquet).
@ion-elgreco Can you please let me know how we can run the optimize.compact without having partitions with spaces?
Similar issue was raised in June (#2634), where it was mentioned it is fixed in the 0.18.3 version but I still see the same issue when I optimize now. To clarify, I use Pyarrow engine and not Rust if that is causing the break in partitions.
The text was updated successfully, but these errors were encountered: