Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

>0.17.0 _delta_log gets corrupted after overwrite (log files grows and grows upto 350mb per file) #3006

Closed
TinoSM opened this issue Nov 19, 2024 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@TinoSM
Copy link

TinoSM commented Nov 19, 2024

Environment

Delta-rs version:
0.21.0
0.20.0
0.19.0
I can't test with 0.18.0
In 0.17.0 it works fine

Binding:
Python

Environment:

  • Cloud provider:
  • Local and S3
  • OS:
  • MacOS and Amazon Linux
  • Other:

Bug

What happened:
When overwriting a table all the schema gets rewritten (already reported here #2923) AND I think because of how json metadata is encoded/decoded, all \ characters get escaped again (these characters come from Spark comments/metadata for example, or my own comments)

One of my "development" tables json files grew to 350mb, now delta can't scan them anymore (thrift buffer size limits :) )

What you expected to happen:

When rewriting metadata, no extra escape characters should be added again

How to reproduce it:

I'm sorry but I can only test with polars :(

https://docs.pola.rs/api/python/stable/reference/api/polars.DataFrame.write_delta.html

import polars as pl

df = pl.DataFrame({
     "active": [1, 2, 3, 4, 5],
     "id": ["A", "B", "A", "B", "C"],
})

df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
df.write_delta("./test_table", mode="overwrite")
#passing delta_write_options={"engine": "pyarrow"} fixes the issue

More details:
test_table.zip contains the delta table with active+id columns, empty.
test_table_broken.zip contains the tables with many \\\

Image with cat 00008.json and 0000.json, see how the \\ grew
image

test_table_broken.zip
test_table.zip

@TinoSM TinoSM added the bug Something isn't working label Nov 19, 2024
@TinoSM TinoSM changed the title Delta Table written with rust _delta_log of table written with rust engine+overwrite grows and grows (upto 350mb per file) Nov 19, 2024
@TinoSM TinoSM changed the title _delta_log of table written with rust engine+overwrite grows and grows (upto 350mb per file) >0.17.0 _delta_log of table written with rust engine+overwrite grows and grows (upto 350mb per file) Nov 19, 2024
@TinoSM TinoSM changed the title >0.17.0 _delta_log of table written with rust engine+overwrite grows and grows (upto 350mb per file) >0.17.0 _delta_log gets corrupted after overwrite (log files grows and grows upto 350mb per file) Nov 20, 2024
@TinoSM
Copy link
Author

TinoSM commented Nov 24, 2024

Fixed in 0.22

@TinoSM TinoSM closed this as completed Nov 24, 2024
@rtyler rtyler added this to the v0.22 milestone Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants