-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: no longer load full table into ram in write #2265
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
@aersam haven't checked yet, but are you streaming data to open file handles? Or do you close the files after writing chunks? |
I just create many files 🙂 |
Streaming would be better, but way more complicated to do, mostly since DataFusion's MemoryExec does not take a Stream but a Vec |
@aersam in that case I don't think it's the right way, with small sized recordbatches you could potentially get many many small parquets which then suddenly would require to constantly run optimize afterwards to fix that |
Ok, true. What if I'd count the bytes in a chunk and let it grow to a certain threshold? |
This is resolved now, it will now produce few big files using streams |
I also removed lot's of duplicate code now, the writer was dividing stuff into partitions twice. Column Mapping in Write would also be pretty straightforward now |
I could implement From for the WriteData Enum to make usage a bit simplier, if you want |
@aersam I'll try to take a look tonight! |
So instead of passing plans over, you are now passing RecordBatchStreams instead? |
Yep. I tried using a StreamingTable, but this one has to be Sync which is an issue. The stream worked fine |
Maybe do that and have the DeltaOps().write use an into |
this is done now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing came to my mind, I'm not a 100% sure, but is it a good idea to to |
On this I'm not entirely sure, it is a python object and consuming a reader can only be done once but @emcake 's description of his PR suggests it's safe? #2091 @wjones127 @emcake any insights or comments on this? |
It think it can only be safe if pyarrow itself would take care of getting the GIL. I don't know if they do this |
If that's the case, then we need to apply GilIterator change to merge_execute as well, since that one also wouldn't be safe anymore with your changes |
I tried getting myself through the pyarrow code and see if it does acquire the GIL and it looks like it does: https://github.com/apache/arrow/blob/93816475f75d751067d4ff427fb9ae64e85acebe/python/pyarrow/src/arrow/python/ipc.cc#L39 So I'll revert the last two commits |
The failure is not related to my changes, I guess? |
Closing this in favor of #2289 which I'll keep up to date with the main branch |
Description
Well, I suffered quite a bit and am still not finished. But here's what I've learned so far:
Send
, but notSync
which complicates the usage in a future. To work around limitations of this, I had to split the WriterBuilder into two Structs, one for the data and one for the configAbout the implementation:
Instead of doing something proper, I wanted to first create this PR that basically just takes the iterable and breaks in into chunks to process. This is not ideal, it does not parallelize as good as using things like channels, but that would be a bigger thing. Still a big win for large tables!
Related Issue(s)
Fixes #2255
Documentation