-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-use PyArrow memory via PyCall #92
Comments
This looks super interesting. I've been working to find a way to reuse c++ memory in Julia but I've been running into some roadblocks. Once we do this, it should be possible to use all of c++'s arrow functionality including their crazy fast parquet reader and gandiva, and then create a very performant arrow based DataFrame library on top of it. @Moelf Are you familiar with c++? |
Sorry for the slow response here, but here's one way we could convert between the awkward array and a Julia array: julia> off = Arrow.Offsets(UInt8[], arr.layout.offsets)
3-element Arrow.Offsets{Int64}:
(1, 3)
(4, 3)
(4, 5)
julia> @time list = Arrow.List{Vector{Int64}, Int64, typeof(arr.layout.content)}(UInt8[], Arrow.ValidityBitmap(UInt8[], 1, 0, 0), off, arr.layout.content, arr.__len__(), nothing)
0.000102 seconds (64 allocations: 3.422 KiB)
3-element Arrow.List{Vector{Int64}, Int64, Vector{Int64}}:
[1, 2, 3]
0-element view(::Vector{Int64}, 4:3) with eltype Int64
[4, 5] I'm by no means a PyCall.jl expert, so it's unclear to me if, when we do |
In general though, it's going to be, IMO, practically impossible to try and re-use arrow memory at the individual column/array level. There are two many factors that would complicate things. On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will), is a main use-case for the arrow format. So if you had an arrow IPC stream written to a memory buffer in c++ and were able to pass the pointer + len to Julia, we'd be able to do |
Hey @quinnj , Thanks for responding to the issue! I've actually looked into this issue a bit more, it looks like implementing the C Data Interface would solve this for us, since that is its main use case: This link shows and example of reusing arrow memory in r and python. I'm happy to contribute as well if I can get some pointers on where to start |
I've been able to create Julia struct for the C Data Interface using Clang.jl, but I'm a bit confused about accessing the C Data Interface on the python side. @Moelf would you know how to do this? |
I posted this snippet in Slack, but I think I'll post it here for posterity too. I've made some progress with the C Data Interface but I'm stuck at converting the C Pointer to a Julia struct. Can't quite figure out what I'm doing wrong, but I would appreciate it if someone had a look. These are the resources I used to build this snippet:
|
that would actually be useful for 99% of the cases. As a concrete example, if we can generally re-use a ugh.... the |
I have re-kindled hope for this, from https://arrow.apache.org/docs/python/ipc.html if we replace Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len)) around? |
with the last gist I posted, it's practically viable, with one open question being how can we minimize the actual bytes movement in this call julia> pywith(pa.ipc.new_stream(jl_sink, batch.schema)) do writer
writer.write_batch(batch)
end; essentially, the question is if it's possible to get the ipc bytes for a pyarrow batch:
without copying them through a Julia |
@Moelf hey just curious, did you eventually figure out a more efficient way than the gist you posted? |
nope, and there's not much going on in arrow-julia in general unfortunatelly |
how about save to mmap in python side and then read mmap in julia side? If I understand the source code correctly, when reading a local file using Arrow.jl, mmap is used. |
if you have to save to disk it's lost battle. the whole point is there are packages like Also, Arrow.jl is terrible with compressed on-disk file since you can't mmap it, but usually when you're writing to disk, you almost always have compression on. |
Hi @quinnj , thank you for just willing to consider this wild attempt! The only pkg you need to re-create is PyArrow and awkward-1.0 on the python side and PyCall.jl on Julia side.
Create example
arr
:Then you can get an pyarrow object via:
Currently the fastest / least copy method of re-using as been:
The text was updated successfully, but these errors were encountered: