Re-use PyArrow memory via PyCall #92

Moelf · 2020-12-29T01:02:48Z

Hi @quinnj , thank you for just willing to consider this wild attempt! The only pkg you need to re-create is PyArrow and awkward-1.0 on the python side and PyCall.jl on Julia side.

Create example arr:

julia> using PyCall

julia> ak = pyimport("awkward");

julia> arr = ak.Array(py"[[1,2,3], [], [4,5]]")
PyObject <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

julia> arr.layout
PyObject <ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x0000037ff330"/></offsets>
    <content><NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x000003a256a0"/></content>
</ListOffsetArray64>

Then you can get an pyarrow object via:

julia> arr_arrow = ak.to_arrow(arr)
PyObject <pyarrow.lib.ListArray object at 0x7f2144343048>
...
..

julia> @time [Int64[x...] for x in arr]
  0.034800 seconds (38.72 k allocations: 2.031 MiB)
3-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 []
 [4, 5]

Currently the fastest / least copy method of re-using as been:

function view_ak(arr)
    c = PyArray(arr.layout."content")
    o = PyArray(arr.layout."offsets")
    @views [c[o[i]+1:o[i+1]] for i in 1:length(o)-1]
end

julia> @time view_ak(arr)
  0.000089 seconds (37 allocations: 1.609 KiB)
3-element Array{SubArray{Int64,1,PyArray{Int64,1},Tuple{UnitRange{Int64}},false},1}:
 [1, 2, 3]
 0-element view(::PyArray{Int64,1}, 4:3) with eltype Int64
 [4, 5]

The text was updated successfully, but these errors were encountered:

sa- · 2020-12-29T15:16:32Z

This looks super interesting. I've been working to find a way to reuse c++ memory in Julia but I've been running into some roadblocks. Once we do this, it should be possible to use all of c++'s arrow functionality including their crazy fast parquet reader and gandiva, and then create a very performant arrow based DataFrame library on top of it.

@Moelf Are you familiar with c++?

quinnj · 2021-04-14T05:29:03Z

Sorry for the slow response here, but here's one way we could convert between the awkward array and a Julia array:

julia> off = Arrow.Offsets(UInt8[], arr.layout.offsets)
3-element Arrow.Offsets{Int64}:
 (1, 3)
 (4, 3)
 (4, 5)

julia> @time list = Arrow.List{Vector{Int64}, Int64, typeof(arr.layout.content)}(UInt8[], Arrow.ValidityBitmap(UInt8[], 1, 0, 0), off, arr.layout.content, arr.__len__(), nothing)
  0.000102 seconds (64 allocations: 3.422 KiB)
3-element Arrow.List{Vector{Int64}, Int64, Vector{Int64}}:
 [1, 2, 3]
 0-element view(::Vector{Int64}, 4:3) with eltype Int64
 [4, 5]

I'm by no means a PyCall.jl expert, so it's unclear to me if, when we do arr.layout.offsets, a copy of the data is made; judging from the allocations in the @time invocation, I'd guess that a copy is being made. There's perhaps a PyCall.jl incantation to avoid making a copy, which would be ideal. Otherwise, perhaps there's a way to get an actual pointer to the awkward offsets/content arrays which we could unsafe_wrap to avoid allocating.

quinnj · 2021-04-14T05:33:09Z

In general though, it's going to be, IMO, practically impossible to try and re-use arrow memory at the individual column/array level. There are two many factors that would complicate things. On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will), is a main use-case for the arrow format. So if you had an arrow IPC stream written to a memory buffer in c++ and were able to pass the pointer + len to Julia, we'd be able to do Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len)) and that should work just fine.

sa- · 2021-04-15T06:38:24Z

Hey @quinnj ,

Thanks for responding to the issue! I've actually looked into this issue a bit more, it looks like implementing the C Data Interface would solve this for us, since that is its main use case:
https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/

This link shows and example of reusing arrow memory in r and python.

I'm happy to contribute as well if I can get some pointers on where to start

sa- · 2021-04-15T06:41:00Z

I've been able to create Julia struct for the C Data Interface using Clang.jl, but I'm a bit confused about accessing the C Data Interface on the python side. @Moelf would you know how to do this?

sa- · 2021-04-15T23:46:29Z

I posted this snippet in Slack, but I think I'll post it here for posterity too. I've made some progress with the C Data Interface but I'm stuck at converting the C Pointer to a Julia struct. Can't quite figure out what I'm doing wrong, but I would appreciate it if someone had a look.

These are the resources I used to build this snippet:

https://arrow.apache.org/docs/format/CDataInterface.html
https://gist.github.com/wesm/d48908018c4b7a0d9789a31d10caf525 (click "Display the rendered blob" on the top right)

#ENV["PYTHON"] = "/usr/local/opt/[email protected]/bin/python3"
#using Pkg
#Pkg.build("PyCall")
using PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
ffi = pyimport("pyarrow.cffi").ffi
##
const ARROW_FLAG_DICTIONARY_ORDERED = 1
const ARROW_FLAG_NULLABLE = 2
const ARROW_FLAG_MAP_KEYS_SORTED = 4

struct ArrowSchema
    format::Cstring
    name::Cstring
    metadata::Cstring
    flags::Cint
    n_children::Cint
    children::Ptr{Ptr{ArrowSchema}}
    dictionary::Ptr{ArrowSchema}
    release::Ptr{Cvoid}
    private_data::Ptr{Cvoid}
end

struct ArrowArray
    length::Cint
    null_count::Cint
    offset::Cint
    n_buffers::Cint
    n_children::Cint
    buffers::Ptr{Ptr{Cvoid}}
    children::Ptr{Ptr{ArrowArray}}
    dictionary::Ptr{ArrowArray}
    release::Ptr{Cvoid}
    private_data::Ptr{Cvoid}
end
##

df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
##
c_schema_py = ffi.new("struct ArrowSchema*")
c_schema_ptr = ffi.cast("uintptr_t", c_schema_py).__int__()
c_batch_py = ffi.new("struct ArrowArray*")
c_batch_ptr = ffi.cast("uintptr_t", c_batch_py).__int__()
##
rb.schema._export_to_c(c_schema_ptr)
rb._export_to_c(c_batch_ptr)
##
c_schema_jl = unsafe_load(convert(Ptr{ArrowSchema}, c_schema_ptr))
@assert c_schema_jl.n_children == c_batch_py.n_children

Moelf · 2021-05-23T17:58:48Z

On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will)

that would actually be useful for 99% of the cases. As a concrete example, if we can generally re-use a pyarrow.Table construct, it would make 99% use case possible because almost everything interfaces to this object in python land.

ugh.... the unwrap_pyarrow_table is in Cython, so dumb

Moelf · 2022-08-31T01:50:36Z

I have re-kindled hope for this, from https://arrow.apache.org/docs/python/ipc.html

if we replace sink with a Julia buffer does that work? @quinnj I think this is the IPC stream chunk (of bytes) you mentioned earlier, that supposedly Julia can just

Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len))

around?

Moelf · 2022-08-31T02:55:41Z

https://gist.github.com/Moelf/de9e6be8575ed5a0399e04637fb935cf

Moelf · 2023-01-22T21:10:02Z

with the last gist I posted, it's practically viable, with one open question being how can we minimize the actual bytes movement in this call

julia> pywith(pa.ipc.new_stream(jl_sink, batch.schema)) do writer
               writer.write_batch(batch)
           end;

essentially, the question is if it's possible to get the ipc bytes for a pyarrow batch:

Python RecordBatch:
pyarrow.RecordBatch

without copying them through a Julia IOBuffer

mzy2240 · 2024-10-15T14:33:02Z

@Moelf hey just curious, did you eventually figure out a more efficient way than the gist you posted?

Moelf · 2024-10-15T15:53:54Z

nope, and there's not much going on in arrow-julia in general unfortunatelly

mzy2240 · 2024-10-15T18:31:03Z

how about save to mmap in python side and then read mmap in julia side? If I understand the source code correctly, when reading a local file using Arrow.jl, mmap is used.

Moelf · 2024-10-15T19:10:32Z

if you have to save to disk it's lost battle. the whole point is there are packages like awkawrd from Python's side that can ingest a few different file format and internally have Arrow Table memory layout. If you have to go to disk, you can do any of other options.

Also, Arrow.jl is terrible with compressed on-disk file since you can't mmap it, but usually when you're writing to disk, you almost always have compression on.

Moelf closed this as completed Jan 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-use PyArrow memory via PyCall #92

Re-use PyArrow memory via PyCall #92

Moelf commented Dec 29, 2020 •

edited

Loading

sa- commented Dec 29, 2020 •

edited

Loading

quinnj commented Apr 14, 2021

quinnj commented Apr 14, 2021

sa- commented Apr 15, 2021

sa- commented Apr 15, 2021

sa- commented Apr 15, 2021 •

edited

Loading

Moelf commented May 23, 2021 •

edited

Loading

Moelf commented Aug 31, 2022

Moelf commented Aug 31, 2022

Moelf commented Jan 22, 2023

mzy2240 commented Oct 15, 2024

Moelf commented Oct 15, 2024

mzy2240 commented Oct 15, 2024

Moelf commented Oct 15, 2024

Re-use PyArrow memory via PyCall #92

Re-use PyArrow memory via PyCall #92

Comments

Moelf commented Dec 29, 2020 • edited Loading

sa- commented Dec 29, 2020 • edited Loading

quinnj commented Apr 14, 2021

quinnj commented Apr 14, 2021

sa- commented Apr 15, 2021

sa- commented Apr 15, 2021

sa- commented Apr 15, 2021 • edited Loading

Moelf commented May 23, 2021 • edited Loading

Moelf commented Aug 31, 2022

Moelf commented Aug 31, 2022

Moelf commented Jan 22, 2023

mzy2240 commented Oct 15, 2024

Moelf commented Oct 15, 2024

mzy2240 commented Oct 15, 2024

Moelf commented Oct 15, 2024

Moelf commented Dec 29, 2020 •

edited

Loading

sa- commented Dec 29, 2020 •

edited

Loading

sa- commented Apr 15, 2021 •

edited

Loading

Moelf commented May 23, 2021 •

edited

Loading