Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-use PyArrow memory via PyCall #92

Closed
Moelf opened this issue Dec 29, 2020 · 14 comments
Closed

Re-use PyArrow memory via PyCall #92

Moelf opened this issue Dec 29, 2020 · 14 comments

Comments

@Moelf
Copy link
Contributor

Moelf commented Dec 29, 2020

Hi @quinnj , thank you for just willing to consider this wild attempt! The only pkg you need to re-create is PyArrow and awkward-1.0 on the python side and PyCall.jl on Julia side.

Create example arr:

julia> using PyCall

julia> ak = pyimport("awkward");

julia> arr = ak.Array(py"[[1,2,3], [], [4,5]]")
PyObject <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

julia> arr.layout
PyObject <ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x0000037ff330"/></offsets>
    <content><NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x000003a256a0"/></content>
</ListOffsetArray64>

Then you can get an pyarrow object via:

julia> arr_arrow = ak.to_arrow(arr)
PyObject <pyarrow.lib.ListArray object at 0x7f2144343048>
...
..

julia> @time [Int64[x...] for x in arr]
  0.034800 seconds (38.72 k allocations: 2.031 MiB)
3-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 []
 [4, 5]

Currently the fastest / least copy method of re-using as been:

function view_ak(arr)
    c = PyArray(arr.layout."content")
    o = PyArray(arr.layout."offsets")
    @views [c[o[i]+1:o[i+1]] for i in 1:length(o)-1]
end

julia> @time view_ak(arr)
  0.000089 seconds (37 allocations: 1.609 KiB)
3-element Array{SubArray{Int64,1,PyArray{Int64,1},Tuple{UnitRange{Int64}},false},1}:
 [1, 2, 3]
 0-element view(::PyArray{Int64,1}, 4:3) with eltype Int64
 [4, 5]
@sa-
Copy link

sa- commented Dec 29, 2020

This looks super interesting. I've been working to find a way to reuse c++ memory in Julia but I've been running into some roadblocks. Once we do this, it should be possible to use all of c++'s arrow functionality including their crazy fast parquet reader and gandiva, and then create a very performant arrow based DataFrame library on top of it.

@Moelf Are you familiar with c++?

@quinnj
Copy link
Member

quinnj commented Apr 14, 2021

Sorry for the slow response here, but here's one way we could convert between the awkward array and a Julia array:

julia> off = Arrow.Offsets(UInt8[], arr.layout.offsets)
3-element Arrow.Offsets{Int64}:
 (1, 3)
 (4, 3)
 (4, 5)

julia> @time list = Arrow.List{Vector{Int64}, Int64, typeof(arr.layout.content)}(UInt8[], Arrow.ValidityBitmap(UInt8[], 1, 0, 0), off, arr.layout.content, arr.__len__(), nothing)
  0.000102 seconds (64 allocations: 3.422 KiB)
3-element Arrow.List{Vector{Int64}, Int64, Vector{Int64}}:
 [1, 2, 3]
 0-element view(::Vector{Int64}, 4:3) with eltype Int64
 [4, 5]

I'm by no means a PyCall.jl expert, so it's unclear to me if, when we do arr.layout.offsets, a copy of the data is made; judging from the allocations in the @time invocation, I'd guess that a copy is being made. There's perhaps a PyCall.jl incantation to avoid making a copy, which would be ideal. Otherwise, perhaps there's a way to get an actual pointer to the awkward offsets/content arrays which we could unsafe_wrap to avoid allocating.

@quinnj
Copy link
Member

quinnj commented Apr 14, 2021

In general though, it's going to be, IMO, practically impossible to try and re-use arrow memory at the individual column/array level. There are two many factors that would complicate things. On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will), is a main use-case for the arrow format. So if you had an arrow IPC stream written to a memory buffer in c++ and were able to pass the pointer + len to Julia, we'd be able to do Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len)) and that should work just fine.

@sa-
Copy link

sa- commented Apr 15, 2021

Hey @quinnj ,

Thanks for responding to the issue! I've actually looked into this issue a bit more, it looks like implementing the C Data Interface would solve this for us, since that is its main use case:
https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/

This link shows and example of reusing arrow memory in r and python.

I'm happy to contribute as well if I can get some pointers on where to start

@sa-
Copy link

sa- commented Apr 15, 2021

I've been able to create Julia struct for the C Data Interface using Clang.jl, but I'm a bit confused about accessing the C Data Interface on the python side. @Moelf would you know how to do this?

@sa-
Copy link

sa- commented Apr 15, 2021

I posted this snippet in Slack, but I think I'll post it here for posterity too. I've made some progress with the C Data Interface but I'm stuck at converting the C Pointer to a Julia struct. Can't quite figure out what I'm doing wrong, but I would appreciate it if someone had a look.

These are the resources I used to build this snippet:

#ENV["PYTHON"] = "/usr/local/opt/[email protected]/bin/python3"
#using Pkg
#Pkg.build("PyCall")
using PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
ffi = pyimport("pyarrow.cffi").ffi
##
const ARROW_FLAG_DICTIONARY_ORDERED = 1
const ARROW_FLAG_NULLABLE = 2
const ARROW_FLAG_MAP_KEYS_SORTED = 4

struct ArrowSchema
    format::Cstring
    name::Cstring
    metadata::Cstring
    flags::Cint
    n_children::Cint
    children::Ptr{Ptr{ArrowSchema}}
    dictionary::Ptr{ArrowSchema}
    release::Ptr{Cvoid}
    private_data::Ptr{Cvoid}
end

struct ArrowArray
    length::Cint
    null_count::Cint
    offset::Cint
    n_buffers::Cint
    n_children::Cint
    buffers::Ptr{Ptr{Cvoid}}
    children::Ptr{Ptr{ArrowArray}}
    dictionary::Ptr{ArrowArray}
    release::Ptr{Cvoid}
    private_data::Ptr{Cvoid}
end
##

df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
##
c_schema_py = ffi.new("struct ArrowSchema*")
c_schema_ptr = ffi.cast("uintptr_t", c_schema_py).__int__()
c_batch_py = ffi.new("struct ArrowArray*")
c_batch_ptr = ffi.cast("uintptr_t", c_batch_py).__int__()
##
rb.schema._export_to_c(c_schema_ptr)
rb._export_to_c(c_batch_ptr)
##
c_schema_jl = unsafe_load(convert(Ptr{ArrowSchema}, c_schema_ptr))
@assert c_schema_jl.n_children == c_batch_py.n_children

@Moelf
Copy link
Contributor Author

Moelf commented May 23, 2021

On the other hand, re-using arrow memory at the IPC stream level (an entire table, if you will)

that would actually be useful for 99% of the cases. As a concrete example, if we can generally re-use a pyarrow.Table construct, it would make 99% use case possible because almost everything interfaces to this object in python land.

ugh.... the unwrap_pyarrow_table is in Cython, so dumb

@Moelf
Copy link
Contributor Author

Moelf commented Aug 31, 2022

I have re-kindled hope for this, from https://arrow.apache.org/docs/python/ipc.html

if we replace sink with a Julia buffer does that work? @quinnj I think this is the IPC stream chunk (of bytes) you mentioned earlier, that supposedly Julia can just

Arrow.Table(unsafe_wrap(Vector{UInt8}, ptr, len))

around?

@Moelf
Copy link
Contributor Author

Moelf commented Aug 31, 2022

@Moelf
Copy link
Contributor Author

Moelf commented Jan 22, 2023

with the last gist I posted, it's practically viable, with one open question being how can we minimize the actual bytes movement in this call

julia> pywith(pa.ipc.new_stream(jl_sink, batch.schema)) do writer
               writer.write_batch(batch)
           end;

essentially, the question is if it's possible to get the ipc bytes for a pyarrow batch:

Python RecordBatch:
pyarrow.RecordBatch

without copying them through a Julia IOBuffer

@Moelf Moelf closed this as completed Jan 22, 2023
@mzy2240
Copy link

mzy2240 commented Oct 15, 2024

@Moelf hey just curious, did you eventually figure out a more efficient way than the gist you posted?

@Moelf
Copy link
Contributor Author

Moelf commented Oct 15, 2024

nope, and there's not much going on in arrow-julia in general unfortunatelly

@mzy2240
Copy link

mzy2240 commented Oct 15, 2024

how about save to mmap in python side and then read mmap in julia side? If I understand the source code correctly, when reading a local file using Arrow.jl, mmap is used.

@Moelf
Copy link
Contributor Author

Moelf commented Oct 15, 2024

if you have to save to disk it's lost battle. the whole point is there are packages like awkawrd from Python's side that can ingest a few different file format and internally have Arrow Table memory layout. If you have to go to disk, you can do any of other options.

Also, Arrow.jl is terrible with compressed on-disk file since you can't mmap it, but usually when you're writing to disk, you almost always have compression on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants