Arrow python client read and write, did not output the expected metadata #833

liusitan · 2022-07-07T04:59:23Z

liusitan
Jul 7, 2022

This is the code for reading.
preader.py

import pyarrow as pa
# f = '/tmp/fib.arrow'
f = 'arraydata.arrow'
with pa.ipc.open_file(f) as reader:
    t = reader.read_all()
    print(t.to_string(show_metadata=True, preview_cols=2))

This is the code for writing
pwriter.py

import numpy as np
import pyarrow as pa

arr = pa.array(np.arange(8))
schema = pa.schema([
    pa.field('nums', arr.type)
])
n_legs = pa.array([2, 4, 5, 100])
t = pa.Table.from_arrays([n_legs], names=["nums"],metadata = {"loc":"san diego"})
with pa.OSFile('arraydata.arrow', 'wb') as sink:
    with pa.ipc.new_file(sink, schema=schema) as writer:
        # batch = pa.record_batch([arr], schema=schema)
        writer.write_table(t)

weirdly, when I run the python writer first, and reader second, it does not print out the metadata stored.

I am using macOS, if it matters:>

Answered by liusitan

Jul 7, 2022

apache/arrow#13535
I posted it to the Apache-Arrow community, David liu answered that in arrow, the table metadata is located inside the schema. When I create a writer with a given schema, the table written by the writer must follow the schema. In this example, the metadata in the "schema" variable is None, and the metadata in the t is {"loc":"san diego"}. When creating the writer, the schema is initialized to "schema", which means, whatever t is, its metadata is replaced by None. Further, I found out that, to access the schema, I can call t.schema.metadata. Below is the updated example, printing out extracted metadata.

schema is replaced by the t.schema
pwriter.py

import numpy as np
import

View full answer

liusitan · 2022-07-07T16:22:07Z

liusitan
Jul 7, 2022
Author

apache/arrow#13535
I posted it to the Apache-Arrow community, David liu answered that in arrow, the table metadata is located inside the schema. When I create a writer with a given schema, the table written by the writer must follow the schema. In this example, the metadata in the "schema" variable is None, and the metadata in the t is {"loc":"san diego"}. When creating the writer, the schema is initialized to "schema", which means, whatever t is, its metadata is replaced by None. Further, I found out that, to access the schema, I can call t.schema.metadata. Below is the updated example, printing out extracted metadata.

schema is replaced by the t.schema
pwriter.py

import numpy as np
import pyarrow as pa
arr = pa.array(np.arange(8))
schema = pa.schema([
    pa.field('nums', arr.type)
])
n_legs = pa.array([2, 4, 5, 100])
t = pa.Table.from_arrays([n_legs], names=["nums"],metadata = {"loc":"san diego"})
with pa.OSFile('arraydata.arrow', 'wb') as sink:
    with pa.ipc.new_file(sink, schema=t.schema) as writer:
        writer.write_table(t)

printing out the metadata
preader.py

import pyarrow as pa
# f = '/tmp/fib.arrow'
f = 'arraydata.arrow'
with pa.ipc.open_file(f) as reader:
    t = reader.read_all()
    # print(t.to_string(show_metadata=True, preview_cols=2))
    print(t.schema.metadata)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow python client read and write, did not output the expected metadata #833

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Arrow python client read and write, did not output the expected metadata #833

liusitan Jul 7, 2022

Replies: 1 comment

liusitan Jul 7, 2022 Author

liusitan
Jul 7, 2022

liusitan
Jul 7, 2022
Author