fixed size list type is not retained when writing to parquet #957

matko · 2024-11-25T19:44:37Z

When I create a parquet file from an arrow table with a fixed size array as one of the columns, then read back the resulting parquet, the column is no longer a fixed size array, but instead a dynamically sized array.

Example:

import datafusion as df
import pyarrow as pa

FILENAME = "/tmp/fixed_array_example.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})
df_table = ctx.from_arrow(table)
print("original schema:")
print(df_table.schema())

df_table.write_parquet(FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())

Output:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
roundtrip schema:
array: list<item: float>
  child 0, item: float

As the output demonstrates, the datafusion dataframe that is written out has the proper schema. Nevertheless, the file that is read back does not.

If instead of datafusion, I use pyarrow to write the parquet file, I do get the expected schema when I read it back using datafusion.

import datafusion as df
import pyarrow as pa
import pyarrow.parquet as pq

FILENAME = "/tmp/fixed_array_example_pyarrow.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})

print("original schema:")
print(table.schema)

pq.write_table(table, FILENAME)
print("roundtrip schema:")
print(ctx.read_parquet(FILENAME).schema())

output:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
roundtrip schema:
array: fixed_size_list<element: float>[2]
  child 0, element: float

The text was updated successfully, but these errors were encountered:

kosiew · 2024-12-03T11:30:11Z

Research note:
currently FixedSizeList is stored as List in Parquet

kylebarron · 2024-12-17T20:44:46Z

I think you want to set skip_metadata=False

datafusion-python/python/datafusion/context.py

Line 717 in 79c22d6

skip_metadata: bool = True,

In my opinion that should be the default, otherwise no Arrow-only types will be retained I think.

kosiew · 2024-12-18T03:36:31Z

hi @kylebarron ,

I amended skip_metadata: bool = False and ran

import datafusion as df
import pyarrow as pa

FILENAME = "/tmp/fixed_array_example.parquet"
ctx = df.SessionContext()

array = pa.array([[1.0, 2.0], [3.0, 4.0]], type=pa.list_(pa.float32(), 2))
table = pa.Table.from_pydict({"array": array})
df_table = ctx.from_arrow(table)
print("original schema:")
print(df_table.schema())

df_table.write_parquet(FILENAME)

ctx.register_parquet("test_fixed_list", FILENAME)

print("register parquet schema:")
print(ctx.table("test_fixed_list").schema())

but still the fixed size list type was still retained:

original schema:
array: fixed_size_list<item: float>[2]
  child 0, item: float
==> register_parquet with skip_metadata:  False
register parquet schema:
array: list<item: float>
  child 0, item: float

kylebarron · 2024-12-19T22:10:24Z

In this case it's an issue with writing the Parquet file, not reading, which you can see if you try to read the file back with pyarrow:

In [23]: import pyarrow.parquet as pq

In [24]: pq.read_schema(FILENAME)
Out[24]:
array: list<item: float>
  child 0, item: float

In this case it's actually because the writing side doesn't correctly propagate the Arrow metadata either.

Here's how pyarrow.parquet correctly propagates the Arrow schema within the Parquet metadata:

In [32]: pq.write_table(table, "test.parquet")

In [33]: meta2 = pq.read_metadata('test.parquet')

In [34]: meta2.metadata
Out[34]: {b'ARROW:schema': b'/////6gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAEAAAAzP///wAAARAUAAAAIAAAAAQAAAABAAAALAAAAAUAAABhcnJheQAGAAgABAAGAAAAAgAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABAxAAAAAcAAAABAAAAAAAAAAEAAAAaXRlbQAABgAIAAYABgAAAAAAAQA='}

However there's no embedded Arrow schema in the Parquet file written by DataFusion:

In [35]: df_table.write_parquet(FILENAME)

In [36]: meta = pq.read_metadata(FILENAME)

In [37]: meta.metadata # None

kylebarron · 2024-12-19T22:32:22Z

IMO not writing the Arrow schema to Parquet is a big bug.

Trying to track this down...

datafusion-python/src/dataframe.rs

Lines 510 to 520 in 79c22d6

    
           let mut options = TableParquetOptions::default(); 
        
           options.global.compression = Some(compression_string); 
        
           wait_for_future( 
        
               py, 
        
               self.df.as_ref().clone().write_parquet( 
        
                   path, 
        
                   DataFrameWriteOptions::new(), 
        
                   Option::from(options), 
        
               ), 
        
           )?;

This just calls datafusion::dataframe::DataFrame::write_parquet with the default options.

I'm not sure where on the datafusion side this fails.

kylebarron · 2024-12-19T22:39:41Z

Looks like it's this bug: apache/datafusion#11770

matko added the bug Something isn't working label Nov 25, 2024

kylebarron mentioned this issue Dec 19, 2024

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. apache/datafusion#11770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed size list type is not retained when writing to parquet #957

fixed size list type is not retained when writing to parquet #957

matko commented Nov 25, 2024 •

edited

Loading

kosiew commented Dec 3, 2024 •

edited

Loading

kylebarron commented Dec 17, 2024 •

edited

Loading

kosiew commented Dec 18, 2024

kylebarron commented Dec 19, 2024

kylebarron commented Dec 19, 2024

kylebarron commented Dec 19, 2024

fixed size list type is not retained when writing to parquet #957

fixed size list type is not retained when writing to parquet #957

Comments

matko commented Nov 25, 2024 • edited Loading

kosiew commented Dec 3, 2024 • edited Loading

kylebarron commented Dec 17, 2024 • edited Loading

kosiew commented Dec 18, 2024

kylebarron commented Dec 19, 2024

kylebarron commented Dec 19, 2024

kylebarron commented Dec 19, 2024

matko commented Nov 25, 2024 •

edited

Loading

kosiew commented Dec 3, 2024 •

edited

Loading

kylebarron commented Dec 17, 2024 •

edited

Loading