Unable to read parquet file #585

sarnikowski · 2024-06-03T13:03:31Z

I am having trouble reading a parquet file using this library, that is generated from using pyarrow and pandas in python. Any help would be appreciated.

example.zip

With the provided example file (the file in the zip), the following code produces an empty string:

package main

import (
	"encoding/json"
	"fmt"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
)

func main() {
	fr, err := local.NewLocalFileReader("path/to/file")
	if err != nil {
		panic(err)
	}

	pr, err := reader.NewParquetReader(fr, nil, 1)
	if err != nil {
		panic(err)
	}

	num := int(pr.GetNumRows())
	res, err := pr.ReadByNumber(num)
	if err != nil {
		panic(err)
	}

	jsonB, err := json.Marshal(res)
	if err != nil {
		panic(err)
	}
	fmt.Printf("%s", string(jsonB))
}

Below is the metadata printed when using pqrs schema --detailed on the file:

version: 2
num of rows: 1
created by: parquet-cpp-arrow version 16.1.0
metadata:
  pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 1, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "id", "field_name": "id", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "16.1.0"}, "pandas_version": "2.1.4"}
  ARROW:schema: //////gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAEwCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAABQCAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBbeyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiAiaWQiLCAiZmllbGRfbmFtZSI6ICJpZCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJuYW1lIiwgImZpZWxkX25hbWUiOiAibmFtZSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMTYuMS4wIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIyLjEuNCJ9AAAAAAIAAABEAAAABAAAANT///8AAAEFEAAAABwAAAAEAAAAAAAAAAQAAABuYW1lAAAAAAQABAAEAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECEAAAABwAAAAEAAAAAAAAAAIAAABpZAAACAAMAAgABwAIAAAAAAAAAUAAAAAAAAAA
message schema {
  OPTIONAL INT64 id;
  OPTIONAL BYTE_ARRAY name (STRING);
}


num of row groups: 1
row groups:

row group 0:
--------------------------------------------------------------------------------
total byte size: 180
num of rows: 1

num of columns: 2
columns:

column 0:
--------------------------------------------------------------------------------
column type: INT64
column path: "id"
encodings: PLAIN RLE RLE_DICTIONARY
file path: N/A
file offset: 98
num of values: 1
compression: LZ4_RAW
total compressed size (in bytes): 94
total uncompressed size (in bytes): 92
data page offset: 27
index page offset: N/A
dictionary page offset: 4
statistics: {min: 1, max: 1, distinct_count: N/A, null_count: 0, min_max_deprecated: false}
bloom filter offset: N/A
offset index offset: N/A
offset index length: N/A
column index offset: N/A
column index length: N/A


column 1:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "name"
encodings: PLAIN RLE RLE_DICTIONARY
file path: N/A
file offset: 281
num of values: 1
compression: LZ4_RAW
total compressed size (in bytes): 91
total uncompressed size (in bytes): 88
data page offset: 222
index page offset: N/A
dictionary page offset: 190
statistics: {min: [80, 101, 116, 101, 114, 32, 80, 97, 114, 107, 101, 114], max: [80, 101, 116, 101, 114, 32, 80, 97, 114, 107, 101, 114], distinct_count: N/A, null_count: 0, min_max_deprecated: false}
bloom filter offset: N/A
offset index offset: N/A
offset index length: N/A
column index offset: N/A
column index length: N/A

The text was updated successfully, but these errors were encountered:

hangxie · 2024-07-06T01:51:04Z

Would you please help to try out #591?

byteplus-rec · 2024-09-06T11:06:39Z

Would you please help to try out #591?

Hello, I have encountered the same issue. I am currently using version 1.5.3. Upgrading to the latest version can solve the problem, but it introduces some dependency conflicts in my service. Is it possible to support the new compressor in older versions of parquet-go? Alternatively, would it be feasible to consider opening up compressor customization so that users can flexibly implement their own compressors?

hangxie mentioned this issue Jul 6, 2024

add LZ4_RAW support #591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read parquet file #585

Unable to read parquet file #585

sarnikowski commented Jun 3, 2024 •

edited

Loading

hangxie commented Jul 6, 2024

byteplus-rec commented Sep 6, 2024 •

edited

Loading

Unable to read parquet file #585

Unable to read parquet file #585

Comments

sarnikowski commented Jun 3, 2024 • edited Loading

hangxie commented Jul 6, 2024

byteplus-rec commented Sep 6, 2024 • edited Loading

sarnikowski commented Jun 3, 2024 •

edited

Loading

byteplus-rec commented Sep 6, 2024 •

edited

Loading