Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read parquet file #585

Open
sarnikowski opened this issue Jun 3, 2024 · 2 comments
Open

Unable to read parquet file #585

sarnikowski opened this issue Jun 3, 2024 · 2 comments

Comments

@sarnikowski
Copy link

sarnikowski commented Jun 3, 2024

I am having trouble reading a parquet file using this library, that is generated from using pyarrow and pandas in python. Any help would be appreciated.

example.zip

With the provided example file (the file in the zip), the following code produces an empty string:

package main

import (
	"encoding/json"
	"fmt"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
)

func main() {
	fr, err := local.NewLocalFileReader("path/to/file")
	if err != nil {
		panic(err)
	}

	pr, err := reader.NewParquetReader(fr, nil, 1)
	if err != nil {
		panic(err)
	}

	num := int(pr.GetNumRows())
	res, err := pr.ReadByNumber(num)
	if err != nil {
		panic(err)
	}

	jsonB, err := json.Marshal(res)
	if err != nil {
		panic(err)
	}
	fmt.Printf("%s", string(jsonB))
}

Below is the metadata printed when using pqrs schema --detailed on the file:

version: 2
num of rows: 1
created by: parquet-cpp-arrow version 16.1.0
metadata:
  pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 1, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "id", "field_name": "id", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "16.1.0"}, "pandas_version": "2.1.4"}
  ARROW:schema: //////gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAEwCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAABQCAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBbeyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiAiaWQiLCAiZmllbGRfbmFtZSI6ICJpZCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJuYW1lIiwgImZpZWxkX25hbWUiOiAibmFtZSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMTYuMS4wIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIyLjEuNCJ9AAAAAAIAAABEAAAABAAAANT///8AAAEFEAAAABwAAAAEAAAAAAAAAAQAAABuYW1lAAAAAAQABAAEAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECEAAAABwAAAAEAAAAAAAAAAIAAABpZAAACAAMAAgABwAIAAAAAAAAAUAAAAAAAAAA
message schema {
  OPTIONAL INT64 id;
  OPTIONAL BYTE_ARRAY name (STRING);
}


num of row groups: 1
row groups:

row group 0:
--------------------------------------------------------------------------------
total byte size: 180
num of rows: 1

num of columns: 2
columns:

column 0:
--------------------------------------------------------------------------------
column type: INT64
column path: "id"
encodings: PLAIN RLE RLE_DICTIONARY
file path: N/A
file offset: 98
num of values: 1
compression: LZ4_RAW
total compressed size (in bytes): 94
total uncompressed size (in bytes): 92
data page offset: 27
index page offset: N/A
dictionary page offset: 4
statistics: {min: 1, max: 1, distinct_count: N/A, null_count: 0, min_max_deprecated: false}
bloom filter offset: N/A
offset index offset: N/A
offset index length: N/A
column index offset: N/A
column index length: N/A


column 1:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "name"
encodings: PLAIN RLE RLE_DICTIONARY
file path: N/A
file offset: 281
num of values: 1
compression: LZ4_RAW
total compressed size (in bytes): 91
total uncompressed size (in bytes): 88
data page offset: 222
index page offset: N/A
dictionary page offset: 190
statistics: {min: [80, 101, 116, 101, 114, 32, 80, 97, 114, 107, 101, 114], max: [80, 101, 116, 101, 114, 32, 80, 97, 114, 107, 101, 114], distinct_count: N/A, null_count: 0, min_max_deprecated: false}
bloom filter offset: N/A
offset index offset: N/A
offset index length: N/A
column index offset: N/A
column index length: N/A
@hangxie
Copy link
Contributor

hangxie commented Jul 6, 2024

Would you please help to try out #591?

@byteplus-rec
Copy link

byteplus-rec commented Sep 6, 2024

Would you please help to try out #591?

Hello, I have encountered the same issue. I am currently using version 1.5.3. Upgrading to the latest version can solve the problem, but it introduces some dependency conflicts in my service. Is it possible to support the new compressor in older versions of parquet-go? Alternatively, would it be feasible to consider opening up compressor customization so that users can flexibly implement their own compressors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants