array.Binary and array.String should use int64 offsets #195

tosinva-stripe · 2024-11-21T19:19:33Z

Describe the bug, including details regarding any error messages, version, and platform.

LargeBinary and LargeString use int64 offsets, however Binary and String types use int32 offsets, this makes them susceptible to slice index out of bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.

To reproduce try deserializing a parquet file that is greater than 2.2 GB.

A workaround is to force the go library to deserialize the field/column as LargeBinary instead of Binary:

explicitly store the arrow schema during write. see store_schema https://arrow.apache.org/docs/cpp/parquet.html#roundtripping-arrow-types-and-schema
and schema explicitly uses the large_binary or large_string type when defining the schema that is used to write the parquet files.

Error looks like:

panic: runtime error: slice bounds out of range [:-2147483014]

goroutine 95 [running]:
github.com/apache/arrow/go/v17/arrow/array.(*Binary).Value(...)
	/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:59
github.com/apache/arrow/go/v17/arrow/array.(*Binary).ValueStr(0xc000178d20?, 0xc091402a00?)
	/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:67 +0xfa
extractorvalidator/data.BootstrapRecordsFromParquet({0x1de1a40, 0xcc6a9775f0}, 0x0)
	/.../data/records.go:78 +0x582
main.validationWorker({0x1dccd90, 0x2c31840}, 0x0?, {0x0?}, 0xc0000315e0, 0xc000001de0, 0xc0000fe9c0)
	/.../command.go:428 +0x125
created by main.RunValidateCmd in goroutine 1
	/.../command.go:174 +0xb90

version and platform

Arrow Version: github.com/apache/arrow/go/v17 v17.0.0
Platform: Linux 20.04.1-Ubuntu  x86_64 x86_64 x86_64 GNU/Linux

Component(s)

Parquet, Other

The text was updated successfully, but these errors were encountered:

zeroshade · 2024-11-21T22:26:16Z

There are a couple ways we can adjust to do this that I can think of, depending on your use case:

We could add an option to the ArrowReaderProperties to explicitly always use LargeString/LargeBinary for strings (eliminating the need for the workaround). This requires a user to know ahead of time that they need to use LargeString/LargeBinary, which may not be feasible or the best route.
Are you using ReadTable? Or are you streaming the records? We could check the size of the column data ahead of time and force it to split records based on the column size, that would avoid this problem. This would require a bit of extra up-front work to do the checking, but allows seamless record reading without the user needing to know ahead of time.

Thoughts?

tosinva-stripe · 2024-11-21T22:30:54Z

Thanks for the response @zeroshade, I am using arrow/go/v17's pqarrow.ReadTable to read the records from parquet files.

Both approaches seem reasonable, however, wouldn't using int64 offsets instead of int32 be the simplest approach?

zeroshade · 2024-11-21T23:38:39Z

I don't want to change the current default as it's possible users may be relying on the current behavior that defaults to using String and Binary instead of their Large variants.

If both approaches are reasonable, then I think I'll first simply add the option to force LargeString/LargeBinary as that is much easier to do. I'll wait on attempting the automatic splitting until someone specifically requests that. I'll try to get to this in the next week or two.

tosinva-stripe · 2024-11-22T02:29:10Z

Okay, this sounds good to me; thanks!

### Rationale for this change closes #195 For parquet files that contain more than 2GB of data in a column, we should allow a user to force using the LargeString/LargeBinary variants without requiring a stored schema. ### What changes are included in this PR? Adds `ForceLarge` option to `pqarrow.ArrowReadProperties` and enables it to force usage of `LargeString` and `LargeBinary` data types. ### Are these changes tested? Yes, a unit test is added. ### Are there any user-facing changes? No breaking changes, only the addition of a new option.

tosinva-stripe added the Type: bug label Nov 21, 2024

tosinva-stripe mentioned this issue Nov 21, 2024

[GO] array.Binary and array.String should use int64 offsets. apache/arrow#44806

Closed

kou changed the title ~~[GO] array.Binary and array.String should use int64 offsets.~~ array.Binary and array.String should use int64 offsets Nov 22, 2024

zeroshade mentioned this issue Nov 22, 2024

feat(parquet/pqarrow): Add ForceLarge option #197

Merged

zeroshade closed this as completed in #197 Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

array.Binary and array.String should use int64 offsets #195

array.Binary and array.String should use int64 offsets #195

tosinva-stripe commented Nov 21, 2024

zeroshade commented Nov 21, 2024

tosinva-stripe commented Nov 21, 2024 •

edited

Loading

zeroshade commented Nov 21, 2024

tosinva-stripe commented Nov 22, 2024

array.Binary and array.String should use int64 offsets #195

array.Binary and array.String should use int64 offsets #195

Comments

tosinva-stripe commented Nov 21, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

zeroshade commented Nov 21, 2024

tosinva-stripe commented Nov 21, 2024 • edited Loading

zeroshade commented Nov 21, 2024

tosinva-stripe commented Nov 22, 2024

tosinva-stripe commented Nov 21, 2024 •

edited

Loading