Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array.Binary and array.String should use int64 offsets #195

Closed
tosinva-stripe opened this issue Nov 21, 2024 · 4 comments · Fixed by #197
Closed

array.Binary and array.String should use int64 offsets #195

tosinva-stripe opened this issue Nov 21, 2024 · 4 comments · Fixed by #197
Labels
Type: bug Something isn't working

Comments

@tosinva-stripe
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

LargeBinary and LargeString use int64 offsets, however Binary and String types use int32 offsets, this makes them susceptible to slice index out of bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.

To reproduce try deserializing a parquet file that is greater than 2.2 GB.

A workaround is to force the go library to deserialize the field/column as LargeBinary instead of Binary:

Error looks like:

panic: runtime error: slice bounds out of range [:-2147483014]

goroutine 95 [running]:
github.com/apache/arrow/go/v17/arrow/array.(*Binary).Value(...)
	/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:59
github.com/apache/arrow/go/v17/arrow/array.(*Binary).ValueStr(0xc000178d20?, 0xc091402a00?)
	/go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:67 +0xfa
extractorvalidator/data.BootstrapRecordsFromParquet({0x1de1a40, 0xcc6a9775f0}, 0x0)
	/.../data/records.go:78 +0x582
main.validationWorker({0x1dccd90, 0x2c31840}, 0x0?, {0x0?}, 0xc0000315e0, 0xc000001de0, 0xc0000fe9c0)
	/.../command.go:428 +0x125
created by main.RunValidateCmd in goroutine 1
	/.../command.go:174 +0xb90

version and platform

Arrow Version: github.com/apache/arrow/go/v17 v17.0.0
Platform: Linux 20.04.1-Ubuntu  x86_64 x86_64 x86_64 GNU/Linux

Component(s)

Parquet, Other

@zeroshade
Copy link
Member

There are a couple ways we can adjust to do this that I can think of, depending on your use case:

  1. We could add an option to the ArrowReaderProperties to explicitly always use LargeString/LargeBinary for strings (eliminating the need for the workaround). This requires a user to know ahead of time that they need to use LargeString/LargeBinary, which may not be feasible or the best route.
  2. Are you using ReadTable? Or are you streaming the records? We could check the size of the column data ahead of time and force it to split records based on the column size, that would avoid this problem. This would require a bit of extra up-front work to do the checking, but allows seamless record reading without the user needing to know ahead of time.

Thoughts?

@tosinva-stripe
Copy link
Author

tosinva-stripe commented Nov 21, 2024

Thanks for the response @zeroshade, I am using arrow/go/v17's pqarrow.ReadTable to read the records from parquet files.

Both approaches seem reasonable, however, wouldn't using int64 offsets instead of int32 be the simplest approach?

@zeroshade
Copy link
Member

I don't want to change the current default as it's possible users may be relying on the current behavior that defaults to using String and Binary instead of their Large variants.

If both approaches are reasonable, then I think I'll first simply add the option to force LargeString/LargeBinary as that is much easier to do. I'll wait on attempting the automatic splitting until someone specifically requests that. I'll try to get to this in the next week or two.

@kou kou changed the title [GO] array.Binary and array.String should use int64 offsets. array.Binary and array.String should use int64 offsets Nov 22, 2024
@tosinva-stripe
Copy link
Author

Okay, this sounds good to me; thanks!

zeroshade added a commit that referenced this issue Dec 9, 2024
### Rationale for this change
closes #195

For parquet files that contain more than 2GB of data in a column, we
should allow a user to force using the LargeString/LargeBinary variants
without requiring a stored schema.

### What changes are included in this PR?
Adds `ForceLarge` option to `pqarrow.ArrowReadProperties` and enables it
to force usage of `LargeString` and `LargeBinary` data types.

### Are these changes tested?
Yes, a unit test is added.

### Are there any user-facing changes?
No breaking changes, only the addition of a new option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants