-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
array.Binary and array.String should use int64 offsets #195
Comments
There are a couple ways we can adjust to do this that I can think of, depending on your use case:
Thoughts? |
Thanks for the response @zeroshade, I am using arrow/go/v17's Both approaches seem reasonable, however, wouldn't using int64 offsets instead of int32 be the simplest approach? |
I don't want to change the current default as it's possible users may be relying on the current behavior that defaults to using String and Binary instead of their Large variants. If both approaches are reasonable, then I think I'll first simply add the option to force LargeString/LargeBinary as that is much easier to do. I'll wait on attempting the automatic splitting until someone specifically requests that. I'll try to get to this in the next week or two. |
Okay, this sounds good to me; thanks! |
### Rationale for this change closes #195 For parquet files that contain more than 2GB of data in a column, we should allow a user to force using the LargeString/LargeBinary variants without requiring a stored schema. ### What changes are included in this PR? Adds `ForceLarge` option to `pqarrow.ArrowReadProperties` and enables it to force usage of `LargeString` and `LargeBinary` data types. ### Are these changes tested? Yes, a unit test is added. ### Are there any user-facing changes? No breaking changes, only the addition of a new option.
Describe the bug, including details regarding any error messages, version, and platform.
LargeBinary and LargeString use int64 offsets, however Binary and String types use int32 offsets, this makes them susceptible to slice index out of bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.
To reproduce try deserializing a parquet file that is greater than 2.2 GB.
A workaround is to force the go library to deserialize the field/column as LargeBinary instead of Binary:
store_schema
https://arrow.apache.org/docs/cpp/parquet.html#roundtripping-arrow-types-and-schemaError looks like:
version and platform
Component(s)
Parquet, Other
The text was updated successfully, but these errors were encountered: