Skip to content

Conversation

jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Sep 1, 2025

Which issue does this PR close?

Rationale for this change

Apache Avro’s decimal logical type annotates either bytes or fixed and carries precision and scale. Implementations should reject invalid combinations such as scale > precision, and the underlying bytes are the two’s‑complement big‑endian representation of the unscaled integer. On the Arrow side, Rust now exposes first‑class Decimal32, Decimal64, Decimal128, and Decimal256 data types with documented maximum precisions (9, 18, 38, 76 respectively). Until now, arrow-avro decoded all Avro decimals to 128/256‑bit Arrow decimals, even when a narrower type would suffice.

What changes are included in this PR?

arrow-avro/src/codec.rs

  • Map Codec::Decimal(precision, scale, _size) to Arrow’s Decimal32/64/128/256 by precision, preferring the narrowest type (≤9→32, ≤18→64, ≤38→128, otherwise 256).
  • Strengthen decimal attribute parsing:
    • Error if scale > precision.
    • Error if precision exceeds Arrow’s maximum (Decimal256).
    • If Avro uses fixed, check that declared precision fits the byte width (≤4→max 9, ≤8→18, ≤16→38, ≤32→76).
  • Update docstring of Codec::Decimal to mention Decimal32/64.

arrow-avro/src/reader/record.rs

  • Add Decoder::Decimal32 and Decoder::Decimal64 variants with corresponding builders (Decimal32Builder, Decimal64Builder).

  • Builder selection:

    • If Avro uses fixed: choose by size (≤4→Decimal32, ≤8→Decimal64, ≤16→Decimal128, ≤32→Decimal256).
    • If Avro uses bytes: choose by declared precision (≤9/≤18/≤38/≤76).
  • Implement decode paths that sign‑extend Avro’s two’s‑complement payload to 4/8 bytes and append values to the new builders; update append_null/flush for 32/64‑bit decimals.

arrow-avro/src/reader/mod.rs (tests)

  • Expand test_decimal to assert that:

    • bytes‑backed decimals with precision 4 map to Decimal32; precision 10 map to Decimal64;
    • legacy fixed[8] decimals map to Decimal64;
    • fixed[16] decimals map to Decimal128.
  • Add a nulls path test for bytes‑backed Decimal32.

Are these changes tested?

Yes. Unit tests under arrow-avro/src/reader/mod.rs construct expected Decimal32Array/Decimal64Array/Decimal128Array with with_precision_and_scale, and compare against batches decoded from Avro files (including legacy fixed and bytes‑backed cases). The tests also exercise small batch sizes to cover buffering paths; a new Avro data file is added for higher‑width decimals.

New Avro test file details:

  • test/data/int256_decimal.avro # bytes logicalType: decimal(precision=76, scale=10)
  • test/data/fixed256_decimal.avro # fixed[32] logicalType: decimal(precision=76, scale=10)
  • test/data/fixed_length_decimal_legacy_32.avro # fixed[4] logicalType: decimal(precision=9, scale=2)
  • test/data/int128_decimal.avro # bytes logicalType: decimal(precision=38, scale=2)

These new Avro test files were created using this script: https://gist.github.com/jecsand838/3890349bdb33082a3e8fdcae3257eef7

There is also an arrow-testing PR for these new files: apache/arrow-testing#112

Are there any user-facing changes?

N/A due to arrow-avro not being public.

- Expanded Decoder to handle Decimal32 and Decimal64 types alongside existing Decimal128 and Decimal256 support.
- Added precision and fixed size validation for Decimal32 and Decimal64 to ensure Arrow compatibility.
- Updated flush, decode, and append_null methods in the Decoder to process these new decimal types correctly.
- Refined test suite with new test cases to validate Decimal32/Decimal64 decoding and integration logic.
- Introduced thorough coverage across data-backed, fixed-sized, and precision-based decimal paths.
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Sep 1, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838

Can we please add a description of the new binary files added?

Also, one thing I worry about is that Decimal32 and Decimal64 are relatively new types (thanks to @CurtHagenlocher 🙏 ) but their support across the various arrow kernels now is not very large.

Thus if you have the avro reader return Decimal32/Decimal64 arrays, it might be hard to use them in the rest of the arrow-rs world until we complete the rest of the support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were these files created? Maybe we could add a README explaining what is in them and how they were made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that. I have an arrow-testing PR outstanding for these files and the script used to create them saved as a gist: apache/arrow-testing#112
https://gist.github.com/jecsand838/3890349bdb33082a3e8fdcae3257eef7

Also, one thing I worry about is that Decimal32 and Decimal64 are relatively new types (thanks to @CurtHagenlocher 🙏 ) but their support across the various arrow kernels now is not very large.

Ah, that makes sense. Do you think it's worth feature flagging the Decimal32 and Decimal64 support for the time being?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-avro arrow-avro crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants