fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

ikrommyd · 2025-06-04T09:21:32Z

As explained in the issue, we should be casting ints and bools to float64 to avoid integer summation overflow errors in descriptive statistics like numpy does. The tests pass but I'm not 100% sure that I'm not missing an edge case that ak.values_as_type doesn't cover.

ianna

@ikrommyd - Thanks, but I don't think we should do this. I'd rather leave it to the user who knows their data better. Firstly, binary floats cannot represent decimal fractional values accurately, except for those which are an integer divided by a power of 2. Secondly, this implementation would trigger device to host data copies on GPU.

ikrommyd · 2025-06-04T11:23:32Z

Why would it trigger copies to host? It's using self._backend.nplike.asarray(self._data, dtype=dtype), internally. I believe cupy.asarray does device side cast without copying to host.

Regarding the binary float representation, I get that but numpy is doing it anyways.
Numpy explicitly does this for such functions: https://github.com/numpy/numpy/blob/ff1d6cc78322898c02339a529005eb358aeba327/numpy/_core/_methods.py#L160-L170
If we leave it to the user, you expect the users to know when an integer summation is going to overflow and take care of it which I don't think is gonna happen. They will probably be mostly getting incorrect results without them realizing it.
In this specific issue, it resulted in a NaN value which was pretty striking. You can just get a valid positive variance/std that is just wrong and the users will just trust it. In the case of the mean calculation, there is no invalid value even to judge from. I think it’s dangerous to just spit out incorrect numbers and expect the users to figure it out.
I don't think a library should expect the users to consider integer overflow errors. It took me a good few minutes to understand why ak.var is giving an incorrect result.

I'm fine with leaving it as is but I think the point was to also mimick numpy behavior here when these functions were implemented, especially since we're doing a NEP18 dispatch to such functions.
Per the project's README:
"Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they're not."

agoose77 · 2025-06-04T17:58:18Z

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

ikrommyd · 2025-06-04T18:01:17Z

My whole point here is that even a simple mean calculation sum(values)/number of values could just give you a wrong number if sum(values) is over 2**63 (even worse for int32). The user will just trust that this is the right mean while it's not.

ikrommyd · 2025-06-04T18:04:06Z

@agoose77 mean, var, std are not quantized as integers in any case. You get a float back either way. The whole point is how you sum the elements of array. For a mean calculation, if you choose to sum all the elements of the array as integers, you get the wrong mean calculation if the values are large enough to overflow the integer. That's why numpy chooses to cast to float64 first before doing the summation.

ianna · 2025-06-04T18:07:37Z

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

Thanks, @agoose77

ikrommyd · 2025-06-04T18:09:32Z

@ianna I really agree with your "copying to host memory" point though but I don't think that it actually happens. I can't find something explicit in the documentation though. Do you have a source? I was under the impression that you can do dtype casting on the GPU.

ikrommyd · 2025-06-12T17:38:14Z

My final argument here is that, indeed we will not get the exact same float64 output as numpy since we're not using the same algorithm exactly. But my opinion is that I'd rather give to the user a slightly different answer due to floating poion precision like this

In [12]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.float64)

In [13]: np.var(data)
Out[13]: np.float64(2.1019389186985453e+17)

In [14]: ak.var(data)
Out[14]: np.float64(2.101938918697719e+17)

instead of a completely wrong value like this

In [9]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.int64)

In [10]: np.var(data)
Out[10]: np.float64(2.1019389186985453e+17)

In [11]: ak.var(data)
Out[11]: np.float64(-5.237074053972603e+16)

In this case, the var is pretty striking cause it's negative but the user is going to trust that the returned value is correct for a mean calculation for example. I think that's more dangerous in my opinion.

With this PR, these are the differences what we observe for the example that the user has provided:

In [5]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.int64)

In [6]: ak.var(data)
Out[6]: np.float64(2.101938918697719e+17)

In [7]: np.var(data)
Out[7]: np.float64(2.1019389186985446e+17)

In [8]: ak.std(data)
Out[8]: np.float64(458469074.0603688)

In [9]: np.std(data)
Out[9]: np.float64(458469074.0604588)

I think I prefer 458469074.0603688 instead of a NaN.

Oh and if you expect the user to convert to float64, then they will still see the difference:

In [4]: ak.var(data)
Out[4]: np.float64(2.101938918697719e+17)

In [5]: np.var(data)
Out[5]: np.float64(2.1019389186985453e+17)

I'm just suggesting we do the conversion for them.

ikrommyd · 2025-07-11T16:07:57Z

I'm not sure why test_block_boundary_prod_complex13 failed with the GPU tests. It has nothing to do with this PR

def test_block_boundary_prod_complex13():
    rng = np.random.default_rng(seed=42)
    array = rng.integers(50, size=1000)
    complex_array = np.vectorize(complex)(
        array[0 : len(array) : 2], array[1 : len(array) : 2]
    )
    content = ak.contents.NumpyArray(complex_array)
    cuda_content = ak.to_backend(content, "cuda", highlevel=False)
    cpt.assert_allclose(
        ak.prod(cuda_content, -1, highlevel=False),
        ak.prod(content, -1, highlevel=False),
    )

    offsets = ak.index.Index64(np.array([0, 5, 996, 1000], dtype=np.int64))
    depth1 = ak.contents.ListOffsetArray(offsets, content)
    cuda_depth1 = ak.to_backend(depth1, "cuda", highlevel=False)
    cpt.assert_allclose(
        to_list(ak.prod(cuda_depth1, -1, highlevel=False)),
        to_list(ak.prod(depth1, -1, highlevel=False)),
    )
    del cuda_content, cuda_depth1

ariostas · 2025-07-11T20:23:08Z

I'm not sure why test_block_boundary_prod_complex13 failed with the GPU tests. It has nothing to do with this PR

There might be a race condition or something in the kernels. I've see it fail a couple of times.

ikrommyd · 2025-07-15T18:03:56Z

Talked with @jpivarski briefly about this too at SciPy. The small differences in the summations comes from the fact that awkward doesn't implement Kahan summation. It was concluded however from the chat, that it's better to give to the user "slightly different values" versus numpy (due to different summation algorithm) and follow numpy's rule of casting everything to float64 in such functions than potentially give back to the user extremely wrong values due to overflow errors without them knowing.

ianna

@ikrommyd - thanks for looking into it. Could you, please, add the tests for where the operations would fail without casting to float64 change? Please, follow the naming convention of the tests: test_3527_<what_is_tested>.py. Thanks!

ikrommyd · 2025-08-03T15:35:36Z

@ikrommyd - thanks for looking into it. Could you, please, add the tests for where the operations would fail without casting to float64 change? Please, follow the naming convention of the tests: test_3527_<what_is_tested>.py. Thanks!

Absolutely! We can generate an array in the test that looks like the one the user used to report the original issue. Will do

ikrommyd · 2025-08-04T18:57:03Z

@ianna tests are added btw, all those assertions fail on main. I can't test linear_fit and moment because they don't exist in numpy.

cast numbers to float64 in descriptive statistics

ee6716f

ikrommyd requested review from agoose77 and ianna June 4, 2025 09:21

ikrommyd temporarily deployed to docs June 4, 2025 09:33 — with GitHub Actions Inactive

ianna reviewed Jun 4, 2025

View reviewed changes

ikrommyd added 2 commits July 9, 2025 07:18

forgot to do moment

6762ba0

Merge branch 'main' into ikrommyd/fix-statistics-int-overflows

b16d374

ikrommyd temporarily deployed to docs July 9, 2025 14:32 — with GitHub Actions Inactive

Merge branch 'main' into ikrommyd/fix-statistics-int-overflows

28d79aa

ikrommyd temporarily deployed to docs July 10, 2025 19:31 — with GitHub Actions Inactive

ikrommyd requested a review from ianna July 24, 2025 16:36

Merge branch 'main' into ikrommyd/fix-statistics-int-overflows

3c4edbe

ikrommyd temporarily deployed to docs July 24, 2025 17:00 — with GitHub Actions Inactive

ikrommyd requested a review from pfackeldey August 1, 2025 16:25

ianna requested changes Aug 3, 2025

View reviewed changes

ikrommyd added 2 commits August 4, 2025 11:31

add tests

0e4c180

Merge branch 'main' into ikrommyd/fix-statistics-int-overflows

7daabff

ikrommyd deployed to docs August 4, 2025 08:42 — with GitHub Actions View deployment

ikrommyd requested a review from ianna August 8, 2025 10:58

ikrommyd mentioned this pull request Aug 21, 2025

Difference between ak.std() and np.std() #3628

Open

ikrommyd marked this pull request as draft August 21, 2025 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ianna left a comment

Uh oh!

ikrommyd commented Jun 4, 2025 •

edited

Loading

Uh oh!

agoose77 commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ianna commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025 •

edited

Loading

Uh oh!

ikrommyd commented Jun 12, 2025 •

edited

Loading

Uh oh!

ikrommyd commented Jul 11, 2025

Uh oh!

ariostas commented Jul 11, 2025

Uh oh!

ikrommyd commented Jul 15, 2025

Uh oh!

ianna left a comment

Uh oh!

ikrommyd commented Aug 3, 2025

Uh oh!

ikrommyd commented Aug 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

Are you sure you want to change the base?

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

Uh oh!

Conversation

ikrommyd commented Jun 4, 2025

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

ikrommyd commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agoose77 commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ianna commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Jul 11, 2025

Uh oh!

ariostas commented Jul 11, 2025

Uh oh!

ikrommyd commented Jul 15, 2025

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

ikrommyd commented Aug 3, 2025

Uh oh!

ikrommyd commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ikrommyd commented Jun 4, 2025 •

edited

Loading

ikrommyd commented Jun 4, 2025 •

edited

Loading

ikrommyd commented Jun 12, 2025 •

edited

Loading

ikrommyd commented Aug 4, 2025 •

edited

Loading