Skip to content

Conversation

ikrommyd
Copy link
Collaborator

@ikrommyd ikrommyd commented Jun 4, 2025

Fixes #3525

As explained in the issue, we should be casting ints and bools to float64 to avoid integer summation overflow errors in descriptive statistics like numpy does. The tests pass but I'm not 100% sure that I'm not missing an edge case that ak.values_as_type doesn't cover.

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - Thanks, but I don't think we should do this. I'd rather leave it to the user who knows their data better. Firstly, binary floats cannot represent decimal fractional values accurately, except for those which are an integer divided by a power of 2. Secondly, this implementation would trigger device to host data copies on GPU.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

Why would it trigger copies to host? It's using self._backend.nplike.asarray(self._data, dtype=dtype), internally. I believe cupy.asarray does device side cast without copying to host.

Regarding the binary float representation, I get that but numpy is doing it anyways.
Numpy explicitly does this for such functions: https://github.com/numpy/numpy/blob/ff1d6cc78322898c02339a529005eb358aeba327/numpy/_core/_methods.py#L160-L170
If we leave it to the user, you expect the users to know when an integer summation is going to overflow and take care of it which I don't think is gonna happen. They will probably be mostly getting incorrect results without them realizing it.
In this specific issue, it resulted in a NaN value which was pretty striking. You can just get a valid positive variance/std that is just wrong and the users will just trust it. In the case of the mean calculation, there is no invalid value even to judge from. I think it’s dangerous to just spit out incorrect numbers and expect the users to figure it out.
I don't think a library should expect the users to consider integer overflow errors. It took me a good few minutes to understand why ak.var is giving an incorrect result.

I'm fine with leaving it as is but I think the point was to also mimick numpy behavior here when these functions were implemented, especially since we're doing a NEP18 dispatch to such functions.
Per the project's README:
"Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they're not."

@agoose77
Copy link
Collaborator

agoose77 commented Jun 4, 2025

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

My whole point here is that even a simple mean calculation sum(values)/number of values could just give you a wrong number if sum(values) is over 2**63 (even worse for int32). The user will just trust that this is the right mean while it's not.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

@agoose77 mean, var, std are not quantized as integers in any case. You get a float back either way. The whole point is how you sum the elements of array. For a mean calculation, if you choose to sum all the elements of the array as integers, you get the wrong mean calculation if the values are large enough to overflow the integer. That's why numpy chooses to cast to float64 first before doing the summation.

@ianna
Copy link
Collaborator

ianna commented Jun 4, 2025

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

Thanks, @agoose77

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

@ianna I really agree with your "copying to host memory" point though but I don't think that it actually happens. I can't find something explicit in the documentation though. Do you have a source? I was under the impression that you can do dtype casting on the GPU.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 12, 2025

My final argument here is that, indeed we will not get the exact same float64 output as numpy since we're not using the same algorithm exactly. But my opinion is that I'd rather give to the user a slightly different answer due to floating poion precision like this

In [12]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.float64)

In [13]: np.var(data)
Out[13]: np.float64(2.1019389186985453e+17)

In [14]: ak.var(data)
Out[14]: np.float64(2.101938918697719e+17)

instead of a completely wrong value like this

In [9]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.int64)

In [10]: np.var(data)
Out[10]: np.float64(2.1019389186985453e+17)

In [11]: ak.var(data)
Out[11]: np.float64(-5.237074053972603e+16)

In this case, the var is pretty striking cause it's negative but the user is going to trust that the returned value is correct for a mean calculation for example. I think that's more dangerous in my opinion.

With this PR, these are the differences what we observe for the example that the user has provided:

In [5]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.int64)

In [6]: ak.var(data)
Out[6]: np.float64(2.101938918697719e+17)

In [7]: np.var(data)
Out[7]: np.float64(2.1019389186985446e+17)

In [8]: ak.std(data)
Out[8]: np.float64(458469074.0603688)

In [9]: np.std(data)
Out[9]: np.float64(458469074.0604588)

I think I prefer 458469074.0603688 instead of a NaN.

Oh and if you expect the user to convert to float64, then they will still see the difference:

In [4]: ak.var(data)
Out[4]: np.float64(2.101938918697719e+17)

In [5]: np.var(data)
Out[5]: np.float64(2.1019389186985453e+17)

I'm just suggesting we do the conversion for them.

@ikrommyd
Copy link
Collaborator Author

I'm not sure why test_block_boundary_prod_complex13 failed with the GPU tests. It has nothing to do with this PR

def test_block_boundary_prod_complex13():
    rng = np.random.default_rng(seed=42)
    array = rng.integers(50, size=1000)
    complex_array = np.vectorize(complex)(
        array[0 : len(array) : 2], array[1 : len(array) : 2]
    )
    content = ak.contents.NumpyArray(complex_array)
    cuda_content = ak.to_backend(content, "cuda", highlevel=False)
    cpt.assert_allclose(
        ak.prod(cuda_content, -1, highlevel=False),
        ak.prod(content, -1, highlevel=False),
    )

    offsets = ak.index.Index64(np.array([0, 5, 996, 1000], dtype=np.int64))
    depth1 = ak.contents.ListOffsetArray(offsets, content)
    cuda_depth1 = ak.to_backend(depth1, "cuda", highlevel=False)
    cpt.assert_allclose(
        to_list(ak.prod(cuda_depth1, -1, highlevel=False)),
        to_list(ak.prod(depth1, -1, highlevel=False)),
    )
    del cuda_content, cuda_depth1

@ariostas
Copy link
Collaborator

I'm not sure why test_block_boundary_prod_complex13 failed with the GPU tests. It has nothing to do with this PR

There might be a race condition or something in the kernels. I've see it fail a couple of times.

@ikrommyd
Copy link
Collaborator Author

Talked with @jpivarski briefly about this too at SciPy. The small differences in the summations comes from the fact that awkward doesn't implement Kahan summation. It was concluded however from the chat, that it's better to give to the user "slightly different values" versus numpy (due to different summation algorithm) and follow numpy's rule of casting everything to float64 in such functions than potentially give back to the user extremely wrong values due to overflow errors without them knowing.

@ikrommyd ikrommyd requested a review from ianna July 24, 2025 16:36
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - thanks for looking into it. Could you, please, add the tests for where the operations would fail without casting to float64 change? Please, follow the naming convention of the tests: test_3527_<what_is_tested>.py. Thanks!

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Aug 3, 2025

@ikrommyd - thanks for looking into it. Could you, please, add the tests for where the operations would fail without casting to float64 change? Please, follow the naming convention of the tests: test_3527_<what_is_tested>.py. Thanks!

Absolutely! We can generate an array in the test that looks like the one the user used to report the original issue. Will do

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Aug 4, 2025

@ianna tests are added btw, all those assertions fail on main. I can't test linear_fit and moment because they don't exist in numpy.

@ikrommyd ikrommyd requested a review from ianna August 8, 2025 10:58
@ikrommyd ikrommyd marked this pull request as draft August 21, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ak.std returns nan wrongly when being applied on int array
4 participants