-
Notifications
You must be signed in to change notification settings - Fork 95
fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikrommyd - Thanks, but I don't think we should do this. I'd rather leave it to the user who knows their data better. Firstly, binary floats cannot represent decimal fractional values accurately, except for those which are an integer divided by a power of 2. Secondly, this implementation would trigger device to host data copies on GPU.
Why would it trigger copies to host? It's using Regarding the binary float representation, I get that but numpy is doing it anyways. I'm fine with leaving it as is but I think the point was to also mimick numpy behavior here when these functions were implemented, especially since we're doing a NEP18 dispatch to such functions. |
@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise |
My whole point here is that even a simple mean calculation |
@agoose77 mean, var, std are not quantized as integers in any case. You get a float back either way. The whole point is how you sum the elements of array. For a mean calculation, if you choose to sum all the elements of the array as integers, you get the wrong mean calculation if the values are large enough to overflow the integer. That's why numpy chooses to cast to float64 first before doing the summation. |
Thanks, @agoose77 |
@ianna I really agree with your "copying to host memory" point though but I don't think that it actually happens. I can't find something explicit in the documentation though. Do you have a source? I was under the impression that you can do dtype casting on the GPU. |
My final argument here is that, indeed we will not get the exact same float64 output as numpy since we're not using the same algorithm exactly. But my opinion is that I'd rather give to the user a slightly different answer due to floating poion precision like this In [12]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.float64)
In [13]: np.var(data)
Out[13]: np.float64(2.1019389186985453e+17)
In [14]: ak.var(data)
Out[14]: np.float64(2.101938918697719e+17) instead of a completely wrong value like this In [9]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.int64)
In [10]: np.var(data)
Out[10]: np.float64(2.1019389186985453e+17)
In [11]: ak.var(data)
Out[11]: np.float64(-5.237074053972603e+16) In this case, the With this PR, these are the differences what we observe for the example that the user has provided: In [5]: data = np.loadtxt("/Users/iason/Downloads/data.csv", delimiter=",", dtype=np.int64)
In [6]: ak.var(data)
Out[6]: np.float64(2.101938918697719e+17)
In [7]: np.var(data)
Out[7]: np.float64(2.1019389186985446e+17)
In [8]: ak.std(data)
Out[8]: np.float64(458469074.0603688)
In [9]: np.std(data)
Out[9]: np.float64(458469074.0604588) I think I prefer Oh and if you expect the user to convert to float64, then they will still see the difference: In [4]: ak.var(data)
Out[4]: np.float64(2.101938918697719e+17)
In [5]: np.var(data)
Out[5]: np.float64(2.1019389186985453e+17) I'm just suggesting we do the conversion for them. |
I'm not sure why def test_block_boundary_prod_complex13():
rng = np.random.default_rng(seed=42)
array = rng.integers(50, size=1000)
complex_array = np.vectorize(complex)(
array[0 : len(array) : 2], array[1 : len(array) : 2]
)
content = ak.contents.NumpyArray(complex_array)
cuda_content = ak.to_backend(content, "cuda", highlevel=False)
cpt.assert_allclose(
ak.prod(cuda_content, -1, highlevel=False),
ak.prod(content, -1, highlevel=False),
)
offsets = ak.index.Index64(np.array([0, 5, 996, 1000], dtype=np.int64))
depth1 = ak.contents.ListOffsetArray(offsets, content)
cuda_depth1 = ak.to_backend(depth1, "cuda", highlevel=False)
cpt.assert_allclose(
to_list(ak.prod(cuda_depth1, -1, highlevel=False)),
to_list(ak.prod(depth1, -1, highlevel=False)),
)
del cuda_content, cuda_depth1 |
There might be a race condition or something in the kernels. I've see it fail a couple of times. |
Talked with @jpivarski briefly about this too at SciPy. The small differences in the summations comes from the fact that awkward doesn't implement Kahan summation. It was concluded however from the chat, that it's better to give to the user "slightly different values" versus numpy (due to different summation algorithm) and follow numpy's rule of casting everything to float64 in such functions than potentially give back to the user extremely wrong values due to overflow errors without them knowing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikrommyd - thanks for looking into it. Could you, please, add the tests for where the operations would fail without casting to float64
change? Please, follow the naming convention of the tests: test_3527_<what_is_tested>.py
. Thanks!
Absolutely! We can generate an array in the test that looks like the one the user used to report the original issue. Will do |
@ianna tests are added btw, all those assertions fail on |
Fixes #3525
As explained in the issue, we should be casting ints and bools to float64 to avoid integer summation overflow errors in descriptive statistics like numpy does. The tests pass but I'm not 100% sure that I'm not missing an edge case that
ak.values_as_type
doesn't cover.