Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range semantics #56479

Open
LilithHafner opened this issue Nov 6, 2024 · 7 comments
Open

Range semantics #56479

LilithHafner opened this issue Nov 6, 2024 · 7 comments
Assignees
Labels
design Design of APIs or of the language itself needs decision A decision on this change is needed ranges Everything AbstractRange triage This should be discussed on a triage call

Comments

@LilithHafner
Copy link
Member

Theoretical question:
Is the range 0.0:0.1:1.0
A) a performance optimization that should behave identically to the vector [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] or
B) does it represent the 11 real numbers evenly spaced between 0 and 1, including both endpoints?

Practical implications:
Under A, diff(0:.1:1) is [0.1, 0.1, 0.09999999999999998, 0.10000000000000003, 0.09999999999999998, 0.09999999999999998, 0.09999999999999998, 0.10000000000000009, 0.09999999999999998, 0.09999999999999998] (current implementation)
Under B, a performance and accuracy improvement would be to define diff(x::AbstractRange) = fill(step(x), length(x)-1) which would return [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1].

Under A, searchsorted(1f8-10:1f8, 1f8) should return 5:9
Under B, searchsorted(1f8-10:1f8, 1f8) should return 9:9 (current implementation, see #34408)

Under A, collecting a range before performing an operation should never change results
Under B, a range can be more precise than the collected vector so it is possible for collect to introduce rounding errors which change results.

@LilithHafner LilithHafner added design Design of APIs or of the language itself needs decision A decision on this change is needed ranges Everything AbstractRange triage This should be discussed on a triage call labels Nov 6, 2024
@mbauman
Copy link
Member

mbauman commented Nov 6, 2024

Even better, diff(A::AbstractRange) could be implemented as r[begin:end-1] .- r[begin+1:end], which gives back the range:

julia> rangediff(r) = r[begin:end-1] .- r[begin+1:end]
rangediff (generic function with 4 methods)

julia> rangediff(0:.1:1)
StepRangeLen(-0.1, 0.0, 10)

julia> println(collect(ans))
[-0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1]

This is in keeping with how we do lots of range arithmetic — it's O(1) where possible, even if it means that the values are different than the collected versions would be. For example, (0:.1:1) ./ 3 is 0:1/30:1/3 instead of the division of the values the range generates:

julia> (0:.1:1) ./ 3
0.0:0.03333333333333333:0.3333333333333333

julia> (collect(0:.1:1) ./ 3) .- collect((0:.1:1) ./ 3)
11-element Vector{Float64}:
  0.0
  0.0
  0.0
 -1.3877787807814457e-17
  0.0
  0.0
 -2.7755575615628914e-17
 -2.7755575615628914e-17
  0.0
  0.0
  0.0

So what's the rule? I think it's gotta be a little gray:

  • If you can compute the operation with O(1) range operations directly on the begin / step / stop / len quad and return an O(1) object, then that's the right thing to do.
  • Otherwise, it's gotta work on the generated values

Where does this leave searching? No clue.

@adienes
Copy link
Contributor

adienes commented Nov 6, 2024

I have, in the past, though I do not remember the exact discussions, been informed of the following two stylized facts:

  • a range IS a vector, almost equal to its collected vector, but with some performance improvements on top
  • a range always includes its two endpoints

is it possible for a clean design to satisfy both? it may have to sacrifice "equally spaced."

@andrewjradcliffe
Copy link
Contributor

Even better, diff(A::AbstractRange) could be implemented as r[begin:end-1] .- r[begin+1:end], which gives back the range:

julia> rangediff(r) = r[begin:end-1] .- r[begin+1:end]
rangediff (generic function with 4 methods)

julia> rangediff(0:.1:1)
StepRangeLen(-0.1, 0.0, 10)

julia> println(collect(ans))
[-0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1]

I think you mean rangediff(r) = r[begin+1:end] .- r[begin:end-1] ^_^;;

@oscardssmith
Copy link
Member

triage defers to a higher power.

@StefanKarpinski
Copy link
Member

Lol, took me a while to realize I'm supposed to be the higher power.

@mbauman
Copy link
Member

mbauman commented Nov 7, 2024

So there are more than two possible semantics:

  1. There's the theoretical perfect spacing of 11 elements between 0.0 and 1.0 (or eleven increments of 1 starting from 1f8)
  2. There's the internal implementation, pre-rounding.
  3. There are potentially alternative implementations that try to work on what they think the maths behind the range are, based on the outputs of first(r), last(r), step(r), length(r).
  4. There's the rounded element-wise getindex approximation

In the case of searchsorted, we currently use a mix of 3 and 4. This means that we can find 0.3 in 0:0:.1:1, but I can't find the theoretical 3//10 — instead, both 3//10 and 4//10 correctly land between 0.3 and 0.4. That works, and it's "just" an optimization. But it assumes that the rounded range outputs (4) are monotonically increasing, which is why it falls apart with 1f8:

julia> searchsorted(1f8:1f8+8, 100000000)
1:1

julia> searchsorted(1f8:1f8+8, 100000001)
2:1

julia> searchsorted(collect(1f8:1f8+8), 100000001)
6:5

I think the rule has got to be about what the particular function returns. search/find/etc return indices — indices that are specific to the given range. Thus those indices must be able to be used as indices in the range and get the values they found.

Conversely, simple range arithmetic returns other ranges. As long as the start/stop and length are correctly computed, I think it's ok for the collected variants to give slightly different numeric answers. It's not terribly unlike different summation algorithms, really.

@mbauman
Copy link
Member

mbauman commented Nov 7, 2024

search/find/etc return indices — indices that are specific to the given range. Thus those indices must be able to be used as indices in the range and get the values they found.

Amusingly Horrifyingly, though, I don't see any feasible way to get around this disconnect:

julia> r = 1f8:1f8+12
1.0f8:1.0f0:1.0000002f8

julia> idxs = searchsorted(collect(r), 100000000) # Assuming searchsorted returns the range outputs that match
1:5

julia> r[idxs]
1.0f8:1.0f0:1.0f8

julia> all(==(100000000), r[idxs])
true

julia> r[idxs] .- 100000000
0.0f0:1.0f0:4.0f0

julia> collect(ans)
5-element Vector{Float32}:
 0.0
 1.0
 2.0
 3.0
 4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design of APIs or of the language itself needs decision A decision on this change is needed ranges Everything AbstractRange triage This should be discussed on a triage call
Projects
None yet
Development

No branches or pull requests

6 participants