Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster h.findBucket(v) #1662

Closed
wants to merge 0 commits into from
Closed

Conversation

imorph
Copy link
Contributor

@imorph imorph commented Oct 27, 2024

What problem this PR is solving

Hi!
It is part "two" of exploration for optimization possibilities for cockroachdb/cockroach#133306 (part "one" is here). Second largest contributor in original issue was findBucket function. Closer look at that function revealed some excellent suggestion from @beorn7 about smaller-sized arrays and linear search:

	// TODO(beorn7): For small numbers of buckets (<30), a linear search is
	// slightly faster than the binary search. If we really care, we could
	// switch from one search strategy to the other depending on the number
	// of buckets.
	//
	// Microbenchmarks (BenchmarkHistogramNoLabels):
	// 11 buckets: 38.3 ns/op linear - binary 48.7 ns/op
	// 100 buckets: 78.1 ns/op linear - binary 54.9 ns/op
	// 300 buckets: 154 ns/op linear - binary 61.6 ns/op

I decided to explore it in isolation and this is a result: https://github.com/imorph/go-search-research

TLDR:

  • linear search indeed faster for arrays < 35 elements (regardless position of the needle in haystack, but the closer to beginning the faster search)
  • linear search always outperforms binary search in case of +Inf bucket (regardless the size)
  • both search variants are performing the same (under lower bounds) when executed in parallel
  • sophisticated loop-unrolling implementation of linear search is not worth it

This PR is proposal to use this "adaptive" implementation

Design

  • use simple linear search for everything under 35 elements
  • check for +Inf before calling into sort.SearchFloat64s
  • use stdlib's sort.SearchFloat64s for everything else

Results

  • as expected short histograms performing better by 50%
  • in case of +Inf new implementation is better by 90% (no need to execute full binary search just check last element)
  • (!!!) mid/large/huge tests are faster as well by 12%-16%, i have no other explanation for that other than that early exit check are touching slice which leads to prefetching it into CPU cache and then binary search on it running faster
goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                  │ ../../fbOLD  │             ../../fbNEW             │
                  │    sec/op    │   sec/op     vs base                │
FindBucketShort-8   13.450n ± 0%   6.365n ± 0%  -52.67% (p=0.000 n=10)
FindBucketMid-8      16.40n ± 1%   14.32n ± 0%  -12.68% (p=0.000 n=10)
FindBucketLarge-8    18.95n ± 0%   16.41n ± 1%  -13.43% (p=0.000 n=10)
FindBucketHuge-8     23.91n ± 0%   20.00n ± 0%  -16.35% (p=0.000 n=10)
FindBucketNone-8    23.835n ± 0%   2.012n ± 1%  -91.56% (p=0.000 n=10)
geomean              18.86n        9.036n       -52.08%

@vesari
@ArthurSens
@bwplotka
@kakkoyun

@imorph imorph force-pushed the faster_find_bucket branch from cce97b4 to e3911b2 Compare October 27, 2024 14:39
@imorph imorph changed the title Faster find bucket Faster h.findBucket(v) Oct 27, 2024
@ArthurSens
Copy link
Member

ArthurSens commented Oct 28, 2024

That's really cool @imorph, I didn't have the time to look deep into this but just skimmed over the changes. One question, how did you come up with the 35 limit? Was just intuition or backed by evidence? There's no wrong answer, I'm just curious :)

@imorph
Copy link
Contributor Author

imorph commented Oct 29, 2024

@ArthurSens I made bite-size research here -> https://github.com/imorph/go-search-research where benchmarks were executed for many different lengths of slice. On both x86 and M3 slice length where binary search starts to take over was somewhere in between 30 and 40 (it also depends on a position of an element in a slice - the closer to beginning the lengthier slice should be for binary search to win).

@kakkoyun
Copy link
Member

I plan to review this PR this weekend. Sorry for the tardiness.

@imorph
Copy link
Contributor Author

imorph commented Nov 2, 2024

One more before/after diff on another CPU (+ one test added):

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: INTEL(R) XEON(R) GOLD 5512U
                   │     orig      │              new_35_low              │
                   │    sec/op     │    sec/op     vs base                │
FindBucketShort-56    19.74n ±  1%   11.34n ±  3%  -42.53% (p=0.000 n=10)
FindBucketMid-56      23.69n ±  1%   21.59n ±  1%   -8.85% (p=0.000 n=10)
FindBucketLarge-56    27.18n ± 14%   24.35n ±  0%  -10.41% (p=0.022 n=10)
FindBucketHuge-56     35.07n ±  1%   30.57n ±  1%  -12.84% (p=0.000 n=10)
FindBucketInf-56     34.810n ± 14%   3.140n ± 15%  -90.98% (p=0.000 n=10)
FindBucketLow-56     30.305n ±  0%   2.425n ± 42%  -92.00% (p=0.000 n=10)
geomean               27.88n         10.56n        -62.12%

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the other PR, let's add a comment explaining that numbers were chosen based on empirical testing while linking to your PR description.

I've added some comments about early exits and then LGTM!

Comment on lines 861 to 864
n := len(h.upperBounds)

// Early exit: if v is less than or equal to the first upper bound, return 0
if v <= h.upperBounds[0] {
return 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we swap the order here between the early exit and n := ... just so it truly is an early exit? :o)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is cases when len(h.upperBounds) == 0 so we should first check for len otherwise TestNativeHistogram panics at factor 1.1 results in schema 3

// Early exit: if v is greater than the last upper bound, return len(h.upperBounds)
if v > h.upperBounds[n-1] {
return n
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this part upwards? Even before the linear search?

@ArthurSens
Copy link
Member

By the way, awesome research here: https://github.com/imorph/go-search-research

@imorph imorph force-pushed the faster_find_bucket branch from d6d7c58 to aaa3f32 Compare November 4, 2024 10:00
@ArthurSens
Copy link
Member

@imorph looks like you added some extra commits by accident 🤔

@imorph
Copy link
Contributor Author

imorph commented Nov 4, 2024

@imorph looks like you added some extra commits by accident 🤔

yes, sorry for the this git-mess =(

  • first I FF-merged main (with fixed test to make CI green again)
  • then "early exit re-arrangements" were pushed without sign-off (so I followed directions to rewrite DCO'less commit)
  • then one of the tests start to failing (turns out this n==0 check were needed on top of all other logic) and i reverted back a bit

I can reset to main this branch, and put everything new on top for more clearer picture. WDYT?

@ArthurSens
Copy link
Member

No worries!

Yeah, ideally we have only the relevant commits in this PR, so rebasing on main sounds like a good idea

@imorph imorph closed this Nov 4, 2024
@imorph imorph force-pushed the faster_find_bucket branch from e3cc4e4 to 0c73c1c Compare November 4, 2024 19:24
@imorph imorph mentioned this pull request Nov 4, 2024
@imorph
Copy link
Contributor Author

imorph commented Nov 4, 2024

@ArthurSens this PR closed itself after me trying to start from fresh main.

#1673 is what it should be from the start.

latest version comparison with main on apple M3:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Apple M3 Max
                   │  ../../main  │          ../../branch_alt           │
                   │    sec/op    │   sec/op     vs base                │
FindBucketShort-14    8.394n ± 1%   4.240n ± 1%  -49.49% (p=0.000 n=10)
FindBucketMid-14      10.73n ± 1%   10.92n ± 0%   +1.86% (p=0.000 n=10)
FindBucketLarge-14    13.73n ± 0%   13.95n ± 1%   +1.60% (p=0.000 n=10)
FindBucketHuge-14     19.05n ± 1%   19.38n ± 1%   +1.76% (p=0.000 n=10)
FindBucketInf-14     19.230n ± 1%   1.359n ± 1%  -92.94% (p=0.000 n=10)
FindBucketLow-14     22.010n ± 1%   1.344n ± 2%  -93.90% (p=0.000 n=10)
geomean               14.67n        5.327n       -63.68%

interesting that I don't see any significant improvements for basic cases (that were documented in PR description). May be this is some intel specifics or measurement artefact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants