-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark order affects results #57
Comments
I had that in my experience. var n = 1
var emit = (m) => { n = (n * m) }
return () =>
{
emit(-1)
} It seemes to heat up things, then the order become irrelevant. You can also try to run tests as 1-2-1-2 scheme. |
Not in my case. add('sigma:defer:1', () => parseSigmaDefer(SAMPLE)),
add('sigma:grammar:1', () => parseSigmaGrammar(SAMPLE)),
add('sigma:defer:2', () => parseSigmaDefer(SAMPLE)),
add('sigma:grammar:2', () => parseSigmaGrammar(SAMPLE)),
add('sigma:grammar:1', () => parseSigmaGrammar(SAMPLE)),
add('sigma:defer:1', () => parseSigmaDefer(SAMPLE)),
add('sigma:grammar:2', () => parseSigmaGrammar(SAMPLE)),
add('sigma:defer:2', () => parseSigmaDefer(SAMPLE)),
As you can see, the first and second run of the same function give about the result in both cases - however, the one that gets to go first is faster in both cases. But wait, there's more. parseSigmaGrammar(SAMPLE) // 👈 grammar first
parseSigmaDefer(SAMPLE)
suite(
'JSON :: sigma vs parjs',
add('sigma:defer', () => parseSigmaDefer(SAMPLE)), // 👈 defer first
add('sigma:grammar', () => parseSigmaGrammar(SAMPLE)),
add('parjs', () => parseParjs(SAMPLE)),
...handlers
)
parseSigmaDefer(SAMPLE) // 👈 defer first
parseSigmaGrammar(SAMPLE)
suite(
'JSON :: sigma vs parjs',
add('sigma:defer', () => parseSigmaDefer(SAMPLE)), // 👈 defer first
add('sigma:grammar', () => parseSigmaGrammar(SAMPLE)),
add('parjs', () => parseParjs(SAMPLE)),
...handlers
)
So it's really only a matter of which function gets called first - even if it gets called once outside of the benchmark, this somehow determines the winner. I have no explanation for this. 😅
I guess, how would you do that? Although, based on this, there is no reason to think the results will be any different. |
I think you've already did 1-2-1-2 with: add('sigma:grammar:1', () => parseSigmaGrammar(SAMPLE)),
add('sigma:defer:1', () => parseSigmaDefer(SAMPLE)),
add('sigma:grammar:2', () => parseSigmaGrammar(SAMPLE)),
add('sigma:defer:2', () => parseSigmaDefer(SAMPLE)), The picture looks very similar to what of mine and it was solved for me with zero test. Another thing I can't see but what may be important is IO. If parse routine involves reading of actual files the actual deviation may be much more than 1% displayed. 🤔 benchmarking is hard, hopefully someone would join us with some better insights.
This is how things look with zero test for me
|
Yeah, I tried the "zero test" - just didn't commit (or include it above) because it didn't help. suite(
'zero',
add('hello', () => {
let n = 1
const emit = (m: number) => {
n = n * m
}
emit(-1)
emit(-1)
emit(-1)
})
) I tried it with/without |
I tried adding a "warmup" suite as well. suite(
'warmup',
add('woosh', () => {
parseSigmaGrammar(SAMPLE)
parseSigmaDefer(SAMPLE)
})
) The order of the two calls in this still somehow determines the outcome of the following suite. It really does look like the function that runs first gets the most favorable conditions somehow. I really don't think mixing tests in a single run-time with V8 these days is reliable - the optimizations it makes are so incredibly complex, and it's definitely plausible that one function could affect the performance of another, since it does appear to be possible for any code to affect the performance of the engine overall, at least maybe temporarily. My guess is the only reliable approach these days would be to fork the process before running each test. Or better still, just run the individual benchmarks one at a time under |
Just to prove the point I'm trying to make, I decided to run each benchmark in isolation. Just an ugly quick hack, but... const benchmarks = {
'sigma:grammar': () => parseSigmaGrammar(SAMPLE),
'sigma:defer': () => parseSigmaDefer(SAMPLE),
parjs: () => parseParjs(SAMPLE)
} as any
function selectBenchmark() {
for (const name in benchmarks) {
for (const arg of process.argv) {
if (arg === name) {
return add(name, benchmarks[name])
}
}
}
throw new Error('no benchmark selected')
}
suite(
'JSON :: sigma vs parjs',
selectBenchmark(),
...handlers
) And then in
And of course, I tried changing my script to:
The result is now what I expected: Also, the numbers between individual runs are now much more consistent. I'm afraid the For the record, I'm on Node v18.17.1. |
I think the same more and more. Somehow zero test fixed a lot of problems for me in the past (the most recent runs I had was Node v16). I think you're right and isolated test is the destination point here. The idea of zero test is if first test is optimized, create some fake load to optimize. Then, following test are in the same conditions. The problem with isolated tests is that numbers can change between runs. When I run grouped tests absolute numbers do change, but the ratio remains relatively stable and I can compare. |
Unfortunately, forking the process (or launching a dedicated process) would break compatibility with the browser. Isolating tests in a Worker is another option, but that breaks compatibility with DOM, and other facilities unavailable to workers. It's hard to think of a reliable way to do this both in browsers and under Node. 🤔 |
Just to eliminate the funky maths in |
Oh, hello. https://github.com/Llorx/iso-bench 🤔 |
In browsers is difficult, if not impossible. I've tried running iso-bench in workers instead of different processes and still had optimization pollution, so had to convert it to a fork. There MAY be a way with browsers, but it requires automatically reloading the website and such, which may not be a high price to pay just to benefit from the isolated benchmarking. That's something in my TODO list. |
I have a benchmark here, in which the order of the tests seems to completely change the results.
That is, if the order is like this,
sigma:defer
wins by about 10-15% ...Whereas, if the order is like this,
sigma:grammar
wins by the same 10-15% ...So it would appear whatever runs first just wins.
I tried tweaking all the options as well, minimums, delay, etc. - nothing changes.
I wonder if
benchmark
is still reliable after 6 years and no updates? It's benchmark method was first described 13 years ago - a lot of water under the bridge since then, I'm sure?To start with, I'd expect benchmarks should run in dedicated Workers, which I don't think existed back then?
Even then, they probably shouldn't run one after the other (111122223333) but rather round-robin (123123123123) or perhaps even randomly, to make sure they all get equally affected by the garbage collector, run-time optimizations, and other side-effects? Ideally, they probably shouldn't even run in the same process though.
The text was updated successfully, but these errors were encountered: