-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsidering min()? #75
Comments
Hi,
Are you talking about the timeit module of the Python standard library? You should use the pyperf timeit command instead.
It seems like your benchmark has a flaw. Maybe you need to increase the number of warmup runs. Read also https://pyperf.readthedocs.io/en/latest/system.html |
I was a little confusing in my explaining. I post here my "steps to reproduce". This is the tests done with
This is the tests done with
Notice that I've not isolated any CPUs, since I have only 2 cores x 2 threads. This is the test I've done for
If I run this code with So I focused on the results I got from the test with
that is, I dropped manually the slowest results (or, in other words, I considered only the first gaussian curve). The mean in this case is compatible with the resulting distribution. The results are:
that is compatible with It seems that both you and |
It seems like you have a lot of outlines. You may analyze results with "pyperf stats" and "pyperf hist". You may try to increase --warmups. |
Ok, I'll explain it better (and with better code. Ignore the first one, is bugged :) ) TL;DR I wrote a little script that seems to do the same results of your Long explaining: I tried to raise the number of loops and teardowns to very high values. The result is the same:
Then I tried to remove the "noise" in my test code. This is the new code:
Basically, the script do this:
The results (without the plot):
Some considerations:
So, as I said in my previous post, it seems that both you and timeit devs are right: you have to do some statistical calcula to remove the "noise", but the "real" value is around the minimum. |
The pyperf API gives you access to all values. You can use min(), max(), median(), etc. You're free to ignore outliers. But I don't want to change the default: mean +- std dev is likely the best compromise for most common use cases. To measure a microbenchmark faster than 100 ns, you should follow pyperf advices:
|
Ok, I can try these kernel parameters, but I did not suggested to remove mean and stdev, but to remove first "unlikely" results. I don't know how do you measure these quantities, but if you don't remove the "tail" of the measures, it's just wrong, since mean and stdev applies on a Gaussian distribution and, if you see the graph, this is not a Gaussian distribution. |
@Marco-Sulla the sporadic results might just be due to string hash randomization: the lowest 3 bits of random strings sometimes collide, so the dict lookup has to take an extra step (or several) to probe another slot. As a proof of this concept, running
Regardless of the distribution, multiplying the mean by the size of the data is a good estimator for the sum. When benchmarking code that runs many times, the sum is generally what I think people are looking to estimate. Question: "If I run this a million times, how long will that take in total?" Regardless of the type of distribution, a very good answer is: "One million times the mean." I don't think outliers should be ignored without good reason, and I think your outliers are legitimate data points in this case, since hash randomization is a real part of the environment in which the code will be run. One could probably find a way to force a particular string hashing seed and run the benchmarks more deterministically on multiple different seeds to get consistent results for each. And standard deviation is a good heuristic measure of spread, which can help detect when something is behaving unexpectedly or inconsistently, even if you can't produce a precise confidence interval from it without knowing the type of distribution. Two numbers can't possibly convey everything, but IMO mean and standard deviation are a good default. |
Yes, but I calculate the mean and the standard dev, but removing the "tail". The data shows that the times are a multiple Gaussian, where the most high Gaussian is the first one, with the more fast benchs. So I remove the other gaussian. How? With a simple t-test. If the data is X sigma greater of the fastest bench, I remove it, where the sigma is calculated with the real value as the minimum, and not the mean. It's imprecise, since the minimum is the first tail of the first Gaussian, but it's a good approximation. This is my algorithm: def mindev(data, xbar = None):
"""
This function calculates the stdev around the _minimum_ data, and not the
mean
"""
if not data:
raise ValueError("No data")
if xbar == None:
xbar = min(data)
sigma2 = 0
for x in data:
sigma2 += (x - xbar) ** 2
N = len(data) - 1
if N < 1:
N = 1
return sqrt(sigma2 / N)
def autorange(stmt, setup="pass", globals=None, ratio=1000, bench_time=10, number=None):
if setup == None:
setup = "pass"
t = timeit.Timer(stmt=stmt, setup=setup, globals=globals)
break_immediately = False
if number == None:
# get automatically the number of needed loops
a = t.autorange()
num = a[0]
# the number of loops
number = int(num / ratio)
if number < 1:
number = 1
# the number of repeat of loops
repeat = int(num / number)
if repeat < 1:
repeat = 1
else:
repeat = 1
break_immediately = True
results = []
data_tmp = t.repeat(number=number, repeat=repeat)
min_value = min(data_tmp)
# I create a list with minimum values
data_min = [min_value]
bench_start = time()
while 1:
# I get additional benchs until `bench_time` seconds passes
data_min.extend(t.repeat(number=number, repeat=repeat))
if break_immediately or time() - bench_start > bench_time:
break
# I sort the data...
data_min.sort()
# ... so the minimum is the fist datum
xbar = data_min[0]
i = 0
# I repeat until no data is removed
while i < len(data_min):
i = len(data_min)
# I calculate the sigma using the minimum as "real" value, and not
# the mean
sigma = mindev(data_min, xbar=xbar)
for i in range(2, len(data_min)):
# I thind the point where the data are are greater than
# 3 sigma. Data are sorted...
if data_min[i] - xbar > 3 * sigma:
break
k = i
# do not remove too much data
if i < 5:
k = 5
# remove the data with sigma > 3
del data_min[k:]
# I return the minimum as real value, with the sigma calculated around
# the minimum
return (min(data_min) / number, mindev(data_min, xbar=xbar) / number) Maybe the last step, the return, is incorrect. I could return the normal mean and stdev, since data_min should contain only the first Gaussian. |
I've done a little experiment: I run
timeit.repeat
1000 times on a really simple statement. After that, I sorted the list and I grouped the results in time slices, counting the number of times a bench is inside a certain time slice. This way I have a sort of probability. Then I plot the result:https://www.dropbox.com/s/f4naryc055k42cs/2020-07-24-080340_1366x768_scrot.png?dl=0
The result is quite interesting. As you can see, the distribution is not a normal distribution. It seems a multimodal normal distribution. Furthermore, the highest probability is near the minimum value, and not near the average (that is on the second peak).
Mean and stdev are only meaningful if the distribution is a normal distribution. It seems that the way you can have a sort of normal distribution is to consider only the first part of the plot.
Since
pyperf
returns mean and stdev, does it "cut" the slowest runs?The text was updated successfully, but these errors were encountered: