-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default running time per process? #136
Comments
On micro-benchmarks (values less than 100 ns), each process has different performance depending on many things: environment variables, current working directory, adress space layout (which is randomized on Linux: ASLR), Python random hash seed (indirectly change the number of collisions in hash tables), etc. pyperf is the result of my research of benchmarking: https://vstinner.github.io/category/benchmark.html pyperf is not turned for JIT compilers. I tried but failed to implement R changepoint in pyperf to decide when a benchmark looks "steady". I stopped my research at: https://vstinner.readthedocs.io/pypy_warmups.html Sadly, it seems like nobody tried to tune pyperformance for PyPy so far. PyPy still uses its own benchmark suite and its own benchmark runner. If you want to change the default parameters, can you please prove that it has a limited or no impact on reproducible results? My main concern is getting reproducible results, not really to run benchmarks fast. But I'm also annoyed that a whole run of pyperformance is so slow. Reproducible means that for example that if you run a benchmark 5 times on the same machine but reboot the machine between each run, you get almost the same values (mean +- std dev). For that, I like to use "pyperf dump" and "pyperf stats", to look at all values, not just the mean and std dev. On the other side, I'm perfectly fine to have different parameters for JIT compilers. pyperf already has heuristics only enabled if a JIT compiler is detected. Currently, it's mostly about computing the number of warmups in the first (and maybe second) worker process. |
Ah, also, I don't want to be the gatekeeper of pyperf, I want it to be useful to most people :-) That's why I added co-maintainers to the project: @corona10 and @pablogsal who also care about Python performance. |
On CPython with CPU isolation, in my experience, 3 values per process (ignoring the first warmup) are almost the same. Computing more values per process wouldn't bring much benefits. If you don't use CPU isolation, it can be different. With a JIT compiler, it's likely very different. Also, Python 3.10 optimizes LOAD_ATTR if you run a code object often enough. Python 3.11 optimizes way more opcodes with a new "adaptative" bytecode design. So last years, CPython performance also started to change depending on how many times you run a benchmark. It may also need more warmups ;-) |
@vstinner @kmod IMO, users should know that JIT implementation needs warmup time, it should be also measurable and seeable to end-users through a benchmark. so I would like to suggest the following things.
WDYT? |
Very important paper in this field: https://arxiv.org/abs/1602.00602 "Virtual Machine Warmup Blows Hot and Cold" (2017). |
The number of times we run a benchmark and the duration of run should be independent. We do want some form of inter-process warmup (compiling pyc files, warming O/S file caches, etc) as that reduces noise, but allowing some VMs a free "warmup" time is nonsense. We can have benchmarks of varying lengths. I agree with @kmod that many (all?) of the pyperformance benchmarks do not reflect user experience. If a JIT compiler has a long warmup, but is fast in the long run, we should show that, not just say it is fast. |
@markshannon So to be clear, pyperf already treats jit-implementations different than non-jit ones, and I am advocating for getting rid of this distinction. I think a single set of numbers should be chosen, and personally I think the jit numbers (or higher) should be chosen, but I think choosing the non-jit numbers for everyone would also be an improvement. Also I could have been more clear -- my proposal doesn't change the number of samples collected or the length of each sample, just the number of processes that those samples are spread across. Also for what it's worth pyperf already gives each process a short warmup period. @vstinner I disagree that reproducibility is the primary concern of benchmarking, because if true then "return 0" would be an ideal benchmarking methodology. The current interest in benchmarking is coming from wanting to explain to users how their experience might be changed by switching to a newer python implementation; I don't think users really care if the number is "15% +- 1%" vs "15% +- 0.1%", but they would care if the real number is actually "25% +- 1%" because the benchmarking methodology was not representative of their workload. ie I think accuracy is generally more important than precision, and that's the tradeoff that I'm advocating for here. I could see the argument "python processes run on average for 600ms so that's why we should keep that number" but personally I believe that that premise is false. Maybe put another way: I think everything that's been said on this thread would also be an argument against me proposing increasing the runtime to 600ms if it were currently 300ms. So this thread seems to be implying that we should actually decrease the amount of time per subprocess? For what it's worth, I believe pyperf's per-process execution time is a few orders of magnitude smaller than what everyone else does, which is suggestive of increasing it. |
Using the default settings, pyperf aims to run 20 worker processes for ~600ms each. Or for implementations that are noted as having jits, 6 processes for 1600ms each.
Is there a strong reason for running so many subprocesses for such a short amount of time? It looks like the results are aggregated and process-to-process comparisons are dropped. 600ms/1600ms is a short amount of time when it comes to jit warmup and in my view doesn't quite reflect the typical experience that users have.
I'd like to propose a new set of numbers, such as 3 worker processes for 4s each. (I'd even be in support of 1 worker process for 12s.) I'd also like to propose using this configuration regardless of whether the implementation has a jit, since I put a higher weight on consistency than using more processes when possible.
What do you all think? I'm also curious what the cinder folks think, I saw @Orvid comment about this in facebookincubator/cinder#74 (comment)
The text was updated successfully, but these errors were encountered: