-
-
Notifications
You must be signed in to change notification settings - Fork 46
Cycle 5: Reliable performance benchmarking for the Astropy project #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I would prefer this approach over NumFOCUS AWS if we can pull it off. But I do have concern on the bus factor here. Will we be back to square one when Aperio decide to stop maintaining this server, like when the funding runs out, or you all win lotteries and retire? |
|
I think the best way to guard against this is to have the astropy project pay directly for the server once we have identified which one to get and for us to openly document the server set up or at least if not completely open, have it in a repo that at the very least the CoCo have access to. Then anyone else can take over maintenance of the server. We plan to have the server side set up be as simple as possible, with all the important config being in eg the core package repo. |
|
Also, I think a lot of the risk of who looks after this can be mitigated by running nothing but a github actions runner on the server. That should make it as portable as is reasonably practical. If we need to shut it down and stop maintaining it, then anyone with a bit of linux sysadmin experience should be able to spin up a replacement. |
|
If you manage to spin this up fast enough and have extra money left, is looking to replace asv in scope, since we're talking about reliability? |
|
Potentially but I don't know if there's consensus on if that's a good idea. |
|
Please react to this comment to vote on this proposal ( 👍, 👎 or no reaction for +0) |
|
I admit that performance is often high on user's wish lists, that we don't have reliably benchmarks now, and that this has been a roadmap item for years.
Thus, while I in principle agree that benchmarks are "good to have", I don't see this bringing actionable results. I think we need to figure out what we actually want to do with the benchmark results before we pay to generate them. |
|
I think Moritz makes a pretty solid case here. However I do want to point out that one possible answer would be to focus benchmarking effort on low level code. With this target in mind, there's a possible synergy within my FR (#493), specifically with the APE I'm working on with @astrofrog, where we plan to propose splitting out low level code from astropy into one or several separate package(s), and we expect we'll have to build a low level test suite to reach that goal. Low level code being use exclusively where performance is already recognized as critical, focusing on that test suite for benchmarks could actually yield a lot of value for a small (additional) cost. |
|
Part of the reason that the benchmarks have not been as useful as they could be is because it's not been possible to really run them reliably as part of the CI - yes we do have it running on astropy core in CI as opt-in (as it requires a label), but we have a low bar for looking at regressions because of the lack of dedicated/stable hardware. If we could run the benchmarks in a more stable environment, we could be more precise in terms of timings and look for smaller changes. We could also in principle run the benchmarks on all PRs as sometimes it is unexpected PRs that will introduce regressions. |
I am a bit more sceptical how well-defined that process works. There is not necessarily even a clear definition of "wrong results" in the way of how accurate and reproducible should the results be – there are a load of tests in |
|
I can tell you that back when we had Tom's little machine and access to nightly run results whenever we feel like it, it was nice. During the campaign for "performance release" (remember that?), it was helpful. Not having it at all is a regression and I think it should be addressed. |
|
I recently set up codspeed on a project of mine and it was pretty smooth to get working. Have you compared codspeed vs ASV in terms of expected cost, ease of maintenance, etc.? The huge advantage of codspeed for me was the ability to plug into pytest, so didn't require maintaining a separate benchmark suite. |
|
I also use codspeed for dkist and it's good. I would say the main drawbacks of it are:
I will clarify that the main objective of this FR is to provide dedicated hardware for stable benchmarking as a github actions runner. This means we could use it for asv, pytest-benchmark or codspeed (I think). |
|
I do worry about unwittingly introducing significant performance regressions with feature or bug fix PRs. Making a plan for a stable hardware configuration for performance testing seems like a necessary first step. As mentioned, how to most effectively use that hardware is a separate (and not trivial) question but just having the existing ASV is a decent start. |
|
The Cycle 5 funding request process has been hugely successful! On the downside, that means our funds are severely oversubscribed. Even after the Finance Committee and SPOC have taken into consideration community feedback/voting and alignment with the roadmap, there are still more funding requests than we can afford in 2026. We would like to stretch the budget as far as possible, and to fund as many activities as possible, while making sure the Project remains volunteer-driven. Hence, we would like to know if this project will still meet its deliverables if minimum your budget is reduced by 25%, 50%, or 100%. Or if there’s some other minimum, feel free to specify that instead. As a reminder, there will be more funding for 2027 and we expect the Cycle 6 call for 2027 funding requests to begin in the Fall of 2026. Thank you for your engagement and understanding as we continue to optimize our funding and budgeting processes and the balance of volunteer vs funded work! (@astrofrog ) |
|
The budget provided (USD 6600) is already a strict minimum, since we definitely need to cover the costs of the server, and with fewer than 40 hours it likely won't be possible to get everything set up, which would then not be useful. |
This proposal was solicited by the Strategic Planning Committee as it was identified that no existing proposal addressed the roadmap item about performance benchmark reporting (hence why this was opened after the deadline).
Note that this is separate from the NumFOCUS approach which @pllim mentioned in https://groups.google.com/g/astropy-dev/c/Ns2jj7qtW-s - the approach in the current proposal has a lower monthly cost and includes developer time to make it happen.