Cycle 5: Reliable performance benchmarking for the Astropy project #508

astrofrog · 2025-11-13T12:06:31Z

This proposal was solicited by the Strategic Planning Committee as it was identified that no existing proposal addressed the roadmap item about performance benchmark reporting (hence why this was opened after the deadline).

Note that this is separate from the NumFOCUS approach which @pllim mentioned in https://groups.google.com/g/astropy-dev/c/Ns2jj7qtW-s - the approach in the current proposal has a lower monthly cost and includes developer time to make it happen.

pllim · 2025-11-13T15:12:04Z

I would prefer this approach over NumFOCUS AWS if we can pull it off.

But I do have concern on the bus factor here. Will we be back to square one when Aperio decide to stop maintaining this server, like when the funding runs out, or you all win lotteries and retire?

astrofrog · 2025-11-13T15:24:23Z

I think the best way to guard against this is to have the astropy project pay directly for the server once we have identified which one to get and for us to openly document the server set up or at least if not completely open, have it in a repo that at the very least the CoCo have access to. Then anyone else can take over maintenance of the server. We plan to have the server side set up be as simple as possible, with all the important config being in eg the core package repo.

Cadair · 2025-11-18T13:38:02Z

Also, I think a lot of the risk of who looks after this can be mitigated by running nothing but a github actions runner on the server. That should make it as portable as is reasonably practical. If we need to shut it down and stop maintaining it, then anyone with a bit of linux sysadmin experience should be able to spin up a replacement.

pllim · 2025-11-18T16:56:15Z

If you manage to spin this up fast enough and have extra money left, is looking to replace asv in scope, since we're talking about reliability?

Cadair · 2025-11-18T17:09:01Z

Potentially but I don't know if there's consensus on if that's a good idea.

eteq · 2025-11-25T16:30:33Z

Please react to this comment to vote on this proposal ( 👍, 👎 or no reaction for +0)

hamogu · 2025-11-25T18:50:15Z

I admit that performance is often high on user's wish lists, that we don't have reliably benchmarks now, and that this has been a roadmap item for years.
Yet, I don't think, we (as a project) have figured out how to use benchmarks. We had ASV running for a while, and I don't remember (which might just mean I have a bad memory) a huge impact on our work from it.

If we fix a bug that produces wrong results, we'll fix it no matter the performance impact.
If we add a small new feature, people often already discuss the implementation in the PR, sometimes with specific, ad-hoc benchmarks just testing that feature under discussion. I believe that is unlikely that small features have such wide of an impact, except for the obvious cases (e.g. units, tables) where we would discuss that in the review anyway - and those are so rare and ad-hoc measurements on the developers machine are probably more cost effective than running benchmarks on CI.
If we add a new feature that has wide-spread impact throughout astropy, we usually decide that we want that pretty early on. Unless it has a performance impact that is so obvious that we'll notice anyway, I don't see us dropping a major feature for minor performance improvements (e.g. if it turned out that mix-in columns in tables cost performance, would we deprecate and remove that feature? Almost certainly not).
Personally, I don't think "accidental performance degradation over time" (meaning not related to bug fixes or adding intentional features) is a concern in practice. One might assume that a certain programming algorithm was the fastest with numpy 1.x and Python < 3.11, but that that algorithm now slows down with numpy 2+ and Python 3.12 and the implementation could be changed to speed that up. I can't prove that that doesn't happen in astropy, and it seems unlikely that since I'm not aware of anything like that in the scientific Python ecosystem. If it was wide-spread, we would have heard about it.

Thus, while I in principle agree that benchmarks are "good to have", I don't see this bringing actionable results. I think we need to figure out what we actually want to do with the benchmark results before we pay to generate them.

neutrinoceros · 2025-11-25T19:06:02Z

I think Moritz makes a pretty solid case here. However I do want to point out that one possible answer would be to focus benchmarking effort on low level code. With this target in mind, there's a possible synergy within my FR (#493), specifically with the APE I'm working on with @astrofrog, where we plan to propose splitting out low level code from astropy into one or several separate package(s), and we expect we'll have to build a low level test suite to reach that goal. Low level code being use exclusively where performance is already recognized as critical, focusing on that test suite for benchmarks could actually yield a lot of value for a small (additional) cost.

astrofrog · 2025-11-25T22:37:22Z

Part of the reason that the benchmarks have not been as useful as they could be is because it's not been possible to really run them reliably as part of the CI - yes we do have it running on astropy core in CI as opt-in (as it requires a label), but we have a low bar for looking at regressions because of the lack of dedicated/stable hardware. If we could run the benchmarks in a more stable environment, we could be more precise in terms of timings and look for smaller changes. We could also in principle run the benchmarks on all PRs as sometimes it is unexpected PRs that will introduce regressions.

dhomeier · 2025-11-26T18:59:53Z

If we fix a bug that produces wrong results, we'll fix it no matter the performance impact.

If we add a small new feature, people often already discuss the implementation in the PR, sometimes with specific, ad-hoc benchmarks just testing that feature under discussion. I believe that is unlikely that small features have such wide of an impact, except for the obvious cases (e.g. units, tables) where we would discuss that in the review anyway - and those are so rare and ad-hoc measurements on the developers machine are probably more cost effective than running benchmarks on CI.

I am a bit more sceptical how well-defined that process works. There is not necessarily even a clear definition of "wrong results" in the way of how accurate and reproducible should the results be – there are a load of tests in modeling that will not converge or find different results with slightly different inputs, or where you need to switch to a possibly slower solver to get more stable results. Likewise the discussion on "serious" performance impacts sometimes feels rather anecdotal.
A clearer set of standards and ways to verify them could certainly be helpful.

pllim · 2025-11-26T19:03:15Z

I can tell you that back when we had Tom's little machine and access to nightly run results whenever we feel like it, it was nice. During the campaign for "performance release" (remember that?), it was helpful.

Not having it at all is a regression and I think it should be addressed.

adrn · 2025-12-01T19:07:34Z

I recently set up codspeed on a project of mine and it was pretty smooth to get working. Have you compared codspeed vs ASV in terms of expected cost, ease of maintenance, etc.? The huge advantage of codspeed for me was the ability to plug into pytest, so didn't require maintaining a separate benchmark suite.

Cadair · 2025-12-01T20:25:56Z

I also use codspeed for dkist and it's good.

I would say the main drawbacks of it are:

The free tier can only use their instrumented benchmarking which can't benchmark syscalls so things like io.
It can't benchmark memory usage.

I will clarify that the main objective of this FR is to provide dedicated hardware for stable benchmarking as a github actions runner. This means we could use it for asv, pytest-benchmark or codspeed (I think).

taldcroft · 2025-12-03T17:37:34Z

I do worry about unwittingly introducing significant performance regressions with feature or bug fix PRs. Making a plan for a stable hardware configuration for performance testing seems like a necessary first step. As mentioned, how to most effectively use that hardware is a separate (and not trivial) question but just having the existing ASV is a decent start.

eteq · 2025-12-08T20:14:25Z

The Cycle 5 funding request process has been hugely successful! On the downside, that means our funds are severely oversubscribed. Even after the Finance Committee and SPOC have taken into consideration community feedback/voting and alignment with the roadmap, there are still more funding requests than we can afford in 2026.

We would like to stretch the budget as far as possible, and to fund as many activities as possible, while making sure the Project remains volunteer-driven. Hence, we would like to know if this project will still meet its deliverables if minimum your budget is reduced by 25%, 50%, or 100%. Or if there’s some other minimum, feel free to specify that instead.

As a reminder, there will be more funding for 2027 and we expect the Cycle 6 call for 2027 funding requests to begin in the Fall of 2026.

Thank you for your engagement and understanding as we continue to optimize our funding and budgeting processes and the balance of volunteer vs funded work! (@astrofrog )

astrofrog · 2025-12-10T06:40:01Z

The budget provided (USD 6600) is already a strict minimum, since we definitely need to cover the costs of the server, and with fewer than 40 hours it likely won't be possible to get everything set up, which would then not be useful.

Added proposal for benchmarking

fe79c9f

kelle added the cycle 5️⃣ label Nov 17, 2025

astrofrog marked this pull request as ready for review November 21, 2025 21:28

Uh oh!

Cycle 5: Reliable performance benchmarking for the Astropy project #508

Are you sure you want to change the base?

Cycle 5: Reliable performance benchmarking for the Astropy project #508

Conversation

astrofrog commented Nov 13, 2025

Uh oh!

pllim commented Nov 13, 2025

Uh oh!

astrofrog commented Nov 13, 2025

Uh oh!

Cadair commented Nov 18, 2025

Uh oh!

pllim commented Nov 18, 2025

Uh oh!

Cadair commented Nov 18, 2025

Uh oh!

eteq commented Nov 25, 2025

Uh oh!

hamogu commented Nov 25, 2025

Uh oh!

neutrinoceros commented Nov 25, 2025

Uh oh!

astrofrog commented Nov 25, 2025

Uh oh!

dhomeier commented Nov 26, 2025

Uh oh!

pllim commented Nov 26, 2025

Uh oh!

adrn commented Dec 1, 2025

Uh oh!

Cadair commented Dec 1, 2025

Uh oh!

taldcroft commented Dec 3, 2025

Uh oh!

eteq commented Dec 8, 2025

Uh oh!

astrofrog commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants