CI for benchmarks to track performance #13

vogler · 2021-05-28T13:16:48Z

GitHub Actions are fine for running the regression tests, but we also want something to track performance (and precision) for long-running benchmarks.

Originally posted by @michael-schwarz in goblint/analyzer#234 (comment):

or better having some server with a job queue checking every commit (https://github.com/goblint/analyzer/settings/hooks).

Something along these lines was supposed to be the outcome. Basing it on this benchexec framework has the advantages that it is the same setup for SV-Comp so all those tests work out of the box and our own tests can be integrated without too many issues. Also this tablegen tool would in theory give us a nice diff of what changed between runs (or configurations) that could simply be served at some URL to look at the results without having to ssh to the machine.

One probably wants some glue code so that this is not all shell scripts but a bit more robust. But the idea was exactly this.

checking every commit

This is a bit optimistic given that one of these runs will likely take >12h (at least for SV-Com) even on the new hardware.

The text was updated successfully, but these errors were encountered:

sim642 · 2021-05-28T13:22:36Z

Now that this issue exists, I'll write down one thought. Maybe we could just use a GitHub Actions self-hosted runner for this: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners. I haven't looked into it but it looks like it already has a builtin job queue system etc, so it would avoid a lot of reinventing of the wheel.

Each workflow run is limited to 72 hours.

This limit should be sufficiently high that we can run big jobs that the free GitHub hosted runner probably doesn't allow.

Also GitHub Actions can schedule jobs a la cron instead of trying to do them on each push: https://docs.github.com/en/actions/reference/events-that-trigger-workflows#schedule. And there looks to be even a way to manually trigger jobs.

And of course the integration would be minimal: no need to build some properly authenticated HTTPS webhook server to handle GitHub hooks into testing-framework or whatever.

vogler · 2021-05-28T13:23:17Z

Does it make sense to look at something like https://www.jenkins.io/ or do we make our own?

A simple implementation would probably be some nodejs server as an endpoint reacting to the GitHub commit hook.
There are libraries for job queues with priorities and web-interfaces: https://github.com/Automattic/kue, https://github.com/OptimalBits/bull

vogler · 2021-05-28T13:25:54Z

Now that this issue exists, I'll write down one thought. Maybe we could just use a GitHub Actions self-hosted runner for this: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners. I haven't looked into it but it looks like it already has a builtin job queue system etc, so it would avoid a lot of reinventing of the wheel.

Each workflow run is limited to 72 hours.
This limit should be sufficiently high that we can run big jobs that the free GitHub hosted runner probably doesn't allow.

Also GitHub Actions can schedule jobs a la cron instead of trying to do them on each push: https://docs.github.com/en/actions/reference/events-that-trigger-workflows#schedule. And there looks to be even a way to manually trigger jobs.

And of course the integration would be minimal: no need to build some properly authenticated HTTPS webhook server to handle GitHub hooks into testing-framework or whatever.

Ok, that looks like an easy option.
Just need to make sure the limits are fine for the selected benchmarks:

Each job for self-hosted runners can be queued for a maximum of 24 hours. If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete.

vogler · 2021-05-28T13:31:39Z

If we do our own, there'd be no limits and one could think about more sophisticated prioritization strategies.
What's the GitHub behavior? Start if nothing is running, ignore following commits until run is done and then start accepting again?
Ideally you'd have the same, but then start bisecting on idle if there are changes above a certain threshold.

sim642 · 2021-05-28T16:10:39Z

If we do our own, there'd be no limits and one could think about more sophisticated prioritization strategies.

I would be very cautious of trying to roll something decent from scratch. If we really need something beyond those limits, then it still might be worth looking at Jenkins or something else existing and mature. For example, Jenkins even seems to have a plugin for bisecting. Although I'm not sure how necessary such functionality would be. If we already do nightly benchmarks, then there's probably not that much to bisect. And even if there is a need, one can bisect a single/handful of benchmarks locally by hand instead of having to do bisect with an entire 12h suite or whatever.

vogler · 2021-05-28T17:55:54Z

Yea, just some greenfield thinking, but likely the devil is in the details 😄
Bisect is also good for looking back to see what changes had a big (unexpected) impact.

michael-schwarz · 2022-07-14T10:13:22Z

Moved it over here, as it seems more appropriate here.

michael-schwarz · 2024-04-01T21:57:12Z

We now have a minimum working version of this running on server01 and reporting to Zulip.

sim642 added the testing label May 28, 2021

michael-schwarz transferred this issue from goblint/analyzer Jul 14, 2022

sim642 added benchmark and removed testing labels Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI for benchmarks to track performance #13

CI for benchmarks to track performance #13

vogler commented May 28, 2021

sim642 commented May 28, 2021 •

edited by vogler

Loading

vogler commented May 28, 2021

vogler commented May 28, 2021

vogler commented May 28, 2021 •

edited

Loading

sim642 commented May 28, 2021

vogler commented May 28, 2021

michael-schwarz commented Jul 14, 2022

michael-schwarz commented Apr 1, 2024

CI for benchmarks to track performance #13

CI for benchmarks to track performance #13

Comments

vogler commented May 28, 2021

sim642 commented May 28, 2021 • edited by vogler Loading

vogler commented May 28, 2021

vogler commented May 28, 2021

vogler commented May 28, 2021 • edited Loading

sim642 commented May 28, 2021

vogler commented May 28, 2021

michael-schwarz commented Jul 14, 2022

michael-schwarz commented Apr 1, 2024

sim642 commented May 28, 2021 •

edited by vogler

Loading

vogler commented May 28, 2021 •

edited

Loading