Improve efficiency of CI checks (so we can add MORE!) #13845

alamb · 2024-12-19T13:32:27Z

Is your feature request related to a problem or challenge?

Part of [EPIC] A collection of items to improve developer / CI speed #13813
Related to CI: Windows flow takes 1.5h #13726

There is a tension between adding more tests and and PRs and code velocity (more tests --> longer CI)

DataFusion runs a many tests on every change to every PR. For example my most recent PR ran 24 tests (link) consuming over an hour of worker time.

This has several challenges

Increases CI cycle time and thus how fast contributors get feedback on their PRs
Limits new test we can add: As we contemplate adding even more testing (which we should) such as the sqllogictest suite adding them to CI would make the time even worse
wastes worker credits (though we don't run out of credits largely because the Apache Software Foundation has approximately unlimited time from github, but there are broader concerns, like the amount of CO2 generated by that waste 🙈 )

Another observation is that there are several tests also rarely fail in PRs, but offer important coverage such as the Windows and mac tests. We even disabled the Windows test due to its (lack of) speed.

Describe the solution you'd like

I would like to improve the efficiency of existing CI jobs as well as have a mechanism to run both new and existing tests that offer important coverage but take too long to run on each CI

Describe alternatives you've considered

Here are some options from my past lives:

Option 1: Change some tests to only run on merges to `main`

In this option, we could change some checks to only run on merges to main. For example, we could run the windows tests only on merges to main rather than also on PRs.

Instead of all jobs being triggered like

# trigger for all PRs and changes to main
on:
  push:
    branches:
      - main
  pull_request:

we would change some tests to run like

# trigger for only changes to main
on:
  push:
    branches:
      - main

pros:

simple to implement (change github branch matching rules)
cons:
PRs could merge to main that broke the tests, so the tests would be brokenand someone would have to chase / fix the tests
Someone (maintainers?) would have to traige and fix these tests

We could probably add some sort of job that would automatically make revert PRs for any code that broke the main tests to help triage

Option 2: Implement more sophisticated merge flow

In this option, I would imagine a workflow like

Open a PR
Subset of tests pass
PR is "marked for merge" somehow
Additional longer CI checks run and if pass PR is merged, if fail PR is "unmarked for merge"

This might already be supported by github in Merge Queues

There are probably bots like https://github.com/rust-lang/homu that could automate something like this

pros:

more automation / less human intervention
cons:
More complicated to setup / maintain

Option 3: Your idea here

Additional context

No response

The text was updated successfully, but these errors were encountered:

findepi · 2024-12-19T14:14:56Z

In this option, we could change some checks to only run on merges to main. For example, we could run the windows tests only on merges to main rather than also on PRs.

Running tests after a merge has a problem: who gonna look at the results?
And who gonna fix this if something breaks.
Nightly tests seems suitable for stress tests and scale benchmarks, which inherently need to take time to be meaningful.

It seems to me that we should either give merge queue a shot, or at least understand why we're not doing that.

comphead · 2024-12-19T19:41:05Z

I agree with @findepi, once it is merged it is a deferred problem and much harder to find the root cause.
My initial naive thought was to have smoke tests/correctness tests on PRs and more advanced things like benchmarks before the release, but this approach has the same flaw of deferred problem detection.

The challenge is we focused to move most of test cases to sqllogictests, which is good and doesn't have tests duplication anymore. But another side of the medal we cannot unit test some packages without end 2 end tests which are sqllogictest and the latter requires all DF to be compiled

alamb · 2024-12-19T20:50:36Z

I agree with @findepi, once it is merged it is a deferred problem and much harder to find the root cause.

Yes, I also 100% agree.

My initial naive thought was to have smoke tests/correctness tests on PRs and more advanced things like benchmarks before the release, but this approach has the same flaw of deferred problem detection.

Maybe we could still run the more advanced tests before merging a PR but would not run them all during development, review and iteration.

This would ensure that if a PR introduced a problem it wouldn't be merged to main so the error is not deferred

I expect in most cases the "extended" tests will simply pass and things will be fine. Occasionally the extended tests would fail and we would have to go dig into why before the PR was merged

It seems like wishful thinking though, as I am not sure there is capacity to create and maintain such a workflow 🤔

alamb · 2024-12-19T21:03:18Z

I was also thinking that this "when to run more extended tests" may have stopped the great work @2010YOUY01 did to run with Sqllancer.

Implement SQLancer (a end-to-end SQL fuzz testing library) #11030

I suspect it is somewhat thankless work to run, triage, and file those tickets and totally understandability interests move on

findepi · 2024-12-20T11:56:47Z

Maybe we could still run the more advanced tests before merging a PR but would not run them all during development, review and iteration.

This would ensure that if a PR introduced a problem it wouldn't be merged to main so the error is not deferred

That's exactly what merge queues are supposed to do.

benchmarks before the release, but this approach has the same flaw of deferred problem detection.

Note that benchmarks are not qualitative, but quantitive.
It's possible to automatically catch big regressions without false positives, but it's harder to catch small regressions.

if we have merge queue, we should be able to do some benchmarks, but not necessarily all benchmarks.
With a merge queue i still would hope for "interactive developer experience", ie that the PR gets merged not long after it's scheduled for a merge.

alamb · 2024-12-21T02:37:23Z

It sounds like then "implement a basic merge queue" would be a good next step!

Omega359 · 2024-12-21T17:52:45Z

Maintaining an extended workflow shouldn't be too bad tbh. I think having a workflow that runs outside of PR's (iow runs nightly) could be useful as well for expensive tests that rarely break (or just a take a long time to run). Think sqlite tests, etc

alamb · 2024-12-22T21:41:32Z

Maintaining an extended workflow shouldn't be too bad tbh. I think having a workflow that runs outside of PR's (iow runs nightly) could be useful as well for expensive tests that rarely break (or just a take a long time to run). Think sqlite tests, etc

I think the cost / annoyance will be when the tests start failing, someone has to care enough to look into the failure and triage / file tickets

alamb added the enhancement New feature or request label Dec 19, 2024

This was referenced Dec 19, 2024

chore: temporarily disable Windows Rust flow #13833

Merged

[EPIC] A collection of items to improve developer / CI speed #13813

Open

alamb mentioned this issue Dec 19, 2024

Complete / integrate sqlite sqllogictest test scripts integrattion #13812

Open

Omega359 mentioned this issue Dec 21, 2024

ci improvements #13876

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of CI checks (so we can add MORE!) #13845

Improve efficiency of CI checks (so we can add MORE!) #13845

alamb commented Dec 19, 2024

findepi commented Dec 19, 2024

comphead commented Dec 19, 2024

alamb commented Dec 19, 2024

alamb commented Dec 19, 2024

findepi commented Dec 20, 2024

alamb commented Dec 21, 2024

Omega359 commented Dec 21, 2024

alamb commented Dec 22, 2024

Improve efficiency of CI checks (so we can add MORE!) #13845

Improve efficiency of CI checks (so we can add MORE!) #13845

Comments

alamb commented Dec 19, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Option 1: Change some tests to only run on merges to main

Option 2: Implement more sophisticated merge flow

Option 3: Your idea here

Additional context

findepi commented Dec 19, 2024

comphead commented Dec 19, 2024

alamb commented Dec 19, 2024

alamb commented Dec 19, 2024

findepi commented Dec 20, 2024

alamb commented Dec 21, 2024

Omega359 commented Dec 21, 2024

alamb commented Dec 22, 2024

Option 1: Change some tests to only run on merges to `main`