[CI] Idea: Merge gating #103180

brianseeders · 2021-06-23T21:05:52Z

Overview

The Not Rocket Science Rule Of Software Engineering:

Automatically maintain a repository of code that always passes all the tests

The Kibana Operations team has been brainstorming around the idea of implementing this rule for the Kibana repository. The high-level idea is that CI for master is always green, and that this is achieved by merging code using automated processes (instead of humans clicking the Merge button). Nothing gets merged unless it's guaranteed to work, assuming no network/dependency issues, etc.

The high-level process

Instead of merging a PR, a human marks it as "ready to merge"
A bot automatically creates a branch that contains the latest master commit + any PRs that have recently been marked as "ready to merge"
The full CI pipeline runs for this branch
PRs marked as "ready to merge" during this time are queued up for testing later
If CI succeeds, master is fast-forwarded to the branch, effectively merging the PRs, and the PRs will automatically be marked as "merged" in Github
The next batch of PRs is tested, and the cycle continues

If CI fails:

The failing batch of PRs is kicked out of the queue, and is automatically bisected to find the failing PR
The merge queue proceeds as normal, without the failing PRs
When a failing PR is found, it is kicked back to the PR author for changes, and the other PRs are re-added to the normal merge queue

We have discussed many more implementation details. This is just a high-level overview.

Pros

(In theory)

The HEAD of master (or another tracked branch) is always build-able
PRs almost never need to be re-run for problems not related to the PR itself
- e.g, no more merge upstream because of an unrelated failure
Humans need to do very little failure triage as the process is fully automated
Reverts and skipped tests should rarely happen on tracked branches, except for flaky tests

Cons

Merging a PR potentially takes a long time, depending on what's currently in the queue, or if it happens to be batched with a bad PR
The bot will likely need to squash the commits in a PR before the merge process runs, replacing the commits on the PR with a single mergebot force-pushed the pr branch from aabbccdd to aabbccee 1 day ago
- The original PR author will still be the commit author, but their commit won't be signed by them
Backports will be a little delayed because PRs need time to merge. Auto-backport will still work but be delayed.

Prior Art / Links

bors-ng - an app that implements this kind of process, but is not very customizable and would likely not work for us
Successfully Merging the Work of 1000+ Developers - Shopify's experience, which is very close to our desired approach. They have not open-sourced their solution.
Gullintanni - an unfinished merge bot created by our very own @elasticdog

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-23T21:05:53Z

Pinging @elastic/kibana-operations (Team:Operations)

leathekd · 2021-06-24T18:08:31Z

Nice- I'm excited to see how this turns out!

BTW, this is, essentially, the main feature of Zuul CI, as well. Perhaps there is some further inspiration or implementation details that will be useful to you here: https://zuul-ci.org/docs/zuul/discussion/gating.html Your description already sounds pretty similar, though.

drewr · 2021-06-24T21:51:29Z

@brianseeders Thanks for rebooting this!

@nkammah @kseniia-kolpakova @cachedout @alpar-t Maybe we can do this more generally?

v1v · 2021-06-25T12:50:13Z

Heads up, one of the reasons we enabled mergify was to support in the future the merge queue for the beats project, some details were added in some of the comments in elastic/beats#24608 (comment)

If it helps mergify is a free service for the Open Source projects and they provide a merge-queue implementation, plus some other automation that you can build on the top. OTOH, mergify can be extended and they are happy with contributions, in case there is any missing features.

brianseeders · 2021-06-25T13:58:53Z

@v1v the main problem with mergify for us is that it doesn't seem to support parallelization or batches. Our CI just takes too long to support a serial one-at-a-time queue.

alpar-t · 2021-06-28T06:47:14Z

We have been playing around with the idea in cloud too. The description of this issue would I think work well for cloud too, the only additional feature I see us needing is the ability to manually override the queue or move a PR to the front (maybe both). The reason is that we may have infrastructure PRs that we need to be able to merge very quickly as we might be responding to an incident in ESS.

For us an added major benefit would be that we don't need to come up with tools and process to triage and route failures on the master branch as these should be extremely rare and would likely have to do with CI infrastructure more than with the actual change. In fact once we get enough confidence ti doesn't really make sense to run builds on master any more.

Since we have build avoidance with Gradle, assuming that we still run the PR checks as they are and we run that or a potentially extended set of tests on merge, I'm not sure to what extent it would be worth investing in a complex solution that would try to optimistically parallelize the merges as most of the checks would have been run on the PR (especially if master is not different in meaningful ways) so it won't take too long to asses if a PR can be merged and we don't rely have so many PRs all merging at once, nor urgency around it most of the time. It would be nice for sure, but I'm not sure it's something we absolutely need in a first implementation.

The challenges we identified are flaky tests and dependencies on external resources on the internet, both of which can cause a lot of back and forth reducing trust in the process. Sometimes these dependencies disappear from the internet and they would result in a failure popping up unjustly (e.g. a PR failing that changes a completely different part ) and that the PR author is not best suited to handle. We are adsressing these separately but I think they need to be addressed before the approach outlined here can be successful.

I'm curious about the advantages of the proposed implementation that uses a new branch. I was thinking we could also automate it by having the bot "click" the merge button when it knows things are ok to merge giving more visibility into the process and keeping things more familiar for folks. Merging master in for as many times as it needs to. That would I think keep Github as a single UI to interact with the process as we could also link the individual builds to the individual commits on the PR.

brianseeders · 2021-06-28T20:02:39Z

the ability to manually override the queue or move a PR to the front (maybe both)

If we do it, we're planning to have a way to tag a PR as "emergency", which would immediately bump it to the front of the queue and let it run in a batch by itself, isolated from other PRs. Pushing straight to master is also an option, which would cause all of the PRs in the queue to restart testing in our current proposal.

I'm curious about the advantages of the proposed implementation that uses a new branch

A branch is simply a place to place the commits that need to be tested. If PRs A,B,C are going to be tested in a batch together, we need a commit that contains latest master -> A -> B -> C, which we would put on a separate branch.

automate it by having the bot "click" the merge button

We considered this option as well. The main downside is that the commits that are tested by the gating system will be different than the commits that ultimately end up in master. If we simply fast-forward master to the latest tested batch, the PRs will still show up as merged, and master will contain exactly the commits that were tested. Doing it this way will simplify some of the implementation details, as commits will get created at the beginning and then essentially flow through the system to the end. The big downside is that we are going to be squashing commits inside the PR before we test, since we won't be able to do a real squash merge.

joshdover · 2021-07-06T16:25:14Z

@v1v the main problem with mergify for us is that it doesn't seem to support parallelization or batches. Our CI just takes too long to support a serial one-at-a-time queue.

I believe they do support batches, but they call them Speculative Checks

cachedout · 2021-07-16T17:29:25Z

I came across another service today that's trying to fit this niche as well. Could be worth evaluating: https://mergequeue.com/

tylersmalley · 2022-02-24T20:17:28Z

Closing this issue, thanks for everyones input. Seeing as the Github solution does not support squashed merges, we are going to continue with this. The next phase will be generating the project specification and we will ping those here once that's available.

nkammah · 2022-02-26T14:02:38Z

@tylersmalley given how many teams have expressed interest in such feature (Cloud Delivery cc @alpar-t @jeredding), O11y (cc @v1v @KseniaElastic), would you be opened to approaching this collaboratively ? If so, let's sync up offline on the best collaboration model.

cachedout · 2022-03-15T09:50:55Z

@nkammah FYI: you meant to ping @kseniia-kolpakova instead. :)

We'd probably be interested in some exploratory discussions on this topic, but it's not a top priority ATM.

jbudz · 2023-07-13T14:15:58Z

https://github.blog/2023-07-12-github-merge-queue-is-generally-available/

Squash and merge does appear to be an option now.

mistic · 2023-07-13T14:19:11Z

@jbudz thanks for the heads up here. We have an item on our roadmap to look at this but I was awaiting for that GA announcement. Aside from that, when the time comes to look into this, we should test this in a smaller repo because a couple months ago when we were looking into it there was still some bugs that would prevent us to use it for Kibana.

watson · 2023-10-19T05:59:34Z

We just hit yet another example of where a merge queue would have saved us from two PR's landing in main with conflicting logic changes - which in turn fails CI on main: #168735 and #166664 (resulting CI failure)

Both PR's edit x-pack/plugins/infra/server/lib/alerting/metric_threshold/metric_threshold_executor.test.ts
PR #168735 adds an extra reference to STARTED_AT_MOCK_DATE const
PR #166664 removes all mentions of STARTED_AT_MOCK_DATE from the file

v1v · 2024-02-06T12:24:52Z

Folks, we have tried to use the merge-queue from GitHub in elastic/apm-server repository but found that the current CLA check is blocking the merge queue.

If I recall, you enabled this back in the days but reverted, did you happen to solve the CLA check issue?

mistic · 2024-03-19T20:30:54Z

After researching, debating and trialing what would be needed on our side to enable a merge queue we decided to not go forward with it this time. The main reason is that we concluded the costs to implement one exceeds its benefits. The merge queue will improve a not common incidence having as cost an adding complexity to the most common use case on top of great addition of used Buildkite minutes which we are trying to reduce for the moment.

brianseeders added Team:Operations Team label for Operations Team Feature:CI Continuous integration labels Jun 23, 2021

spalger mentioned this issue Jun 28, 2021

[discuss] Always run casing check on CI? Ignore casing on documentation-specific files? #103545

Closed

tylersmalley mentioned this issue Sep 21, 2021

[CI] 'merge upstream' only good changes #69953

Closed

tylersmalley added 1 and removed 1 labels Oct 11, 2021

exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Nov 5, 2021

tylersmalley closed this as completed Feb 24, 2022

jbudz reopened this Jul 13, 2023

mistic self-assigned this Oct 23, 2023

mistic closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Idea: Merge gating #103180

[CI] Idea: Merge gating #103180

brianseeders commented Jun 23, 2021 •

edited

Loading

elasticmachine commented Jun 23, 2021

leathekd commented Jun 24, 2021

drewr commented Jun 24, 2021

v1v commented Jun 25, 2021

brianseeders commented Jun 25, 2021

alpar-t commented Jun 28, 2021

brianseeders commented Jun 28, 2021

joshdover commented Jul 6, 2021

cachedout commented Jul 16, 2021

tylersmalley commented Feb 24, 2022

nkammah commented Feb 26, 2022

cachedout commented Mar 15, 2022

jbudz commented Jul 13, 2023

mistic commented Jul 13, 2023

watson commented Oct 19, 2023

v1v commented Feb 6, 2024

mistic commented Mar 19, 2024

[CI] Idea: Merge gating #103180

[CI] Idea: Merge gating #103180

Comments

brianseeders commented Jun 23, 2021 • edited Loading

Overview

The high-level process

Pros

Cons

Prior Art / Links

elasticmachine commented Jun 23, 2021

leathekd commented Jun 24, 2021

drewr commented Jun 24, 2021

v1v commented Jun 25, 2021

brianseeders commented Jun 25, 2021

alpar-t commented Jun 28, 2021

brianseeders commented Jun 28, 2021

joshdover commented Jul 6, 2021

cachedout commented Jul 16, 2021

tylersmalley commented Feb 24, 2022

nkammah commented Feb 26, 2022

cachedout commented Mar 15, 2022

jbudz commented Jul 13, 2023

mistic commented Jul 13, 2023

watson commented Oct 19, 2023

v1v commented Feb 6, 2024

mistic commented Mar 19, 2024

brianseeders commented Jun 23, 2021 •

edited

Loading