Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Idea: Merge gating #103180

Closed
brianseeders opened this issue Jun 23, 2021 · 17 comments
Closed

[CI] Idea: Merge gating #103180

brianseeders opened this issue Jun 23, 2021 · 17 comments
Assignees
Labels
Feature:CI Continuous integration impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Operations Team label for Operations Team

Comments

@brianseeders
Copy link
Contributor

brianseeders commented Jun 23, 2021

Overview

The Not Rocket Science Rule Of Software Engineering:

Automatically maintain a repository of code that always passes all the tests

The Kibana Operations team has been brainstorming around the idea of implementing this rule for the Kibana repository. The high-level idea is that CI for master is always green, and that this is achieved by merging code using automated processes (instead of humans clicking the Merge button). Nothing gets merged unless it's guaranteed to work, assuming no network/dependency issues, etc.

The high-level process

  • Instead of merging a PR, a human marks it as "ready to merge"
  • A bot automatically creates a branch that contains the latest master commit + any PRs that have recently been marked as "ready to merge"
  • The full CI pipeline runs for this branch
  • PRs marked as "ready to merge" during this time are queued up for testing later
  • If CI succeeds, master is fast-forwarded to the branch, effectively merging the PRs, and the PRs will automatically be marked as "merged" in Github
  • The next batch of PRs is tested, and the cycle continues

If CI fails:

  • The failing batch of PRs is kicked out of the queue, and is automatically bisected to find the failing PR
  • The merge queue proceeds as normal, without the failing PRs
  • When a failing PR is found, it is kicked back to the PR author for changes, and the other PRs are re-added to the normal merge queue

We have discussed many more implementation details. This is just a high-level overview.

Pros

(In theory)

  • The HEAD of master (or another tracked branch) is always build-able
  • PRs almost never need to be re-run for problems not related to the PR itself
    • e.g, no more merge upstream because of an unrelated failure
  • Humans need to do very little failure triage as the process is fully automated
  • Reverts and skipped tests should rarely happen on tracked branches, except for flaky tests

Cons

  • Merging a PR potentially takes a long time, depending on what's currently in the queue, or if it happens to be batched with a bad PR
  • The bot will likely need to squash the commits in a PR before the merge process runs, replacing the commits on the PR with a single mergebot force-pushed the pr branch from aabbccdd to aabbccee 1 day ago
    • The original PR author will still be the commit author, but their commit won't be signed by them
  • Backports will be a little delayed because PRs need time to merge. Auto-backport will still work but be delayed.

Prior Art / Links

@brianseeders brianseeders added Team:Operations Team label for Operations Team Feature:CI Continuous integration labels Jun 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@leathekd
Copy link

Nice- I'm excited to see how this turns out!

BTW, this is, essentially, the main feature of Zuul CI, as well. Perhaps there is some further inspiration or implementation details that will be useful to you here: https://zuul-ci.org/docs/zuul/discussion/gating.html Your description already sounds pretty similar, though.

@drewr
Copy link

drewr commented Jun 24, 2021

@brianseeders Thanks for rebooting this!

@nkammah @kseniia-kolpakova @cachedout @alpar-t Maybe we can do this more generally?

@v1v
Copy link
Member

v1v commented Jun 25, 2021

Heads up, one of the reasons we enabled mergify was to support in the future the merge queue for the beats project, some details were added in some of the comments in elastic/beats#24608 (comment)

If it helps mergify is a free service for the Open Source projects and they provide a merge-queue implementation, plus some other automation that you can build on the top. OTOH, mergify can be extended and they are happy with contributions, in case there is any missing features.

@brianseeders
Copy link
Contributor Author

@v1v the main problem with mergify for us is that it doesn't seem to support parallelization or batches. Our CI just takes too long to support a serial one-at-a-time queue.

@alpar-t
Copy link

alpar-t commented Jun 28, 2021

We have been playing around with the idea in cloud too. The description of this issue would I think work well for cloud too, the only additional feature I see us needing is the ability to manually override the queue or move a PR to the front (maybe both). The reason is that we may have infrastructure PRs that we need to be able to merge very quickly as we might be responding to an incident in ESS.

For us an added major benefit would be that we don't need to come up with tools and process to triage and route failures on the master branch as these should be extremely rare and would likely have to do with CI infrastructure more than with the actual change. In fact once we get enough confidence ti doesn't really make sense to run builds on master any more.

Since we have build avoidance with Gradle, assuming that we still run the PR checks as they are and we run that or a potentially extended set of tests on merge, I'm not sure to what extent it would be worth investing in a complex solution that would try to optimistically parallelize the merges as most of the checks would have been run on the PR (especially if master is not different in meaningful ways) so it won't take too long to asses if a PR can be merged and we don't rely have so many PRs all merging at once, nor urgency around it most of the time. It would be nice for sure, but I'm not sure it's something we absolutely need in a first implementation.

The challenges we identified are flaky tests and dependencies on external resources on the internet, both of which can cause a lot of back and forth reducing trust in the process. Sometimes these dependencies disappear from the internet and they would result in a failure popping up unjustly (e.g. a PR failing that changes a completely different part ) and that the PR author is not best suited to handle. We are adsressing these separately but I think they need to be addressed before the approach outlined here can be successful.

I'm curious about the advantages of the proposed implementation that uses a new branch. I was thinking we could also automate it by having the bot "click" the merge button when it knows things are ok to merge giving more visibility into the process and keeping things more familiar for folks. Merging master in for as many times as it needs to. That would I think keep Github as a single UI to interact with the process as we could also link the individual builds to the individual commits on the PR.

@brianseeders
Copy link
Contributor Author

the ability to manually override the queue or move a PR to the front (maybe both)

If we do it, we're planning to have a way to tag a PR as "emergency", which would immediately bump it to the front of the queue and let it run in a batch by itself, isolated from other PRs. Pushing straight to master is also an option, which would cause all of the PRs in the queue to restart testing in our current proposal.

I'm curious about the advantages of the proposed implementation that uses a new branch

A branch is simply a place to place the commits that need to be tested. If PRs A,B,C are going to be tested in a batch together, we need a commit that contains latest master -> A -> B -> C, which we would put on a separate branch.

automate it by having the bot "click" the merge button

We considered this option as well. The main downside is that the commits that are tested by the gating system will be different than the commits that ultimately end up in master. If we simply fast-forward master to the latest tested batch, the PRs will still show up as merged, and master will contain exactly the commits that were tested. Doing it this way will simplify some of the implementation details, as commits will get created at the beginning and then essentially flow through the system to the end. The big downside is that we are going to be squashing commits inside the PR before we test, since we won't be able to do a real squash merge.

@joshdover
Copy link
Contributor

@v1v the main problem with mergify for us is that it doesn't seem to support parallelization or batches. Our CI just takes too long to support a serial one-at-a-time queue.

I believe they do support batches, but they call them Speculative Checks

@cachedout
Copy link
Contributor

I came across another service today that's trying to fit this niche as well. Could be worth evaluating: https://mergequeue.com/

@tylersmalley tylersmalley added 1 and removed 1 labels Oct 11, 2021
@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Nov 5, 2021
@tylersmalley
Copy link
Contributor

Closing this issue, thanks for everyones input. Seeing as the Github solution does not support squashed merges, we are going to continue with this. The next phase will be generating the project specification and we will ping those here once that's available.

@nkammah
Copy link
Contributor

nkammah commented Feb 26, 2022

@tylersmalley given how many teams have expressed interest in such feature (Cloud Delivery cc @alpar-t @jeredding), O11y (cc @v1v @KseniaElastic), would you be opened to approaching this collaboratively ? If so, let's sync up offline on the best collaboration model.

@cachedout
Copy link
Contributor

@nkammah FYI: you meant to ping @kseniia-kolpakova instead. :)

We'd probably be interested in some exploratory discussions on this topic, but it's not a top priority ATM.

@jbudz
Copy link
Member

jbudz commented Jul 13, 2023

https://github.blog/2023-07-12-github-merge-queue-is-generally-available/

Squash and merge does appear to be an option now.

@jbudz jbudz reopened this Jul 13, 2023
@mistic
Copy link
Member

mistic commented Jul 13, 2023

@jbudz thanks for the heads up here. We have an item on our roadmap to look at this but I was awaiting for that GA announcement. Aside from that, when the time comes to look into this, we should test this in a smaller repo because a couple months ago when we were looking into it there was still some bugs that would prevent us to use it for Kibana.

@watson
Copy link
Contributor

watson commented Oct 19, 2023

We just hit yet another example of where a merge queue would have saved us from two PR's landing in main with conflicting logic changes - which in turn fails CI on main: #168735 and #166664 (resulting CI failure)

  • Both PR's edit x-pack/plugins/infra/server/lib/alerting/metric_threshold/metric_threshold_executor.test.ts
  • PR #168735 adds an extra reference to STARTED_AT_MOCK_DATE const
  • PR #166664 removes all mentions of STARTED_AT_MOCK_DATE from the file

@mistic mistic self-assigned this Oct 23, 2023
@v1v
Copy link
Member

v1v commented Feb 6, 2024

Folks, we have tried to use the merge-queue from GitHub in elastic/apm-server repository but found that the current CLA check is blocking the merge queue.

Screenshot 2024-02-06 at 13 22 06

If I recall, you enabled this back in the days but reverted, did you happen to solve the CLA check issue?

@mistic
Copy link
Member

mistic commented Mar 19, 2024

After researching, debating and trialing what would be needed on our side to enable a merge queue we decided to not go forward with it this time. The main reason is that we concluded the costs to implement one exceeds its benefits. The merge queue will improve a not common incidence having as cost an adding complexity to the most common use case on top of great addition of used Buildkite minutes which we are trying to reduce for the moment.

@mistic mistic closed this as completed Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:CI Continuous integration impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Operations Team label for Operations Team
Projects
None yet
Development

No branches or pull requests