-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deflake etcd tests #13167
Comments
This sounds like a pretty interesting thing and also like a thing that alleviates a lot of pain and improves developer experience ! |
I was able to get a basic bash script using GitHub GraphQL API - https://github.com/karuppiah7890/issues-info/blob/main/etcd-io/etcd/issue-13167/find-flaky-tests-data.sh . It gives data like this - https://github.com/karuppiah7890/issues-info/blob/main/etcd-io/etcd/issue-13167/commit-and-check-data.json |
I'm able to get the number of successes and we can get failures too. Given total (for example 100) and any one of those (successes / failures), we get the other value too |
Great! Would you be interested in sending PR that adds it to etcd |
Sure @serathius ! I was also wondering if I should try out a golang script too, so anyone can run it with just "go run" or similar on any platform. No need to worry about OS, bash shell being available, other tools being available etc. What do you think? |
Letting everyone to run it is a good initiative, but on the other hand long term we should just automate it. Most scripts are already written in bash and I don't think there is any need to invest in this script too much. It should be simple enough (2-3 commands) that it could be replaced when needed. I think it would make sense revisit those improvements when we have established whole process and automated it. |
Makes sense @serathius ! 👍 I'll raise the PR and we can discuss more about the bash script as part of the PR |
This is to start measuring the test flakyness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
This is to start measuring the test flakyness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
This is to start measuring the test flakyness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
This is to start measuring the test flakyness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
This is to start measuring the test flakyness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
This is to start measuring the test flakyness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
This is to start measuring the test flakyness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
This is to start measuring the test flakiness and see the numbers improving once we improve and deflake flaky tests Fixes etcd-io#13167
…ommits with failed status The workflow runs on a cron schedule on a weekly basis - once every week Fixes etcd-io#13167
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
commenting to avoid closing of issue |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
I hacked together a tool for finding/tracking/fixing flakes the other day: https://github.com/endocrimes/etcd-test-analyzer Because it parses all of the test results from every run in a given time period, it makes it relatively easy to modify to ask new questions in place, but definitely isn't a tool that is widely useful in its current form. |
Status update, running
So on last 100 merged commits we got 24 test failures. Excluding 7 coverage failures (not blocking merge) and 2 recent failures due to post merge bug #14101, we get 14% flakiness. Going down from 50% to 14% is great result!! |
Looking into failures from last 100 runs (excluding coverage and known issues) we get failures in:
|
As there are a lot of tests would be great to get some help. Please let me know if you are interested in tackling one of the tests listed. |
This should aid in debugging test flakes, especially in tests where the process is restarted very often and thus changes its pid. Now it's a lot easier to grep for different members, also when different tests fail at the same time. The test TestDowngradeUpgradeClusterOf3 as mentioned in etcd-io#13167 is a good example for that. Signed-off-by: Thomas Jungblut <[email protected]>
This should aid in debugging test flakes, especially in tests where the process is restarted very often and thus changes its pid. Now it's a lot easier to grep for different members, also when different tests fail at the same time. The test TestDowngradeUpgradeClusterOf3 as mentioned in etcd-io#13167 is a good example for that. Signed-off-by: Thomas Jungblut <[email protected]>
Status: 28% flakiness |
I noticed recent increase in flakes (at least in my PRs). From https://github.com/etcd-io/etcd/actions/runs/4394774437/jobs/7696017126 we see 26% of flakiness. Loved recent initiative by @chaochn47 to use tools developed by @endocrimes in #15501. It would be great to integrate them into https://github.com/etcd-io/etcd/actions/workflows/measure-test-flakiness.yaml |
Yeah, I can help add to the existing workflow. ETA next Monday |
Hi, I'd like to work on this! |
Thanks @nitishfy for your interest. The issue was created some time ago so not everything is up to date, however high level goals remained relevant. We want to improve our visibility of test flakes so we can fix them more effectively. For the original plan, we have instrumented etcd e2e tests to export JUnit reports, @endocrimes and @karuppiah7890 implemented some custom scripts that would analyse them. This approach allowed us to start reporting and manually creating issues to fix flakes. One thing we can do better is to avoid developing our own scripting, etcd community is not very big, so we want to avoid spreading too thin maintaining too many custom tools. With introduction of SIG-etcd we now have a option to benefit from whole ecosystem of tools built by Kubernetes community. We should do that. One example of such tool is testgrid, it's a test result visualization tool that uses the same JUnit reports to create a grid showing which tests passed and which failed. It makes it really easy to track flakes. For example https://testgrid.k8s.io/sig-etcd-periodics#ci-etcd-e2e-amd64 I think we should work more on integrating with K8s tools, this first requires migrating etcd testing to Prow, K8s CI tool. This work can be tracked in kubernetes/k8s.io#6102. In the meantime we could improve ensure that all etcd tests generate a Junit report, that can be later used. Looking at github workflows only in https://github.com/etcd-io/etcd/blob/main/.github/workflows/tests-template.yaml etcd/.github/workflows/tests-template.yaml Lines 69 to 73 in 11ff264
|
This should aid in debugging test flakes, especially in tests where the process is restarted very often and thus changes its pid. Now it's a lot easier to grep for different members, also when different tests fail at the same time. The test TestDowngradeUpgradeClusterOf3 as mentioned in etcd-io#13167 is a good example for that. Signed-off-by: Thomas Jungblut <[email protected]>
This should aid in debugging test flakes, especially in tests where the process is restarted very often and thus changes its pid. Now it's a lot easier to grep for different members, also when different tests fail at the same time. The test TestDowngradeUpgradeClusterOf3 as mentioned in etcd-io#13167 is a good example for that. Signed-off-by: Thomas Jungblut <[email protected]>
If we look into tests results since we migrated Github Actions commits on main branch we get:
Where failure/success is based on green check vs red cross under commit message (commits without them means that they were not tested as they were multiple commits in one PR).
Those are all test failures on main branch, so after a PR passed tests and was approved. We can use those failures to calculate chance of any PR failing to pass tests just due to test flaking.
Having flakyness ratio of over 50% means that average PR needs to be run 2 times, but number of failures in sequences may be much much longer, 3-5 failures in row is not something uncommon. This can be frustrating especially to new contributors, as there is no easy way to retrigger tests (need to do an empty commit amend and push).
Proposal
Etcd community should set on a test flakyness target, measure it and establish a process to fix flaky tests.
For start I would propose to target a 10% failure rate for whole test suite. It should be reachable by fixing only couple of tests as from last runs we got 22% (7 out of last 32). Measuring flakyness could start from something simple, like for example running a script once a week that checks last 100 test results. If the measured flakyness is over our target, we should identify most flaky tests, create issues for them and encourage community to fix them.
For couple of first runs we could depend on executing the scripts manualy, but we should plan to automate them.
TODO:
cc @hexfusion @Rajalakshmi-Girish
The text was updated successfully, but these errors were encountered: