-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate the robustness tests to prow #18136
Comments
Do we have access to arm nodes in the Prow infra? The last I remember is that we were waiting for them. I don't see any updates regarding this on kubernetes/k8s.io#6102. So, it may be a blocker for the second point. |
Not great, but I will not block the migration regardless. Robustness tests only bring value if there is someone willing to review them. With Prow being much better, no-one will be willing to review arm robustness failures. |
I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions. I don't see much activity in kubernetes/k8s.io#6102. Who or where would be a good place to ask for a status update/ETA for ARM nodegroups? |
Hi @upodroid - We spoke at KubeCon EU Paris about a dedicated |
I was thinking about the second option, however due to sub-par user experience I expect it would be equal the first one. |
Discussed on Slack with Arka, we'll be working on the following at the moment:
/assign @ArkaSaha30 @ivanvc |
@ivanvc: GitHub didn't allow me to assign the following users: ArkaSaha30. Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign |
Currently, the robustness tests on Github Actions run only on main or PRs to main. Do we need to run it on |
There are no robustness test on other branches beside main. We develop and run robustness test from main branch and validate binaries build from older branches. |
We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius? We can't move forward with the second, as we don't have a timeline on when ARM runners are going to be available. |
We can keep arm64 on Github actions to not block on it. |
@ArkaSaha30, can you help with
Thanks. |
Update -
|
Looking at most recent full run it says:
Job logs show: {"Time":"2024-08-08T06:47:33.907178941Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory/EtcdHighTraffic/ClusterOfSize1","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (79484): Git SH{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-08T06:47:36Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143
Looks like job was interrupted? Or is that expected / unrelated output? Job history shows as Edit: Interestingly |
@jmhbnz, @serathius, are we ready to remove |
We can remove I don't think we can close this yet though, we still have an problem with the Edit: Defer to @serathius as tech lead for robustness for final decision on |
Think we are ok to make presubmit job blocking.
My high level question, why do we have separated |
My bad, I thought it was addressed in #17593. I see it's a different issue. It looks like they are consistently aborted at around 80 minutes. Following
I wonder if the ARM node or pods inside the node get rotated after 80m.
I'm unsure about this one. Should we only have |
Just giving an update that I have a thread in #sig-k8s-infra. It looks like the bug is in the infra, not the job itself. |
Link to kubernetes/k8s.io#7241 |
The ARM issues are now solved. There are multiple green runs in prow (https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64). @serathius, should we delete |
Don't know the exact differences in the job definition but from those 4 jobs
We only need 2 one for amd64 one for arm. As for the name I think it would be better follow the same convention as
|
The difference between the jobs is that
Which one would we need to keep, the one with gofail enabled or the other? |
The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep |
Good spotting @ivanvc. That seems reasonable to me, defer to @serathius for final decision. |
Lack of building and enabling gofail is expected because the difference between targets With the differences cleaned up I think we can leave |
As discussed in etcd-io/etcd#18136, ci-etcd-robustness-{arm64,amd64} were a duplication of the main branch jobs.
I believe the only outstanding task from this issue is marking the pre-submit jobs as blocking. @serathius, do you think we should do this soon, or should we leave them running for a little longer? |
I'll close this issue now since we don't have any outstanding tasks (please reopen if needed). Thanks to everyone who contributed to migrating the robustness tests. |
What would you like to be added?
After the last robustness team meeting it was clear how superior Prow + TestGrid is over GitHub actions.
https://testgrid.k8s.io/sig-etcd-robustness#Summary vs https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml
Advantages:
TODO:
cc @jmhbnz @ivanvc
Why is this needed?
Migration to Prow opens a new chapter for stability and debuggability of robustness test with the goal of making the process more approachable for new contributors.
The text was updated successfully, but these errors were encountered: