Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate the robustness tests to prow #18136

Closed
5 tasks done
serathius opened this issue Jun 6, 2024 · 32 comments
Closed
5 tasks done

Migrate the robustness tests to prow #18136

serathius opened this issue Jun 6, 2024 · 32 comments
Assignees
Labels
area/robustness-testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature

Comments

@serathius
Copy link
Member

serathius commented Jun 6, 2024

What would you like to be added?

After the last robustness team meeting it was clear how superior Prow + TestGrid is over GitHub actions.

https://testgrid.k8s.io/sig-etcd-robustness#Summary vs https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml

Advantages:

  • More stable, 7% vs 56% failure rate running the same code
  • Testgrid being much better UI to monitor failures plus addition of more advanced features like creating bugs alerting.
  • Prow being much better tool to view logs from tests, it parses logs, groups them by test, doesn't crash browser, doesn't fail on downloading compressed logs, doesn't created archives.

TODO:

cc @jmhbnz @ivanvc

Why is this needed?

Migration to Prow opens a new chapter for stability and debuggability of robustness test with the goal of making the process more approachable for new contributors.

@henrybear327
Copy link
Contributor

@ArkaSaha30

@ivanvc
Copy link
Member

ivanvc commented Jun 6, 2024

Do we have access to arm nodes in the Prow infra? The last I remember is that we were waiting for them. I don't see any updates regarding this on kubernetes/k8s.io#6102. So, it may be a blocker for the second point.

@serathius
Copy link
Member Author

serathius commented Jun 6, 2024

Not great, but I will not block the migration regardless. Robustness tests only bring value if there is someone willing to review them. With Prow being much better, no-one will be willing to review arm robustness failures.

@ivanvc
Copy link
Member

ivanvc commented Jun 6, 2024

I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.

I don't see much activity in kubernetes/k8s.io#6102. Who or where would be a good place to ask for a status update/ETA for ARM nodegroups?

@jmhbnz
Copy link
Member

jmhbnz commented Jun 6, 2024

Hi @upodroid - We spoke at KubeCon EU Paris about a dedicated arm64 cluster for prow. Can you please provide an update on the timeline for it being available?

@serathius
Copy link
Member Author

I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.

I was thinking about the second option, however due to sub-par user experience I expect it would be equal the first one.

@ivanvc
Copy link
Member

ivanvc commented Jun 7, 2024

Discussed on Slack with Arka, we'll be working on the following at the moment:

/assign @ArkaSaha30 @ivanvc

@k8s-ci-robot
Copy link

@ivanvc: GitHub didn't allow me to assign the following users: ArkaSaha30.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Discussed on Slack with Arka, we'll be working on the following at the moment:

/assign @ArkaSaha30 @ivanvc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ivanvc ivanvc removed the help wanted label Jun 7, 2024
@ArkaSaha30
Copy link
Contributor

/assign

@ArkaSaha30
Copy link
Contributor

Currently, the robustness tests on Github Actions run only on main or PRs to main. Do we need to run it on release-3.5 and release-3.4?
The existing robustness periodic and presubmit can be configured to handle all the 3 branches.

@serathius
Copy link
Member Author

There are no robustness test on other branches beside main. We develop and run robustness test from main branch and validate binaries build from older branches.

@jmhbnz jmhbnz added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 11, 2024
@ivanvc
Copy link
Member

ivanvc commented Jun 13, 2024

We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?

We can't move forward with the second, as we don't have a timeline on when ARM runners are going to be available.

@serathius
Copy link
Member Author

We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?

We can keep arm64 on Github actions to not block on it.

@ivanvc
Copy link
Member

ivanvc commented Jun 14, 2024

@ArkaSaha30, can you help with

Remove non-arm robustness tests from github actions.?

Thanks.

@jmhbnz
Copy link
Member

jmhbnz commented Aug 8, 2024

Update - arm64 runners were enabled in prow, (refer k8s-infra slack discussions: 1, 2)

@serathius
Copy link
Member Author

ci-etcd-robustness-arm64 looks broken.
image

@jmhbnz
Copy link
Member

jmhbnz commented Aug 8, 2024

ci-etcd-robustness-arm64 looks broken.

Looking at most recent full run it says:

Test started today at 5:36 PM failed after 1h19m14s.

Job logs show:

 {"Time":"2024-08-08T06:47:33.907178941Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory/EtcdHighTraffic/ClusterOfSize1","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (79484): Git SH{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-08T06:47:36Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143 

Looks like job was interrupted? Or is that expected / unrelated output?

Job config is here.

Job history shows as aborted: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64

Edit: Interestingly ci-etcd-robustness-main-arm64 was fine https://testgrid.k8s.io/sig-etcd-robustness#ci-etcd-robustness-main-arm64. I am not too sure on the difference between those two jobs.

@ivanvc
Copy link
Member

ivanvc commented Aug 16, 2024

@jmhbnz, @serathius, are we ready to remove optional: true from the robustness presubmit jobs and mark this issue as complete?

@jmhbnz
Copy link
Member

jmhbnz commented Aug 16, 2024

@jmhbnz, @serathius, are we ready to remove optional: true from the robustness presubmit jobs and mark this issue as complete?

We can remove optional: true from the presubmits I believe, the job seems to be behaving about the same if not better than the amd64 equivalent presubmit.

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.

@serathius
Copy link
Member Author

Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.

Think we are ok to make presubmit job blocking.

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

My high level question, why do we have separated ci-etcd-robustness-amd64 and ci-etcd-robutstness-main-amd64 (mirrored for arm)?

@ivanvc
Copy link
Member

ivanvc commented Aug 16, 2024

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

My bad, I thought it was addressed in #17593. I see it's a different issue.

It looks like they are consistently aborted at around 80 minutes. Following early_exit_handler, it seems like the process is being interrupted by its parent. Which sounds consistent with the output from the logs:

{"Time":"2024-08-16T22:50:16.205037989Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (80167): Go OS{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-16T22:50:17Z"}

I wonder if the ARM node or pods inside the node get rotated after 80m.

My high level question, why do we have separated ci-etcd-robustness-amd64 and ci-etcd-robutstness-main-amd64 (mirrored for arm)?

I'm unsure about this one. Should we only have ci-etcd-robustness-amd64?

@ivanvc
Copy link
Member

ivanvc commented Aug 30, 2024

Just giving an update that I have a thread in #sig-k8s-infra. It looks like the bug is in the infra, not the job itself.

@ivanvc
Copy link
Member

ivanvc commented Aug 30, 2024

Link to kubernetes/k8s.io#7241

@ivanvc
Copy link
Member

ivanvc commented Sep 3, 2024

The ARM issues are now solved. There are multiple green runs in prow (https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64).

@serathius, should we delete ci-etcd-robustness-main-arm64 and only keep ci-etcd-robustness-arm64?

@serathius
Copy link
Member Author

serathius commented Sep 4, 2024

Don't know the exact differences in the job definition but from those 4 jobs

  • ci-etcd-robustness-amd64
  • ci-etcd-robustness-arm64
  • ci-etcd-robustness-main-amd64
  • ci-etcd-robustness-main-arm64

We only need 2 one for amd64 one for arm. As for the name I think it would be better follow the same convention as ci-etcd-robustness-release35-amd64 and use the branch name in the job name. So preferably we leave

  • ci-etcd-robustness-main-amd64
  • ci-etcd-robustness-main-arm64

@ivanvc
Copy link
Member

ivanvc commented Sep 4, 2024

The difference between the jobs is that ci-etcd-robustness-{amd64,arm64} enables gofail make gofail-enable and builds the project (make build). While ci-etcd-robustness-main-{amd64,arm64}` doesn't.

  • ci-etcd-robustness-arm64: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L171-L183
    result=0
    apt-get -o APT::Update::Error-Mode=any update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3
    sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf
    make install-lazyfs
    set -euo pipefail
    GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory"
    make gofail-enable
    make build
    VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness || result=$?
    if [ -d /data/results ]; then
      zip -r ${ARTIFACTS}/results.zip /data/results
    fi
    exit $result
    
  • ci-etcd-robustness-main-arm64: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L263-L273
    result=0
    apt update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3
    sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf
    make install-lazyfs
    set -euo pipefail
    GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory"
    VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness-main || result=$?
    if [ -d /data/results ]; then
      zip -r ${ARTIFACTS}/results.zip /data/results
    fi
    exit $result
    

Which one would we need to keep, the one with gofail enabled or the other?

@ivanvc
Copy link
Member

ivanvc commented Sep 6, 2024

The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.

@jmhbnz
Copy link
Member

jmhbnz commented Sep 7, 2024

The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.

Good spotting @ivanvc. That seems reasonable to me, defer to @serathius for final decision.

@serathius
Copy link
Member Author

Lack of building and enabling gofail is expected because the difference between targets make test-robustness which just runs tests (on locally available binary), make test-robustness-main tests etcd from the main branch (downloads, enables gofail and builds).

With the differences cleaned up I think we can leave ci-etcd-robustness-main-{arm64,amd64}.

ivanvc added a commit to ivanvc/test-infra that referenced this issue Sep 10, 2024
As discussed in etcd-io/etcd#18136, ci-etcd-robustness-{arm64,amd64}
were a duplication of the main branch jobs.
@ivanvc
Copy link
Member

ivanvc commented Sep 10, 2024

I believe the only outstanding task from this issue is marking the pre-submit jobs as blocking. @serathius, do you think we should do this soon, or should we leave them running for a little longer?

@serathius
Copy link
Member Author

I think we are good to mark them blocking. Robustness tests have been stable on both PRs and periodics.

image
image

@ivanvc
Copy link
Member

ivanvc commented Sep 12, 2024

I'll close this issue now since we don't have any outstanding tasks (please reopen if needed).

Thanks to everyone who contributed to migrating the robustness tests.

@ivanvc ivanvc closed this as completed Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness-testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature
Development

No branches or pull requests

6 participants