Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] flaky test: TestCheckpoint on almalinux-8 #4457

Closed
lifubang opened this issue Oct 18, 2024 · 4 comments
Closed

[ci] flaky test: TestCheckpoint on almalinux-8 #4457

lifubang opened this issue Oct 18, 2024 · 4 comments

Comments

@lifubang
Copy link
Member

=== RUN TestCheckpoint
time="2024-10-18T08:55:44Z" level=warning msg="--- Quoting "/tmp/TestCheckpoint214687474/003/criu-parent/dump.log""
time="2024-10-18T08:55:44Z" level=warning msg="118:(09.517977) freezer.state=FREEZING"
time="2024-10-18T08:55:44Z" level=warning msg="119:(09.618087) freezer.state=FREEZING"
time="2024-10-18T08:55:44Z" level=warning msg="120:(09.718192) freezer.state=FREEZING"
time="2024-10-18T08:55:44Z" level=warning msg="121:(09.818291) freezer.state=FREEZING"
time="2024-10-18T08:55:44Z" level=warning msg="122:(09.918412) freezer.state=FREEZING"
time="2024-10-18T08:55:44Z" level=warning msg="123:(10.001045) Error (criu/cr-dump.c:1779): Timeout reached. Try to interrupt: 0"
time="2024-10-18T08:55:44Z" level=warning msg="124:(10.001084) freezer.state=FREEZING"
time="2024-10-18T08:55:44Z" level=warning msg="125:(10.001125) Unfreezing tasks into 1"
time="2024-10-18T08:55:44Z" level=warning msg="126:(10.001128) \tUnseizing 45035 into 1"
time="2024-10-18T08:55:44Z" level=warning msg="127:(10.001140) Error (compel/src/lib/infect.c:418): Unable to detach from 45035: No such process"
time="2024-10-18T08:55:44Z" level=warning msg="128:(10.001144) Writing image inventory (version 1)"
time="2024-10-18T08:55:44Z" level=warning msg="129:(10.001223) Error (criu/cr-dump.c:1893): Pre-dumping FAILED."
time="2024-10-18T08:55:44Z" level=warning msg=---
checkpoint_test.go:93: criu failed: type PRE_DUMP errno 0
--- FAIL: TestCheckpoint (10.24s)

@lifubang
Copy link
Member Author

I see it many times.

@rata
Copy link
Member

rata commented Oct 18, 2024

It wouldn't be the first time criu becomes unreliable on old kernels. Maybe @kolyshkin has an insight on what to do (just skip it?)

@kolyshkin
Copy link
Contributor

In general, yes, older kernels have issues when trying to freeze a cgroup, and as a result criu fails sometimes. We have a similar issue with runc itself (runc pause is flaky).

Both criu and runc retries (and I've changed the retry timings and attempts in runc at least twice), but sometimes it's still not enough.

The issue is probably the same as #4273.

@kolyshkin
Copy link
Contributor

In general, yes, older kernels have issues when trying to freeze a cgroup

Let me correct myself: it's not older kernels, it's cgroup v1 freezer.

kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 13, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 16, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 16, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 16, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 17, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants