Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRaC checkpoint of java app running inside a docker container (Mac OSX) #1

Open
tarilabs opened this issue Feb 23, 2023 · 5 comments
Open

Comments

@tarilabs
Copy link

Executive summary: I've read the manual, but it's not clear to me why checkpointing a Java app running inside a docker container, should I checkpoint the docker container itself too; I just wanted to "snapshot" my running Java app? Could you kindly clarify why docker checkpoint is needed when performing a CRaC/criu for a java app running inside a docker container, even if I just collect the files in a persistent way, please?

Details

Hi,
I've been experimenting with this project following this video from Devoxx and this great tutorial.

Since I'm on Mac OSX (and not linux) I operate inside docker container.

My goal is to "snapshot" a running Java app, using CRaC/criu at a point in time and restore it, following the tutorials mentioned and the documentation I could find here on github.

Since I operate the CRaC inside a container because I'm on Mac OSX, I make sure the files are collected on a mounted volume, so I can mount them across container restarts.

I have created a banal Java app to test this, here: https://github.com/tarilabs/demo20230223-counting-on-crac

# create directory to host crac files dump
mkdir crac-files
# prepare a docker container as the lab environment to operate within
docker build -f src/main/docker/Dockerfile.jvm -t demo20230223-counting-on-crac .
docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac
java -XX:CRaCCheckpointTo=/opt/crac-files $JAVA_OPTS -jar $JAVA_APP_JAR

In another shell I perfom:

docker exec -it -u root demo20230223-counting-on-crac /bin/bash
ps -u root
# typically java is PID 9, used below
jcmd 9 JDK.checkpoint

Up to here, everything works as expected, the app is checkpointed and dump files are created.

Now I want to restore, using the command:

java -XX:CRaCRestoreFrom=/opt/crac-files

I have tried 3 use-cases

Case A

If in the first shell, as the docker container is still running, I execute the restorefrom, it works.

Case B

If in the second shell, I capture a docker checkpoint with something ~like:

exit
docker ps -a
docker commit CONTAINER_ID demo20230223-counting-on-crac:checkpoint

then in the first shell, I restart from the checkpoint with something ~like:

exit
docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files -p 8080:8080 --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac:checkpoint
java -XX:CRaCRestoreFrom=/opt/crac-files

it works.

Case C

In the second shell, I just exit.
In the first shell, I just exit.
No container is running and no docker-checkpoint was taken.
In the first shell I go with:

docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac
java -XX:CRaCRestoreFrom=/opt/crac-files

I get:

Error (criu/cr-restore.c:1506): Can't fork for 9: File exists
Error (criu/cr-restore.c:2593): Restoring FAILED.

I don't get why I cannot just restart the Java app from the dumped files (which are available across container restart as they are on the host disk), somehow additional status of the docker container must also be captured (with the docker checkpoint) ?

Is this a limitation of the system I'm using Mac OSX, and if I was Linux I could have turned off and turned on the linux computer across Java app checkpoint and restore?

Thanks!

@tarilabs tarilabs changed the title CRaC checkpoint of java app running inside CRaC checkpoint of java app running inside a docker container (Mac OSX) Feb 23, 2023
@snazarkin
Copy link
Contributor

I've reproduced this on native Linux host. The workaround would be use 17-crac+3 release so far.

@snazarkin
Copy link
Contributor

While playing a bit more with this issue, I've found my host was running out of space. I can't reproduce the issue after cleanup.
Could please check if your VM that hosts linux has no issue with space?

@tarilabs
Copy link
Author

Thank you @snazarkin but I can confirm I do NOT run out of space.

with 17-crac+3 indeed cannot reproduce

I've taken your suggestion and try with 17-crac+3 and indeed seems to be working, here I do the checkpoint:

Screenshot 2023-02-24 at 10 57 16

and then I'm able to exit container, start a new one, and resume from checkpoint:

Screenshot 2023-02-24 at 10 57 58

with 17-crac+4 can reproduce every time

If I try to make use again of 17-crac+4, every time I stumble on the same issue: checkpoint is performed with (seems) no errors, I exit the container, I start new one, but I cannot restore from checkpoint:

Screenshot 2023-02-24 at 11 05 27

It is to be noted I'm not on a VM: I'm directly on Mac OSX and I perform those checkpoint/restore operations while inside the Docker container defined here.

System Version: macOS 13.2 (22D49)
Kernel Version: Darwin 22.3.0

Docker version 20.10.22, build 3a2c30b

@AntonKozlov
Copy link
Member

From the log it looks like in the case C there is a process with PID == 9, which prevents java to be restored. Strange it reproduces only with a fresh shell, and only with build 4.

On the last success report with 17-crac+3 you've started shell as a first process, so java got PID 17 on checkpoint, so at restore in the fresh container PID 17 was free, and the restore succeeded.

A workaround is to ensure no clash between PIDs on checkpoint and restore possible, e.g. in this example it's enough to run echo 128 > /proc/sys/kernel/ns_last_pid before java -XX:CRaCCheckpointTo=/opt/crac-files ....

anton@mercury:~/proj/demo20230223-counting-on-crac$ docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac
root@8e2585a2d2fe:/# echo 128 > /proc/sys/kernel/ns_last_pid
root@8e2585a2d2fe:/# java -XX:CRaCCheckpointTo=/opt/crac-files $JAVA_OPTS -jar $JAVA_APP_JAR

We are trying to make the container experience better now, we'll look how to cover this sutiation as well. CC @wkia

The same workaround helps on restore, but the workaroudnd should be applied either on checkpoint, or on restore, but not for both (the OR is strong). And occassionally, trying to restore several times to free PIDs for java helps as well

anton@mercury:~/proj/demo20230223-counting-on-crac$ docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac
root@3c5f050d2ead:/# echo 128 > /proc/sys/kernel/ns_last_pid
root@3c5f050d2ead:/# java -XX:CRaCRestoreFrom=/opt/crac-files

or

anton@mercury:~/proj/demo20230223-counting-on-crac$ docker run -it --privileged -v $(pwd)/crac-files:/opt/crac-files --rm --name demo20230223-counting-on-crac demo20230223-counting-on-crac
root@6ec5e9174b1e:/# java -XX:CRaCRestoreFrom=/opt/crac-files
Error (criu/cr-restore.c:1506): Can't fork for 9: File exists
Error (criu/cr-restore.c:2593): Restoring FAILED.
root@6ec5e9174b1e:/# java -XX:CRaCRestoreFrom=/opt/crac-files
pie: 9: Error (criu/pie/restorer.c:1919): Unable to create a thread: -17
pie: 9: Error (criu/pie/restorer.c:2056): Restorer fail 9
Error (criu/cr-restore.c:2593): Restoring FAILED.
root@6ec5e9174b1e:/# java -XX:CRaCRestoreFrom=/opt/crac-files
# this one succeeds

@tarilabs
Copy link
Author

Thank you so much @AntonKozlov, 🙏

I'm playing around with the workaround to avoid PID clashes as described in #1 (comment) with 17-crac+4 even on more complex applications, and so far it works as I would expect! 🚀 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants