-
Notifications
You must be signed in to change notification settings - Fork 940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: ubuntu-minimal VM images do not support cloud-init as claimed #14605
Comments
Which image os version did you use? |
@holmanb @blackboxsw this does appear to be a problem with the I've confirmed the issue doesn't affect ubuntu-minimal:22.04 and ubuntu:24.04 VM images. Any ideas what could be going on here? Is cloud-init broken in the 24.04 minimal image somehow? |
I was able to reproduce the issue with the example failing image. Using a modified version of cloud-init's python detection code: def is_platform_viable() -> bool:
"""Return True when this platform appears to have an LXD socket."""
if not os.path.exists(LXD_SOCKET_PATH):
LOG.warning(f"{LXD_SOCKET_PATH} does not exist")
return False
if not stat.S_ISSOCK(os.lstat(LXD_SOCKET_PATH).st_mode):
LOG.warning(f"{LXD_SOCKET_PATH} is not a socket: {os.lstat(LXD_SOCKET_PATH).st_mode}")
return False
return True I see the issue logged:
It looks like the lxd socket doesn't exist when cloud-init's Python code is running. ds-identify correctly identifies LXD as the datasource as a systemd-generator, but the later python code doesn't see it there. In the lxd agent logs I notice that pam is logging an error:
I'm not sure if that is related. Does the lxd agent modify or remove the socket after generators run? Possibly this fails due to a missing dependency which causes the above error? |
It doesnt explain why it works in ubuntu:24.04 though, they should be the same from lxds perspective. |
Maybe @simondeziel might know a difference in ubuntu 24.04 minimal |
@tomponline I also noticed that the agent doesn't even appear to run on non-minimal images:
vs a minimal vm:
|
@simondeziel as youre familiar with the lxd-agent package please can you take a look at this, looks like its units are not firing right (although i can get in fine) and potentially starting too late for lxd. |
Here with LXD
That's using the current default image (Noble,
Now trying with the exact image you tried (
@rptb1 I added a |
Here is an exact paste of me reproducing the issue just now. If there is anything more I can do locally to help debug this please let me know.
|
@rptb1 I still cannot reproduce despite using the same base OS, 22.04 with the same kernel and same LXD rev:
|
@hamistao as you're looking at cloud-init issues ATM, would you mind seeing if you can reproduce this? |
@tomponline You got it! |
We had a quick Meet with @hamistao and he was able to reproduce the issue. The error he got is Here, I see that
|
So after a debugging session me an @simondeziel compliled the folllowing findings: As you can see detecting the LXD Datasource fails at
And the LXD-agent only starts at
That also happens to explain why @simondeziel wasn't able to reproduce it and I was, my machine's bulk is ideal to provoke this race consistently. Why this only happens on Also, lxd-agent is scheduled to start before This is the critical chain on this experiment.
@holmanb Could you please take a look on this when you have the time? We would like to know what is your take on fixing this. It is possible a patch on |
Nice sleuthing both! |
On the systemd side, I've come to the conclusion that Worth nothing in the Locally, (where I cannot reproduce the race), we clearly see that
|
A quick correction here, if it has a device suitable to be a datasource
|
I have been looking into the
And I got these logs on a
We can see that, assuming these timestamps are accurate and precise, the chronological order of operations is: |
And this is just a guess, but this behavior is so consistent in my set up, even though the time gap between the events is somewhat small, that I am wondering if this may not be a race condition. That said, I have no idea what else it could be. |
Hey @hamistao and @simondeziel, thanks for digging into this.
I can still reproduce the issue locally: $ lxc launch ubuntu-minimal:526b11bb926ebe8a1d05e5f69b02d4a311b311a9a9acfe760f210ef8d45c2bc6 minimal --vm
$ lxc exec minimal -- cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:16 +0000
detail: DataSourceNone
errors: []
recoverable_errors:
WARNING:
- Used fallback datasource I would need to better understand LXD's architecture to help you debug the issue. Something is removing the socket at runtime after the systemd generator and before cloud-init-local starts. I'm not sure how, or why, but this still appears to be an issue in LXD. |
@holmanb Could you elaborate on your theory of I ask because somw experiments suggest the problem lies in the order of steps. Read further if you wish to better understand the evidence at hand. As shown in #14605 (comment), the order of operations is not being conserved in
Then I tried to analyze On While on Then I started experimenting with the
I still cannot fully understand why this fixes the order of operations, since we have the following definitions for:
WantedBy and RequiredBy have the same behavior as adding
[Requires] is basically a more restrictive
We can rule out
A similar behavior can be achieved with
tbh I found the descriptions confusing and even contradictory sometimes, so the questions now are: A few more comments. Both images use cc @simondeziel |
@hamistao when added
|
I did see this, but the reverse property table conveys a different idea: So I tried it and got the results above |
Hmm, I read the
Would you mind double checking they key was not simply ignored?
|
Oh I see. I read the table as "Can be used in either [Unit] or [Install]". Indeed running That prompted be to double check my workflow, and indeed there was a problem that made me attribute the success to the addition to I now have fixed my workflow and am running more tests, I should have some more news soon. |
Checking the existence of the socket is how cloud-init identifies that it is running on LXD. If the socket exists, |
@holmanb Indeed that makes sense to me, but if the evidence of removal is just the absence of the |
My bad, I see that the issue is only happening on VMs, where cloud-init initially detects via board-name since the socket doesn't exist yet. So you're right - I don't think that the socket was removed and re-added. Sorry for the mis-direction there. Yes, if |
ubuntu-minimal VM images do not support cloud-init.
I do not know whether this is a documentation issue, or a bug with the image builds.
If they're meant to support cloud-init, then it's a bug in the image builds, in which case please forward this or direct me to where I can report that. Otherwise the documentation is wrong.
Reproduction:
lxc launch ubuntu-minimal: test-container --profile default --profile test-profile
. Wait a bit. Trylxc exec test-container -- ls /run
and note that "cloud.init.ran" exists. It works in containers.lxc launch --vm ubuntu-minimal: test-container2 --profile default --profile test-profile
. Wait a bit. Trylxc exec test-container2 -- ls /run
and note that "cloud.init.ran" does not exist. It doesn't work in minimal VM.lxc launch --vm ubuntu: test-container3 --profile default --profile test-profile
. Wait a bit. Trylxc exec test-container3 -- ls /run
and note that "cloud.init.ran" exists. It does work in non-minimal VM.An example non-working image is 526b11bb926ebe8a1d05e5f69b02d4a311b311a9a9acfe760f210ef8d45c2bc6 .
Document: reference/remote_image_servers.md
The text was updated successfully, but these errors were encountered: