Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: ubuntu-minimal VM images do not support cloud-init as claimed #14605

Open
rptb1 opened this issue Dec 7, 2024 · 28 comments
Open

doc: ubuntu-minimal VM images do not support cloud-init as claimed #14605

rptb1 opened this issue Dec 7, 2024 · 28 comments
Assignees
Labels
Bug Confirmed to be a bug External Issue is about a bug/feature in another project
Milestone

Comments

@rptb1
Copy link

rptb1 commented Dec 7, 2024

ubuntu-minimal VM images do not support cloud-init.

I do not know whether this is a documentation issue, or a bug with the image builds.

If they're meant to support cloud-init, then it's a bug in the image builds, in which case please forward this or direct me to where I can report that. Otherwise the documentation is wrong.

ubuntu-minimal:

This server provides official Ubuntu Minimal images. All images are cloud images, which means that they include both cloud-init and the lxd-agent.

Reproduction:

  1. Create a LXD profile with a simple cloud-init test, such as https://documentation.ubuntu.com/lxd/en/latest/cloud-init/#run-commands
  2. Try lxc launch ubuntu-minimal: test-container --profile default --profile test-profile. Wait a bit. Try lxc exec test-container -- ls /run and note that "cloud.init.ran" exists. It works in containers.
  3. Try lxc launch --vm ubuntu-minimal: test-container2 --profile default --profile test-profile. Wait a bit. Try lxc exec test-container2 -- ls /run and note that "cloud.init.ran" does not exist. It doesn't work in minimal VM.
  4. Try lxc launch --vm ubuntu: test-container3 --profile default --profile test-profile. Wait a bit. Try lxc exec test-container3 -- ls /run and note that "cloud.init.ran" exists. It does work in non-minimal VM.

An example non-working image is 526b11bb926ebe8a1d05e5f69b02d4a311b311a9a9acfe760f210ef8d45c2bc6 .


Document: reference/remote_image_servers.md

@tomponline
Copy link
Member

Which image os version did you use?

@tomponline
Copy link
Member

@holmanb @blackboxsw this does appear to be a problem with the ubuntu-minimal:24.04 VM image indeed.

I've confirmed the issue doesn't affect ubuntu-minimal:22.04 and ubuntu:24.04 VM images.

Any ideas what could be going on here? Is cloud-init broken in the 24.04 minimal image somehow?

@tomponline tomponline added the External Issue is about a bug/feature in another project label Dec 9, 2024
@holmanb
Copy link
Member

holmanb commented Dec 9, 2024

Any ideas what could be going on here? Is cloud-init broken in the 24.04 minimal image somehow?

I was able to reproduce the issue with the example failing image.

Using a modified version of cloud-init's python detection code:

def is_platform_viable() -> bool:
    """Return True when this platform appears to have an LXD socket."""
    if not os.path.exists(LXD_SOCKET_PATH):
        LOG.warning(f"{LXD_SOCKET_PATH} does not exist")
        return False
    if not stat.S_ISSOCK(os.lstat(LXD_SOCKET_PATH).st_mode):
        LOG.warning(f"{LXD_SOCKET_PATH} is not a socket: {os.lstat(LXD_SOCKET_PATH).st_mode}")
        return False
    return True

I see the issue logged:

2024-12-09 17:25:09,172 - DataSourceLXD.py[WARNING]: /dev/lxd/sock does not exist

It looks like the lxd socket doesn't exist when cloud-init's Python code is running. ds-identify correctly identifies LXD as the datasource as a systemd-generator, but the later python code doesn't see it there.

In the lxd agent logs I notice that pam is logging an error:

-- Boot 0f67701eac764777a4c03daad8595aec --
Dec 09 17:24:28 minimal-vm systemd[1]: Starting lxd-agent.service - LXD - agent...
Dec 09 17:24:29 minimal-vm lxd-agent[346]: time="2024-12-09T17:24:29Z" level=info msg=Starting
Dec 09 17:24:29 minimal-vm lxd-agent[346]: time="2024-12-09T17:24:29Z" level=info msg="Loading vsock module"
Dec 09 17:24:29 minimal-vm lxd-agent[346]: time="2024-12-09T17:24:29Z" level=info msg="Started vsock listener"
Dec 09 17:24:29 minimal-vm systemd[1]: Started lxd-agent.service - LXD - agent.
Dec 09 17:24:29 minimal-vm su[408]: (to root) root on pts/0
Dec 09 17:24:29 minimal-vm su[408]: pam_unix(su-l:session): session opened for user root(uid=0) by (uid=0)
Dec 09 17:24:29 minimal-vm su[408]: pam_systemd(su-l:session): Failed to connect to system bus: No 
such file or directory
Dec 09 17:24:57 minimal-vm su[408]: pam_unix(su-l:session): session closed for user root

I'm not sure if that is related.

Does the lxd agent modify or remove the socket after generators run? Possibly this fails due to a missing dependency which causes the above error?

@tomponline
Copy link
Member

It doesnt explain why it works in ubuntu:24.04 though, they should be the same from lxds perspective.

@tomponline
Copy link
Member

Maybe @simondeziel might know a difference in ubuntu 24.04 minimal

@holmanb
Copy link
Member

holmanb commented Dec 9, 2024

@tomponline I also noticed that the agent doesn't even appear to run on non-minimal images:

○ lxd-agent.service - LXD - agent
     Loaded: loaded (/usr/lib/systemd/system/lxd-agent.service; static)
     Active: inactive (dead)
       Docs: https://documentation.ubuntu.com/lxd/en/latest/

vs a minimal vm:

root@minimal-vm:~# systemctl status lxd-agent
● lxd-agent.service - LXD - agent
     Loaded: loaded (/usr/lib/systemd/system/lxd-agent.servic
e; static)
     Active: active (running) since Mon 2024-12-09 18:26:55 UTC; 2min 59s ago
       Docs: https://documentation.ubuntu.com/lxd/en/latest/
    Process: 280 ExecStartPre=/lib/systemd/lxd-agent-setup (code=exited, status=0/SUCCESS)
   Main PID: 357 (lxd-agent)
      Tasks: 8 (limit: 1124)
     Memory: 37.4M (peak: 39.3M)
        CPU: 400ms
     CGroup: /system.slice/lxd-agent.service
             └─357 /run/lxd_agent/lxd-agent

Dec 09 18:26:55 minimal-vm lxd-agent[357]: time="2024-12-09T18:26:55Z" level=info msg=Starting
Dec 09 18:26:55 minimal-vm lxd-agent[357]: time="2024-12-09T18:26:55Z" level=info msg="Loading vsock module"
Dec 09 18:26:55 minimal-vm lxd-agent[357]: time="2024-12-09T18:26:55Z" level=info msg="Started vsock listener"
Dec 09 18:26:55 minimal-vm systemd[1]: Started lxd-agent.service - LXD - agent.
Dec 09 18:26:57 minimal-vm su[438]: (to root) root on pts/0
Dec 09 18:26:57 minimal-vm su[438]: pam_unix(su-l:session): session opened for user root(uid=0) by (uid=0)
Dec 09 18:26:57 minimal-vm su[438]: pam_systemd(su-l:session): Failed to connect to system bus: No 
such file or directory
Dec 09 18:27:18 minimal-vm su[438]: pam_unix(su-l:session): session closed for user root
Dec 09 18:29:48 minimal-vm su[608]: (to root) root on pts/0
Dec 09 18:29:48 minimal-vm su[608]: pam_unix(su-l:session): session opened for user root(uid=0) by (uid=0)

@tomponline
Copy link
Member

@simondeziel as youre familiar with the lxd-agent package please can you take a look at this, looks like its units are not firing right (although i can get in fine) and potentially starting too late for lxd.

@tomponline tomponline added this to the lxd-6.3 milestone Dec 9, 2024
@tomponline tomponline added the Bug Confirmed to be a bug label Dec 9, 2024
@simondeziel
Copy link
Member

Here with LXD 5.21/stable, ubuntu-minimal: works:

root@v3:~# lxc launch --vm ubuntu-minimal: minimal-vm1 --profile default --profile new-keys
Creating minimal-vm1
Starting minimal-vm1
root@v3:~# lxc exec minimal-vm1 -- cloud-init status --wait
Error: LXD VM agent isn't currently running
root@v3:~# lxc exec minimal-vm1 -- cloud-init status --wait
Error: LXD VM agent isn't currently running
root@v3:~# lxc exec minimal-vm1 -- cloud-init status --wait
...........................
status: done
root@v3:~# lxc exec minimal-vm1 -- ls /run/cloud.init.ran
/run/cloud.init.ran

That's using the current default image (Noble, eb6632b7bffc386a3dbf81570d813f4133009dc7ea1cb640215fafc481f7d670) and with the expanded config:

root@v3:~# lxc config show -e minimal-vm1
architecture: x86_64
config:
  cloud-init.user-data: |
    #cloud-config
    runcmd:
      - [touch, /run/cloud.init.ran]
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (minimal release) (20241211)
  image.label: minimal release
  image.os: ubuntu
  image.release: noble
  image.serial: "20241211"
  image.type: disk1.img
  image.version: "24.04"
  volatile.base_image: eb6632b7bffc386a3dbf81570d813f4133009dc7ea1cb640215fafc481f7d670
  volatile.cloud-init.instance-id: 79056b7c-27b2-40da-a58f-5e62dc465257
  volatile.eth0.host_name: tap5b5a0b1c
  volatile.eth0.hwaddr: 00:16:3e:33:c0:78
  volatile.last_state.power: RUNNING
  volatile.uuid: 6bbfc369-767c-4048-84a1-3cd134adfcee
  volatile.uuid.generation: 6bbfc369-767c-4048-84a1-3cd134adfcee
  volatile.vsock_id: "3132111516"
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
- new-keys
stateful: false
description: ""

Now trying with the exact image you tried (526b11bb926ebe8a1d05e5f69b02d4a311b311a9a9acfe760f210ef8d45c2bc6), it still works here:

root@v3:~# lxc launch --vm ubuntu-minimal:526b11bb926ebe8a1d05e5f69b02d4a311b311a9a9acfe760f210ef8d45c2bc6 minimal-vm2 --profile default --profile new-keys
Creating minimal-vm2
Starting minimal-vm2                          
root@v3:~# lxc exec minimal-vm2 -- cloud-init status --wait

status: done
root@v3:~# lxc exec minimal-vm2 -- ls /run/cloud.init.ran
/run/cloud.init.ran

@rptb1 I added a cloud-init status --wait to be sure to wait for long enough for cloud-init to do its job. Maybe you just raced it and concluded it didn't work? Or maybe I'm just unable to reproduce it.

@rptb1
Copy link
Author

rptb1 commented Dec 21, 2024

Here is an exact paste of me reproducing the issue just now. If there is anything more I can do locally to help debug this please let me know.

rb@kiwi:~$ lxc profile show test-profile
name: test-profile
description: ""
config:
  cloud-init.user-data: |
    #cloud-config
    runcmd:
      - [touch, /run/cloud.init.ran]
devices: {}
used_by: []
rb@kiwi:~$ lxc launch -s kiwi-tmp --vm ubuntu-minimal: test-container2 --profile default --profile test-profile
Launching test-container2
rb@kiwi:~$ lxc exec test-container2 -- cloud-init status --wait

status: done
rb@kiwi:~$ lxc exec test-container2 -- ls /run/cloud.init.ran
ls: cannot access '/run/cloud.init.ran': No such file or directory
rb@kiwi:~$ lxc config show test-container2
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (minimal release) (20241211)
  image.label: minimal release
  image.os: ubuntu
  image.release: noble
  image.serial: "20241211"
  image.type: disk1.img
  image.version: "24.04"
  volatile.base_image: eb6632b7bffc386a3dbf81570d813f4133009dc7ea1cb640215fafc481f7d670
  volatile.cloud-init.instance-id: 72a11da0-1e49-4762-ba8f-3f62f63c46fb
  volatile.eth0.host_name: tap14b7ba95
  volatile.eth0.hwaddr: 00:16:3e:2e:93:50
  volatile.last_state.power: RUNNING
  volatile.uuid: 26aa50bd-c502-4946-a351-af065eb15bfc
  volatile.uuid.generation: 26aa50bd-c502-4946-a351-af065eb15bfc
  volatile.vsock_id: "4201320121"
devices:
  root:
    path: /
    pool: kiwi-tmp
    type: disk
ephemeral: false
profiles:
- default
- test-profile
stateful: false
description: ""
rb@kiwi:~$ snap list lxd
Name  Version      Rev    Tracking       Publisher   Notes
lxd   6.2-afb00d0  31571  latest/stable  canonical✓  -
rb@kiwi:~$ uname -a
Linux kiwi 5.15.0-130-generic #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
rb@kiwi:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.5 LTS
Release:	22.04
Codename:	jammy
rb@kiwi:~$ lxc launch -s kiwi-tmp --vm ubuntu: test-container3 --profile default --profile test-profile
Launching test-container3
rb@kiwi:~$ lxc exec test-container3 -- cloud-init status --wait
.........................
status: done
rb@kiwi:~$ lxc exec test-container3 -- ls /run/cloud.init.ran
/run/cloud.init.ran
rb@kiwi:~$ lxc exec test-container2 -- ls /run/cloud.init.ran
ls: cannot access '/run/cloud.init.ran': No such file or directory
rb@kiwi:~$ lxc stop test-container2
rb@kiwi:~$ lxc stop test-container3

@simondeziel
Copy link
Member

@rptb1 I still cannot reproduce despite using the same base OS, 22.04 with the same kernel and same LXD rev:

root@v1:~# lxc profile show test-profile
name: test-profile
description: ""
config:
  cloud-init.user-data: |
    #cloud-config
    runcmd:
      - [touch, /run/cloud.init.ran]
devices: {}
used_by: []

root@v1:~# lxc launch --vm ubuntu-minimal:eb6632b7bffc386a3dbf81570d813f4133009dc7ea1cb640215fafc481f7d670 vm1 --profile default --profile test-profile
Launching vm1
root@v1:~# lxc exec vm1 -- cloud-init status --wait

status: done
root@v1:~# lxc exec vm1 -- ls /run/cloud.init.ran
/run/cloud.init.ran

root@v1:~# lxc config show vm1
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (minimal release) (20241211)
  image.label: minimal release
  image.os: ubuntu
  image.release: noble
  image.serial: "20241211"
  image.type: disk1.img
  image.version: "24.04"
  volatile.base_image: eb6632b7bffc386a3dbf81570d813f4133009dc7ea1cb640215fafc481f7d670
  volatile.cloud-init.instance-id: 1100ed94-a9f9-4cfc-abe2-e97138ec5580
  volatile.eth0.host_name: tap783c977c
  volatile.eth0.hwaddr: 00:16:3e:be:7b:24
  volatile.last_state.power: RUNNING
  volatile.uuid: 945abbdb-fcba-4dad-ae42-d29908cb9fe2
  volatile.uuid.generation: 945abbdb-fcba-4dad-ae42-d29908cb9fe2
  volatile.vsock_id: "3932951384"
devices: {}
ephemeral: false
profiles:
- default
- test-profile
stateful: false
description: ""
root@v1:~# snap list lxd
Name  Version      Rev    Tracking     Publisher   Notes
lxd   6.2-afb00d0  31571  latest/edge  canonical✓  -
root@v1:~# uname -a
Linux v1 5.15.0-130-generic #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@v1:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.5 LTS
Release:	22.04
Codename:	jammy

@tomponline
Copy link
Member

@hamistao as you're looking at cloud-init issues ATM, would you mind seeing if you can reproduce this?

@tomponline tomponline assigned hamistao and unassigned simondeziel Jan 29, 2025
@hamistao
Copy link
Contributor

@tomponline You got it!

@simondeziel
Copy link
Member

We had a quick Meet with @hamistao and he was able to reproduce the issue. The error he got is cloud-init status --wait --long warns about falling back to the nocloud DS instead of using the LXD data source as it should.

Here, I see that lxd-agent.service starts before anything cloud-init related:

# systemd-analyze critical-chain 
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @7.206s
└─multi-user.target @7.205s
  └─snapd.seeded.service @5.709s +1.494s
    └─basic.target @5.689s
      └─sockets.target @5.689s
        └─snapd.socket @5.682s +5ms
          └─sysinit.target @5.660s
            └─cloud-init.service @4.366s +1.290s
              └─systemd-networkd-wait-online.service @2.535s +1.828s
                └─systemd-networkd.service @2.468s +53ms
                  └─network-pre.target @2.455s
                    └─cloud-init-local.service @708ms +1.745s
                      └─lxd-agent.service @1.044s +144ms
                        └─systemd-journald.socket @466ms
                          └─system.slice @393ms
                            └─-.slice @393ms

@hamistao
Copy link
Contributor

hamistao commented Feb 14, 2025

So after a debugging session me an @simondeziel compliled the folllowing findings:

As you can see detecting the LXD Datasource fails at 19:33:24,923

2025-02-14 19:33:24,922 - sources[DEBUG]: Searching for local data source in: ['DataSourceLXD']
2025-02-14 19:33:24,922 - handlers.py[DEBUG]: start: init-local/search-LXD: searching for local data from DataSourceLXD
2025-02-14 19:33:24,922 - sources[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceLXD.DataSourceLXD'>
2025-02-14 19:33:24,922 - sources[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
**2025-02-14 19:33:24,923 - sources[DEBUG]: Did not detect DataSourceLXD**
2025-02-14 19:33:24,934 - performance.py[DEBUG]: Getting metadata took 0.012 seconds
2025-02-14 19:33:24,934 - sources[DEBUG]: Datasource DataSourceLXD not updated for events: boot-new-instance
2025-02-14 19:33:24,934 - handlers.py[DEBUG]: finish: init-local/search-LXD: SUCCESS: no local data found from DataSourceLXD

And the LXD-agent only starts at 19:33:25 so surely after cloud-init failed to detect the LXD Datasource.

Feb 14 19:33:24 ubuntu systemd[1]: Starting lxd-agent.service - LXD - agent...
Feb 14 19:33:25 ubuntu lxd-agent[309]: time="2025-02-14T19:33:25Z" level=info msg=Starting
Feb 14 19:33:25 vm lxd-agent[309]: time="2025-02-14T19:33:25Z" level=info msg="Started vsock listener"
Feb 14 19:33:25 vm systemd[1]: Started lxd-agent.service - LXD - agent.
Feb 14 19:34:27 vm su[926]: (to root) root on pts/0
Feb 14 19:34:27 vm su[926]: pam_unix(su-l:session): session opened for user root(uid=0) by (uid=0)

That also happens to explain why @simondeziel wasn't able to reproduce it and I was, my machine's bulk is ideal to provoke this race consistently.

Why this only happens on ubuntu-minimal is still a mystery, but we can conjecture that the thicker image somehow slows down the lxd-agent. As is shown above, it takes 968ms on ubuntu-minimal:n and further tests showed it takes 638ms on ubuntu:n, so perhaps this is something we can look into.

Also, lxd-agent is scheduled to start before cloud-init-local.service, but still it consistently finishes last in some systems, even though its systemd unit determines it should start Before cloud-init. @simondeziel is looking into the semantics of Before and searching for a method to ensure cloud-init-local.service start is gated until lxd-agent is fully started.

This is the critical chain on this experiment.

graphical.target @15.964s
└─multi-user.target @15.962s
  └─snapd.seeded.service @11.064s +4.893s
    └─basic.target @10.834s
      └─sockets.target @10.833s
        └─snapd.socket @10.820s +11ms
          └─sysinit.target @10.799s
            └─cloud-init.service @7.588s +3.207s
              └─systemd-networkd-wait-online.service @8.862s +58ms
                └─systemd-networkd.service @8.808s +39ms
                  └─network-pre.target @6.147s
                    └─cloud-init-local.service @1.148s +4.995s
                      └─lxd-agent.service @2.663s +968ms
                        └─systemd-journald.socket @442ms
                          └─-.mount @316ms
                            └─-.slice @317ms

@holmanb Could you please take a look on this when you have the time? We would like to know what is your take on fixing this. It is possible a patch on cloud-init will be needed, if so I would be happy to contribute with said patch myself :D

@tomponline
Copy link
Member

Nice sleuthing both!

@simondeziel
Copy link
Member

On the systemd side, I've come to the conclusion that lxd-agent.service has the needed ordering condition so it is fully started before cloud-init-local.service is given the start signal.

Worth nothing in the systemctl status lxd-agent.service output, the VM's name is changed midway from ubuntu to $instanceName but that's part of the lxd-agent.service startup (https://github.com/canonical/lxd/blob/main/lxd-agent/main_agent.go#L67-L92), a bit before it signals readiness to systemd (https://github.com/canonical/lxd/blob/main/lxd-agent/main_agent.go#L178).

Locally, (where I cannot reproduce the race), we clearly see that lxd-agent is fully started (16:48:27.129) before init-local starts logging (16:48:27,841):

root@vm1:~# journalctl -o short-iso-precise -u lxd-agent
2025-02-14T16:48:26.985087+00:00 ubuntu systemd[1]: Starting lxd-agent.service - LXD - agent...
2025-02-14T16:48:27.120015+00:00 ubuntu lxd-agent[338]: time="2025-02-14T16:48:27Z" level=info msg=Starting
2025-02-14T16:48:27.120575+00:00 vm1 lxd-agent[338]: time="2025-02-14T16:48:27Z" level=info msg="Started vsock listener"
2025-02-14T16:48:27.129900+00:00 vm1 systemd[1]: Started lxd-agent.service - LXD - agent.
2025-02-14T16:52:01.344202+00:00 vm1 su[907]: (to root) root on pts/0
2025-02-14T16:52:01.345823+00:00 vm1 su[907]: pam_unix(su-l:session): session opened for user root(uid=0) by (uid=0)
root@vm1:~# cat /var/log/cloud-init.log 
2025-02-14 16:48:27,841 - log_util.py[DEBUG]: Cloud-init v. 24.4-0ubuntu1~24.04.2 running 'init-local' at Fri, 14 Feb 2025 16:48:27 +0000. Up 3.35 seconds.
2025-02-14 16:48:27,841 - main.py[INFO]: PID [1] started cloud-init 'init-local'.
2025-02-14 16:48:27,841 - main.py[DEBUG]: No kernel command line url found.
2025-02-14 16:48:27,841 - main.py[DEBUG]: Closing stdin
2025-02-14 16:48:27,844 - util.py[DEBUG]: Writing to /var/log/cloud-init.log - ab: [640] 0 bytes
2025-02-14 16:48:27,844 - util.py[DEBUG]: Changing the ownership of /var/log/cloud-init.log to 0:4
2025-02-14 16:48:27,845 - util.py[DEBUG]: Writing to /var/lib/cloud/data/python-version - wb: [644] 4 bytes
2025-02-14 16:48:27,845 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance/boot-finished
2025-02-14 16:48:27,845 - handlers.py[DEBUG]: start: init-local/check-cache: attempting to read from cache [check]
2025-02-14 16:48:27,845 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2025-02-14 16:48:27,845 - stages.py[DEBUG]: no cache found
2025-02-14 16:48:27,845 - handlers.py[DEBUG]: finish: init-local/check-cache: SUCCESS: no cache found
2025-02-14 16:48:27,850 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2025-02-14 16:48:27,851 - sources[DEBUG]: Looking for data source in: ['LXD', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']
2025-02-14 16:48:27,855 - sources[DEBUG]: Searching for local data source in: ['DataSourceLXD']
2025-02-14 16:48:27,855 - handlers.py[DEBUG]: start: init-local/search-LXD: searching for local data from DataSourceLXD
2025-02-14 16:48:27,855 - sources[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceLXD.DataSourceLXD'>
2025-02-14 16:48:27,855 - sources[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2025-02-14 16:48:27,856 - sources[DEBUG]: Detected DataSourceLXD
2025-02-14 16:48:27,871 - DataSourceLXD.py[DEBUG]: [GET] [HTTP:200] http://lxd/1.0/meta-data
2025-02-14 16:48:27,885 - DataSourceLXD.py[DEBUG]: [GET] [HTTP:200] http://lxd/1.0/config
2025-02-14 16:48:27,898 - DataSourceLXD.py[DEBUG]: [GET] [HTTP:200] http://lxd/1.0/config/cloud-init.user-data
2025-02-14 16:48:27,912 - DataSourceLXD.py[DEBUG]: [GET] [HTTP:200] http://lxd/1.0/devices
2025-02-14 16:48:27,912 - util.py[DEBUG]: Attempting to load yaml from string of length 70 with allowed root types (<class 'dict'>,)

@hamistao
Copy link
Contributor

The error he got is cloud-init status --wait --long warns about falling back to the nocloud DS instead of using the LXD data source as it should.

A quick correction here, if it has a device suitable to be a datasource cloud-init, NoCloud takes precedence over LXD. And btw using a NoCloud Datasource works fine in minimal images. The fallback in this case is the DataSourceNone, or no datasource:

status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:12 +0000
detail: DataSourceNone
errors: []
recoverable_errors:
WARNING:
        - Used fallback datasource

@hamistao
Copy link
Contributor

I have been looking into the lxd-agent code and found the section where we are notifying systemd to continue boot: https://github.com/hamistao/lxd/blob/main/lxd-agent/main_agent.go#L177. Then I added extra logging to help me determine if by the time we are notifying systemd, /dev/lxd/sock has been created:

if os.Getenv("NOTIFY_SOCKET") != "" {
	info, err := os.Stat("/dev/lxd/sock")
	if err != nil {
		fmt.Println(err.Error())
	} else {
		fmt.Println(info.Name())
		fmt.Println(info.Mode().String())
	}
	_, err = shared.RunCommand("systemd-notify", "READY=1")
	if err != nil {
		cancelStatusNotifier() // Ensure STOPPED status is written to QEMU status ringbuffer.
		cancelFunc()

		return fmt.Errorf("Failed to notify systemd of readiness: %w", err)
	}
}

And I got these logs on a ubuntu-minimal:n VM:

root@vm:~# journalctl -o short-iso-precise -u lxd-agent
2025-02-17T13:45:04.206596+00:00 ubuntu systemd[1]: Starting lxd-agent.service - LXD - agent...
2025-02-17T13:45:04.802147+00:00 ubuntu lxd-agent[300]: time="2025-02-17T13:45:04Z" level=info msg=Starting
2025-02-17T13:45:04.807475+00:00 vm lxd-agent[300]: time="2025-02-17T13:45:04Z" level=info msg="Started vsock listener"
2025-02-17T13:45:04.845477+00:00 vm lxd-agent[300]: sock
**2025-02-17T13:45:04.845477+00:00 vm lxd-agent[300]: Srw-------**
2025-02-17T13:45:04.895202+00:00 vm systemd[1]: Started lxd-agent.service - LXD - agent.
2025-02-17T13:45:19.105216+00:00 vm su[855]: (to root) root on pts/0
2025-02-17T13:45:19.106059+00:00 vm su[855]: pam_unix(su-l:session): session opened for user root(uid=0) by (uid=0)
root@vm:~# cat /var/log/cloud-init.log | grep DataSourceLXD
2025-02-17 13:45:04,646 - sources[DEBUG]: Searching for local data source in: ['DataSourceLXD']
2025-02-17 13:45:04,646 - handlers.py[DEBUG]: start: init-local/search-LXD: searching for local data from DataSourceLXD
2025-02-17 13:45:04,646 - sources[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceLXD.DataSourceLXD'>
**2025-02-17 13:45:04,647 - sources[DEBUG]: Did not detect DataSourceLXD**
2025-02-17 13:45:04,653 - sources[DEBUG]: Datasource DataSourceLXD not updated for events: boot-new-instance
2025-02-17 13:45:04,653 - handlers.py[DEBUG]: finish: init-local/search-LXD: SUCCESS: no local data found from DataSourceLXD

We can see that, assuming these timestamps are accurate and precise, the chronological order of operations is: cloud-init fails detecting LXD -> my additional logging -> lxd-agent notifies systemd to continue boot. That would mean that cloud-init is starting before the lxd-agent notifies systemd to release the cloud-init services. So far I am not sure what to conclude on this end.

@hamistao
Copy link
Contributor

hamistao commented Feb 17, 2025

And this is just a guess, but this behavior is so consistent in my set up, even though the time gap between the events is somewhat small, that I am wondering if this may not be a race condition. That said, I have no idea what else it could be.

@holmanb
Copy link
Member

holmanb commented Feb 17, 2025

Hey @hamistao and @simondeziel, thanks for digging into this.

@holmanb Could you please take a look on this when you have the time? We would like to know what is your take on fixing this. It is possible a patch on cloud-init will be needed, if so I would be happy to contribute with said patch myself :D

I can still reproduce the issue locally:

$ lxc launch ubuntu-minimal:526b11bb926ebe8a1d05e5f69b02d4a311b311a9a9acfe760f210ef8d45c2bc6 minimal --vm
$ lxc exec minimal -- cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:16 +0000
detail: DataSourceNone
errors: []
recoverable_errors:
WARNING:
        - Used fallback datasource

I would need to better understand LXD's architecture to help you debug the issue. Something is removing the socket at runtime after the systemd generator and before cloud-init-local starts. I'm not sure how, or why, but this still appears to be an issue in LXD.

@hamistao
Copy link
Contributor

hamistao commented Feb 20, 2025

Something is removing the socket at runtime after the systemd generator and before cloud-init-local starts. I'm not sure how, or why, but this still appears to be an issue in LXD.

@holmanb Could you elaborate on your theory of /dev/lxd/sock being removed? Could you detect the removal somehow?

I ask because somw experiments suggest the problem lies in the order of steps. Read further if you wish to better understand the evidence at hand.

As shown in #14605 (comment), the order of operations is not being conserved in ubuntu-minimal. To test for that, I removed the logic for notofying systemd and release the cloud-init services to start. Since the lxd-agent unit has a configuration Before=... cloud-init-local.service, and cloud-init-local.service being responsible for detecting the datasource, we expect that not nofifying systemd on the agent will block the start of cloud-init-local.service time out. This is what happens with ubuntu:n. But in ubuntu-minimal:n, we have some activity on the logs from cloud-init even though the systemd notification was removed. This includes the line:

2025-02-20 05:18:04,839 - sources[DEBUG]: Did not detect DataSourceLXD

Then I tried to analyze systemd-analyze plot for boots on ubuntu-minmal::n and ubuntu::n

On ubuntu-minmal::n, we can clearly see that cloud-init-local.service starts after lxd-agent:
Image

Image

While on ubuntu::n, the expected ordeting can be seen:

Image

Then I started experimenting with the lxd-agent unit and found that removing cloud-init-local.service from Before and adding wither WantedBy=cloud-init-local.service or RequiredBy=cloud-init-local.service fixes the problem and has consistent behavior across both images. Resulting in the following unit:

[Unit]
Description=LXD - agent
Documentation=https://documentation.ubuntu.com/lxd/en/latest/
Before=multi-user.target cloud-init.target cloud-init.service
[RequiredBy|WantedBy]=cloud-init-local.service
DefaultDependencies=no

I still cannot fully understand why this fixes the order of operations, since we have the following definitions for:

Before:

If unit foo.service contains the setting Before=bar.service and both units are being started, bar.service's start-up is delayed until foo.service has finished starting up.

WantedBy and RequiredBy have the same behavior as adding Wants or Requires in cloud-init-local.service and:

Wants:

Note that requirement dependencies do not influence the order in which services are started or stopped. This has to be configured independently with the After= or Before= options.

[Requires] is basically a more restrictive Wants:(https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#Requires=)

Similar to Wants=, but declares a stronger requirement dependency.

We can rule out Requires as a solution since

If one of the other units fails to activate, and an ordering dependency After= on the failing unit is set, this unit will not be started. Besides, with or without specifying After=, this unit will be stopped (or restarted) if one of the other units is explicitly stopped (or restarted).

A similar behavior can be achieved with BindsTo as well, with a similar problem as in with Requires:

Use the BindsTo= dependency type together with After= to ensure that a unit may never be in active state without a specific other unit also in active state (see below).

Configures requirement dependencies, very similar in style to Requires=. However, this dependency type is stronger: in addition to the effect of Requires= it declares that if the unit bound to is stopped, this unit will be stopped too.

tbh I found the descriptions confusing and even contradictory sometimes, so the questions now are:
1- Is it safe to use WantedBy as a solution to this problem?
2- Why the mentioned keys fix the problem and using Before and After seem ineffective in ubuntu-minimal?

A few more comments. Both images use systemd 255 (255.4-1ubuntu8.5) and I repeated some experiments more than once as a way to ensure this wasn't just me luckily not hitting the race condition when doing the experiment (mind that so far the problematic behavior has been 100% consistent in my set up)

cc @simondeziel

@simondeziel
Copy link
Member

@hamistao when added WantedBy=cloud-init-local.service to the LXD agent unit, it seems like you added it to the [Unit] section which I believe is not supported (https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#Mapping%20of%20unit%20properties%20to%20their%20inverses):

Note: WantedBy=, RequiredBy=, and UpheldBy= are used in the [Install] section to create symlinks in .wants/, .requires/, and .upholds/ directories. They cannot be used directly as a unit configuration setting.

@hamistao
Copy link
Contributor

hamistao commented Feb 20, 2025

I did see this, but the reverse property table conveys a different idea:

Image

So I tried it and got the results above

@simondeziel
Copy link
Member

I did see this, but the reverse property table conveys a different idea:

Hmm, I read the Where used column as confirming that WantedBy= is something to use only in the [Install] section, as indicated by https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#WantedBy= being documented there.

Image

So I tried it and got the results above

Would you mind double checking they key was not simply ignored? journalctl --grep 'Unknown key name'

root@c1:~# cat /etc/systemd/system/lxd-agent.service.d/override.conf
[Unit]
WantedBy=cloud-init-local.service
root@c1:~# systemctl daemon-reload
root@c1:~# journalctl --grep 'Unknown key name'
-- Boot 2beba2ad2d1f405c84ad8ed88a5e0649 --
-- Boot 4986046da0924439b1d161340f43044f --
-- Boot 3716afc61f9c4d4dbc58d864e759de76 --
-- Boot e329263c6664477eb0322c5ce3f04051 --
Feb 20 17:30:06 c1 systemd[1]: /etc/systemd/system/lxd-agent.service.d/override.conf:2: Unknown key name 'WantedBy' in section 'Unit', ignoring.

@hamistao
Copy link
Contributor

Oh I see. I read the table as "Can be used in either [Unit] or [Install]".

Indeed running journalctl --grep 'Unknown key name' shows Unknown key name 'WantedBy' in section 'Unit', ignoring.

That prompted be to double check my workflow, and indeed there was a problem that made me attribute the success to the addition to WantedBy when in fact it was only chance.

I now have fixed my workflow and am running more tests, I should have some more news soon.

@holmanb
Copy link
Member

holmanb commented Mar 12, 2025

Something is removing the socket at runtime after the systemd generator and before cloud-init-local starts. I'm not sure how, or why, but this still appears to be an issue in LXD.

@holmanb Could you elaborate on your theory of /dev/lxd/sock being removed? Could you detect the removal somehow?

Checking the existence of the socket is how cloud-init identifies that it is running on LXD. If the socket exists, cloud-init-local.service will identify that it was running on LXD.

@hamistao
Copy link
Contributor

Checking the existence of the socket is how cloud-init identifies that it is running on LXD. If the socket exists, cloud-init-local.service will identify that it was running on LXD.

@holmanb Indeed that makes sense to me, but if the evidence of removal is just the absence of the /dev/lxd socket, then our current theory that the root of the problem is that cloud-init-local.service is occasionally starting before the lxd-agent is also a valid explanation. If you happen to know something that could be messing up the boot sequence in such a manner please let us know.

@tomponline tomponline modified the milestones: lxd-6.3, lxd-6.4 Mar 13, 2025
@holmanb
Copy link
Member

holmanb commented Mar 13, 2025

Checking the existence of the socket is how cloud-init identifies that it is running on LXD. If the socket exists, cloud-init-local.service will identify that it was running on LXD.

@holmanb Indeed that makes sense to me, but if the evidence of removal is just the absence of the /dev/lxd socket, then our current theory that the root of the problem is that cloud-init-local.service is occasionally starting before the lxd-agent is also a valid explanation. If you happen to know something that could be messing up the boot sequence in such a manner please let us know.

My bad, I see that the issue is only happening on VMs, where cloud-init initially detects via board-name since the socket doesn't exist yet. So you're right - I don't think that the socket was removed and re-added. Sorry for the mis-direction there.

Yes, if cloud-init-local.service is starting before lxd-agent has created the socket then this would likely cause the issue. Does this issue reproduce on with cloud-init 24.4.1?

@canonical canonical locked and limited conversation to collaborators Mar 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Confirmed to be a bug External Issue is about a bug/feature in another project
Projects
None yet
Development

No branches or pull requests

5 participants