Skip to content

Conversation

yasminvalim
Copy link
Contributor

@yasminvalim yasminvalim commented Sep 2, 2025

BUG REPORTED:

OCP 4.17 Single node subsequent reboot after initial install fails.

SOLUTION:
Switch from using lsblk to blkid -p in the get_filesystem_uuid() function to bypass cached device information and retrieve fresh filesystem UUIDs.

@yasminvalim yasminvalim self-assigned this Sep 2, 2025
@yasminvalim yasminvalim added jira and removed jira labels Sep 2, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a race condition where outdated partition information, specifically UUIDs, could be used after a partition resize. The approach is to force a reread of the partition table and wait for udev events to settle before collecting device information.

The change to make reread_partition_table public is correct. However, the implementation in rootmap.rs has a couple of issues. It attempts to reread the partition table on what is likely a partition device instead of the parent whole-disk device, and it silently ignores errors which could cause the fix to be ineffective. My review includes a suggestion to address these points by finding the parent device and propagating errors.

With the suggested change, the fix should be more robust and reliable.

@yasminvalim yasminvalim requested a review from jlebon September 5, 2025 16:27
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think rereading the partition table is the fix for https://issues.redhat.com/browse/OCPBUGS-57573. The issue there has nothing to do with a stale partition table. The issue is that the rootfs filesystem UUID is stale.

To quote my comment from that bug:

Note to CoreOS devs: I think the easiest fix for this is to switch the rdcore rootmap code to use the blkid helper for this instead of lsblk:

pub fn blkid_single(dev: &Path) -> Result<HashMap<String, String>> {

I.e. get_filesystem_uuid() is what we call below when generating the root= karg; that function currently calls lsblk_single, but let's have it instead call blkid_single which uses blkid -p to bypass the cache.

@yasminvalim
Copy link
Contributor Author

Thanks for your review, @jlebon. I guess I understand better what happened now, I am new to this codebase so thank you for the explanation about the problem. I updated with the right fix :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants