Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand BOOTSEQ.md #193

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Expand BOOTSEQ.md #193

wants to merge 9 commits into from

Conversation

tedstreete
Copy link
Contributor

Signed-off-by: tedstreete [email protected]

Signed-off-by: tedstreete <[email protected]>
@tedstreete tedstreete requested a review from a team as a code owner January 10, 2023 13:26
Copy link

@jainvipin jainvipin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedstreete - good description on the challenges and possible alternatives.

- There are different kinds of resets. Generally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and may result in a controlled or uncontrolled host reset.
- See _Host Reset or Crash_ and _xPU Reset or Crash_ sections for specifics.

- There are use cases for network boot only where the host is used only to provide power and cooling to one or more xPUs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this explicitly as bump-in-the-wire xPU use case?

Copy link

@mrabeda mrabeda Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only "bump in the wire" case? It may be a bare-metal hosting case, where host is completely isolated from the xPU (meaning, to the level PCIe card can be isolated from host - power+thermals).

- There are different kinds of resets. Genrally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and needs reboot the server.
- There are use cases for network boot only.
- Resetting the xPU is assumed to cause PCIe errors that may impact the operation of the host OS.
- There are different kinds of resets. Generally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and may result in a controlled or uncontrolled host reset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is not the text you introduced, rather existed previously. However, while we are making changes to this file, we could also change this this statement to refer to xPU architectures that implement segregated PCIe function from the ARM subsystem. e.g. if PCIe layer is programmable and is controlled by the ARM SOC then this may not be applicable. Again, my point is to only clarify the statement and not remove this all together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the challenge occurs when the ARM cores controll the PCIe layer. During the time it takes for the PCIe layer to initialize, the xPU is likely to cause PCIe errors for PCIe Config Cycles.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "complete SoC reset" statement is the main line here, IMHO. I would be inclined to assume that no matter what, cold/warm full xPU reset domain incorporates both PCIe subsystem and Arm complex, no matter if PCIe subsystem is completely independent or tied to Arm complex, as @jainvipin distinguishes. As a result, PCIe errors caused by that reset may result in host reset, controllable or not.

Downstream from there, we could either state an OPI preference or advisory that xPU design might/should draw separate reset domains for PCIe subsystem and Arm complex, because of reasons. However, it might be risky, considering the possible relationship between Arm complex and PCIe subsystem. To add more flavor, in Intel IPUs, IMC complex programs PCIe subsystem and handles resets coming from the host (e.g. PERST#); ACC complex is not involved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the challenge occurs when the ARM cores controll the PCIe layer. During the time it takes for the PCIe layer to initialize, the xPU is likely to cause PCIe errors for PCIe Config Cycles.

agree, if ARM cores are involved in controlling PCIe layer, then this would have an impact. If the programmability of the PCIe layer is controlled by the ARM complex, then too this would be indirectly involving ARM complex.

- These events may not be synchronous. The xPU may be installed in a PICe slot providing persistent power. xPU power cycle may be a function of the host BMC (and potentially the xPU BMC) rather than the power state of the host
- is Host CPU halted ?
- can we wait in UEFI/BIOS in case of network boot via xPU ?
- Host OS continues to its boot. xPU needs to either respond to PCIe transactions or respond with retry config cycles (CRS) till it is ready. This automatically holds the host BIOS.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry cycles (CRS) would have limits on it. I mean we wouldn't be able to hold on to this for long enough that a general purpose OS boots on the ARM subsystem. If you wonder why would OS needs to boot as long as PCIe functions can be served, then I'd point to the system that have programs to download to assume certain PCIe function which can only happen after OS Boot. For example for AMD pensando device the decision to present VirtIO device to a native Pensando AMD device is a configurable/late-binding decision and can't be done before OS boot. Let's also include the BIOS option where server BIOS gets a nod from xPU reset complete before taking the server cpu out of reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. CRS is not practical. Some xPUs may take as long as 5 mins after power is applied to respond to PCIe Config Cycles. The x86 CPU will timeout and crash in 10s of milliseconds. There is a another PCIe retry mechanism when the target interrupts the x86 host when it is ready, but BIOS is not ready to accept interrupts that earily in the boot process.

Copy link
Member

@glimchb glimchb Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RezaBacchus unless we change the BIOS to longer CRS timeouts, this is what we did at DELL... will that work ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some implementations, while the xPU is initializing the PCIe manager, the xPU ignores config cycles leading BIOS to infer that the card is absent. In other cases, the xPU claims the config cycle but fails to respond causing a catastrophic reset due to a Table of Record (ToR) timeout in the x86 CPU. CRS assumes that the PCIe manager on the xPU is stable enough to avoid these 2 cases.
The same behavior is observed when the ARM cores lock up during runtime. A PF0 HW PCIe register that is independent of SW running on the ARM is a solution to both the boot-up and runtime issues. This however will require the PCI SIG to standardize on the appropriate capabilities and status registers; and will take time to implement in silicon.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RezaBacchus This would assume that PCIe subsystem is independent from Arm complex - at least to the point where it can setup config space on its own. This is not the case for Intel IPU, as IMC configures PCIe subsystem (including config space), including releasing PCIe PHY from reset and enabling link training SSM.

Considering how implementations of PCIe subsystem may vary across xPU vendors, including CRS reliability concerns, CRS might not be the way to go.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good with options that does any coordination with BIOS/BMC - going in interdependent lock-step will ensure the dependencies are cleared.

- is Host CPU halted ?
- can we wait in UEFI/BIOS in case of network boot via xPU ?
- Host OS continues to its boot. xPU needs to either respond to PCIe transactions or respond with retry config cycles (CRS) till it is ready. This automatically holds the host BIOS.
- The xPU may hold its PCIe channel down during xPU boot. When the xPU has finished booting, it generates a hot-plug event, and the host will insert the xPU into the PCIe root complex. Is this true for all host OS?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCIe hot-plug is not a reliable way to do this across various OSes. The support exists in code, but no one ever uses it. The challenge is that without the NIC booting, sometimes host can't boot either (say if it were to obtain its image over the network i.e. same xPU that is not yet present).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network boot through the xPU is a requirement for cloud native architectures where the tenant may change frequently. There is a workaround today that requires a performance NIC to provide that capability. Operators are seeking ways to eliminate the additional NIC to reduce opex and carbon footprint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, network boot is a need (agree), and not having to have another NIC just for this purpose. It is the hot plug that I was wondering is a needed to achieve this.


- Who tells host OS/xPU OS to wait and how ?
- PCIe has a mechanism for config retry cycles (CRS) that can be used
- Host BIOS may need to be modified to allow longer CRS response time before timing out

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this mechanis - it is fool proof, guaranteed to work with all xPUs, and doesn't leave any room for error. CRS could be a hit-and-a-miss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainvipin. It is not clear which mechanism you are refering to.
The server BMC could poll the xPU via I2C or NC-SI for its boot status. To avoid the BMC having to know unique HW locations on xPUs from each curated vendor, the polling mechanism must be abstracted on the xPU. Example Redfish or PLDM. Again these functions must be functional on the xPU within seconds after power is applied to the xPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, at high level, I meant host BIOS modification to coordinate this. I agree with polling mechanism you mention and that it is desired to be happening soon after power is applied to it.

- During host OS Reboot or Crashes, the host writes to Chipset Reset Control Register (i.e., IO port 0xCF9) or to the ACPI FADT
- The Platform (CPLD) firmware can monitor for these events and trigger host BMC/BIOS to coordinate with xPU to take appropriate action. Potential actions include:-
- Ensuring that xPU is gracefully shut down before a full host reboot
- Force reset xPU after a timeout if xPU has failed to gracefully shutdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting scenario. Should we assume that xPU wouldn't require graceful shutdown ever? Or do we, if a general purpose OS is booting on the xPU, then it is desired for the next reboot to boot an image properly. What about disk/flash-storage inconsistencies in such conditions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purposely rebooting the server requires a graceful shutdown with the appropriate security audit logs.
A controlled shutdown may be initiated during some error conditions that initiate a crashdump.
A forced shutdown may be initiated during some error (lockup, thermal) or timeout conditions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can provide better logs, reset reasons during planned vs unplanned shutdown.


xPU OS/FW crash and/or independent reboot will result in a Downstream Port Containment (DPC) event on the PCIe bus

- The host (BMC/BIOS) firmware and host OS needs to implement DPC and hot plug requirements defined in PCIe Firmware spec 3.3 for these scenarios to work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. This also requires some basic mechanism for BMC/BIOS to learn that a xPU exist in the enumeration of devices on the server.

## Error handling
## xPU OS Installation mode

- An xPU may go through multiple reboots during xPU OS installation. Consequently, the host OS may choose to bring the xPU and host PCIe link down prior to xPU OS install, and to hold the link down until the install process has completed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this essentially means hot plug is the only way to do this? i.e. host OS is booted and PCIe link is brought down and then up?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Downstream Port Containment (DPC)?

- xPU OS/FW reboot assumed to cause Host OS crash
- Not in all cases. e.g. NVIDIA's design separates the various components, and an ARM complex reboot should have no direct effect on the host OS. Similarly, some vendors implement ISSU processes for certain FW updates where no disruption is observed.
- Coordinated shutdown
- The xPU can provide in-band or out-of band signalling to host OS that it intends to restart. The host OS can respond by either

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of band will be tricky to implement and will increase the dependencies further on some other infra to communicate this. OTOH, if we require xPU to always reboot when host reboots then this can be solved by an implicit understanding that reboot of a server is required when xPU needs it to (coordinate shutdown case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simultaneously rebooting both x86 and xPU seems reasonable, but let us make sure this covers all use cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I misunderstood out-of-band; I thought it was some coordination using some software stack over IP network. Where as, what you meant is PLDM sensors, i2c, etc. (i.e. non PCIe)

- Elect to contain the event, removing the xPU from the bus table. When the xPU restarts, it will generate a hot-plug event on the PCIe bus, allowing the host BIOS/OS to reinsert the xPU.
- Note: This behavior is currently a largely untested feature in Linux host OS, and is not available on Windows

- _Further discussion required_ NVIDIA's design separates the various components, and an ARM complex reboot should have no direct effect on the host OS. Similarly, some vendors implement ISSU processes for certain FW updates where no disruption is observed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack - it depends on the what you are upgrading/changing. But why ISSU is kept under xPU Crash/reset bullet?

@glimchb glimchb marked this pull request as draft March 14, 2023 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants