-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand BOOTSEQ.md #193
base: main
Are you sure you want to change the base?
Expand BOOTSEQ.md #193
Conversation
Signed-off-by: tedstreete <[email protected]>
Signed-off-by: Ted Streete <[email protected]>
Signed-off-by: Ted Streete <[email protected]>
Signed-off-by: Ted Streete <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedstreete - good description on the challenges and possible alternatives.
- There are different kinds of resets. Generally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and may result in a controlled or uncontrolled host reset. | ||
- See _Host Reset or Crash_ and _xPU Reset or Crash_ sections for specifics. | ||
|
||
- There are use cases for network boot only where the host is used only to provide power and cooling to one or more xPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this explicitly as bump-in-the-wire xPU
use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only "bump in the wire" case? It may be a bare-metal hosting case, where host is completely isolated from the xPU (meaning, to the level PCIe card can be isolated from host - power+thermals).
- There are different kinds of resets. Genrally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and needs reboot the server. | ||
- There are use cases for network boot only. | ||
- Resetting the xPU is assumed to cause PCIe errors that may impact the operation of the host OS. | ||
- There are different kinds of resets. Generally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and may result in a controlled or uncontrolled host reset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is not the text you introduced, rather existed previously. However, while we are making changes to this file, we could also change this this statement to refer to xPU architectures that implement segregated PCIe function from the ARM subsystem. e.g. if PCIe layer is programmable and is controlled by the ARM SOC then this may not be applicable. Again, my point is to only clarify the statement and not remove this all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the challenge occurs when the ARM cores controll the PCIe layer. During the time it takes for the PCIe layer to initialize, the xPU is likely to cause PCIe errors for PCIe Config Cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "complete SoC reset" statement is the main line here, IMHO. I would be inclined to assume that no matter what, cold/warm full xPU reset domain incorporates both PCIe subsystem and Arm complex, no matter if PCIe subsystem is completely independent or tied to Arm complex, as @jainvipin distinguishes. As a result, PCIe errors caused by that reset may result in host reset, controllable or not.
Downstream from there, we could either state an OPI preference or advisory that xPU design might/should draw separate reset domains for PCIe subsystem and Arm complex, because of reasons. However, it might be risky, considering the possible relationship between Arm complex and PCIe subsystem. To add more flavor, in Intel IPUs, IMC complex programs PCIe subsystem and handles resets coming from the host (e.g. PERST#); ACC complex is not involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the challenge occurs when the ARM cores controll the PCIe layer. During the time it takes for the PCIe layer to initialize, the xPU is likely to cause PCIe errors for PCIe Config Cycles.
agree, if ARM cores are involved in controlling PCIe layer, then this would have an impact. If the programmability of the PCIe layer is controlled by the ARM complex, then too this would be indirectly involving ARM complex.
- These events may not be synchronous. The xPU may be installed in a PICe slot providing persistent power. xPU power cycle may be a function of the host BMC (and potentially the xPU BMC) rather than the power state of the host | ||
- is Host CPU halted ? | ||
- can we wait in UEFI/BIOS in case of network boot via xPU ? | ||
- Host OS continues to its boot. xPU needs to either respond to PCIe transactions or respond with retry config cycles (CRS) till it is ready. This automatically holds the host BIOS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retry cycles (CRS) would have limits on it. I mean we wouldn't be able to hold on to this for long enough that a general purpose OS boots on the ARM subsystem. If you wonder why would OS needs to boot as long as PCIe functions can be served, then I'd point to the system that have programs to download to assume certain PCIe function which can only happen after OS Boot. For example for AMD pensando device the decision to present VirtIO device to a native Pensando AMD device is a configurable/late-binding decision and can't be done before OS boot. Let's also include the BIOS option where server BIOS gets a nod from xPU reset complete before taking the server cpu out of reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. CRS is not practical. Some xPUs may take as long as 5 mins after power is applied to respond to PCIe Config Cycles. The x86 CPU will timeout and crash in 10s of milliseconds. There is a another PCIe retry mechanism when the target interrupts the x86 host when it is ready, but BIOS is not ready to accept interrupts that earily in the boot process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RezaBacchus unless we change the BIOS to longer CRS timeouts, this is what we did at DELL... will that work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some implementations, while the xPU is initializing the PCIe manager, the xPU ignores config cycles leading BIOS to infer that the card is absent. In other cases, the xPU claims the config cycle but fails to respond causing a catastrophic reset due to a Table of Record (ToR) timeout in the x86 CPU. CRS assumes that the PCIe manager on the xPU is stable enough to avoid these 2 cases.
The same behavior is observed when the ARM cores lock up during runtime. A PF0 HW PCIe register that is independent of SW running on the ARM is a solution to both the boot-up and runtime issues. This however will require the PCI SIG to standardize on the appropriate capabilities and status registers; and will take time to implement in silicon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RezaBacchus This would assume that PCIe subsystem is independent from Arm complex - at least to the point where it can setup config space on its own. This is not the case for Intel IPU, as IMC configures PCIe subsystem (including config space), including releasing PCIe PHY from reset and enabling link training SSM.
Considering how implementations of PCIe subsystem may vary across xPU vendors, including CRS reliability concerns, CRS might not be the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good with options that does any coordination with BIOS/BMC - going in interdependent lock-step will ensure the dependencies are cleared.
- is Host CPU halted ? | ||
- can we wait in UEFI/BIOS in case of network boot via xPU ? | ||
- Host OS continues to its boot. xPU needs to either respond to PCIe transactions or respond with retry config cycles (CRS) till it is ready. This automatically holds the host BIOS. | ||
- The xPU may hold its PCIe channel down during xPU boot. When the xPU has finished booting, it generates a hot-plug event, and the host will insert the xPU into the PCIe root complex. Is this true for all host OS? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCIe hot-plug is not a reliable way to do this across various OSes. The support exists in code, but no one ever uses it. The challenge is that without the NIC booting, sometimes host can't boot either (say if it were to obtain its image over the network i.e. same xPU that is not yet present).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Network boot through the xPU is a requirement for cloud native architectures where the tenant may change frequently. There is a workaround today that requires a performance NIC to provide that capability. Operators are seeking ways to eliminate the additional NIC to reduce opex and carbon footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, network boot is a need (agree), and not having to have another NIC just for this purpose. It is the hot plug that I was wondering is a needed to achieve this.
|
||
- Who tells host OS/xPU OS to wait and how ? | ||
- PCIe has a mechanism for config retry cycles (CRS) that can be used | ||
- Host BIOS may need to be modified to allow longer CRS response time before timing out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this mechanis - it is fool proof, guaranteed to work with all xPUs, and doesn't leave any room for error. CRS could be a hit-and-a-miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jainvipin. It is not clear which mechanism you are refering to.
The server BMC could poll the xPU via I2C or NC-SI for its boot status. To avoid the BMC having to know unique HW locations on xPUs from each curated vendor, the polling mechanism must be abstracted on the xPU. Example Redfish or PLDM. Again these functions must be functional on the xPU within seconds after power is applied to the xPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, at high level, I meant host BIOS modification to coordinate this. I agree with polling mechanism you mention and that it is desired to be happening soon after power is applied to it.
- During host OS Reboot or Crashes, the host writes to Chipset Reset Control Register (i.e., IO port 0xCF9) or to the ACPI FADT | ||
- The Platform (CPLD) firmware can monitor for these events and trigger host BMC/BIOS to coordinate with xPU to take appropriate action. Potential actions include:- | ||
- Ensuring that xPU is gracefully shut down before a full host reboot | ||
- Force reset xPU after a timeout if xPU has failed to gracefully shutdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting scenario. Should we assume that xPU wouldn't require graceful shutdown ever? Or do we, if a general purpose OS is booting on the xPU, then it is desired for the next reboot to boot an image properly. What about disk/flash-storage inconsistencies in such conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purposely rebooting the server requires a graceful shutdown with the appropriate security audit logs.
A controlled shutdown may be initiated during some error conditions that initiate a crashdump.
A forced shutdown may be initiated during some error (lockup, thermal) or timeout conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can provide better logs, reset reasons during planned vs unplanned shutdown.
|
||
xPU OS/FW crash and/or independent reboot will result in a Downstream Port Containment (DPC) event on the PCIe bus | ||
|
||
- The host (BMC/BIOS) firmware and host OS needs to implement DPC and hot plug requirements defined in PCIe Firmware spec 3.3 for these scenarios to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. This also requires some basic mechanism for BMC/BIOS to learn that a xPU exist in the enumeration of devices on the server.
## Error handling | ||
## xPU OS Installation mode | ||
|
||
- An xPU may go through multiple reboots during xPU OS installation. Consequently, the host OS may choose to bring the xPU and host PCIe link down prior to xPU OS install, and to hold the link down until the install process has completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this essentially means hot plug is the only way to do this? i.e. host OS is booted and PCIe link is brought down and then up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about Downstream Port Containment (DPC)?
- xPU OS/FW reboot assumed to cause Host OS crash | ||
- Not in all cases. e.g. NVIDIA's design separates the various components, and an ARM complex reboot should have no direct effect on the host OS. Similarly, some vendors implement ISSU processes for certain FW updates where no disruption is observed. | ||
- Coordinated shutdown | ||
- The xPU can provide in-band or out-of band signalling to host OS that it intends to restart. The host OS can respond by either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of band will be tricky to implement and will increase the dependencies further on some other infra to communicate this. OTOH, if we require xPU to always reboot when host reboots then this can be solved by an implicit understanding that reboot of a server is required when xPU needs it to (coordinate shutdown case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simultaneously rebooting both x86 and xPU seems reasonable, but let us make sure this covers all use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I misunderstood out-of-band; I thought it was some coordination using some software stack over IP network. Where as, what you meant is PLDM sensors, i2c, etc. (i.e. non PCIe)
- Elect to contain the event, removing the xPU from the bus table. When the xPU restarts, it will generate a hot-plug event on the PCIe bus, allowing the host BIOS/OS to reinsert the xPU. | ||
- Note: This behavior is currently a largely untested feature in Linux host OS, and is not available on Windows | ||
|
||
- _Further discussion required_ NVIDIA's design separates the various components, and an ARM complex reboot should have no direct effect on the host OS. Similarly, some vendors implement ISSU processes for certain FW updates where no disruption is observed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack - it depends on the what you are upgrading/changing. But why ISSU is kept under xPU Crash/reset bullet?
Signed-off-by: tedstreete [email protected]