Expand BOOTSEQ.md #193

tedstreete · 2023-01-10T13:26:00Z

Signed-off-by: tedstreete [email protected]

Signed-off-by: tedstreete <[email protected]>

Signed-off-by: Ted Streete <[email protected]>

jainvipin

@tedstreete - good description on the challenges and possible alternatives.

jainvipin · 2023-02-15T19:29:31Z

BOOTSEQ.md

+  - There are different kinds of resets. Generally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and may result in a controlled or uncontrolled host reset.
+  - See  _Host Reset or Crash_ and _xPU Reset or Crash_ sections for specifics.
+
+- There are use cases for network boot only where the host is used only to provide power and cooling to one or more xPUs.


can we call this explicitly as bump-in-the-wire xPU use case?

Is it only "bump in the wire" case? It may be a bare-metal hosting case, where host is completely isolated from the xPU (meaning, to the level PCIe card can be isolated from host - power+thermals).

jainvipin · 2023-02-15T19:31:36Z

BOOTSEQ.md

-  - There are different kinds of resets. Genrally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and needs reboot the server.
- There are use cases for network boot only.
+- Resetting the xPU is assumed to cause PCIe errors that may impact the operation of the host OS.
+  - There are different kinds of resets. Generally XPU core complex reset and XPU PCIe MAC reset should be separated for minimal disruptive ISSU etc. but a complete SoC reset of the xPU will cause PCIe errors and may result in a controlled or uncontrolled host reset.


I understand this is not the text you introduced, rather existed previously. However, while we are making changes to this file, we could also change this this statement to refer to xPU architectures that implement segregated PCIe function from the ARM subsystem. e.g. if PCIe layer is programmable and is controlled by the ARM SOC then this may not be applicable. Again, my point is to only clarify the statement and not remove this all together.

Actually the challenge occurs when the ARM cores controll the PCIe layer. During the time it takes for the PCIe layer to initialize, the xPU is likely to cause PCIe errors for PCIe Config Cycles.

The "complete SoC reset" statement is the main line here, IMHO. I would be inclined to assume that no matter what, cold/warm full xPU reset domain incorporates both PCIe subsystem and Arm complex, no matter if PCIe subsystem is completely independent or tied to Arm complex, as @jainvipin distinguishes. As a result, PCIe errors caused by that reset may result in host reset, controllable or not.

Downstream from there, we could either state an OPI preference or advisory that xPU design might/should draw separate reset domains for PCIe subsystem and Arm complex, because of reasons. However, it might be risky, considering the possible relationship between Arm complex and PCIe subsystem. To add more flavor, in Intel IPUs, IMC complex programs PCIe subsystem and handles resets coming from the host (e.g. PERST#); ACC complex is not involved.

Actually the challenge occurs when the ARM cores controll the PCIe layer. During the time it takes for the PCIe layer to initialize, the xPU is likely to cause PCIe errors for PCIe Config Cycles.

agree, if ARM cores are involved in controlling PCIe layer, then this would have an impact. If the programmability of the PCIe layer is controlled by the ARM complex, then too this would be indirectly involving ARM complex.

jainvipin · 2023-02-15T19:38:11Z

BOOTSEQ.md

-  - These events may not be synchronous. The xPU may be installed in a PICe slot providing persistent power. xPU power cycle may be a function of the host BMC (and potentially the xPU BMC) rather than the power state of the host
- is Host CPU halted ?
- can we wait in UEFI/BIOS in case of network boot via xPU ?
+  - Host OS continues to its boot. xPU needs to either respond to PCIe transactions or respond with retry config cycles (CRS) till it is ready. This automatically holds the host BIOS.


retry cycles (CRS) would have limits on it. I mean we wouldn't be able to hold on to this for long enough that a general purpose OS boots on the ARM subsystem. If you wonder why would OS needs to boot as long as PCIe functions can be served, then I'd point to the system that have programs to download to assume certain PCIe function which can only happen after OS Boot. For example for AMD pensando device the decision to present VirtIO device to a native Pensando AMD device is a configurable/late-binding decision and can't be done before OS boot. Let's also include the BIOS option where server BIOS gets a nod from xPU reset complete before taking the server cpu out of reset.

Agreed. CRS is not practical. Some xPUs may take as long as 5 mins after power is applied to respond to PCIe Config Cycles. The x86 CPU will timeout and crash in 10s of milliseconds. There is a another PCIe retry mechanism when the target interrupts the x86 host when it is ready, but BIOS is not ready to accept interrupts that earily in the boot process.

@RezaBacchus unless we change the BIOS to longer CRS timeouts, this is what we did at DELL... will that work ?

In some implementations, while the xPU is initializing the PCIe manager, the xPU ignores config cycles leading BIOS to infer that the card is absent. In other cases, the xPU claims the config cycle but fails to respond causing a catastrophic reset due to a Table of Record (ToR) timeout in the x86 CPU. CRS assumes that the PCIe manager on the xPU is stable enough to avoid these 2 cases.
The same behavior is observed when the ARM cores lock up during runtime. A PF0 HW PCIe register that is independent of SW running on the ARM is a solution to both the boot-up and runtime issues. This however will require the PCI SIG to standardize on the appropriate capabilities and status registers; and will take time to implement in silicon.

@RezaBacchus This would assume that PCIe subsystem is independent from Arm complex - at least to the point where it can setup config space on its own. This is not the case for Intel IPU, as IMC configures PCIe subsystem (including config space), including releasing PCIe PHY from reset and enabling link training SSM.

Considering how implementations of PCIe subsystem may vary across xPU vendors, including CRS reliability concerns, CRS might not be the way to go.

good with options that does any coordination with BIOS/BMC - going in interdependent lock-step will ensure the dependencies are cleared.

jainvipin · 2023-02-15T19:40:05Z

BOOTSEQ.md

- is Host CPU halted ?
- can we wait in UEFI/BIOS in case of network boot via xPU ?
+  - Host OS continues to its boot. xPU needs to either respond to PCIe transactions or respond with retry config cycles (CRS) till it is ready. This automatically holds the host BIOS.
+  - The xPU may hold its PCIe channel down during xPU boot. When the xPU has finished booting, it generates a hot-plug event, and the host will insert the xPU into the PCIe root complex. Is this true for all host OS?


PCIe hot-plug is not a reliable way to do this across various OSes. The support exists in code, but no one ever uses it. The challenge is that without the NIC booting, sometimes host can't boot either (say if it were to obtain its image over the network i.e. same xPU that is not yet present).

Network boot through the xPU is a requirement for cloud native architectures where the tenant may change frequently. There is a workaround today that requires a performance NIC to provide that capability. Operators are seeking ways to eliminate the additional NIC to reduce opex and carbon footprint.

yes, network boot is a need (agree), and not having to have another NIC just for this purpose. It is the hot plug that I was wondering is a needed to achieve this.

jainvipin · 2023-02-15T19:40:59Z

BOOTSEQ.md

+
+- Who tells host OS/xPU OS to wait and how ?
+  - PCIe has a mechanism for config retry cycles (CRS) that can be used
+  - Host BIOS may need to be modified to allow longer CRS response time before timing out


I'd prefer this mechanis - it is fool proof, guaranteed to work with all xPUs, and doesn't leave any room for error. CRS could be a hit-and-a-miss.

@jainvipin. It is not clear which mechanism you are refering to.
The server BMC could poll the xPU via I2C or NC-SI for its boot status. To avoid the BMC having to know unique HW locations on xPUs from each curated vendor, the polling mechanism must be abstracted on the xPU. Example Redfish or PLDM. Again these functions must be functional on the xPU within seconds after power is applied to the xPU.

sorry, at high level, I meant host BIOS modification to coordinate this. I agree with polling mechanism you mention and that it is desired to be happening soon after power is applied to it.

jainvipin · 2023-02-15T20:30:27Z

BOOTSEQ.md

+- During host OS Reboot or Crashes, the host writes to Chipset Reset Control Register (i.e., IO port 0xCF9) or to the ACPI FADT
+- The Platform (CPLD) firmware can monitor for these events and trigger host BMC/BIOS to coordinate with xPU to take appropriate action. Potential actions include:-
+  - Ensuring that xPU is gracefully shut down before a full host reboot
+  - Force reset xPU after a timeout if xPU has failed to gracefully shutdown


This is an interesting scenario. Should we assume that xPU wouldn't require graceful shutdown ever? Or do we, if a general purpose OS is booting on the xPU, then it is desired for the next reboot to boot an image properly. What about disk/flash-storage inconsistencies in such conditions?

Purposely rebooting the server requires a graceful shutdown with the appropriate security audit logs.
A controlled shutdown may be initiated during some error conditions that initiate a crashdump.
A forced shutdown may be initiated during some error (lockup, thermal) or timeout conditions.

Sure, we can provide better logs, reset reasons during planned vs unplanned shutdown.

jainvipin · 2023-02-15T20:32:19Z

BOOTSEQ.md

+
+xPU OS/FW crash and/or independent reboot will result in a Downstream Port Containment (DPC) event on the PCIe bus
+
+- The host (BMC/BIOS) firmware and host OS needs to implement DPC and hot plug requirements defined in PCIe Firmware spec 3.3 for these scenarios to work.


Ack. This also requires some basic mechanism for BMC/BIOS to learn that a xPU exist in the enumeration of devices on the server.

jainvipin · 2023-02-15T20:53:18Z

BOOTSEQ.md

-## Error handling
+## xPU OS Installation mode
+
+- An xPU may go through multiple reboots during xPU OS installation. Consequently, the host OS may choose to bring the xPU and host PCIe link down prior to xPU OS install, and to hold the link down until the install process has completed


this essentially means hot plug is the only way to do this? i.e. host OS is booted and PCIe link is brought down and then up?

What about Downstream Port Containment (DPC)?

jainvipin · 2023-02-15T20:56:21Z

BOOTSEQ.md

- xPU OS/FW reboot assumed to cause Host OS crash
-  - Not in all cases. e.g. NVIDIA's design separates the various components, and an ARM complex reboot should have no direct effect on the host OS. Similarly, some vendors implement ISSU processes for certain FW updates where no disruption is observed.
+- Coordinated shutdown
+  - The xPU can provide in-band or out-of band signalling to host OS that it intends to restart. The host OS can respond by either


out of band will be tricky to implement and will increase the dependencies further on some other infra to communicate this. OTOH, if we require xPU to always reboot when host reboots then this can be solved by an implicit understanding that reboot of a server is required when xPU needs it to (coordinate shutdown case)

Simultaneously rebooting both x86 and xPU seems reasonable, but let us make sure this covers all use cases.

maybe I misunderstood out-of-band; I thought it was some coordination using some software stack over IP network. Where as, what you meant is PLDM sensors, i2c, etc. (i.e. non PCIe)

jainvipin · 2023-02-15T20:58:02Z

BOOTSEQ.md

+    - Elect to contain the event, removing the xPU from the bus table. When the xPU restarts, it will generate a hot-plug event on the PCIe bus, allowing the host BIOS/OS to reinsert the xPU.
+    - Note: This behavior is currently a largely untested feature in Linux host OS, and is not available on Windows
+
+  - _Further discussion required_ NVIDIA's design separates the various components, and an ARM complex reboot should have no direct effect on the host OS. Similarly, some vendors implement ISSU processes for certain FW updates where no disruption is observed.


Ack - it depends on the what you are upgrading/changing. But why ISSU is kept under xPU Crash/reset bullet?

Update BOOTSEQ.md

a7538b4

Signed-off-by: tedstreete <[email protected]>

tedstreete requested a review from a team as a code owner January 10, 2023 13:26

tedstreete added 8 commits January 10, 2023 07:34

Update BOOTSEQ.md

d1187ff

Update BOOTSEQ.md

e2fc9be

Update BOOTSEQ.md

3671639

Update BOOTSEQ.md

8721c9b

Update BOOTSEQ.md

7f88437

Update BOOTSEQ.md

4946c0c

Signed-off-by: Ted Streete <[email protected]>

Update BOOTSEQ.md

94d53db

Signed-off-by: Ted Streete <[email protected]>

Update BOOTSEQ.md

261a799

Signed-off-by: Ted Streete <[email protected]>

jainvipin reviewed Feb 15, 2023

View reviewed changes

glimchb marked this pull request as draft March 14, 2023 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand BOOTSEQ.md #193

Expand BOOTSEQ.md #193

tedstreete commented Jan 10, 2023

jainvipin left a comment

jainvipin Feb 15, 2023

mrabeda Feb 21, 2023 •

edited

Loading

jainvipin Feb 15, 2023

RezaBacchus Feb 17, 2023

mrabeda Feb 21, 2023

jainvipin Feb 28, 2023

jainvipin Feb 15, 2023

RezaBacchus Feb 17, 2023

glimchb Feb 17, 2023 •

edited

Loading

RezaBacchus Feb 17, 2023

mrabeda Feb 21, 2023

jainvipin Feb 28, 2023

jainvipin Feb 15, 2023

RezaBacchus Feb 17, 2023

jainvipin Feb 28, 2023

jainvipin Feb 15, 2023

RezaBacchus Feb 17, 2023

jainvipin Feb 28, 2023

jainvipin Feb 15, 2023

RezaBacchus Feb 21, 2023

jainvipin Feb 28, 2023

jainvipin Feb 15, 2023

jainvipin Feb 15, 2023

mrabeda Feb 21, 2023

jainvipin Feb 15, 2023

RezaBacchus Feb 21, 2023

jainvipin Feb 28, 2023

jainvipin Feb 15, 2023


		xPU OS/FW crash and/or independent reboot will result in a Downstream Port Containment (DPC) event on the PCIe bus

		- The host (BMC/BIOS) firmware and host OS needs to implement DPC and hot plug requirements defined in PCIe Firmware spec 3.3 for these scenarios to work.

Expand BOOTSEQ.md #193

Are you sure you want to change the base?

Expand BOOTSEQ.md #193

Conversation

tedstreete commented Jan 10, 2023

jainvipin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrabeda Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glimchb Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrabeda Feb 21, 2023 •

edited

Loading

glimchb Feb 17, 2023 •

edited

Loading