Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting guide #435

Open
6 tasks
lentzi90 opened this issue May 24, 2024 · 4 comments
Open
6 tasks

Troubleshooting guide #435

lentzi90 opened this issue May 24, 2024 · 4 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@lentzi90
Copy link
Member

lentzi90 commented May 24, 2024

We should add a troubleshooting guide with the most common issues and solutions. Since we integrate with Cluster API, it also makes sense to link to their troubleshooting page, and if possible give some guidance on when an issue is with CAPI and when it is with Metal3.

Things to include

BMO/Ironic/IPA:

  • How to verify that BMO and Ironic are operational to rule out configuration errors Troubleshooting: Verify BMO and Ironic healthy #490
    • Containers should be running (not restarting, this could indicate that BMO cannot connect to Ironic)
    • Ironic should not be "waiting for IP". Check Ironic logs!
  • What does inspection errors look like, what could be the cause and solution?
    • BMC credentials could be wrong or missing. These issues should show up in the BareMetalHost status or as events (to be confirmed)
    • The host is not able to communicate back results to Ironic (this will result in a timeout). Access to serial logs is needed to determine the exact issue in these cases.
    • Incompatible configuration. For example, attempt to use virtualmedia or UEFI when not supported. This should show up in the BareMetalHost status or as events (to be confirmed).
  • Provisioning errors
    • Image errors (wrong checksum, missing image, image too large to decompress or too large for disk).
      This should show up in the BMH status (to be confirmed)
    • No root device found
      This should show up in the BMH status (to be confirmed)

CAPM3/IPAM:

  • No BareMetalHost available/matching. This should show as an event when describing the Metal3Machine (to be confirmed)
  • Provider ID missing. This can happen if noCloudProvider is set to false on the Metal3Cluster when no external cloud provider is used.
  • nodeRef missing. Really a CAPI level issue. Could be caused by failure to boot the image or failure to join it to the cluster. Access to the node or serial logs is needed to determine the issue. Especially cloud-final logs are of interest.
@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label May 24, 2024
@lentzi90
Copy link
Member Author

/help
/triage accepted

@metal3-io-bot
Copy link
Contributor

@lentzi90:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels May 24, 2024
@Rozzii Rozzii moved this to Backlog in Metal3 - Roadmap Jun 28, 2024
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2024
@dtantsur
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@metal3-io-bot metal3-io-bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
Status: Backlog
Development

No branches or pull requests

3 participants