Skip to content

Conversation

@hexinw-nvidia
Copy link
Contributor

Add --ft-non-retryable-exception-file to mark nodes unhealthy when workers fail with specific exception patterns (e.g., config errors). This prevents retrying on errors that won't be fixed by retrying.

Implementation:

  • Workers write full tracebacks via sys.excepthook to error files
  • Launcher checks error files against configured patterns on worker failure
  • Nodes with matching exceptions increment unhealthy_count and exit
  • Rendezvous uses unhealthy_count to decide if job can continue

Example: Configure patterns like "insufficient shared memory (shm)" to stop retry on configuration error.

Add --ft-non-retryable-exception-file to mark nodes unhealthy when workers
fail with specific exception patterns (e.g., config errors). This prevents
retrying on errors that won't be fixed by retrying.

Implementation:
- Workers write full tracebacks via sys.excepthook to error files
- Launcher checks error files against configured patterns on worker failure
- Nodes with matching exceptions increment unhealthy_count and exit
- Rendezvous uses unhealthy_count to decide if job can continue

Example: Configure patterns like "insufficient shared memory (shm)" to
stop retry on configuration error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant