feat: add non-retryable exception pattern matching #212

hexinw-nvidia · 2025-10-28T19:51:04Z

Add --ft-non-retryable-exception-file to mark nodes unhealthy when workers fail with specific exception patterns (e.g., config errors). This prevents retrying on errors that won't be fixed by retrying.

Implementation:

Workers write full tracebacks via sys.excepthook to error files
Launcher checks error files against configured patterns on worker failure
Nodes with matching exceptions increment unhealthy_count and exit
Rendezvous uses unhealthy_count to decide if job can continue

Example: Configure patterns like "insufficient shared memory (shm)" to stop retry on configuration error.

Add --ft-non-retryable-exception-file to mark nodes unhealthy when workers fail with specific exception patterns (e.g., config errors). This prevents retrying on errors that won't be fixed by retrying. Implementation: - Workers write full tracebacks via sys.excepthook to error files - Launcher checks error files against configured patterns on worker failure - Nodes with matching exceptions increment unhealthy_count and exit - Rendezvous uses unhealthy_count to decide if job can continue Example: Configure patterns like "insufficient shared memory (shm)" to stop retry on configuration error.

hexinw-nvidia requested review from anjalibshah, apaithankar, namitdhameja, rhewett-nv and sbak5 October 28, 2025 19:51

hexinw-nvidia force-pushed the stop_retry branch from 3107ef8 to c6dd9d1 Compare October 28, 2025 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add non-retryable exception pattern matching #212

feat: add non-retryable exception pattern matching #212

Uh oh!

hexinw-nvidia commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add non-retryable exception pattern matching #212

Are you sure you want to change the base?

feat: add non-retryable exception pattern matching #212

Uh oh!

Conversation

hexinw-nvidia commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant