Skip to content

Conversation

@mvanhorn
Copy link
Contributor

@mvanhorn mvanhorn commented Feb 9, 2026

Summary

  • Adds loopFailureSignatures tracking to runLoop() to detect and abort infinite cycles caused by repeated deterministic failures (e.g., expired provider auth tokens)
  • Reuses the existing loop_restart_signature_limit graph attribute (default 3) as the threshold
  • Resets the counter on any successful stage, so legitimate single-hop fail routing to recovery nodes is unaffected

Fixes #1

Details

The engine already had failure signature tracking for loop_restart edges (restartFailureSignatures), but the main runLoop had no equivalent protection. When every stage fails with failure_class=deterministic, conditional edge routing (e.g., check -> implement [condition="outcome=fail"]) keeps cycling forever.

This mirrors the existing loopRestart circuit breaker but applies it to the main loop.

Test plan

  • TestRun_DeterministicFailureCycle_AbortsInfiniteLoop — graph with a conditional retry cycle where all nodes exit 1; verifies the run aborts after 3 repeated signatures
  • TestRun_DeterministicFailure_SingleRouteToRecovery_StillWorks — single deterministic failure routes to a recovery node and completes successfully (no false positive)
  • All existing engine tests pass (56/57; 1 pre-existing flaky TestWaitWithIdleWatchdog_ContextCancelKillsProcessGroup)

🤖 Generated with Claude Code

…main loop

The engine already had failure signature tracking for loop_restart edges,
but the main runLoop had no equivalent protection. When a provider auth
token expires mid-run, every stage fails with failure_class=deterministic.
The per-node retry system (executeWithRetry) correctly blocks same-node
retries, but conditional edge routing (e.g., check -> implement on
outcome=fail) still follows fail edges, creating an infinite cycle of
1-2 second failures that never terminates.

This adds loopFailureSignatures tracking to runLoop. After each stage,
if the outcome is a deterministic failure, the failure signature is
recorded. If the same signature repeats N times (default 3, configurable
via loop_restart_signature_limit), the run aborts with a clear error.
The counter resets on any successful stage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Owner

@danshapiro danshapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough, well-tested fix that follows established codebase patterns. Clean implementation of the circuit breaker for deterministic failure cycles. Merging now with two optional follow-up items to handle in a subsequent commit.

@danshapiro danshapiro merged commit 6215408 into danshapiro:main Feb 9, 2026
danshapiro pushed a commit that referenced this pull request Feb 9, 2026
… correctness

The circuit breaker for deterministic failure cycles (PR #2) tracked
failure signatures in-memory but did not persist them to checkpoint.json.
This meant a resumed run would restart the counter at zero, potentially
allowing 2×limit failures before the breaker tripped.

Changes:
- checkpoint() now serializes loopFailureSignatures to cp.Extra["loop_failure_signatures"]
  alongside the existing restart_failure_signatures
- resume.go restores loopFailureSignatures via restoreLoopFailureSignatures(),
  mirroring the existing restoreRestartFailureSignatures() pattern
- Added explanatory comment in loopRestart() documenting the intentional
  non-reset of loopFailureSignatures across loop restarts (if the same
  deterministic failure persists after a restart, the counter should keep
  accumulating)
- Added tests for round-trip serialization including populated maps,
  nil checkpoints, missing keys, and empty-key filtering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

engine: infinite loop when provider auth expires mid-run (no cycle breaker for deterministic failures)

2 participants