Skip to content

🤖 fix: reuse cached skill diagnostics on transient discovery failures#2417

Open
ammar-agent wants to merge 3 commits intomainfrom
skill-loader-9ph2
Open

🤖 fix: reuse cached skill diagnostics on transient discovery failures#2417
ammar-agent wants to merge 3 commits intomainfrom
skill-loader-9ph2

Conversation

@ammar-agent
Copy link
Collaborator

Summary

Handle transient skill discovery transport failures without surfacing false SKILL.md is missing or unreadable diagnostics in SSH workspaces.

Background

Skill discovery currently treats all stat/read failures as deterministic file problems. On flaky SSH links this produced misleading invalid-skill diagnostics for valid skills and caused agentSkills.list/agentSkills.listDiagnostics to diverge from expected behavior.

Implementation

  • Added AgentSkillTransientDiscoveryError and transient error pattern classification in agentSkillsService.
  • Kept deterministic file/schema problems on the existing invalid-skill diagnostics path.
  • Added src/node/orpc/agentSkillsDiagnosticsCache.ts with:
    • cache key builder for discovery context
    • stale-on-transient fallback loader for diagnostics
  • Updated router.ts so both agentSkills.list and agentSkills.listDiagnostics share the fallback-backed diagnostics loader.
  • Added tests for:
    • transient stat/read error propagation in service diagnostics discovery
    • cache fallback on transient failures
    • no fallback for non-transient failures
    • transient failure rethrow when cache is empty

Validation

  • make static-check
  • bun test src/node/services/agentSkills/agentSkillsService.test.ts src/node/orpc/agentSkillsDiagnosticsCache.test.ts

Risks

Low to medium. The transient classifier is message-pattern based, so misclassification is possible if transport/runtime error strings change. The fallback is scoped per discovery context and only applies to explicit transient error type, minimizing risk of masking deterministic skill issues.


Generated with mux • Model: openai:gpt-5.3-codex • Thinking: xhigh • Cost: $0.00

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90bbfce452

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ammar-agent
Copy link
Collaborator Author

@codex review

Addressed the transient discovery regression by keeping non-diagnostic skill discovery resilient to per-skill transient I/O failures while retaining diagnostics-path transient rethrow behavior.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 565cf01152

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Treat transient SKILL.md stat/read failures as transport errors instead of invalid
skill diagnostics, then reuse the last good diagnostics snapshot for list/listDiagnostics
when discovery fails transiently.

- add AgentSkillTransientDiscoveryError classification in agentSkillsService
- add ORPC diagnostics cache + transient fallback helper
- route both agentSkills.list and listDiagnostics through shared diagnostics loader
- add service and cache tests for transient and non-transient failure paths

---

_Generated with `mux` • Model: `openai:gpt-5.3-codex` • Thinking: `xhigh` • Cost: `$0.00`_

<!-- mux-attribution: model=openai:gpt-5.3-codex thinking=xhigh costs=0.00 -->
@ammar-agent
Copy link
Collaborator Author

@codex review

Applied the requested fix:

  • Rebased on latest main.
  • Switched agentSkills.list back to discoverAgentSkills(...) (non-diagnostic resilient path).
  • Kept diagnostics cache fallback only on agentSkills.listDiagnostics.

Local validation:

  • make static-check

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments