Add RunLedger replay gate for agent regressions#24
Add RunLedger replay gate for agent regressions#24ZackMitchell910 wants to merge 2 commits intocryxnet:mainfrom
Conversation
|
Thanks for taking a look! This PR adds a replay-only RunLedger gate. The workflow run is currently waiting on fork approval (action_required). If you are open to it, please approve/authorize the workflow run so CI can complete. Happy to adjust anything. |
|
Hi @ZackMitchell910, |
There was a problem hiding this comment.
Pull request overview
This PR adds a RunLedger-based CI gate for detecting regressions in tool-using agent behavior. It introduces a deterministic, replay-only evaluation suite that uses pre-recorded tool interactions (cassettes) to validate agent outputs against JSON schemas and baseline metrics without making external calls.
Key changes:
- Adds a complete RunLedger evaluation suite with configuration, test case, and replay cassettes
- Implements a GitHub Actions workflow that runs on pull requests to enforce regression gates
- Includes baseline metrics file for pass rate, tool call budgets, and timing thresholds
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/runledger.yml |
GitHub Actions workflow that runs RunLedger evals on PRs with artifact uploads |
evals/runledger/suite.yaml |
Suite configuration defining agent command, budgets, assertions, and regression thresholds |
evals/runledger/schema.json |
JSON schema for validating agent output structure (category and reply fields) |
evals/runledger/agent/agent.py |
Stub agent implementation that handles task input and produces formatted output |
evals/runledger/cases/t1.yaml |
Test case definition for triaging a password reset ticket |
evals/runledger/cassettes/t1.jsonl |
Pre-recorded tool response for replay mode (search_docs result) |
baselines/runledger-demo.json |
Baseline metrics snapshot for regression comparison (pass rate, tool calls, timing) |
README.md |
Documentation explaining the RunLedger CI gate and how to run it locally |
.gitignore |
Excludes runledger_out/ directory from version control |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ticket = msg.get("input", {}).get("ticket", "") | ||
| send({"type": "tool_call", "name": "search_docs", "call_id": "c1", "args": {"q": ticket}}) | ||
| elif msg.get("type") == "tool_result": | ||
| send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}}) |
There was a problem hiding this comment.
The hardcoded output values in the agent ("category": "account", "reply": "Reset password instructions sent.") do not use any information from the tool result. The agent receives a search result with a snippet "Use the reset link." but ignores it completely. This creates a disconnect between the tool call and the output, making the tool call appear unnecessary.
| send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}}) | |
| tool_output = msg.get("output") | |
| details = "" | |
| if tool_output is not None: | |
| try: | |
| details = " Details: " + json.dumps(tool_output) | |
| except (TypeError, ValueError): | |
| details = " Details: " + str(tool_output) | |
| send( | |
| { | |
| "type": "final_output", | |
| "output": { | |
| "category": "account", | |
| "reply": "Reset password instructions sent." + details, | |
| }, | |
| } | |
| ) |
| description: triage a login ticket | ||
| input: | ||
| ticket: reset password | ||
| cassette: cassettes/t1.jsonl | ||
| assertions: | ||
| - type: required_fields |
There was a problem hiding this comment.
The description field "triage a login ticket" contains a potential inaccuracy. The test case is about resetting a password, not triaging a login issue. Consider updating this to "triage a password reset ticket" or similar to better match the actual ticket content.
| description: triage a login ticket | |
| input: | |
| ticket: reset password | |
| cassette: cassettes/t1.jsonl | |
| assertions: | |
| - type: required_fields | |
| description: triage a password reset ticket | |
| input: | |
| ticket: reset password | |
| cassette: cassettes/t1.jsonl | |
| assertions: | |
| - type: required_fields |
Summary
runledger/Runledger@v0.1runledger_out/How to run locally
Notes