-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Milestone
Description
Problem Statement
Running the amp-evaluation SDK locally currently requires writing Python scripts and managing configuration programmatically, which creates friction for rapid iteration and testing. Developers need a streamlined command-line interface to quickly evaluate agent traces and run experiments without boilerplate code.
Motivation
A dedicated CLI tool will:
- Reduce iteration time: Test evaluators instantly without writing Python scripts
- Improve developer experience: Simple commands for common evaluation workflows
- Enable CI/CD integration: Easy to incorporate into automated testing pipelines
- Lower barrier to entry: Make evaluation accessible to non-Python experts
Use Cases
1. Evaluate Exported Traces
Run evaluators against locally exported OTEL trace files (JSON format):
# Run all registered evaluators
amp-eval trace evaluate my_trace.json
# Run specific evaluators
amp-eval trace evaluate my_trace.json --evaluators latency,answer_relevancy
# Specify output format
amp-eval trace evaluate my_trace.json --output results.json --format json
# Use custom evaluator config
amp-eval trace evaluate my_trace.json --config evaluators.yaml2. Run Dataset Experiments
Execute experiment runs against local datasets:
# Run experiment with dataset
amp-eval experiment run my_dataset.csv --evaluators hallucination,toxicity
# Specify agent invoker
amp-eval experiment run my_dataset.csv --agent my_agent.py:invoke_agent
# Save detailed results
amp-eval experiment run my_dataset.csv --output-dir ./results --save-traces
# Resume failed experiment
amp-eval experiment run my_dataset.csv --resume ./results/experiment_1233. Evaluator Management
List and inspect available evaluators:
# List all registered evaluators
amp-eval evaluators list
# Show evaluator details
amp-eval evaluators info answer_relevancy
# Validate custom evaluator
amp-eval evaluators validate my_evaluator.py4. Configuration Management
Manage evaluation configurations:
# Initialize config template
amp-eval init --template experiment
# Validate configuration
amp-eval config validate evaluators.yaml
# Show current config
amp-eval config showProposed Commands
Command Structure
amp-eval <resource> <action> [arguments] [options]
Command Reference
| Command | Description |
|---|---|
amp-eval trace evaluate <file> |
Evaluate single trace file |
amp-eval trace batch <dir> |
Evaluate multiple traces in directory |
amp-eval experiment run <dataset> |
Run experiment against dataset |
amp-eval evaluators list |
List available evaluators |
amp-eval evaluators info <name> |
Show evaluator details |
amp-eval init |
Initialize configuration template |
amp-eval config validate <file> |
Validate config file |
amp-eval version |
Show CLI and SDK versions |
Acceptance Criteria
- Users can evaluate a trace file with a single command
- Users can run experiments against datasets without writing code
- CLI provides helpful error messages and validation
- Configuration can be specified via files or flags
- Output formats include console, JSON, and CSV
- Documentation includes usage examples for all commands
- CLI is packaged and installable via pip
- Exit codes are appropriate for CI/CD integration
- Tests cover core CLI functionality
Reactions are currently unavailable