Align built-in code evaluator interface with SDK evaluators

## Motivation

The built-in code evaluator uses a legacy interface that differs from SDK custom evaluators (`@ag.evaluator`). This causes two problems.

First, users cannot access trace data. Evaluators cannot inspect spans, latency, token usage, or internals stored during execution. This limits what users can evaluate.

Second, the interface is confusing. The `app_params` parameter is deprecated and always empty. The `correct_answer` parameter requires a separate `correct_answer_key` setting. Users who learn the SDK interface must learn a different pattern for built-in evaluators.

## Current Interface

```python
def evaluate(
    app_params: Dict[str, str],      # deprecated, always {}
    inputs: Dict[str, str],          
    output: Union[str, Dict],        
    correct_answer: str,             # requires correct_answer_key setting
) -> float:
```

## Proposed Interface

```python
def evaluate(
    testcase: Dict[str, Any],        # testcase data (includes correct_answer)
    inputs: Dict[str, Any],          # inputs sent to the application
    outputs: Any,                    # application outputs
    trace: Dict[str, Any],           # full trace data
) -> float:
```

This matches the fields available to SDK evaluators through `WorkflowServiceRequestData`. Users access `correct_answer` directly from the testcase dict.

## Requirements

1. **Version flag for migration.** Add a hidden `version` setting (similar to LLM-as-a-judge). Existing evaluators keep version 1 and continue working. New evaluators default to version 2 with the new interface.

2. **Update default template and presets.** The default code and all presets should use the new interface. Consider adding presets that demonstrate trace access (e.g., latency checks).

3. **Support both interfaces in execution.** The handler and sandbox must check the version and pass the appropriate parameters to user code.

4. **Update documentation.** Document the new interface, available fields in the trace dict, and migration path for existing evaluators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align built-in code evaluator interface with SDK evaluators #3591

Motivation

Current Interface

Proposed Interface

Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Align built-in code evaluator interface with SDK evaluators #3591

Description

Motivation

Current Interface

Proposed Interface

Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions