Skip to content

Align built-in code evaluator interface with SDK evaluators #3591

@mmabrouk

Description

@mmabrouk

Motivation

The built-in code evaluator uses a legacy interface that differs from SDK custom evaluators (@ag.evaluator). This causes two problems.

First, users cannot access trace data. Evaluators cannot inspect spans, latency, token usage, or internals stored during execution. This limits what users can evaluate.

Second, the interface is confusing. The app_params parameter is deprecated and always empty. The correct_answer parameter requires a separate correct_answer_key setting. Users who learn the SDK interface must learn a different pattern for built-in evaluators.

Current Interface

def evaluate(
    app_params: Dict[str, str],      # deprecated, always {}
    inputs: Dict[str, str],          
    output: Union[str, Dict],        
    correct_answer: str,             # requires correct_answer_key setting
) -> float:

Proposed Interface

def evaluate(
    testcase: Dict[str, Any],        # testcase data (includes correct_answer)
    inputs: Dict[str, Any],          # inputs sent to the application
    outputs: Any,                    # application outputs
    trace: Dict[str, Any],           # full trace data
) -> float:

This matches the fields available to SDK evaluators through WorkflowServiceRequestData. Users access correct_answer directly from the testcase dict.

Requirements

  1. Version flag for migration. Add a hidden version setting (similar to LLM-as-a-judge). Existing evaluators keep version 1 and continue working. New evaluators default to version 2 with the new interface.

  2. Update default template and presets. The default code and all presets should use the new interface. Consider adding presets that demonstrate trace access (e.g., latency checks).

  3. Support both interfaces in execution. The handler and sandbox must check the version and pass the appropriate parameters to user code.

  4. Update documentation. Document the new interface, available fields in the trace dict, and migration path for existing evaluators.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions