Coded evaluators #509

radu-mocanu · 2025-08-22T13:14:59Z

UiPath Agent Evaluation Framework

Evaluation system for assessing agent performance with multiple evaluator types.

AgentEvaluator

Main orchestrator that manages multiple evaluators, handles agent execution, and provides async result streaming with error handling.

Built-in Evaluators

ExactMatchEvaluator: Strict structural comparison with boolean scoring (True/False)
JsonSimilarityEvaluator: Flexible comparison with numerical scoring (0-100) using token-level similarity and Levenshtein distance
LlmAsAJudgeEvaluator: LLM-powered subjective assessment with structured JSON responses

EvaluationService

Orchestrates full evaluation workflows with auto-discovery, parallel execution, progress reporting, and result persistence.

Score Types

BOOLEAN: Pass/fail evaluations
NUMERICAL: Scored evaluations (0-100)
ERROR: Graceful error handling with diagnostics

How it works:

Initialize the AgentEvaluator

from uipath.eval import AgentEvaluator
from uipath.eval.evaluators import ExactMatchEvaluator, LlmAsAJudgeEvaluator, JsonSimilarityEvaluator, BaseEvaluator 
agent_evaluator = AgentEvaluator(
        evaluators=[ExactMatchEvaluator(),
                    LlmAsAJudgeEvaluator(
                        model=ChatModels.gpt_4o_2024_08_06,
                        prompt="As an expert evaluator, analyze the semantic similarity of these JSON contents.\n----\nExpectedOutput:\n{{ExpectedOutput}}\n----\nActualOutput:\n{{ActualOutput}}\n"),
                    JsonSimilarityEvaluator(),
                    ],
        path_to_agent="C:\\Users\\radu.mocanu\\agents_playground\\example"
    )

Run the evaluations

await agent_evaluator.run_and_collect(
        agent_input={
            "human_input": 123
        },
        expected_output={
            "content": '123'
        }
    )

(Optional) Define a custom evaluator

class CustomEvaluator(BaseEvaluator):
    async def evaluate(
            self,
            agent_input: Optional[Dict[str, Any]],
            expected_output: Dict[str, Any],
            actual_output: Dict[str, Any],
            uipath_eval_spans: Optional[list[UiPathEvalSpan]],
            execution_logs: str,
    ) -> EvaluationResult:
        if 'my custom print' in execution_logs:
            return EvaluationResult(
                score=100,
                score_type=ScoreType.NUMERICAL
            )
        return EvaluationResult(
            score=0,
            score_type=ScoreType.NUMERICAL
        )

register it

agent_evaluator.add_evaluator(
                    CustomEvaluator(
                        name='MyCustomEvaluator',
                        description="This evaluator checks if 'my custom print' was printed by the agent"
                    ),)

run the evaluations (with real time evaluation reporting)

    async for eval_item_result in  agent_evaluator.run(
        agent_input={
            "human_input": 123
        },
        expected_output={
            "content": '123'
        }
    ):

Development Package

Add this package as a dependency in your pyproject.toml:

[project]
dependencies = [
  # Exact version:
  "uipath==2.1.28.dev1005090830",

  # Any version from PR
  "uipath>=2.1.28.dev1005090000,<2.1.28.dev1005100000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath = { index = "testpypi" }

radu-mocanu self-assigned this Aug 22, 2025

radu-mocanu added the build:dev Create a dev build from the pr label Aug 22, 2025

radu-mocanu changed the title ~~[WIP]: capture traces for evals~~ [WIP]: capture spans for evals Aug 22, 2025

radu-mocanu force-pushed the fix/trace branch from d674560 to cb1df50 Compare August 26, 2025 15:20

radu-mocanu changed the title ~~[WIP]: capture spans for evals~~ [WIP]: coded evaluators Aug 26, 2025

radu-mocanu force-pushed the fix/trace branch from cb1df50 to 60b1917 Compare August 27, 2025 16:57

radu-mocanu changed the title ~~[WIP]: coded evaluators~~ Coded evaluators Aug 27, 2025

radu-mocanu requested a review from cristipufu August 27, 2025 16:58

radu-mocanu force-pushed the fix/trace branch 7 times, most recently from b3880e0 to d62f39b Compare August 27, 2025 17:21

radu-mocanu removed the build:dev Create a dev build from the pr label Aug 27, 2025

radu-mocanu force-pushed the fix/trace branch 2 times, most recently from dfce953 to 7f65e52 Compare August 27, 2025 17:38

feat: add AgentEvaluator

7e7e633

radu-mocanu force-pushed the fix/trace branch from 7f65e52 to 7e7e633 Compare September 1, 2025 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coded evaluators #509

Coded evaluators #509

Uh oh!

radu-mocanu commented Aug 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Coded evaluators #509

Are you sure you want to change the base?

Coded evaluators #509

Uh oh!

Conversation

radu-mocanu commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UiPath Agent Evaluation Framework

AgentEvaluator

Built-in Evaluators

EvaluationService

Score Types

How it works:

Development Package

Uh oh!

Uh oh!

radu-mocanu commented Aug 22, 2025 •

edited

Loading