Skip to content

Conversation

radu-mocanu
Copy link
Contributor

@radu-mocanu radu-mocanu commented Aug 22, 2025

UiPath Agent Evaluation Framework

Evaluation system for assessing agent performance with multiple evaluator types.

AgentEvaluator

Main orchestrator that manages multiple evaluators, handles agent execution, and provides async result streaming with error handling.

Built-in Evaluators

  • ExactMatchEvaluator: Strict structural comparison with boolean scoring (True/False)
  • JsonSimilarityEvaluator: Flexible comparison with numerical scoring (0-100) using token-level similarity and Levenshtein distance
  • LlmAsAJudgeEvaluator: LLM-powered subjective assessment with structured JSON responses

EvaluationService

Orchestrates full evaluation workflows with auto-discovery, parallel execution, progress reporting, and result persistence.

Score Types

  • BOOLEAN: Pass/fail evaluations
  • NUMERICAL: Scored evaluations (0-100)
  • ERROR: Graceful error handling with diagnostics

How it works:

  1. Initialize the AgentEvaluator
from uipath.eval import AgentEvaluator
from uipath.eval.evaluators import ExactMatchEvaluator, LlmAsAJudgeEvaluator, JsonSimilarityEvaluator, BaseEvaluator 
agent_evaluator = AgentEvaluator(
        evaluators=[ExactMatchEvaluator(),
                    LlmAsAJudgeEvaluator(
                        model=ChatModels.gpt_4o_2024_08_06,
                        prompt="As an expert evaluator, analyze the semantic similarity of these JSON contents.\n----\nExpectedOutput:\n{{ExpectedOutput}}\n----\nActualOutput:\n{{ActualOutput}}\n"),
                    JsonSimilarityEvaluator(),
                    ],
        path_to_agent="C:\\Users\\radu.mocanu\\agents_playground\\example"
    )
  1. Run the evaluations
await agent_evaluator.run_and_collect(
        agent_input={
            "human_input": 123
        },
        expected_output={
            "content": '123'
        }
    )
  1. (Optional) Define a custom evaluator
class CustomEvaluator(BaseEvaluator):
    async def evaluate(
            self,
            agent_input: Optional[Dict[str, Any]],
            expected_output: Dict[str, Any],
            actual_output: Dict[str, Any],
            uipath_eval_spans: Optional[list[UiPathEvalSpan]],
            execution_logs: str,
    ) -> EvaluationResult:
        if 'my custom print' in execution_logs:
            return EvaluationResult(
                score=100,
                score_type=ScoreType.NUMERICAL
            )
        return EvaluationResult(
            score=0,
            score_type=ScoreType.NUMERICAL
        )
  • register it
agent_evaluator.add_evaluator(
                    CustomEvaluator(
                        name='MyCustomEvaluator',
                        description="This evaluator checks if 'my custom print' was printed by the agent"
                    ),)
  • run the evaluations (with real time evaluation reporting)
    async for eval_item_result in  agent_evaluator.run(
        agent_input={
            "human_input": 123
        },
        expected_output={
            "content": '123'
        }
    ):         

Development Package

  • Add this package as a dependency in your pyproject.toml:
[project]
dependencies = [
  # Exact version:
  "uipath==2.1.28.dev1005090830",

  # Any version from PR
  "uipath>=2.1.28.dev1005090000,<2.1.28.dev1005100000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath = { index = "testpypi" }

@radu-mocanu radu-mocanu self-assigned this Aug 22, 2025
@radu-mocanu radu-mocanu added the build:dev Create a dev build from the pr label Aug 22, 2025
@radu-mocanu radu-mocanu changed the title [WIP]: capture traces for evals [WIP]: capture spans for evals Aug 22, 2025
@radu-mocanu radu-mocanu changed the title [WIP]: capture spans for evals [WIP]: coded evaluators Aug 26, 2025
@radu-mocanu radu-mocanu changed the title [WIP]: coded evaluators Coded evaluators Aug 27, 2025
@radu-mocanu radu-mocanu requested a review from cristipufu August 27, 2025 16:58
@radu-mocanu radu-mocanu force-pushed the fix/trace branch 7 times, most recently from b3880e0 to d62f39b Compare August 27, 2025 17:21
@radu-mocanu radu-mocanu removed the build:dev Create a dev build from the pr label Aug 27, 2025
@radu-mocanu radu-mocanu force-pushed the fix/trace branch 2 times, most recently from dfce953 to 7f65e52 Compare August 27, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant