Support `adk eval --num_runs N` as AgentEvaluator #4411

ftnext · 2026-02-08T05:49:29Z

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Related: Consistency improvements between adk eval and AgentEvaluator #4410

Problem:
adk eval did not support repeated runs, so users had to use external loops to reduce nondeterminism. Also, when adding multi-run support, reporting needs to stay consistent between summary and detailed output.

Solution:
Added --num_runs to adk eval (default: 1, min: 1) and wired it into inference execution so each eval set can be run multiple times in a single CLI invocation.

For reporting:

Summary is aggregated per eval case (eval_set_id + eval_id) across runs.
Detailed output (--print_detailed_results) keeps per-run details (each run's EvalCaseResult) to avoid mismatches between aggregated scores and invocation-level debugging info.

Also added unit tests for:

Aggregation behavior across runs.
CLI validation for invalid --num_runs values.
Multi-run summary aggregation and detailed per-run printing behavior.

Testing Plan

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

Passed test command:

% pytest tests/unittests/cli

========================= 272 passed, 145 warnings in 9.53s =========================

Manual End-to-End (E2E) Tests:

Run:

adk eval <agent_dir> <eval_set_or_id> --num_runs 2

Confirm summary is reported per eval case (not doubled by run count).
Confirm eval set result files are saved for each run.
```
ls <agent_dir>/.adk/eval_results/
```

Run with detailed output:

adk eval <agent_dir> <eval_set_or_id> --num_runs 2 --print_detailed_results

Confirm detailed results are shown per run.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

This PR intentionally focuses on the adk eval --num_runs part of #4410.

Current persistence behavior in this PR:

For adk eval --num_runs N, eval set result files are saved for each run
through the existing LocalEvalService flow (i.e., multiple
.evalset_result.json files can be created).

…iled per-run output

gemini-code-assist · 2026-02-08T05:49:49Z

Summary of Changes

Hello @ftnext, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the adk eval command-line utility by introducing a --num_runs flag, enabling repeated execution of evaluation sets. This feature aims to mitigate nondeterminism in evaluation results by running tests multiple times. The changes include robust aggregation of summary metrics across these runs, while maintaining the ability to view detailed results for each individual run, ensuring comprehensive reporting and debugging capabilities.

Highlights

Multi-run Evaluation Support: Introduced a new --num_runs option to the adk eval command, allowing users to execute evaluation sets multiple times to reduce nondeterminism.
Result Aggregation: Implemented logic to aggregate summary results per evaluation case (combining eval_set_id and eval_id) across all runs, providing a consolidated view.
Detailed Output Preservation: Ensured that when --print_detailed_results is used, the output still displays per-run details, preventing loss of invocation-level debugging information.
CLI Validation and Testing: Added unit tests for the new aggregation behavior, validation of the --num_runs argument (minimum value of 1), and correct multi-run summary and detailed printing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/google/adk/cli/cli_eval.py
- Imported statistics module for calculating mean scores during aggregation.
- Imported EvalMetricResult and EvalStatus for use in new aggregation functions.
- Added _generate_final_eval_status to determine the overall evaluation status from aggregated metric results.
- Added _aggregate_metric_results to compute aggregated scores (mean) and status for individual metrics across multiple runs.
- Implemented aggregate_eval_case_results to group and aggregate EvalCaseResult objects by their eval_set_id and eval_id.
- Modified the _collect_inferences call to process a list of repeated_inference_requests based on the num_runs parameter.
- Updated the main cli_eval function to apply aggregate_eval_case_results to the collected evaluation results before generating the summary.
src/google/adk/cli/cli_tools_click.py
- Added the --num_runs option to the cli_eval command, defaulting to 1 and enforcing a minimum value of 1.
- Updated the cli_eval function signature to accept the new num_runs argument.
- Extended the cli_eval docstring to include a description for the NUM_RUNS parameter.
- Imported the new aggregate_eval_case_results function.
- Modified the cli_eval function to duplicate inference requests num_runs times, creating repeated_inference_requests.
- Integrated the call to aggregate_eval_case_results to process the raw evaluation results before displaying them.
tests/unittests/cli/utils/test_cli_eval.py
- Added necessary imports for aggregate_eval_case_results, EvalMetricResult, EvalStatus, and EvalCaseResult.
- Introduced test_aggregate_eval_case_results_across_runs to verify the correct aggregation of EvalCaseResult objects when multiple runs are involved.
tests/unittests/cli/utils/test_cli_tools_click.py
- Added imports for EvalMetricResult, EvalStatus, and EvalCaseResult for new test cases.
- Added test_cli_eval_rejects_num_runs_less_than_one to validate the CLI's handling of invalid --num_runs values.
- Added test_cli_eval_with_num_runs_aggregates_per_eval_case to confirm that adk eval correctly aggregates results for summary output when --num_runs is specified.
- Added test_cli_eval_with_num_runs_prints_details_per_run to verify that detailed results are printed for each individual run when --print_detailed_results is combined with --num_runs.

Activity

The pull request introduces a new --num_runs command-line option for adk eval to enable repeated evaluation runs.
New Python functions (_generate_final_eval_status, _aggregate_metric_results, aggregate_eval_case_results) have been added to handle the aggregation of evaluation metrics and statuses across multiple runs.
The core cli_eval logic has been updated to incorporate the multi-run execution and result aggregation.
Comprehensive unit tests have been added to validate the new --num_runs CLI option, the aggregation logic, and the behavior of detailed result printing.
The author has confirmed that all unit tests pass locally and has performed manual end-to-end tests to verify the feature's functionality, including summary aggregation and detailed result saving/printing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for running evaluations multiple times using the --num_runs flag in adk eval. The implementation correctly handles the repetition of inference requests and aggregates the results for a consolidated summary report, while preserving per-run details in the detailed output. The code is well-structured and includes comprehensive unit tests for the new functionality. I've provided a couple of minor suggestions to enhance code conciseness.

gemini-code-assist · 2026-02-08T05:51:25Z

src/google/adk/cli/cli_eval.py

+    if key not in eval_results_by_case_id:
+      eval_results_by_case_id[key] = []
+    eval_results_by_case_id[key].append(eval_result)


To make the code more concise, you can use dict.setdefault() to simplify the logic for grouping evaluation results by case ID. This avoids the explicit check for the key's existence.

eval_results_by_case_id.setdefault(key, []).append(eval_result)

gemini-code-assist · 2026-02-08T05:51:25Z

src/google/adk/cli/cli_eval.py

+        if metric_name not in metric_results_by_name:
+          metric_results_by_name[metric_name] = []
+        metric_results_by_name[metric_name].append(metric_result)


Similar to the previous comment, you can use dict.setdefault() here to make the code for grouping metric results by name more concise.

metric_results_by_name.setdefault(metric_name, []).append(metric_result)

ryanaiagent · 2026-02-10T19:45:21Z

Hi @ftnext , Thank you for your contribution! We appreciate you taking the time to submit this pull request.
The _get_eval_results_by_eval_id method explicitly uses inference_requests = [...] * num_runs to repeat inference requests, so the framework itself manages the repetition to address non-determinism.
Regarding reporting, the EvalCaseResult class includes fields for both overall_eval_metric_results (for aggregated results) and eval_metric_result_per_invocation (for detailed per-invocation results). The _EvalMetricResultWithInvocation class further supports this by storing both actual and expected invocations alongside the evaluation metric result for detailed analysis

run locally: pyink --config pyproject.toml src/google/adk/cli/cli_eval.py src/google/adk/cli/cli_tools_click.py tests/unittests/cli/utils/test_cli_eval.py tests/unittests/cli/utils/test_cli_tools_click.py https://github.com/google/adk-python/actions/runs/21793195105/job/63159177296

ftnext · 2026-02-11T09:16:00Z

@ryanaiagent Thanks for your comment.

You are absolutely right that AgentEvaluator already handles num_runs by repeating inference requests and supports both aggregated and detailed outputs via EvalCaseResult.

For this PR specifically, my focus is adk eval CLI behavior.
Today, adk eval does not call AgentEvaluator; it goes through LocalEvalService directly in cli_tools_click.py (_collect_inferences / _collect_eval_results).
So the intent of this change is to bring CLI behavior closer to the same pattern (num_runs repetition + aggregated summary + detailed per-run output) while staying within the current CLI execution path.

feat(cli): add adk eval --num_runs with per-case aggregation and deta…

0627c39

…iled per-run output

adk-bot added the eval [Component] This issue is related to evaluation label Feb 8, 2026

gemini-code-assist bot reviewed Feb 8, 2026

View reviewed changes

ryanaiagent self-assigned this Feb 10, 2026

ryanaiagent added request clarification [Status] The maintainer need clarification or more information from the author answered [Status] This issue has been answered by the maintainer labels Feb 10, 2026

ftnext mentioned this pull request Feb 11, 2026

Consistency improvements between adk eval and AgentEvaluator #4410

Open

ftnext added 2 commits February 11, 2026 12:00

Merge branch 'main' into adk-eval-num-runs

e058068

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `adk eval --num_runs N` as AgentEvaluator #4411

Support `adk eval --num_runs N` as AgentEvaluator #4411

ftnext commented Feb 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 8, 2026

Uh oh!

gemini-code-assist bot Feb 8, 2026

Uh oh!

ryanaiagent commented Feb 10, 2026 •

edited

Loading

Uh oh!

ftnext commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support adk eval --num_runs N as AgentEvaluator #4411

Are you sure you want to change the base?

Support adk eval --num_runs N as AgentEvaluator #4411

Conversation

ftnext commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

gemini-code-assist bot commented Feb 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

ryanaiagent commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ftnext commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support `adk eval --num_runs N` as AgentEvaluator #4411

Support `adk eval --num_runs N` as AgentEvaluator #4411

ftnext commented Feb 8, 2026 •

edited

Loading

ryanaiagent commented Feb 10, 2026 •

edited

Loading