Skip to content

Conversation

@ftnext
Copy link
Contributor

@ftnext ftnext commented Feb 8, 2026

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Problem:
adk eval did not support repeated runs, so users had to use external loops to reduce nondeterminism. Also, when adding multi-run support, reporting needs to stay consistent between summary and detailed output.

Solution:
Added --num_runs to adk eval (default: 1, min: 1) and wired it into inference execution so each eval set can be run multiple times in a single CLI invocation.

For reporting:

  • Summary is aggregated per eval case (eval_set_id + eval_id) across runs.
  • Detailed output (--print_detailed_results) keeps per-run details (each run's EvalCaseResult) to avoid mismatches between aggregated scores and invocation-level debugging info.

Also added unit tests for:

  • Aggregation behavior across runs.
  • CLI validation for invalid --num_runs values.
  • Multi-run summary aggregation and detailed per-run printing behavior.

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

Passed test command:

% pytest tests/unittests/cli

========================= 272 passed, 145 warnings in 9.53s =========================

Manual End-to-End (E2E) Tests:

  1. Run:
    adk eval <agent_dir> <eval_set_or_id> --num_runs 2
  2. Confirm summary is reported per eval case (not doubled by run count).
  3. Confirm eval set result files are saved for each run.
    ls <agent_dir>/.adk/eval_results/
  4. Run with detailed output:
    adk eval <agent_dir> <eval_set_or_id> --num_runs 2 --print_detailed_results
  5. Confirm detailed results are shown per run.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

This PR intentionally focuses on the adk eval --num_runs part of #4410.

Current persistence behavior in this PR:

  • For adk eval --num_runs N, eval set result files are saved for each run
    through the existing LocalEvalService flow (i.e., multiple
    .evalset_result.json files can be created).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ftnext, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the adk eval command-line utility by introducing a --num_runs flag, enabling repeated execution of evaluation sets. This feature aims to mitigate nondeterminism in evaluation results by running tests multiple times. The changes include robust aggregation of summary metrics across these runs, while maintaining the ability to view detailed results for each individual run, ensuring comprehensive reporting and debugging capabilities.

Highlights

  • Multi-run Evaluation Support: Introduced a new --num_runs option to the adk eval command, allowing users to execute evaluation sets multiple times to reduce nondeterminism.
  • Result Aggregation: Implemented logic to aggregate summary results per evaluation case (combining eval_set_id and eval_id) across all runs, providing a consolidated view.
  • Detailed Output Preservation: Ensured that when --print_detailed_results is used, the output still displays per-run details, preventing loss of invocation-level debugging information.
  • CLI Validation and Testing: Added unit tests for the new aggregation behavior, validation of the --num_runs argument (minimum value of 1), and correct multi-run summary and detailed printing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/google/adk/cli/cli_eval.py
    • Imported statistics module for calculating mean scores during aggregation.
    • Imported EvalMetricResult and EvalStatus for use in new aggregation functions.
    • Added _generate_final_eval_status to determine the overall evaluation status from aggregated metric results.
    • Added _aggregate_metric_results to compute aggregated scores (mean) and status for individual metrics across multiple runs.
    • Implemented aggregate_eval_case_results to group and aggregate EvalCaseResult objects by their eval_set_id and eval_id.
    • Modified the _collect_inferences call to process a list of repeated_inference_requests based on the num_runs parameter.
    • Updated the main cli_eval function to apply aggregate_eval_case_results to the collected evaluation results before generating the summary.
  • src/google/adk/cli/cli_tools_click.py
    • Added the --num_runs option to the cli_eval command, defaulting to 1 and enforcing a minimum value of 1.
    • Updated the cli_eval function signature to accept the new num_runs argument.
    • Extended the cli_eval docstring to include a description for the NUM_RUNS parameter.
    • Imported the new aggregate_eval_case_results function.
    • Modified the cli_eval function to duplicate inference requests num_runs times, creating repeated_inference_requests.
    • Integrated the call to aggregate_eval_case_results to process the raw evaluation results before displaying them.
  • tests/unittests/cli/utils/test_cli_eval.py
    • Added necessary imports for aggregate_eval_case_results, EvalMetricResult, EvalStatus, and EvalCaseResult.
    • Introduced test_aggregate_eval_case_results_across_runs to verify the correct aggregation of EvalCaseResult objects when multiple runs are involved.
  • tests/unittests/cli/utils/test_cli_tools_click.py
    • Added imports for EvalMetricResult, EvalStatus, and EvalCaseResult for new test cases.
    • Added test_cli_eval_rejects_num_runs_less_than_one to validate the CLI's handling of invalid --num_runs values.
    • Added test_cli_eval_with_num_runs_aggregates_per_eval_case to confirm that adk eval correctly aggregates results for summary output when --num_runs is specified.
    • Added test_cli_eval_with_num_runs_prints_details_per_run to verify that detailed results are printed for each individual run when --print_detailed_results is combined with --num_runs.
Activity
  • The pull request introduces a new --num_runs command-line option for adk eval to enable repeated evaluation runs.
  • New Python functions (_generate_final_eval_status, _aggregate_metric_results, aggregate_eval_case_results) have been added to handle the aggregation of evaluation metrics and statuses across multiple runs.
  • The core cli_eval logic has been updated to incorporate the multi-run execution and result aggregation.
  • Comprehensive unit tests have been added to validate the new --num_runs CLI option, the aggregation logic, and the behavior of detailed result printing.
  • The author has confirmed that all unit tests pass locally and has performed manual end-to-end tests to verify the feature's functionality, including summary aggregation and detailed result saving/printing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Feb 8, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for running evaluations multiple times using the --num_runs flag in adk eval. The implementation correctly handles the repetition of inference requests and aggregates the results for a consolidated summary report, while preserving per-run details in the detailed output. The code is well-structured and includes comprehensive unit tests for the new functionality. I've provided a couple of minor suggestions to enhance code conciseness.

Comment on lines +187 to +189
if key not in eval_results_by_case_id:
eval_results_by_case_id[key] = []
eval_results_by_case_id[key].append(eval_result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make the code more concise, you can use dict.setdefault() to simplify the logic for grouping evaluation results by case ID. This avoids the explicit check for the key's existence.

    eval_results_by_case_id.setdefault(key, []).append(eval_result)

Comment on lines +198 to +200
if metric_name not in metric_results_by_name:
metric_results_by_name[metric_name] = []
metric_results_by_name[metric_name].append(metric_result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, you can use dict.setdefault() here to make the code for grouping metric results by name more concise.

        metric_results_by_name.setdefault(metric_name, []).append(metric_result)

@ryanaiagent ryanaiagent self-assigned this Feb 10, 2026
@ryanaiagent
Copy link
Collaborator

ryanaiagent commented Feb 10, 2026

Hi @ftnext , Thank you for your contribution! We appreciate you taking the time to submit this pull request.
The _get_eval_results_by_eval_id method explicitly uses inference_requests = [...] * num_runs to repeat inference requests, so the framework itself manages the repetition to address non-determinism.
Regarding reporting, the EvalCaseResult class includes fields for both overall_eval_metric_results (for aggregated results) and eval_metric_result_per_invocation (for detailed per-invocation results). The _EvalMetricResultWithInvocation class further supports this by storing both actual and expected invocations alongside the evaluation metric result for detailed analysis

@ryanaiagent ryanaiagent added request clarification [Status] The maintainer need clarification or more information from the author answered [Status] This issue has been answered by the maintainer labels Feb 10, 2026
run locally:
pyink --config pyproject.toml src/google/adk/cli/cli_eval.py src/google/adk/cli/cli_tools_click.py tests/unittests/cli/utils/test_cli_eval.py tests/unittests/cli/utils/test_cli_tools_click.py

https://github.com/google/adk-python/actions/runs/21793195105/job/63159177296
@ftnext
Copy link
Contributor Author

ftnext commented Feb 11, 2026

@ryanaiagent Thanks for your comment.

You are absolutely right that AgentEvaluator already handles num_runs by repeating inference requests and supports both aggregated and detailed outputs via EvalCaseResult.

For this PR specifically, my focus is adk eval CLI behavior.
Today, adk eval does not call AgentEvaluator; it goes through LocalEvalService directly in cli_tools_click.py (_collect_inferences / _collect_eval_results).
So the intent of this change is to bring CLI behavior closer to the same pattern (num_runs repetition + aggregated summary + detailed per-run output) while staying within the current CLI execution path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

answered [Status] This issue has been answered by the maintainer eval [Component] This issue is related to evaluation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants