-
Notifications
You must be signed in to change notification settings - Fork 485
Open
Labels
feature requestNew feature or requestNew feature or requestimprovementImprovement to existing functionalityImprovement to existing functionality
Description
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem this feature solves
A comprehensive evaluation for an agent usually requires multiple benchmarks. To run multiple benchmarks, currently you have to define one config file per benchmark dataset.
- It would be nice to define all evaluators and datasets in one config file.
- When using weave to log eval results, due to one-dataset-per-config limit, different dataset runs have to be logged on different rows even though they use the exact same workflow config. This makes metrics viewing and filtering on observability platforms like Weave non-intuitive.
Describe your ideal solution
Add the ability to specify multiple dataset file paths in the eval section of the config file, and the ability to associate each evaluator with one of those datasets.
For example, inside the eval section of config file, change to something like this:
...
eval:
general:
output_dir: ./eval_results/
max_concurrency: 10
dataset:
dataset_name_1:
_type: json
file_path: data/file_1.json
dataset_name_2:
_type: json
file_path: data/file_2.json
evaluators:
evaluator_1:
_type: dabstep_easy_evaluator
dataset: dataset_name_1
evaluator_2:
_type: dabstep_hard_evaluator
dataset: dataset_name_2
evaluator_3:
_type: dabstep_easy_evaluator
dataset: dataset_name_2And report all evaluator result to the observability platform as a single row.
Additional context
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
- I have searched the open feature requests and have found no duplicates for this feature request
coderabbitai
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or requestimprovementImprovement to existing functionalityImprovement to existing functionality