Skip to content

Define multiple datasets in a config file #1310

@Jack-Yu-815

Description

@Jack-Yu-815

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

A comprehensive evaluation for an agent usually requires multiple benchmarks. To run multiple benchmarks, currently you have to define one config file per benchmark dataset.

  1. It would be nice to define all evaluators and datasets in one config file.
  2. When using weave to log eval results, due to one-dataset-per-config limit, different dataset runs have to be logged on different rows even though they use the exact same workflow config. This makes metrics viewing and filtering on observability platforms like Weave non-intuitive.

Describe your ideal solution

Add the ability to specify multiple dataset file paths in the eval section of the config file, and the ability to associate each evaluator with one of those datasets.

For example, inside the eval section of config file, change to something like this:

...
eval:
  general:
    output_dir: ./eval_results/
    max_concurrency: 10

    dataset:
      dataset_name_1:
        _type: json
        file_path: data/file_1.json
    
      dataset_name_2:
        _type: json
        file_path: data/file_2.json
        

  evaluators:
    evaluator_1:
      _type: dabstep_easy_evaluator
      dataset: dataset_name_1

    evaluator_2:
      _type: dabstep_hard_evaluator
      dataset: dataset_name_2
    
    evaluator_3:
      _type: dabstep_easy_evaluator
      dataset: dataset_name_2

And report all evaluator result to the observability platform as a single row.

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Labels

feature requestNew feature or requestimprovementImprovement to existing functionality

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions