Feature Request: Grouped output metrics & exportable result files for benchmarks #1172

owenparsons · 2025-01-22T16:12:08Z

Feature request: Grouped output metrics & exportable result files for benchmarks

Summary:

After using the Inspect AI framework while implementing the NIAH benchmark (as part of ASET), I would like to propose two feature enhancements based on my usage. Specifically, I think it would be advantageous to have (i) support for grouping performance metrics by certain parameters (e.g., categorical variables) and (ii) a more flexible output file export mechanism.

Disclaimer: I'm aware that one of both of these features may already exist in the framework and I might not have come across them yet. If so, I can update the NIAH benchmark to incorporate these and will remove this request. If this functionality exists and isn't document, I believe it would be great to have it documented in the examples.

Current Behaviour:

Inspect AI currently outputs a single metric for all samples, without the ability to group the results by categorical parameters (e.g., experiment configurations).
In the NIAH benchmark, I worked around this limitation by using a wrapper function to pass additional meta-information. Custom metric functions were then used to generate subset scores based on these parameters.
I think it would be beneficial to have a more streamlined approach to producing group performance metrics by experimental parameters directly in the framework.
Additionally, it would be useful to be able to generate output files that contain a summary of performance across these grouped performance metrics.

Proposed Changes:

Grouped Output Metrics:
- Extend the functionality of Inspect AI to support grouped output metrics, where performance scores can be categorised or grouped based on specific experimental parameters.
- This would help in evaluating benchmarks across different configurations and make it easier to analyse and compare results.
Exportable Output File:
- It's my understanding that it's currently only possible to extract certain information from logs. It would be helpful to provide an explicit feature to save the output metrics as a standalone, tabular file.
- The ability to define the desired output file location and format during the evaluation run would greatly improve usability, especially when dealing with large sets of benchmark results.

Version

Inspect AI version during development: 0.3.44

jjallaire · 2025-01-22T23:42:05Z

cc @dragonstyle

dragonstyle · 2025-01-23T00:20:21Z

Thanks for these suggestions! I'm planning on doing some work on improving our scoring support in the next couple of weeks - I'll put these on the list of items to work through!

owenparsons · 2025-01-24T17:12:50Z

Thanks for these suggestions! I'm planning on doing some work on improving our scoring support in the next couple of weeks - I'll put these on the list of items to work through!

That's great, thank you very much! If you have any questions or I can support in anyway, just let me know.

dragonstyle self-assigned this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Grouped output metrics & exportable result files for benchmarks #1172

Feature Request: Grouped output metrics & exportable result files for benchmarks #1172

owenparsons commented Jan 22, 2025

jjallaire commented Jan 22, 2025

dragonstyle commented Jan 23, 2025

owenparsons commented Jan 24, 2025

Feature Request: Grouped output metrics & exportable result files for benchmarks #1172

Feature Request: Grouped output metrics & exportable result files for benchmarks #1172

Comments

owenparsons commented Jan 22, 2025