Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Grouped output metrics & exportable result files for benchmarks #1172

Open
owenparsons opened this issue Jan 22, 2025 · 3 comments
Assignees

Comments

@owenparsons
Copy link

Feature request: Grouped output metrics & exportable result files for benchmarks

Summary:

After using the Inspect AI framework while implementing the NIAH benchmark (as part of ASET), I would like to propose two feature enhancements based on my usage. Specifically, I think it would be advantageous to have (i) support for grouping performance metrics by certain parameters (e.g., categorical variables) and (ii) a more flexible output file export mechanism.

Disclaimer: I'm aware that one of both of these features may already exist in the framework and I might not have come across them yet. If so, I can update the NIAH benchmark to incorporate these and will remove this request. If this functionality exists and isn't document, I believe it would be great to have it documented in the examples.

Current Behaviour:

  • Inspect AI currently outputs a single metric for all samples, without the ability to group the results by categorical parameters (e.g., experiment configurations).
  • In the NIAH benchmark, I worked around this limitation by using a wrapper function to pass additional meta-information. Custom metric functions were then used to generate subset scores based on these parameters.
  • I think it would be beneficial to have a more streamlined approach to producing group performance metrics by experimental parameters directly in the framework.
  • Additionally, it would be useful to be able to generate output files that contain a summary of performance across these grouped performance metrics.

Proposed Changes:

  1. Grouped Output Metrics:
    • Extend the functionality of Inspect AI to support grouped output metrics, where performance scores can be categorised or grouped based on specific experimental parameters.
    • This would help in evaluating benchmarks across different configurations and make it easier to analyse and compare results.
  2. Exportable Output File:
    • It's my understanding that it's currently only possible to extract certain information from logs. It would be helpful to provide an explicit feature to save the output metrics as a standalone, tabular file.
    • The ability to define the desired output file location and format during the evaluation run would greatly improve usability, especially when dealing with large sets of benchmark results.

Version

  • Inspect AI version during development: 0.3.44
@jjallaire
Copy link
Collaborator

cc @dragonstyle

@dragonstyle
Copy link
Collaborator

Thanks for these suggestions! I'm planning on doing some work on improving our scoring support in the next couple of weeks - I'll put these on the list of items to work through!

@dragonstyle dragonstyle self-assigned this Jan 23, 2025
@owenparsons
Copy link
Author

Thanks for these suggestions! I'm planning on doing some work on improving our scoring support in the next couple of weeks - I'll put these on the list of items to work through!

That's great, thank you very much! If you have any questions or I can support in anyway, just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants