Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[leaderboard] rethink approach to deciding on which models to test #83

Open
davidschlangen opened this issue Apr 19, 2024 · 1 comment

Comments

@davidschlangen
Copy link
Contributor

We need to get more clarity on what we want to achieve with the leaderboard, and then think about how we can achieve it.

A) Do we want to mirror other leaderboards? Rule could be: And any time, we strive to have numbers for the 30 best-performing models on ChatArena.

B) Do we want to identify the pareto frontier of size/performance, independently of a model's ranking elsewhere? That would be great, but makes the search space too large. We need to limit the number of models we test.

C) Are there certain models for which we want to know numbers, regardless of performance elsewhere? I guess. Rule could be: Test "big name models" (e.g., Llama-3) once they become available, to allow us to set expectations (e.g., how good derivatives might be).

My guess would be that a combination of A) and C) would be best. This limits testing to 30 models, with some fluctuation. Could even be automated: Once a model has appeared on ChatArena that we have not tested, we run the benchmark.

(We want to automatically parse their list anyway, to check for rank correlations.)

@davidschlangen
Copy link
Contributor Author

This didn't actually address the question of what we want to achieve. My proposal for that would be something like:

To provide an up-to-date overview of the performance of the most prominent LLMs (closed and open) as conversational agents (to the extent that it measured by our instrument); in order to help ourselves and others to make decisions on what to use for related purposes.

@davidschlangen davidschlangen changed the title [benchmark] rethink approach to deciding on which models to test [leaderboard] rethink approach to deciding on which models to test Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant