[leaderboard] rethink approach to deciding on which models to test #83

davidschlangen · 2024-04-19T10:08:56Z

We need to get more clarity on what we want to achieve with the leaderboard, and then think about how we can achieve it.

A) Do we want to mirror other leaderboards? Rule could be: And any time, we strive to have numbers for the 30 best-performing models on ChatArena.

B) Do we want to identify the pareto frontier of size/performance, independently of a model's ranking elsewhere? That would be great, but makes the search space too large. We need to limit the number of models we test.

C) Are there certain models for which we want to know numbers, regardless of performance elsewhere? I guess. Rule could be: Test "big name models" (e.g., Llama-3) once they become available, to allow us to set expectations (e.g., how good derivatives might be).

My guess would be that a combination of A) and C) would be best. This limits testing to 30 models, with some fluctuation. Could even be automated: Once a model has appeared on ChatArena that we have not tested, we run the benchmark.

(We want to automatically parse their list anyway, to check for rank correlations.)

davidschlangen · 2024-04-19T10:11:00Z

This didn't actually address the question of what we want to achieve. My proposal for that would be something like:

To provide an up-to-date overview of the performance of the most prominent LLMs (closed and open) as conversational agents (to the extent that it measured by our instrument); in order to help ourselves and others to make decisions on what to use for related purposes.

davidschlangen changed the title ~~[benchmark] rethink approach to deciding on which models to test~~ [leaderboard] rethink approach to deciding on which models to test Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[leaderboard] rethink approach to deciding on which models to test #83

[leaderboard] rethink approach to deciding on which models to test #83

davidschlangen commented Apr 19, 2024

davidschlangen commented Apr 19, 2024

[leaderboard] rethink approach to deciding on which models to test #83

[leaderboard] rethink approach to deciding on which models to test #83

Comments

davidschlangen commented Apr 19, 2024

davidschlangen commented Apr 19, 2024