You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to get more clarity on what we want to achieve with the leaderboard, and then think about how we can achieve it.
A) Do we want to mirror other leaderboards? Rule could be: And any time, we strive to have numbers for the 30 best-performing models on ChatArena.
B) Do we want to identify the pareto frontier of size/performance, independently of a model's ranking elsewhere? That would be great, but makes the search space too large. We need to limit the number of models we test.
C) Are there certain models for which we want to know numbers, regardless of performance elsewhere? I guess. Rule could be: Test "big name models" (e.g., Llama-3) once they become available, to allow us to set expectations (e.g., how good derivatives might be).
My guess would be that a combination of A) and C) would be best. This limits testing to 30 models, with some fluctuation. Could even be automated: Once a model has appeared on ChatArena that we have not tested, we run the benchmark.
(We want to automatically parse their list anyway, to check for rank correlations.)
The text was updated successfully, but these errors were encountered:
This didn't actually address the question of what we want to achieve. My proposal for that would be something like:
To provide an up-to-date overview of the performance of the most prominent LLMs (closed and open) as conversational agents (to the extent that it measured by our instrument); in order to help ourselves and others to make decisions on what to use for related purposes.
davidschlangen
changed the title
[benchmark] rethink approach to deciding on which models to test
[leaderboard] rethink approach to deciding on which models to test
Apr 23, 2024
We need to get more clarity on what we want to achieve with the leaderboard, and then think about how we can achieve it.
A) Do we want to mirror other leaderboards? Rule could be: And any time, we strive to have numbers for the 30 best-performing models on ChatArena.
B) Do we want to identify the pareto frontier of size/performance, independently of a model's ranking elsewhere? That would be great, but makes the search space too large. We need to limit the number of models we test.
C) Are there certain models for which we want to know numbers, regardless of performance elsewhere? I guess. Rule could be: Test "big name models" (e.g., Llama-3) once they become available, to allow us to set expectations (e.g., how good derivatives might be).
My guess would be that a combination of A) and C) would be best. This limits testing to 30 models, with some fluctuation. Could even be automated: Once a model has appeared on ChatArena that we have not tested, we run the benchmark.
(We want to automatically parse their list anyway, to check for rank correlations.)
The text was updated successfully, but these errors were encountered: