Claude Code Leaderboard??? #81
Replies: 3 comments 2 replies
-
Hi @hesreallyhim! It's funny that you thought of this idea, because we've actually been working on a leaderboard for Claude Code usage (as well as Gemini CLI and Codex usage) called Splitrail Leaderboard. It's a bit raw still and can't be used just yet, but here's a screenshot: ![]() It's open source on GitHub. It's more oriented around tokens/cost/usage, but your suggestions sound interesting! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@bl-ue @nikshepsvn so there are in fact two leaderboards for CC usage? that is... very remarkable! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to set up a "leaderboard" for Claude Code (see e.g. the HuggingFace leaderboards) - more in the spirit of friendly competition than anything scientific or scholarly - but it's hard to think of a good approach/design, given the variety of resources that are featured here, and also the fluctuations around models, API vs. Subscription, etc.
My first thought is to focus on domain areas, like "TDD", or "UI", and the submissions would consist of repositories with specialized configurations, including slash commands, CLAUDE.md, hooks, sub agents, etc. And then rather than try to figure out specific challenges, just use Claude as LLM-as-judge to review the submissions and decide which is the "winner". So the prompt would be something like:
Obviously this would be rudimentary and not very "objective" but the idea would be rather to stimulate people to experiment with specialized Claude Code designs, and hopefully learn from each other, instead of compete (there is no prize anyway). I notice a lot of people seem to like TDD so that's why I brought that up, but would love some feedback if anyone has thoughts about how a Claude Code Leaderboard could look like.
There's also a lot of different ways to evaluate submissions, like set up a prepared container, with a prepared prompt, with a pre-determined goal/endpoint, and then initiate Claude Code with the common prompt and see how each framework performs, but again it's really tricky, and I'm inclined to do something less rigid at the moment because of all the independent variables.
Beta Was this translation helpful? Give feedback.
All reactions