You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue
I recently evaluated GPT-5 on the benchmark and observed significantly worse performance than expected, Achieving results substantially below comparable models (e.g., GPT-4 or other baselines).
ID datapoint
What is the issue
Has anyone else encountered similar issues? I’m curious whether this might be related to:
Specific evaluation settings or prompts
Domain-specific limitations of the model
Potential implementation or deployment quirks
If you’ve run similar experiments or have insights, please share your findings or suggestions below.
Proposed Changes
Additional context
I'm looking forward to seeing the official evaluation results from on this leaderboard to better understand the model's capabilities. Thank you!