[BFCL] GPT-5 performs poorly on bfcl — seeking validation from community

**Describe the issue**
I recently evaluated GPT-5 on the benchmark and observed significantly worse performance than expected, Achieving results substantially below comparable models (e.g., GPT-4 or other baselines).



**ID datapoint**


**What is the issue**

Has anyone else encountered similar issues? I’m curious whether this might be related to:
1. Specific evaluation settings or prompts
2. Domain-specific limitations of the model
3. Potential implementation or deployment quirks
If you’ve run similar experiments or have insights, please share your findings or suggestions below. 


**Proposed Changes**


**Additional context**

I'm looking forward to seeing the official evaluation results from on this leaderboard to better understand the model's capabilities. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BFCL] GPT-5 performs poorly on bfcl — seeking validation from community #1157

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[BFCL] GPT-5 performs poorly on bfcl — seeking validation from community #1157

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions