Fix calibration setting in the code evaluation.
Add --no_execute argument for code evaluation.
Support concurrent API inference for o1 and deepseek-chat.
Fix API inference for Google Gemini.
Add --instruction_prefix and --response_prefix arguments for code generation.
Change --id_range input type.
Add --revision arguments for code generation.

Evaluated LLMs (144 models)

Qwen2.5-Coder-32B-Instruct
grok-beta
claude-3-5-haiku-20241022

Full Changelog: v0.2.0...v0.2.1.post2

Assets 2

06 Oct 08:28

terryyz

v0.2.0.post3

58b3f2d

Release BigCodeBench v0.2.0

Breaking Change

No more waiting! The evalution now fully supports batch inference!
No more environment configs! The code execution is done by a remote API endpoint by default, and can be customized.
No more multiple commands! bigcodebench.evaluate will be good enough to handle most cases.

What's Changed

add multiprocessing support for sanitization step by @sk-g in #37
Remove extra period in task BigCodeBench/16 by @hvaara in #38
Await futures in progress checker by @hvaara in #48
A few args have been added to this version, including --direct_completion and --local_execute. See Advanced Usage for the details.

Dataset maintainence

The benchmark data has been bumped to v0.1.2. You can load the dataset with from datasets import load_data; ds = load_data("bigcode/bigcodebench", split="v0.1.2")
BigCodeBench/16: removed period
BigCodeBench/37: added pandas requirement
BigCodeBench/178: removed urlib requirement
BigCodeBench/241: added required plot title
BigCodeBench/267: added required plot title
BigCodeBench/760: changed the import of datetime
BigCodeBench/1006: replaced test links due to the potential connection block

New Contributors

@sk-g made their first contribution in #37
@hvaara made their first contribution in #38

Evaluated LLMs (139 models)

o1-Preview-2024-09-12 (temperature=1)
Gemini-1.5-Pro-002
Llama-3.1 models
DeepSeek-V2.5
Qwen-2.5 models
Qwen-2.5-Coder models
and more

PyPI: https://pypi.org/project/bigcodebench/0.2.0.post3/

Full Changelog: v0.1.9...v0.2.0.post3

Contributors

hvaara and sk-g

Assets 3

10 Aug 10:21

terryyz

v0.1.9

4d05ba9

Release BigCodeBench v0.1.9

Full Changelog: v0.1.8...v0.1.9

Assets 2

17 Jul 20:11

terryyz

v0.1.8

32f5382

Release BigCodeBench v0.1.8

Features:

Support BigCodeBench-Hard subset: #17
Identify and fix tokenizer setup: #21
Customize the tokenizer: #20
Add the pass rate result log: #20

Contributors:

@marianna13: #20

Models：

A total of 96 models at the time of the release

Acknowledgement:

Full Changelog: v0.1.7...v0.1.8

Contributors

takkyu2, imamnurby, and 2 other contributors

Assets 2

05 Jul 14:16

terryyz

0.1.7.post2

a02256f

Release v0.1.7.post2

Enhanced the calculation of ground truth pass rate, and addressed the issue mentioned in #12 (comment).
Update the README docs.

Assets 2

27 Jun 23:52

terryyz

v0.1.7

afbf8de

Release BigCodeBench v0.1.7

Fix some identified issues:

The ground truth pass rate was not previously computed in the correct way.
Passed RAM limits would raise errors, as they were set as float type.
User permission is not correctly set up in the Evaluate Docker.

Features:
-- check-gt-only will print out the pass rate when finishing.

Assets 2

26 Jun 21:39

terryyz

v0.1.6

f6fc695

Release BigCodeBench v0.1.6

New features;

The RAM setup is now adjustable via specific arguments.
Parallel ground truth checking is supported. Potentially failed checks are skipped during execution. A warning will be issued if the ground truth pass rate falls below 0.95.

Assets 2

18 Jun 13:31

terryyz

v0.1.5

0f0ea6e

Release BigCodeBench v0.1.5

New features;

The data is downloaded from HF hub by default.
Data formats have been unified for the one on HF and the one on GitHub.

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Evaluated LLMs (173 models)

What's Changed

Evaluated LLMs (157 models)

What's Changed

Evaluated LLMs (144 models)

Breaking Change

What's Changed

Dataset maintainence

New Contributors

Evaluated LLMs (139 models)

Contributors

Contributors

Releases: bigcode-project/bigcodebench

Release BigCodeBench v0.2.3.post1

What's Changed

Evaluated LLMs (173 models)

v0.2.1.post7

What's Changed

Evaluated LLMs (157 models)

BigCodeBench v0.2.1.post3

What's Changed

Evaluated LLMs (144 models)

Release BigCodeBench v0.2.0

Breaking Change

What's Changed

Dataset maintainence

New Contributors

Evaluated LLMs (139 models)

Contributors

Release BigCodeBench v0.1.9

Release BigCodeBench v0.1.8

Contributors

Release v0.1.7.post2

Release BigCodeBench v0.1.7

Release BigCodeBench v0.1.6

Release BigCodeBench v0.1.5