Skip to content

om-ai-lab/open-agent-leaderboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ… Open Agent Leaderboard

πŸ€— HF Leaderboard

πŸŽ‰ Updates

  • 2025/1/23: Add gpt-4o, Qwen2.5-72B-Instruct, Qwen2.5-7B-Instruct, Qwen2-1.5B-Instruct, Qwen2-0.5B-Instruct, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct, Internllm2_5-7B into the leaderboard.
  • 2025/1/07: The Open Agent Leaderboard is released.

πŸ“– Introduction

This project aims to provide a fair comparison of various agents by evaluating their performance on different datasets and LLMs. Built on top of the OmAgent framework, it allows for simple, quick, and accurate assessments of agents.

Supported benchmark datasets:

Supported algorithms:

Supported LLMs:

  • gpt-3.5-turbo
  • gpt-4o
  • Doubao-lite-32k
  • Qwen2.5-72B-Instruct
  • Qwen2.5-7B-Instruct
  • Qwen2-1.5B-Instruct
  • Qwen2-0.5B-Instruct
  • Llama-3.3-70B-Instruct
  • Llama-3.1-8B-Instruct
  • Internllm2_5-7B

πŸ… Leaderboards

Math tasks

Rank Algorithm LLM Eval Date Avg Score gsm8k-Score gsm8k-Cost($) AQuA-Score AQuA-Cost($)
1 CoT Qwen2.5-72B-Instruct 2025/1/22 89.55 92.87 0.7195 86.22 0.0808
2 SC-CoT Qwen2.5-72B-Instruct 2025/1/22 89.45 93.86 5.9858 85.04 1.0348
3 CoT Llama-3.3-70B-Instruct 2025/1/22 88.7 93.93 0.687 83.46 0.0927
4 SC-CoT Llama-3.3-70B-Instruct 2025/1/22 88.68 95.07 6.2005 82.28 1.0756
5 SC-CoT gpt-4o 2025/1/22 88.46 90.3 31.0542 86.61 8.1485
6 CoT gpt-4o 2025/1/22 88.39 94.09 4.5367 82.68 1.0417
7 IO Llama-3.3-70B-Instruct 2025/1/22 87.48 92.27 0.4709 82.68 0.0798
8 CoT Doubao-lite-32k 2025/1/7 86 89.31 0.0558 82.68 0.0066
9 SC-CoT Qwen2.5-7B-Instruct 2025/1/22 85.53 91.13 0 79.92 0
10 IO Qwen2.5-72B-Instruct 2025/1/22 85.42 86.58 0.4899 84.25 0.0742
11 SC-CoT Doubao-lite-32k 2025/1/7 84.18 87.26 0.2083 81.1 0.0519
12 PoT gpt-4o 2025/1/22 84.15 93.1 4.2166 75.2 1.6087
13 PoT Qwen2.5-72B-Instruct 2025/1/22 83.77 92.34 0.7054 75.2 0.1645
14 ReAct-Pro* Llama-3.3-70B-Instruct 2025/1/22 83.39 87.64 10.1124 79.13 0.768
15 CoT Qwen2.5-7B-Instruct 2025/1/22 83.19 85.67 0 80.71 0
16 IO gpt-4o 2025/1/22 82 88.4 3.3463 75.59 1.1453
17 ReAct-Pro* Doubao-lite-32k 2025/1/7 81.58 85.6 0.2512 77.56 0.0445
18 ReAct-Pro* Qwen2.5-72B-Instruct 2025/1/22 80.25 87.26 10.5479 73.23 0.3177
19 ReAct-Pro* Qwen2.5-7B-Instruct 2025/1/22 78.64 82.87 0 74.41 0
20 PoT Llama-3.3-70B-Instruct 2025/1/22 76.31 73.09 0.9736 79.53 0.1746
21 PoT Doubao-lite-32k 2025/1/7 75.63 79.61 0.0576 71.65 0.0147
22 IO Doubao-lite-32k 2025/1/7 75.58 72.02 0.0354 79.13 0.0058
23 SC-CoT gpt-3.5-turbo 2025/1/7 73.03 79.91 3.3938 66.14 0.7888
24 CoT gpt-3.5-turbo 2025/1/7 69.86 78.7 0.6788 61.02 0.0957
25 ReAct-Pro* gpt-3.5-turbo 2025/1/7 69.74 74.91 3.4633 64.57 0.4928
26 PoT gpt-3.5-turbo 2025/1/7 68.17 76.88 0.6902 59.45 0.1748
27 CoT Llama-3.1-8B-Instruct 2025/1/22 68.04 75.44 0 60.63 0
28 IO Qwen2.5-7B-Instruct 2025/1/22 67.99 57.24 0 78.74 0
29 SC-CoT Llama-3.1-8B-Instruct 2025/1/22 66.46 73.46 0 59.45 0
30 CoT Internllm2_5-7B 2025/1/22 65.24 77.71 0 52.76 0
31 PoT Qwen2.5-7B-Instruct 2025/1/22 63.47 58.83 0 68.11 0
32 ReAct-Pro* Llama-3.1-8B-Instruct 2025/1/22 61.65 67.78 0 55.51 0
33 ReAct-Pro* gpt-4o 2025/1/22 60.4 63.31 39.0751 57.48 2.304
34 IO Llama-3.1-8B-Instruct 2025/1/22 54.17 57.16 0 51.18 0
35 CoT Qwen2-1.5B-Instruct 2025/1/22 48.03 55.5 0 40.55 0
36 SC-CoT Internllm2_5-7B 2025/1/22 43.8 48.22 0 39.37 0
37 IO gpt-3.5-turbo 2025/1/7 38.41 37.83 0.3328 38.98 0.038
38 PoT Llama-3.1-8B-Instruct 2025/1/22 37.64 38.67 0 36.61 0
39 PoT Internllm2_5-7B 2025/1/22 37.41 38.21 0 36.61 0
40 ReAct-Pro* Internllm2_5-7B 2025/1/22 37.23 33.51 0 40.94 0
41 CoT Qwen2-0.5B-Instruct 2025/1/22 34.51 35.94 0 33.07 0
42 IO Internllm2_5-7B 2025/1/22 29.62 11.6 0 47.64 0
43 ReAct-Pro* Qwen2-1.5B-Instruct 2025/1/22 25.23 24.87 0 25.59 0
44 PoT Qwen2-1.5B-Instruct 2025/1/22 24.61 18.5 0 30.71 0
45 IO Qwen2-1.5B-Instruct 2025/1/22 22.91 16.68 0 29.13 0
46 IO Qwen2-0.5B-Instruct 2025/1/22 20.94 14.71 0 27.17 0
47 SC-CoT Qwen2-1.5B-Instruct 2025/1/22 17.69 11.75 0 23.62 0
48 ReAct-Pro* Qwen2-0.5B-Instruct 2025/1/22 15.84 7.66 0 24.02 0
49 PoT Qwen2-0.5B-Instruct 2025/1/22 13.47 9.62 0 17.32 0
50 SC-CoT Qwen2-0.5B-Instruct 2025/1/22 12.25 1.67 0 22.83 0

Evaluation details can be found in the Evaluation Details section and huggingface leaderboard.

  • IO (Input-Output) is the baseline method that directly prompts the model with the question and expects an answer without any intermediate reasoning steps. It represents the most basic way of using language models and serves as a reference point for evaluating the effectiveness of other algorithms.

  • ReAct-Pro*: We modified ReAct to ReAct-Pro, following the Reflexion repository. Comparasion with the original ReAct repo can be found in the Compare to ReAct section.

Leaderboard Visualization

πŸ› οΈ How to Install

  1. Clone the repository:

    git clone https://github.com/om-ai-lab/open-agent-leaderboard.git
    cd open-agent-leaderboard
  2. Install dependencies:

    pip install -r requirements.txt

πŸ—οΈ How to Evaluate Agents

Step 1. Implement your agent in the omagent repository

Navigate to the agent repository:

git clone https://github.com/om-ai-lab/OmAgent.git
cd OmAgent

Set up the environment:

pip install -e omagent-core

Implement your agent in the omagent repository, check the examples/cot folder.

Step 2. Inference in OmAgent Repository

Run the inference script (cot as an example):

cd examples/cot
python eval_demo.py --model_id your_model_id --dataset_name your_dataset_name --dataset_path your_dataset_path --output_path your_output_path --output_name your_output_name --cot_method your_cot_method

Output Format

The output results are saved in JSON format and include the following fields:

  • id: The unique identifier of the sample.
  • question: The input question provided to the model.
  • last_output: The raw output generated by the model.
  • output_postprocess (optional): The processed output after cleansing.
  • ground_truth (optional): The correct answer for the sample.
  • prompt_tokens: The number of tokens in the input prompt.
  • completion_tokens: The number of tokens in the model's output.

Example of an output JSON file:

{
    "dataset": "gsm8k",
    "model_id": "gpt-3.5-turbo",
    "alg": "COT",
    "model_result": [
        {
            "id": 1,
            "question": "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today.....",
            "last_output": "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast and uses 4 to bake muffins,...",
            "output_postprocess": "Paris",
            "ground_truth": "Paris",
            "prompt_tokens": 10,
            "completion_tokens": 5
        }
    ]
}

Step 3. Evaluate inference results

Run the main script to perform evaluations:

python main.py --dataset <dataset_name> --model <model_name> --method <method_name> --output_dir <output_directory>

Parameters

  • --random_seed: Random seed, default is 1.
  • --dataset: Dataset to use, options are aqua, gsm8k.
  • --minibatch_size: Minibatch size, default is 1.
  • --max_num_worker: Maximum number of workers for the data loader, default is 4.
  • --model: Model used for decoding, options are gpt-4o-mini, gpt-4o, gpt-3.5-turbo.
  • --method: Method, options are zero_shot, zero_shot_cot, few_shot, few_shot_cot.
  • --cot_trigger_no: Trigger sentence number for chain of thought, default is 1.
  • --max_length: Maximum length of model output, default is 2048.
  • --max_length_direct: Maximum length of direct model answer, default is 32.
  • --limit_dataset_size: Whether to limit the test dataset size, default is 0 (no limit).
  • --output_dir: Output directory, default is ./outputs/.
  • --output_path: Output path, default is empty.
  • --agent: Agent used for the experiment, options are cot, pot, sc_cot, react.
  • --system_prompt: System prompt, default is empty.
  • --openai_api_key: OpenAI API key, default is empty.
  • --openai_url: OpenAI API URL, default is https://api.openai.com/v1.

Example

python main.py --output_path example/gsm8k_results_cot.json --dataset gsm8k --method few_shot_cot

Evaluation details

Algorithm Dataset Eval Date LLM Score Pass rate X-shot Parameters Samples Total input tokens Average input tokens Total output tokens Average output tokens All tokens Cost($)
IO gsm8k 2025/1/7 gpt-3.5-turbo 37.83 99.92 8 1319 546,990 415 39,563 30 586,553 0.3328
IO gsm8k 2025/1/7 Doubao-lite-32k 72.02 99.92 8 1319 617,377 468 123,106 93 740,483 0.0354
IO gsm8k 2025/1/22 gpt-4o 88.4 100 8 1319 542,416 411 199,030 151 741,446 3.3463
IO gsm8k 2025/1/22 Qwen2.5-72B-Instruct 86.58 100 8 1319 555,340 421 313,720 238 869,060 0.4899
IO gsm8k 2025/1/22 Llama-3.3-70B-Instruct 92.27 100 8 1319 583,916 443 251,359 191 835,275 0.4709
IO gsm8k 2025/1/22 Qwen2.5-7B-Instruct 57.24 100 8 1319 596,229 452 291,684 221 887,913 0.0000
IO gsm8k 2025/1/22 Llama-3.1-8B-Instruct 57.16 99.55 8 1319 550,941 418 1,194,488 906 1,745,429 0.0000
IO gsm8k 2025/1/22 Internllm2_5-7B 11.6 97.95 8 1319 679,302 515 434,426 329 1,113,728 0.0000
IO gsm8k 2025/1/22 Qwen2-1.5B-Instruct 16.68 100 8 1319 568,530 431 168,466 128 736,996 0.0000
IO gsm8k 2025/1/22 Qwen2-0.5B-Instruct 14.71 100 8 1319 568,116 431 266,781 202 834,897 0.0000
ReAct-Pro* gsm8k 2025/1/7 gpt-3.5-turbo 74.91 99.39 8 max_steps=10 1319 6,506,164 4,933 140,122 106 6,646,286 3.4633
ReAct-Pro* gsm8k 2025/1/7 Doubao-lite-32k 85.6 99.62 8 max_steps=10 1319 5,862,016 4,444 136,623 104 5,998,639 0.2512
ReAct-Pro* gsm8k 2025/1/22 gpt-4o 63.31 99.55 8 max_steps=10 1319 14,411,173 10,926 304,714 231 14,715,887 39.0751
ReAct-Pro* gsm8k 2025/1/22 Qwen2.5-72B-Instruct 87.26 100 8 max_steps=10 1319 18,160,983 13,769 549,454 417 18,710,437 10.5479
ReAct-Pro* gsm8k 2025/1/22 Llama-3.3-70B-Instruct 87.64 99.92 8 max_steps=10 1319 17,038,928 12,918 898,936 682 17,937,864 10.1124
ReAct-Pro* gsm8k 2025/1/22 Qwen2.5-7B-Instruct 82.87 100 8 max_steps=10 1319 14,355,752 10,884 495,162 375 14,850,914 0.0000
ReAct-Pro* gsm8k 2025/1/22 Llama-3.1-8B-Instruct 67.78 98.56 8 max_steps=10 1319 21,044,978 15,955 1,790,789 1,358 22,835,767 0.0000
ReAct-Pro* gsm8k 2025/1/22 Internllm2_5-7B 33.51 97.95 8 max_steps=10 1319 30,120,070 22,836 5,549,919 4,208 35,669,989 0.0000
ReAct-Pro* gsm8k 2025/1/22 Qwen2-1.5B-Instruct 24.87 80.21 8 max_steps=10 1319 9,133,603 6,925 694,398 526 9,828,001 0.0000
ReAct-Pro* gsm8k 2025/1/22 Qwen2-0.5B-Instruct 7.66 95.22 8 max_steps=10 1319 52,431,343 39,751 2,961,268 2,245 55,392,611 0.0000
PoT gsm8k 2025/1/7 gpt-3.5-turbo 76.88 99.24 8 1319 1,090,418 827 96,662 73 1,187,080 0.6902
PoT gsm8k 2025/1/7 Doubao-lite-32k 79.61 92.57 8 1319 1,170,038 887 118,017 89 1,288,055 0.0576
PoT gsm8k 2025/1/22 gpt-4o 93.1 99.77 8 1319 1,101,672 835 146,240 111 1,247,912 4.2166
PoT gsm8k 2025/1/22 Qwen2.5-72B-Instruct 92.34 99.39 8 1319 1,106,682 839 144,528 110 1,251,210 0.7054
PoT gsm8k 2025/1/22 Llama-3.3-70B-Instruct 73.09 79.61 8 1319 1,126,025 854 601,019 456 1,727,044 0.9736
PoT gsm8k 2025/1/22 Qwen2.5-7B-Instruct 58.83 70.51 8 1319 1,145,390 868 217,432 165 1,362,822 0.0000
PoT gsm8k 2025/1/22 Llama-3.1-8B-Instruct 38.67 55.42 8 1319 1,147,538 870 243,573 185 1,391,111 0.0000
PoT gsm8k 2025/1/22 Internllm2_5-7B 38.21 48.9 8 1319 1,136,843 862 188,106 143 1,324,949 0.0000
PoT gsm8k 2025/1/22 Qwen2-1.5B-Instruct 18.5 31.01 8 1319 1,151,528 873 175,994 133 1,327,522 0.0000
PoT gsm8k 2025/1/22 Qwen2-0.5B-Instruct 9.62 16.91 8 1319 1,151,528 873 237,607 180 1,389,135 0.0000
CoT gsm8k 2025/1/7 gpt-3.5-turbo 78.7 100 8 1319 953,242 723 134,799 102 1,088,041 0.6788
CoT gsm8k 2025/1/7 Doubao-lite-32k 89.31 100 8 1319 1,042,095 790 159,725 121 1,201,820 0.0558
CoT gsm8k 2025/1/22 gpt-4o 94.09 100 8 1319 948,668 719 216,498 164 1,165,166 4.5367
CoT gsm8k 2025/1/22 Qwen2.5-72B-Instruct 92.87 100 8 1319 1,005,119 762 271,133 206 1,276,252 0.7195
CoT gsm8k 2025/1/22 Llama-3.3-70B-Instruct 93.93 100 8 1319 990,168 751 228,497 173 1,218,665 0.6870
CoT gsm8k 2025/1/22 Qwen2.5-7B-Instruct 85.67 100 8 1319 1,046,008 793 244,797 186 1,290,805 0.0000
CoT gsm8k 2025/1/22 Llama-3.1-8B-Instruct 75.44 99.92 8 1319 990,168 751 258,161 196 1,248,329 0.0000
CoT gsm8k 2025/1/22 Internllm2_5-7B 77.71 99.7 8 1319 968,163 734 234,000 177 1,202,163 0.0000
CoT gsm8k 2025/1/22 Qwen2-1.5B-Instruct 55.5 100 8 1319 1,032,818 783 185,707 141 1,218,525 0.0000
CoT gsm8k 2025/1/22 Qwen2-0.5B-Instruct 35.94 99.92 8 1319 1,032,818 783 190,641 145 1,223,459 0.0000
SC-CoT gsm8k 2025/1/7 gpt-3.5-turbo 79.91 99.92 8 temperature=1, path_num=5 1319 2,740,652 2,078 1,348,960 1,023 4,089,612 3.3938
SC-CoT gsm8k 2025/1/7 Doubao-lite-32k 87.26 99.92 8 temperature=1, path_num=5 1319 2,691,714 2,041 1,197,099 908 3,888,813 0.2083
SC-CoT gsm8k 2025/1/22 gpt-4o 90.3 99.92 8 temperature=1, path_num=5 1319 3,590,336 2,722 2,207,837 1,674 5,798,173 31.0542
SC-CoT gsm8k 2025/1/22 Qwen2.5-72B-Instruct 93.86 100 8 temperature=1, path_num=5 1319 8,136,223 6,168 2,481,785 1,882 10,618,008 5.9858
SC-CoT gsm8k 2025/1/22 Llama-3.3-70B-Instruct 95.07 100 8 temperature=1, path_num=5 1319 8,413,717 6,379 2,585,077 1,960 10,998,794 6.2005
SC-CoT gsm8k 2025/1/22 Qwen2.5-7B-Instruct 91.13 100 8 temperature=1, path_num=5 1319 8,586,888 6,510 2,554,097 1,936 11,140,985 0.0000
SC-CoT gsm8k 2025/1/22 Llama-3.1-8B-Instruct 73.46 99.55 8 temperature=1, path_num=5 1319 8,630,514 6,543 3,148,202 2,387 11,778,716 0.0000
SC-CoT gsm8k 2025/1/22 Internllm2_5-7B 48.22 98.41 8 temperature=1, path_num=5 1319 10,678,792 8,096 3,847,639 2,917 14,526,431 0.0000
SC-CoT gsm8k 2025/1/22 Qwen2-1.5B-Instruct 11.75 91.89 8 temperature=1, path_num=5 1319 9,066,115 6,873 3,345,827 2,537 12,411,942 0.0000
SC-CoT gsm8k 2025/1/22 Qwen2-0.5B-Instruct 1.67 94.69 8 temperature=1, path_num=5 1319 11,019,864 8,355 5,445,856 4,129 16,465,720 0.0000
IO AQuA 2025/1/7 gpt-3.5-turbo 38.98 100 0 254 25,701 101 16,770 66 42,471 0.0380
IO AQuA 2025/1/7 Doubao-lite-32k 79.13 100 0 254 33,058 130 54,684 215 87,742 0.0058
IO AQuA 2025/1/22 gpt-4o 75.59 97.24 0 254 25,631 101 108,121 426 133,752 1.1453
IO AQuA 2025/1/22 Qwen2.5-72B-Instruct 84.25 99.61 0 254 25,397 100 106,207 418 131,604 0.0742
IO AQuA 2025/1/22 Llama-3.3-70B-Instruct 82.68 99.21 0 254 32,809 129 108,758 428 141,567 0.0798
IO AQuA 2025/1/22 Qwen2.5-7B-Instruct 78.74 98.43 0 254 33,271 131 104,500 411 137,771 0.0000
IO AQuA 2025/1/22 Llama-3.1-8B-Instruct 51.18 98.82 0 254 26,459 104 106,647 420 133,106 0.0000
IO AQuA 2025/1/22 Internllm2_5-7B 47.64 90.94 0 254 50,232 198 134,809 531 185,041 0.0000
IO AQuA 2025/1/22 Qwen2-1.5B-Instruct 29.13 97.64 0 254 27,937 110 43,110 170 71,047 0.0000
IO AQuA 2025/1/22 Qwen2-0.5B-Instruct 27.17 98.82 0 254 27,937 110 82,478 325 110,415 0.0000
CoT AQuA 2025/1/7 gpt-3.5-turbo 61.02 93.7 0 254 25,447 100 55,346 218 80,793 0.0957
CoT AQuA 2025/1/7 Doubao-lite-32k 82.68 97.24 0 254 27,978 110 66,599 262 94,577 0.0066
CoT AQuA 2025/1/22 gpt-4o 82.68 98.03 0 254 25,123 99 97,894 385 123,017 1.0417
CoT AQuA 2025/1/22 Qwen2.5-72B-Instruct 86.22 99.21 0 254 25,143 99 118,146 465 143,289 0.0808
CoT AQuA 2025/1/22 Llama-3.3-70B-Instruct 83.46 98.43 0 254 32,555 128 131,834 519 164,389 0.0927
CoT AQuA 2025/1/22 Qwen2.5-7B-Instruct 80.71 99.61 0 254 33,017 130 116,719 460 149,736 0.0000
CoT AQuA 2025/1/22 Llama-3.1-8B-Instruct 60.63 100 0 254 32,555 128 111,880 440 144,435 0.0000
CoT AQuA 2025/1/22 Internllm2_5-7B 52.76 89.37 0 254 26,610 105 100,910 397 127,520 0.0000
CoT AQuA 2025/1/22 Qwen2-1.5B-Instruct 40.55 98.82 0 254 30,477 120 79,563 313 110,040 0.0000
CoT AQuA 2025/1/22 Qwen2-0.5B-Instruct 33.07 98.82 0 254 30,477 120 86,862 342 117,339 0.0000
PoT AQuA 2025/1/7 gpt-3.5-turbo 59.45 100 0 254 225,162 886 41,492 163 266,654 0.1748
PoT AQuA 2025/1/7 Doubao-lite-32k 71.65 96.85 0 254 259,863 1,023 49,573 195 309,436 0.0147
PoT AQuA 2025/1/22 gpt-4o 75.2 100 0 254 222,717 877 105,191 414 327,908 1.6087
PoT AQuA 2025/1/22 Qwen2.5-72B-Instruct 75.2 100 0 254 249,215 981 42,549 168 291,764 0.1645
PoT AQuA 2025/1/22 Llama-3.3-70B-Instruct 79.53 99.21 0 254 240,735 948 69,064 272 309,799 0.1746
PoT AQuA 2025/1/22 Qwen2.5-7B-Instruct 68.11 100 0 254 264,517 1,041 49,211 194 313,728 0.0000
PoT AQuA 2025/1/22 Llama-3.1-8B-Instruct 36.61 96.85 0 254 240,613 947 50,301 198 290,914 0.0000
PoT AQuA 2025/1/22 Internllm2_5-7B 36.61 98.82 0 254 233,505 919 68,457 270 301,962 0.0000
PoT AQuA 2025/1/22 Qwen2-1.5B-Instruct 30.71 96.46 0 254 246,560 971 51,915 204 298,475 0.0000
PoT AQuA 2025/1/22 Qwen2-0.5B-Instruct 17.32 92.13 0 254 258,867 1,019 63,414 250 322,281 0.0000
SC-CoT AQuA 2025/1/7 gpt-3.5-turbo 66.14 99.21 0 temperature=1, path_num=5 254 482,192 1,898 365,143 1,438 847,335 0.7888
SC-CoT AQuA 2025/1/7 Doubao-lite-32k 81.1 97.24 0 temperature=1, path_num=5 254 503,751 1,983 382,235 1,505 885,986 0.0519
SC-CoT AQuA 2025/1/22 gpt-4o 86.61 98.82 0 temperature=1, path_num=5 254 744,478 2,931 628,728 2,475 1,373,206 8.1485
SC-CoT AQuA 2025/1/22 Qwen2.5-72B-Instruct 85.04 99.21 0 temperature=1, path_num=5 254 1,051,218 4,139 784,451 3,088 1,835,669 1.0348
SC-CoT AQuA 2025/1/22 Llama-3.3-70B-Instruct 82.28 99.21 0 temperature=1, path_num=5 254 1,135,251 4,469 772,673 3,042 1,907,924 1.0756
SC-CoT AQuA 2025/1/22 Qwen2.5-7B-Instruct 79.92 100 0 temperature=1, path_num=5 254 1,098,280 4,324 747,052 2,941 1,845,332 0.0000
SC-CoT AQuA 2025/1/22 Llama-3.1-8B-Instruct 59.45 97.24 0 temperature=1, path_num=5 254 971,003 3,823 680,330 2,678 1,651,333 0.0000
SC-CoT AQuA 2025/1/22 Internllm2_5-7B 39.37 98.03 0 temperature=1, path_num=5 254 1,420,494 5,592 875,728 3,448 2,296,222 0.0000
SC-CoT AQuA 2025/1/22 Qwen2-1.5B-Instruct 23.62 96.46 0 temperature=1, path_num=5 254 1,034,362 4,072 740,973 2,917 1,775,335 0.0000
SC-CoT AQuA 2025/1/22 Qwen2-0.5B-Instruct 22.83 97.24 0 temperature=1, path_num=5 254 1,246,929 4,909 968,162 3,812 2,215,091 0.0000
ReAct-Pro* AQuA 2025/1/7 gpt-3.5-turbo 64.57 98.03 0 max_steps=10 254 862,614 3,396 40,973 161 903,587 0.4928
ReAct-Pro* AQuA 2025/1/7 Doubao-lite-32k 77.56 96.06 0 max_steps=10 254 977,890 3,850 54,951 216 1,032,841 0.0445
ReAct-Pro* AQuA 2025/1/22 gpt-4o 57.48 97.24 0 max_steps=10 254 615,589 2,424 76,507 301 692,096 2.3040
ReAct-Pro* AQuA 2025/1/22 Qwen2.5-72B-Instruct 73.23 100 0 max_steps=10 254 441,765 1,739 121,838 480 563,603 0.3177
ReAct-Pro* AQuA 2025/1/22 Llama-3.3-70B-Instruct 79.13 99.61 0 max_steps=10 254 1,119,143 4,406 243,236 958 1,362,379 0.7680
ReAct-Pro* AQuA 2025/1/22 Qwen2.5-7B-Instruct 74.41 99.21 0 max_steps=10 254 564,165 2,221 131,679 518 695,844 0.0000
ReAct-Pro* AQuA 2025/1/22 Llama-3.1-8B-Instruct 55.51 96.85 0 max_steps=10 254 3,764,723 14,822 576,098 2,268 4,340,821 0.0000
ReAct-Pro* AQuA 2025/1/22 Internllm2_5-7B 40.94 96.85 0 max_steps=10 254 3,592,039 14,142 836,762 3,294 4,428,801 0.0000
ReAct-Pro* AQuA 2025/1/22 Qwen2-1.5B-Instruct 25.59 96.06 0 max_steps=10 254 4,555,858 17,936 516,146 2,032 5,072,004 0.0000
ReAct-Pro* AQuA 2025/1/22 Qwen2-0.5B-Instruct 24.02 96.85 0 max_steps=10 254 6,344,167 24,977 825,920 3,252 7,170,087 0.0000

Default settings:

temperature = 0

LLM prices:

  • LLM prices:
    • gpt-3.5-turbo:
      • 0.5$/1M tokens (input)
      • 1.5$/1M tokens (output)
    • Doubao-lite-32k (1 USD = 7.3249 CNY):
      • 0.04096$/1M tokens (input)
      • 0.08200$/1M tokens (output)
    • gpt-4o-2024-08-06:
      • 2.50$ /1M input tokens (input)
      • 10$ /1M output tokens (output)
    • Qwen2.5-7B-Instruct and Llama-3.3-70B-Instruct:
    • Other open source LLMs:
      • Deployed locally, please check the OmAgent repository for more information.
      • Cost is not considered in the leaderboard.

Pass Rate*: The pass rate is calculated by evaluating the percentage of predictions that are valid, where a prediction is valid if it is neither empty nor null.

Compare to original agent repositories

Algorithm Dataset Eval Time LLM Framework Score
CoT gsm8k 2025/1/7 gpt-3.5-turbo Original repo 79.23
CoT gsm8k 2025/1/7 gpt-3.5-turbo OmAgent 78.70
CoT AQuA 2025/1/7 gpt-3.5-turbo Original repo 60.63
CoT AQuA 2025/1/7 gpt-3.5-turbo OmAgent 61.02
PoT gsm8k 2025/1/7 gpt-4o-mini Original repo 86.35
PoT gsm8k 2025/1/7 gpt-4o-mini OmAgent 88.25
ReAct AQuA 2025/1/7 gpt-3.5-turbo Original repo 35.04
ReAct AQuA 2025/1/7 gpt-3.5-turbo OmAgent 34.25
ReAct HotpotQA 2025/1/8 gpt-3.5-turbo Original repo 28.00
ReAct HotpotQA 2025/1/8 gpt-3.5-turbo OmAgent 27.40

Note:

  • The original repo is the official repository of the agent implementation.
  • OmAgent is the implementation of the agent in this project.
  • There is no official implementation of SC-CoT.

Comparison ReAct with ReAct-Pro

Algorithm Dataset Eval Time LLM Score Pass Rate
ReAct gsm8k 2025/1/7 gpt-3.5-turbo 38.13 100.00
ReAct-Pro gsm8k 2025/1/7 gpt-3.5-turbo 74.91 99.39
ReAct AQuA 2025/1/7 gpt-3.5-turbo 34.25 97.64
ReAct-Pro AQuA 2025/1/7 gpt-3.5-turbo 64.57 98.03

πŸ”— Related works

Open Agent Leaderboard is built on top of the OmAgent repository.

Acknowledgments

We extend our deepest gratitude to the authors and contributors of the following datasets: gsm8k, AQuA, agent algorithms: CoT, SC-CoT, PoT, ReAct, and LLMs: gpt-3.5-turbo, Doubao-lite-32k.

⭐️ Citation

If you find our repository beneficial, please cite our repository:

@misc{open-agent-leaderboard,
    title={Open Agent Leaderboard},
    author={Om AI Lab},
    year={2025},
    publisher={GitHub},
    howpublished={\url{https://github.com/om-ai-lab/open-agent-leaderboard}}
}

πŸ”” Follow us

You can follow us on X and Discord for more updates and discussions.

🀝 Contributing

Feel free to submit issues and pull requests.

πŸ“ License

This project is licensed under the MIT License.