🏅 Open Agent Leaderboard

🎉 Updates

2025/1/23: Add gpt-4o, Qwen2.5-72B-Instruct, Qwen2.5-7B-Instruct, Qwen2-1.5B-Instruct, Qwen2-0.5B-Instruct, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct, Internllm2_5-7B into the leaderboard.
2025/1/07: The Open Agent Leaderboard is released.

📖 Introduction

This project aims to provide a fair comparison of various agents by evaluating their performance on different datasets and LLMs. Built on top of the OmAgent framework, it allows for simple, quick, and accurate assessments of agents.

Supported benchmark datasets:

gsm8k
AQuA

Supported algorithms:

Supported LLMs:

gpt-3.5-turbo
gpt-4o
Doubao-lite-32k
Qwen2.5-72B-Instruct
Qwen2.5-7B-Instruct
Qwen2-1.5B-Instruct
Qwen2-0.5B-Instruct
Llama-3.3-70B-Instruct
Llama-3.1-8B-Instruct
Internllm2_5-7B

🏅 Leaderboards

Math tasks

Rank	Algorithm	LLM	Eval Date	Avg Score	gsm8k-Score	gsm8k-Cost($)	AQuA-Score	AQuA-Cost($)
1	CoT	Qwen2.5-72B-Instruct	2025/1/22	89.55	92.87	0.7195	86.22	0.0808
2	SC-CoT	Qwen2.5-72B-Instruct	2025/1/22	89.45	93.86	5.9858	85.04	1.0348
3	CoT	Llama-3.3-70B-Instruct	2025/1/22	88.7	93.93	0.687	83.46	0.0927
4	SC-CoT	Llama-3.3-70B-Instruct	2025/1/22	88.68	95.07	6.2005	82.28	1.0756
5	SC-CoT	gpt-4o	2025/1/22	88.46	90.3	31.0542	86.61	8.1485
6	CoT	gpt-4o	2025/1/22	88.39	94.09	4.5367	82.68	1.0417
7	IO	Llama-3.3-70B-Instruct	2025/1/22	87.48	92.27	0.4709	82.68	0.0798
8	CoT	Doubao-lite-32k	2025/1/7	86	89.31	0.0558	82.68	0.0066
9	SC-CoT	Qwen2.5-7B-Instruct	2025/1/22	85.53	91.13	0	79.92	0
10	IO	Qwen2.5-72B-Instruct	2025/1/22	85.42	86.58	0.4899	84.25	0.0742
11	SC-CoT	Doubao-lite-32k	2025/1/7	84.18	87.26	0.2083	81.1	0.0519
12	PoT	gpt-4o	2025/1/22	84.15	93.1	4.2166	75.2	1.6087
13	PoT	Qwen2.5-72B-Instruct	2025/1/22	83.77	92.34	0.7054	75.2	0.1645
14	ReAct-Pro*	Llama-3.3-70B-Instruct	2025/1/22	83.39	87.64	10.1124	79.13	0.768
15	CoT	Qwen2.5-7B-Instruct	2025/1/22	83.19	85.67	0	80.71	0
16	IO	gpt-4o	2025/1/22	82	88.4	3.3463	75.59	1.1453
17	ReAct-Pro*	Doubao-lite-32k	2025/1/7	81.58	85.6	0.2512	77.56	0.0445
18	ReAct-Pro*	Qwen2.5-72B-Instruct	2025/1/22	80.25	87.26	10.5479	73.23	0.3177
19	ReAct-Pro*	Qwen2.5-7B-Instruct	2025/1/22	78.64	82.87	0	74.41	0
20	PoT	Llama-3.3-70B-Instruct	2025/1/22	76.31	73.09	0.9736	79.53	0.1746
21	PoT	Doubao-lite-32k	2025/1/7	75.63	79.61	0.0576	71.65	0.0147
22	IO	Doubao-lite-32k	2025/1/7	75.58	72.02	0.0354	79.13	0.0058
23	SC-CoT	gpt-3.5-turbo	2025/1/7	73.03	79.91	3.3938	66.14	0.7888
24	CoT	gpt-3.5-turbo	2025/1/7	69.86	78.7	0.6788	61.02	0.0957
25	ReAct-Pro*	gpt-3.5-turbo	2025/1/7	69.74	74.91	3.4633	64.57	0.4928
26	PoT	gpt-3.5-turbo	2025/1/7	68.17	76.88	0.6902	59.45	0.1748
27	CoT	Llama-3.1-8B-Instruct	2025/1/22	68.04	75.44	0	60.63	0
28	IO	Qwen2.5-7B-Instruct	2025/1/22	67.99	57.24	0	78.74	0
29	SC-CoT	Llama-3.1-8B-Instruct	2025/1/22	66.46	73.46	0	59.45	0
30	CoT	Internllm2_5-7B	2025/1/22	65.24	77.71	0	52.76	0
31	PoT	Qwen2.5-7B-Instruct	2025/1/22	63.47	58.83	0	68.11	0
32	ReAct-Pro*	Llama-3.1-8B-Instruct	2025/1/22	61.65	67.78	0	55.51	0
33	ReAct-Pro*	gpt-4o	2025/1/22	60.4	63.31	39.0751	57.48	2.304
34	IO	Llama-3.1-8B-Instruct	2025/1/22	54.17	57.16	0	51.18	0
35	CoT	Qwen2-1.5B-Instruct	2025/1/22	48.03	55.5	0	40.55	0
36	SC-CoT	Internllm2_5-7B	2025/1/22	43.8	48.22	0	39.37	0
37	IO	gpt-3.5-turbo	2025/1/7	38.41	37.83	0.3328	38.98	0.038
38	PoT	Llama-3.1-8B-Instruct	2025/1/22	37.64	38.67	0	36.61	0
39	PoT	Internllm2_5-7B	2025/1/22	37.41	38.21	0	36.61	0
40	ReAct-Pro*	Internllm2_5-7B	2025/1/22	37.23	33.51	0	40.94	0
41	CoT	Qwen2-0.5B-Instruct	2025/1/22	34.51	35.94	0	33.07	0
42	IO	Internllm2_5-7B	2025/1/22	29.62	11.6	0	47.64	0
43	ReAct-Pro*	Qwen2-1.5B-Instruct	2025/1/22	25.23	24.87	0	25.59	0
44	PoT	Qwen2-1.5B-Instruct	2025/1/22	24.61	18.5	0	30.71	0
45	IO	Qwen2-1.5B-Instruct	2025/1/22	22.91	16.68	0	29.13	0
46	IO	Qwen2-0.5B-Instruct	2025/1/22	20.94	14.71	0	27.17	0
47	SC-CoT	Qwen2-1.5B-Instruct	2025/1/22	17.69	11.75	0	23.62	0
48	ReAct-Pro*	Qwen2-0.5B-Instruct	2025/1/22	15.84	7.66	0	24.02	0
49	PoT	Qwen2-0.5B-Instruct	2025/1/22	13.47	9.62	0	17.32	0
50	SC-CoT	Qwen2-0.5B-Instruct	2025/1/22	12.25	1.67	0	22.83	0

Evaluation details can be found in the Evaluation Details section and huggingface leaderboard.

IO (Input-Output) is the baseline method that directly prompts the model with the question and expects an answer without any intermediate reasoning steps. It represents the most basic way of using language models and serves as a reference point for evaluating the effectiveness of other algorithms.
ReAct-Pro*: We modified ReAct to ReAct-Pro, following the Reflexion repository. Comparasion with the original ReAct repo can be found in the Compare to ReAct section.

🛠️ How to Install

Clone the repository:

git clone https://github.com/om-ai-lab/open-agent-leaderboard.git
cd open-agent-leaderboard

Install dependencies:
```
pip install -r requirements.txt
```

🏗️ How to Evaluate Agents

Step 1. Implement your agent in the `omagent` repository

Navigate to the agent repository:

git clone https://github.com/om-ai-lab/OmAgent.git
cd OmAgent

Set up the environment:

pip install -e omagent-core

Implement your agent in the omagent repository, check the examples/cot folder.

Step 2. Inference in OmAgent Repository

Run the inference script (cot as an example):

cd examples/cot
python eval_demo.py --model_id your_model_id --dataset_name your_dataset_name --dataset_path your_dataset_path --output_path your_output_path --output_name your_output_name --cot_method your_cot_method

Output Format

The output results are saved in JSON format and include the following fields:

id: The unique identifier of the sample.
question: The input question provided to the model.
last_output: The raw output generated by the model.
output_postprocess (optional): The processed output after cleansing.
ground_truth (optional): The correct answer for the sample.
prompt_tokens: The number of tokens in the input prompt.
completion_tokens: The number of tokens in the model's output.

Example of an output JSON file:

{
    "dataset": "gsm8k",
    "model_id": "gpt-3.5-turbo",
    "alg": "COT",
    "model_result": [
        {
            "id": 1,
            "question": "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today.....",
            "last_output": "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast and uses 4 to bake muffins,...",
            "output_postprocess": "Paris",
            "ground_truth": "Paris",
            "prompt_tokens": 10,
            "completion_tokens": 5
        }
    ]
}

Step 3. Evaluate inference results

Run the main script to perform evaluations:

python main.py --dataset <dataset_name> --model <model_name> --method <method_name> --output_dir <output_directory>

Parameters

--random_seed: Random seed, default is 1.
--dataset: Dataset to use, options are aqua, gsm8k.
--minibatch_size: Minibatch size, default is 1.
--max_num_worker: Maximum number of workers for the data loader, default is 4.
--model: Model used for decoding, options are gpt-4o-mini, gpt-4o, gpt-3.5-turbo.
--method: Method, options are zero_shot, zero_shot_cot, few_shot, few_shot_cot.
--cot_trigger_no: Trigger sentence number for chain of thought, default is 1.
--max_length: Maximum length of model output, default is 2048.
--max_length_direct: Maximum length of direct model answer, default is 32.
--limit_dataset_size: Whether to limit the test dataset size, default is 0 (no limit).
--output_dir: Output directory, default is ./outputs/.
--output_path: Output path, default is empty.
--agent: Agent used for the experiment, options are cot, pot, sc_cot, react.
--system_prompt: System prompt, default is empty.
--openai_api_key: OpenAI API key, default is empty.
--openai_url: OpenAI API URL, default is https://api.openai.com/v1.

Example

python main.py --output_path example/gsm8k_results_cot.json --dataset gsm8k --method few_shot_cot

Evaluation details

Algorithm	Dataset	Eval Date	LLM	Score	Pass rate	X-shot	Parameters	Samples	Total input tokens	Average input tokens	Total output tokens	Average output tokens	All tokens	Cost($)
IO	gsm8k	2025/1/7	gpt-3.5-turbo	37.83	99.92	8		1319	546,990	415	39,563	30	586,553	0.3328
IO	gsm8k	2025/1/7	Doubao-lite-32k	72.02	99.92	8		1319	617,377	468	123,106	93	740,483	0.0354
IO	gsm8k	2025/1/22	gpt-4o	88.4	100	8		1319	542,416	411	199,030	151	741,446	3.3463
IO	gsm8k	2025/1/22	Qwen2.5-72B-Instruct	86.58	100	8		1319	555,340	421	313,720	238	869,060	0.4899
IO	gsm8k	2025/1/22	Llama-3.3-70B-Instruct	92.27	100	8		1319	583,916	443	251,359	191	835,275	0.4709
IO	gsm8k	2025/1/22	Qwen2.5-7B-Instruct	57.24	100	8		1319	596,229	452	291,684	221	887,913	0.0000
IO	gsm8k	2025/1/22	Llama-3.1-8B-Instruct	57.16	99.55	8		1319	550,941	418	1,194,488	906	1,745,429	0.0000
IO	gsm8k	2025/1/22	Internllm2_5-7B	11.6	97.95	8		1319	679,302	515	434,426	329	1,113,728	0.0000
IO	gsm8k	2025/1/22	Qwen2-1.5B-Instruct	16.68	100	8		1319	568,530	431	168,466	128	736,996	0.0000
IO	gsm8k	2025/1/22	Qwen2-0.5B-Instruct	14.71	100	8		1319	568,116	431	266,781	202	834,897	0.0000
ReAct-Pro*	gsm8k	2025/1/7	gpt-3.5-turbo	74.91	99.39	8	max_steps=10	1319	6,506,164	4,933	140,122	106	6,646,286	3.4633
ReAct-Pro*	gsm8k	2025/1/7	Doubao-lite-32k	85.6	99.62	8	max_steps=10	1319	5,862,016	4,444	136,623	104	5,998,639	0.2512
ReAct-Pro*	gsm8k	2025/1/22	gpt-4o	63.31	99.55	8	max_steps=10	1319	14,411,173	10,926	304,714	231	14,715,887	39.0751
ReAct-Pro*	gsm8k	2025/1/22	Qwen2.5-72B-Instruct	87.26	100	8	max_steps=10	1319	18,160,983	13,769	549,454	417	18,710,437	10.5479
ReAct-Pro*	gsm8k	2025/1/22	Llama-3.3-70B-Instruct	87.64	99.92	8	max_steps=10	1319	17,038,928	12,918	898,936	682	17,937,864	10.1124
ReAct-Pro*	gsm8k	2025/1/22	Qwen2.5-7B-Instruct	82.87	100	8	max_steps=10	1319	14,355,752	10,884	495,162	375	14,850,914	0.0000
ReAct-Pro*	gsm8k	2025/1/22	Llama-3.1-8B-Instruct	67.78	98.56	8	max_steps=10	1319	21,044,978	15,955	1,790,789	1,358	22,835,767	0.0000
ReAct-Pro*	gsm8k	2025/1/22	Internllm2_5-7B	33.51	97.95	8	max_steps=10	1319	30,120,070	22,836	5,549,919	4,208	35,669,989	0.0000
ReAct-Pro*	gsm8k	2025/1/22	Qwen2-1.5B-Instruct	24.87	80.21	8	max_steps=10	1319	9,133,603	6,925	694,398	526	9,828,001	0.0000
ReAct-Pro*	gsm8k	2025/1/22	Qwen2-0.5B-Instruct	7.66	95.22	8	max_steps=10	1319	52,431,343	39,751	2,961,268	2,245	55,392,611	0.0000
PoT	gsm8k	2025/1/7	gpt-3.5-turbo	76.88	99.24	8		1319	1,090,418	827	96,662	73	1,187,080	0.6902
PoT	gsm8k	2025/1/7	Doubao-lite-32k	79.61	92.57	8		1319	1,170,038	887	118,017	89	1,288,055	0.0576
PoT	gsm8k	2025/1/22	gpt-4o	93.1	99.77	8		1319	1,101,672	835	146,240	111	1,247,912	4.2166
PoT	gsm8k	2025/1/22	Qwen2.5-72B-Instruct	92.34	99.39	8		1319	1,106,682	839	144,528	110	1,251,210	0.7054
PoT	gsm8k	2025/1/22	Llama-3.3-70B-Instruct	73.09	79.61	8		1319	1,126,025	854	601,019	456	1,727,044	0.9736
PoT	gsm8k	2025/1/22	Qwen2.5-7B-Instruct	58.83	70.51	8		1319	1,145,390	868	217,432	165	1,362,822	0.0000
PoT	gsm8k	2025/1/22	Llama-3.1-8B-Instruct	38.67	55.42	8		1319	1,147,538	870	243,573	185	1,391,111	0.0000
PoT	gsm8k	2025/1/22	Internllm2_5-7B	38.21	48.9	8		1319	1,136,843	862	188,106	143	1,324,949	0.0000
PoT	gsm8k	2025/1/22	Qwen2-1.5B-Instruct	18.5	31.01	8		1319	1,151,528	873	175,994	133	1,327,522	0.0000
PoT	gsm8k	2025/1/22	Qwen2-0.5B-Instruct	9.62	16.91	8		1319	1,151,528	873	237,607	180	1,389,135	0.0000
CoT	gsm8k	2025/1/7	gpt-3.5-turbo	78.7	100	8		1319	953,242	723	134,799	102	1,088,041	0.6788
CoT	gsm8k	2025/1/7	Doubao-lite-32k	89.31	100	8		1319	1,042,095	790	159,725	121	1,201,820	0.0558
CoT	gsm8k	2025/1/22	gpt-4o	94.09	100	8		1319	948,668	719	216,498	164	1,165,166	4.5367
CoT	gsm8k	2025/1/22	Qwen2.5-72B-Instruct	92.87	100	8		1319	1,005,119	762	271,133	206	1,276,252	0.7195
CoT	gsm8k	2025/1/22	Llama-3.3-70B-Instruct	93.93	100	8		1319	990,168	751	228,497	173	1,218,665	0.6870
CoT	gsm8k	2025/1/22	Qwen2.5-7B-Instruct	85.67	100	8		1319	1,046,008	793	244,797	186	1,290,805	0.0000
CoT	gsm8k	2025/1/22	Llama-3.1-8B-Instruct	75.44	99.92	8		1319	990,168	751	258,161	196	1,248,329	0.0000
CoT	gsm8k	2025/1/22	Internllm2_5-7B	77.71	99.7	8		1319	968,163	734	234,000	177	1,202,163	0.0000
CoT	gsm8k	2025/1/22	Qwen2-1.5B-Instruct	55.5	100	8		1319	1,032,818	783	185,707	141	1,218,525	0.0000
CoT	gsm8k	2025/1/22	Qwen2-0.5B-Instruct	35.94	99.92	8		1319	1,032,818	783	190,641	145	1,223,459	0.0000
SC-CoT	gsm8k	2025/1/7	gpt-3.5-turbo	79.91	99.92	8	temperature=1, path_num=5	1319	2,740,652	2,078	1,348,960	1,023	4,089,612	3.3938
SC-CoT	gsm8k	2025/1/7	Doubao-lite-32k	87.26	99.92	8	temperature=1, path_num=5	1319	2,691,714	2,041	1,197,099	908	3,888,813	0.2083
SC-CoT	gsm8k	2025/1/22	gpt-4o	90.3	99.92	8	temperature=1, path_num=5	1319	3,590,336	2,722	2,207,837	1,674	5,798,173	31.0542
SC-CoT	gsm8k	2025/1/22	Qwen2.5-72B-Instruct	93.86	100	8	temperature=1, path_num=5	1319	8,136,223	6,168	2,481,785	1,882	10,618,008	5.9858
SC-CoT	gsm8k	2025/1/22	Llama-3.3-70B-Instruct	95.07	100	8	temperature=1, path_num=5	1319	8,413,717	6,379	2,585,077	1,960	10,998,794	6.2005
SC-CoT	gsm8k	2025/1/22	Qwen2.5-7B-Instruct	91.13	100	8	temperature=1, path_num=5	1319	8,586,888	6,510	2,554,097	1,936	11,140,985	0.0000
SC-CoT	gsm8k	2025/1/22	Llama-3.1-8B-Instruct	73.46	99.55	8	temperature=1, path_num=5	1319	8,630,514	6,543	3,148,202	2,387	11,778,716	0.0000
SC-CoT	gsm8k	2025/1/22	Internllm2_5-7B	48.22	98.41	8	temperature=1, path_num=5	1319	10,678,792	8,096	3,847,639	2,917	14,526,431	0.0000
SC-CoT	gsm8k	2025/1/22	Qwen2-1.5B-Instruct	11.75	91.89	8	temperature=1, path_num=5	1319	9,066,115	6,873	3,345,827	2,537	12,411,942	0.0000
SC-CoT	gsm8k	2025/1/22	Qwen2-0.5B-Instruct	1.67	94.69	8	temperature=1, path_num=5	1319	11,019,864	8,355	5,445,856	4,129	16,465,720	0.0000
IO	AQuA	2025/1/7	gpt-3.5-turbo	38.98	100	0		254	25,701	101	16,770	66	42,471	0.0380
IO	AQuA	2025/1/7	Doubao-lite-32k	79.13	100	0		254	33,058	130	54,684	215	87,742	0.0058
IO	AQuA	2025/1/22	gpt-4o	75.59	97.24	0		254	25,631	101	108,121	426	133,752	1.1453
IO	AQuA	2025/1/22	Qwen2.5-72B-Instruct	84.25	99.61	0		254	25,397	100	106,207	418	131,604	0.0742
IO	AQuA	2025/1/22	Llama-3.3-70B-Instruct	82.68	99.21	0		254	32,809	129	108,758	428	141,567	0.0798
IO	AQuA	2025/1/22	Qwen2.5-7B-Instruct	78.74	98.43	0		254	33,271	131	104,500	411	137,771	0.0000
IO	AQuA	2025/1/22	Llama-3.1-8B-Instruct	51.18	98.82	0		254	26,459	104	106,647	420	133,106	0.0000
IO	AQuA	2025/1/22	Internllm2_5-7B	47.64	90.94	0		254	50,232	198	134,809	531	185,041	0.0000
IO	AQuA	2025/1/22	Qwen2-1.5B-Instruct	29.13	97.64	0		254	27,937	110	43,110	170	71,047	0.0000
IO	AQuA	2025/1/22	Qwen2-0.5B-Instruct	27.17	98.82	0		254	27,937	110	82,478	325	110,415	0.0000
CoT	AQuA	2025/1/7	gpt-3.5-turbo	61.02	93.7	0		254	25,447	100	55,346	218	80,793	0.0957
CoT	AQuA	2025/1/7	Doubao-lite-32k	82.68	97.24	0		254	27,978	110	66,599	262	94,577	0.0066
CoT	AQuA	2025/1/22	gpt-4o	82.68	98.03	0		254	25,123	99	97,894	385	123,017	1.0417
CoT	AQuA	2025/1/22	Qwen2.5-72B-Instruct	86.22	99.21	0		254	25,143	99	118,146	465	143,289	0.0808
CoT	AQuA	2025/1/22	Llama-3.3-70B-Instruct	83.46	98.43	0		254	32,555	128	131,834	519	164,389	0.0927
CoT	AQuA	2025/1/22	Qwen2.5-7B-Instruct	80.71	99.61	0		254	33,017	130	116,719	460	149,736	0.0000
CoT	AQuA	2025/1/22	Llama-3.1-8B-Instruct	60.63	100	0		254	32,555	128	111,880	440	144,435	0.0000
CoT	AQuA	2025/1/22	Internllm2_5-7B	52.76	89.37	0		254	26,610	105	100,910	397	127,520	0.0000
CoT	AQuA	2025/1/22	Qwen2-1.5B-Instruct	40.55	98.82	0		254	30,477	120	79,563	313	110,040	0.0000
CoT	AQuA	2025/1/22	Qwen2-0.5B-Instruct	33.07	98.82	0		254	30,477	120	86,862	342	117,339	0.0000
PoT	AQuA	2025/1/7	gpt-3.5-turbo	59.45	100	0		254	225,162	886	41,492	163	266,654	0.1748
PoT	AQuA	2025/1/7	Doubao-lite-32k	71.65	96.85	0		254	259,863	1,023	49,573	195	309,436	0.0147
PoT	AQuA	2025/1/22	gpt-4o	75.2	100	0		254	222,717	877	105,191	414	327,908	1.6087
PoT	AQuA	2025/1/22	Qwen2.5-72B-Instruct	75.2	100	0		254	249,215	981	42,549	168	291,764	0.1645
PoT	AQuA	2025/1/22	Llama-3.3-70B-Instruct	79.53	99.21	0		254	240,735	948	69,064	272	309,799	0.1746
PoT	AQuA	2025/1/22	Qwen2.5-7B-Instruct	68.11	100	0		254	264,517	1,041	49,211	194	313,728	0.0000
PoT	AQuA	2025/1/22	Llama-3.1-8B-Instruct	36.61	96.85	0		254	240,613	947	50,301	198	290,914	0.0000
PoT	AQuA	2025/1/22	Internllm2_5-7B	36.61	98.82	0		254	233,505	919	68,457	270	301,962	0.0000
PoT	AQuA	2025/1/22	Qwen2-1.5B-Instruct	30.71	96.46	0		254	246,560	971	51,915	204	298,475	0.0000
PoT	AQuA	2025/1/22	Qwen2-0.5B-Instruct	17.32	92.13	0		254	258,867	1,019	63,414	250	322,281	0.0000
SC-CoT	AQuA	2025/1/7	gpt-3.5-turbo	66.14	99.21	0	temperature=1, path_num=5	254	482,192	1,898	365,143	1,438	847,335	0.7888
SC-CoT	AQuA	2025/1/7	Doubao-lite-32k	81.1	97.24	0	temperature=1, path_num=5	254	503,751	1,983	382,235	1,505	885,986	0.0519
SC-CoT	AQuA	2025/1/22	gpt-4o	86.61	98.82	0	temperature=1, path_num=5	254	744,478	2,931	628,728	2,475	1,373,206	8.1485
SC-CoT	AQuA	2025/1/22	Qwen2.5-72B-Instruct	85.04	99.21	0	temperature=1, path_num=5	254	1,051,218	4,139	784,451	3,088	1,835,669	1.0348
SC-CoT	AQuA	2025/1/22	Llama-3.3-70B-Instruct	82.28	99.21	0	temperature=1, path_num=5	254	1,135,251	4,469	772,673	3,042	1,907,924	1.0756
SC-CoT	AQuA	2025/1/22	Qwen2.5-7B-Instruct	79.92	100	0	temperature=1, path_num=5	254	1,098,280	4,324	747,052	2,941	1,845,332	0.0000
SC-CoT	AQuA	2025/1/22	Llama-3.1-8B-Instruct	59.45	97.24	0	temperature=1, path_num=5	254	971,003	3,823	680,330	2,678	1,651,333	0.0000
SC-CoT	AQuA	2025/1/22	Internllm2_5-7B	39.37	98.03	0	temperature=1, path_num=5	254	1,420,494	5,592	875,728	3,448	2,296,222	0.0000
SC-CoT	AQuA	2025/1/22	Qwen2-1.5B-Instruct	23.62	96.46	0	temperature=1, path_num=5	254	1,034,362	4,072	740,973	2,917	1,775,335	0.0000
SC-CoT	AQuA	2025/1/22	Qwen2-0.5B-Instruct	22.83	97.24	0	temperature=1, path_num=5	254	1,246,929	4,909	968,162	3,812	2,215,091	0.0000
ReAct-Pro*	AQuA	2025/1/7	gpt-3.5-turbo	64.57	98.03	0	max_steps=10	254	862,614	3,396	40,973	161	903,587	0.4928
ReAct-Pro*	AQuA	2025/1/7	Doubao-lite-32k	77.56	96.06	0	max_steps=10	254	977,890	3,850	54,951	216	1,032,841	0.0445
ReAct-Pro*	AQuA	2025/1/22	gpt-4o	57.48	97.24	0	max_steps=10	254	615,589	2,424	76,507	301	692,096	2.3040
ReAct-Pro*	AQuA	2025/1/22	Qwen2.5-72B-Instruct	73.23	100	0	max_steps=10	254	441,765	1,739	121,838	480	563,603	0.3177
ReAct-Pro*	AQuA	2025/1/22	Llama-3.3-70B-Instruct	79.13	99.61	0	max_steps=10	254	1,119,143	4,406	243,236	958	1,362,379	0.7680
ReAct-Pro*	AQuA	2025/1/22	Qwen2.5-7B-Instruct	74.41	99.21	0	max_steps=10	254	564,165	2,221	131,679	518	695,844	0.0000
ReAct-Pro*	AQuA	2025/1/22	Llama-3.1-8B-Instruct	55.51	96.85	0	max_steps=10	254	3,764,723	14,822	576,098	2,268	4,340,821	0.0000
ReAct-Pro*	AQuA	2025/1/22	Internllm2_5-7B	40.94	96.85	0	max_steps=10	254	3,592,039	14,142	836,762	3,294	4,428,801	0.0000
ReAct-Pro*	AQuA	2025/1/22	Qwen2-1.5B-Instruct	25.59	96.06	0	max_steps=10	254	4,555,858	17,936	516,146	2,032	5,072,004	0.0000
ReAct-Pro*	AQuA	2025/1/22	Qwen2-0.5B-Instruct	24.02	96.85	0	max_steps=10	254	6,344,167	24,977	825,920	3,252	7,170,087	0.0000

Default settings:

temperature = 0

LLM prices:

LLM prices:
- gpt-3.5-turbo:
  - 0.5$/1M tokens (input)
  - 1.5$/1M tokens (output)
- Doubao-lite-32k (1 USD = 7.3249 CNY):
  - 0.04096$/1M tokens (input)
  - 0.08200$/1M tokens (output)
- gpt-4o-2024-08-06:
  - 2.50$ /1M input tokens (input)
  - 10$ /1M output tokens (output)
- Qwen2.5-7B-Instruct and Llama-3.3-70B-Instruct:
  - Prices can be found https://cloud.siliconflow.cn/.
- Other open source LLMs:
  - Deployed locally, please check the OmAgent repository for more information.
  - Cost is not considered in the leaderboard.

Pass Rate*: The pass rate is calculated by evaluating the percentage of predictions that are valid, where a prediction is valid if it is neither empty nor null.

Compare to original agent repositories

Algorithm	Dataset	Eval Time	LLM	Framework	Score
CoT	gsm8k	2025/1/7	gpt-3.5-turbo	Original repo	79.23
CoT	gsm8k	2025/1/7	gpt-3.5-turbo	OmAgent	78.70
CoT	AQuA	2025/1/7	gpt-3.5-turbo	Original repo	60.63
CoT	AQuA	2025/1/7	gpt-3.5-turbo	OmAgent	61.02
PoT	gsm8k	2025/1/7	gpt-4o-mini	Original repo	86.35
PoT	gsm8k	2025/1/7	gpt-4o-mini	OmAgent	88.25
ReAct	AQuA	2025/1/7	gpt-3.5-turbo	Original repo	35.04
ReAct	AQuA	2025/1/7	gpt-3.5-turbo	OmAgent	34.25
ReAct	HotpotQA	2025/1/8	gpt-3.5-turbo	Original repo	28.00
ReAct	HotpotQA	2025/1/8	gpt-3.5-turbo	OmAgent	27.40

Note:

The original repo is the official repository of the agent implementation.
OmAgent is the implementation of the agent in this project.
There is no official implementation of SC-CoT.

Comparison ReAct with ReAct-Pro

Algorithm	Dataset	Eval Time	LLM	Score	Pass Rate
ReAct	gsm8k	2025/1/7	gpt-3.5-turbo	38.13	100.00
ReAct-Pro	gsm8k	2025/1/7	gpt-3.5-turbo	74.91	99.39
ReAct	AQuA	2025/1/7	gpt-3.5-turbo	34.25	97.64
ReAct-Pro	AQuA	2025/1/7	gpt-3.5-turbo	64.57	98.03

🔗 Related works

Open Agent Leaderboard is built on top of the OmAgent repository.

Acknowledgments

We extend our deepest gratitude to the authors and contributors of the following datasets: gsm8k, AQuA, agent algorithms: CoT, SC-CoT, PoT, ReAct, and LLMs: gpt-3.5-turbo, Doubao-lite-32k.

⭐️ Citation

If you find our repository beneficial, please cite our repository:

@misc{open-agent-leaderboard,
    title={Open Agent Leaderboard},
    author={Om AI Lab},
    year={2025},
    publisher={GitHub},
    howpublished={\url{https://github.com/om-ai-lab/open-agent-leaderboard}}
}

🔔 Follow us

You can follow us on X and Discord for more updates and discussions.

🤝 Contributing

Feel free to submit issues and pull requests.

📝 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
dataset		dataset
evaluation		evaluation
example		example
figs		figs
.gitignore		.gitignore
README.md		README.md
analyze_results.py		analyze_results.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏅 Open Agent Leaderboard

🎉 Updates

📖 Introduction

🏅 Leaderboards

🛠️ How to Install

🏗️ How to Evaluate Agents

Step 1. Implement your agent in the `omagent` repository

Step 2. Inference in OmAgent Repository

Output Format

Step 3. Evaluate inference results

Parameters

Example

Evaluation details

Compare to original agent repositories

Comparison ReAct with ReAct-Pro

🔗 Related works

⭐️ Citation

🔔 Follow us

🤝 Contributing

📝 License

About

Releases

Packages

Contributors 3

Languages

om-ai-lab/open-agent-leaderboard

Folders and files

Latest commit

History

Repository files navigation

🏅 Open Agent Leaderboard

🎉 Updates

📖 Introduction

🏅 Leaderboards

🛠️ How to Install

🏗️ How to Evaluate Agents

Step 1. Implement your agent in the omagent repository

Step 2. Inference in OmAgent Repository

Output Format

Step 3. Evaluate inference results

Parameters

Example

Evaluation details

Compare to original agent repositories

Comparison ReAct with ReAct-Pro

🔗 Related works

⭐️ Citation

🔔 Follow us

🤝 Contributing

📝 License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Step 1. Implement your agent in the `omagent` repository

Packages