Skip to content

Commit

Permalink
Merge pull request #3 from open-compass/eval
Browse files Browse the repository at this point in the history
update livemathbench-judge
jnanliu authored Jan 10, 2025
2 parents 8a38733 + dc63d5c commit a77d3d0
Showing 3 changed files with 9 additions and 4 deletions.
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -16,8 +16,9 @@
[📚[LeaderBoard](https://github.com/open-compass/GPassK/index.html)] -->

## 🚀 News
- **[2025.1.10]** 🔥 We release a small-scale judge model [LiveMath-Judge](https://huggingface.co/jnanliu/LiveMath-Judge).
- **[2025.1.6]** 🔥 **[LiveMathBench](https://huggingface.co/datasets/opencompass/LiveMathBench)** now can be accessed through hugginface, and you can now evaluate your LLMs on it using G-Pass@k in OpenCompass. We have addressed potential errors in LiveMathBench and inconsistencies in the sampling parameters. Please also refer to our updated version of the **[Paper](http://arxiv.org/abs/2412.13147)** for further details.
- **[2024.12.18]** We release the **[ArXiv Paper](http://arxiv.org/abs/2412.13147)** of G-Pass@k. 🎉🎉🎉
- **[2024.12.18]** 🎉 We release the **[ArXiv Paper](http://arxiv.org/abs/2412.13147)** of G-Pass@k.


## ☀️Introduction
@@ -87,10 +88,12 @@ lmdeploy serve api_server Qwen/Qwen2.5-72B-Instruct --server-port 8000 \
--cache-max-entry-count 0.9 \
--log-level INFO
```
After setting up the judge model, define the URLs in the `eval_urls` within `opencompass_config_templates/*.py`. Adjust other parameters such as `k``temperatures`, `llm_infos`, and other params according to your needs.
After setting up the judge model, define the URLs in the `eval_urls` and `eval_model_name` within `opencompass_config_templates/*.py`. Adjust other parameters such as `k``temperatures`, `llm_infos`, and other params according to your needs.

> ❗️Note that omitting `eval_urls` will default to an internal rule-based judge, which might only apply to datasets with numerical answers
> 💡Now you can use the [LiveMath-Judge](https://huggingface.co/jnanliu/LiveMath-Judge) for judging, which greatly reduces deploy and inference costs.
### 4. Evaluation

To begin the evaluation, first generate the necessary configuration files by running the following script:
3 changes: 2 additions & 1 deletion opencompass_config_templates/nono1.py
Original file line number Diff line number Diff line change
@@ -21,6 +21,7 @@
eval_urls = [
# Put your judge model urls urls here
]
eval_model_name = 'Qwen/Qwen2.5-72B-Instruct'


llm_infos = [
@@ -81,7 +82,7 @@
abbr=f'LiveMathBench-v{version}_k{"-".join(map(str, [k] if isinstance(k, int) else k))}_r{replication}'
))
livemathbench_dataset['eval_cfg']['evaluator'].update(dict(
model_name='Qwen/Qwen2.5-72B-Instruct',
model_name=eval_model_name,
url=eval_urls,
k=k,
replication=replication
3 changes: 2 additions & 1 deletion opencompass_config_templates/o1.py
Original file line number Diff line number Diff line change
@@ -21,6 +21,7 @@
eval_urls = [
# Put your judge model urls urls here
]
eval_model_name = 'Qwen/Qwen2.5-72B-Instruct'


llm_infos = [
@@ -69,7 +70,7 @@
abbr=f'LiveMathBench-v{version}-k{"_".join(map(str, [k] if isinstance(k, int) else k))}-r{replication}'
))
livemathbench_dataset['eval_cfg']['evaluator'].update(dict(
model_name='Qwen/Qwen2.5-72B-Instruct',
model_name=eval_model_name,
url=eval_urls,
k=k,
replication=replication

0 comments on commit a77d3d0

Please sign in to comment.