This repository provides tools to easily create benchmarks and run automatic evaluation for language modelling tasks across domains and modalities using language models end-to-end. Check out our paper.
git clone https://github.com/deep-spin/zsb.git
cd zsb
python -m venv venv
source venv/bin/activate
poetry install
Tested with python==3.10 and poetry==1.6.1
We provide benchmarks for general capabilities on 4 languages (English, French, Chinese, and Korean), translation, and vision language general capabilities in English (check the data folder). All models supported in litellm (e.g., Open AI, Anthropic, Together) or vllm (e.g., most HF models) can be used for data creation, response generation, and evaluation.
For example, to get responses from google/gemma-2-9b-it
for our English general capabilities benchmark, run:
python zsb/scripts/generate_answers.py --model_name google/gemma-2-9b-it --model_type vllm --prompts_path data/general_capabilities_english.jsonl --output_path example_answers.jsonl
Then, to evaluate with claude-3-5-sonnet-20241022
as a judge, run:
python zsb/scripts/generate_da_eval.py --task general_purpose_chat_english --model_name claude-3-5-sonnet-20241022 --model_type litellm --answers_path example_answers.jsonl --output_path example_judgments.jsonl
The scores for each instance will be in the
judgement
entry of each row in the output file.
You can also create a new benchmark for the tasks we support. For example, to create 10 instances of Chinese general capabilities with Qwen/Qwen2.5-72B-Instruct
, run:
python zsb/scripts/generate_prompts.py --task general_purpose_chat_korean --n_prompts 10 --model_name Qwen/Qwen2.5-72B-Instruct --model_type vllm --output_path example_dataset.jsonl --seed 42 --model_args "{'proper_model_args':{'tensor_parallel_size':4},'sampling_params':{'temperature':0,'max_tokens':8192}}"
To list all existing tasks, run:
python zsb/tasks/list.py
Check out our guide to create new tasks under the tasks folder.
@misc{pombal2025zeroshotbenchmarkingframeworkflexible,
title={Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models},
author={José Pombal and Nuno M. Guerreiro and Ricardo Rei and André F. T. Martins},
year={2025},
eprint={2504.01001},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.01001},
}