-
Notifications
You must be signed in to change notification settings - Fork 1.9k
C Eval performance and script
This project tested the performance of the relevant models on the recently released C-Eval benchmark dataset. The test set consists of 12.3K multiple-choice questions covering 52 subjects. Below are the validation and test set evaluation results (Average) for some of the models. For the complete results, please refer to our technical report.
Model | Valid (zero-shot) | Valid (5-shot) | Test (zero-shot) | Test (5-shot) |
---|---|---|---|---|
Chinese-Alpaca-33B | 43.3 | 42.6 | 41.6 | 40.4 |
Chinese-LLaMA-33B | 34.9 | 38.4 | 34.6 | 39.5 |
Chinese-Alpaca-Plus-13B | 43.3 | 42.4 | 41.5 | 39.9 |
Chinese-LLaMA-Plus-13B | 27.3 | 34.0 | 27.8 | 33.3 |
Chinese-Alpaca-Plus-7B | 36.7 | 32.9 | 36.4 | 32.3 |
Chinese-LLaMA-Plus-7B | 27.3 | 28.3 | 26.9 | 28.4 |
In the following, we will introduce the prediction method for the C-Eval dataset. Users can also refer to our Colab Notebook for reference:
Download the dataset from the path specified in official C-Eval, and unzip the file to the data
folder:
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
unzip ceval-exam.zip -d data
Move data
to scripts/ceval
directory of this project.
Run the following script:
model_path=path/to/chinese_llama_or_alpaca
output_path=path/to/your_output_dir
cd scripts/ceval
python eval.py \
--model_path ${model_path} \
--cot False \
--few_shot False \
--with_prompt True \
--constrained_decoding True \
--temperature 0.2 \
--n_times 1 \
--ntrain 5 \
--do_save_csv False \
--do_test False \
--output_dir ${output_path} \
-
model_path
: Path to the model to be evaluated (the model merged with LoRA in HF format) -
cot
: Whether to use chain-of-thought -
few_shot
: Whether to use few-shot -
ntrain
: Specifies the number of few-shot demos whenfew_shot=True
(5-shot:ntrain=5
); Whenfew_shot=False
, this argument does not have any effect -
with_prompt
: Whether input to the model contains the instruction prompt for Alpaca models -
constrained_decoding
: Since the standard answer format for C-Eval is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:-
constrained_decoding=True
: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer -
constrained_decoding=False
: Extract the answer token from model's outputs with regular expressions
-
-
temperature
: Temperature for decoding -
n_times
: The number of repeated evaluations. Folders will be generated underoutput_dir
corresponding to the specified number of times -
do_save_csv
: Whether to save the model outputs, extracted answers, etc. in csv files -
output_dir
: Output path of results -
do_test
: Whether to evaluate on the valid or test set: evaluate on the valid set whendo_test=False
and on the test set whendo_test=True
-
The evaluation script creates directories
outputs\take*
when finishing evaluation, where*
is a number ranges from 0 ton_times-1
, storing the results of then_times
repeated evaluations respectively. -
In each
outputs\take*
, there will be asubmission.json
and asummary.json
. Ifdo_save_csv=True
, there will be also 52 csv files that contain model outputs, extracted answers for each subject, etc.
-
submission.json
stores generated answers in the official submission form, and can be submitted for evaluation:
{
"computer_network": {
"0": "A",
"1": "B",
...
},
"marxism": {
"0": "B",
"1": "A",
...
},
...
}
-
summary.json
stores model evaluation results under 52 subjects, 4 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:"All": { "score": 0.36701337295690933, "num": 1346, "correct": 494.0 }
where score
is the overall accuracy, num
is the total number of evaluation examples, and correct
is the number of correct predictions.
do_test=True
), score
and correct
are 0 since there are no labels available. The test set results require submitting the submission.json
file to the official C-Eval. For detailed instructions, please refer to the official submission process provided by C-Eval.
- 模型合并与转换
- 模型量化、推理、部署
- 效果与评测
- 训练细节
- 常见问题
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Details
- FAQ