Skip to content

Commit 1736c0a

Browse files
authored
Merge pull request #611 from ymcui/ceval-notebook
Add C-Eval notebook & Release v4.1
2 parents c7e9782 + 3be07ec commit 1736c0a

File tree

5 files changed

+9238
-4
lines changed

5 files changed

+9238
-4
lines changed

README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,9 @@
3737

3838
## 新闻
3939

40-
**[2023/06/08] [v4.0版本](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): 发布中文LLaMA/Alpaca-33B、添加privateGPT使用示例、添加C-Eval结果等。**
40+
**[2023/06/16] [v4.1版本](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.1): 发布新版技术报告、添加C-Eval解码脚本、添加低资源模型合并脚本等。**
41+
42+
[2023/06/08] [v4.0版本](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): 发布中文LLaMA/Alpaca-33B、添加privateGPT使用示例、添加C-Eval结果等。
4143

4244
[2023/06/05] llama.cpp已支持Apple Silicon GPU解码,解码速度大幅提升,详见:[讨论区#开发者公告](https://github.com/ymcui/Chinese-LLaMA-Alpaca/discussions/505)
4345

@@ -229,7 +231,7 @@ chinese_llama_lora_7b/
229231

230232
### 客观效果评测
231233

232-
本项目还在“NLU”类客观评测集合上对相关模型进行了测试。这类评测的结果不具有主观性,只需要输出给定标签(需要设计标签mapping策略),因此可以从另外一个侧面了解大模型的能力。本项目在近期推出的[C-Eval评测数据集](https://cevalbenchmark.com)上测试了相关模型效果,其中测试集包含12.3K个选择题,涵盖52个学科。以下是部分模型的valid和test集评测结果(Average),完整结果后续将更新至[技术报告](https://arxiv.org/abs/2304.08177)
234+
本项目还在“NLU”类客观评测集合上对相关模型进行了测试。这类评测的结果不具有主观性,只需要输出给定标签(需要设计标签mapping策略),因此可以从另外一个侧面了解大模型的能力。本项目在近期推出的[C-Eval评测数据集](https://cevalbenchmark.com)上测试了相关模型效果,其中测试集包含12.3K个选择题,涵盖52个学科。以下是部分模型的valid和test集评测结果(Average),完整结果请参考[技术报告](https://arxiv.org/abs/2304.08177)
233235

234236
| 模型 | Valid (zero-shot) | Valid (5-shot) | Test (zero-shot) | Test (5-shot) |
235237
| ----------------------- | :---------------: | :------------: | :--------------: | :-----------: |

README_EN.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,9 @@ To promote open research of large models in the Chinese NLP community, this proj
3939

4040
## News
4141

42-
**[June 8, 2023] [Release v4.0](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): LLaMA/Alpaca 33B versions are available. We also add privateGPT demo, C-Eval results, etc.**
42+
**[June 16, 2023] [Release v4.1](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.1): New technical report, add C-Eval inference script, add low-resource model merging script, etc.**
43+
44+
[June 8, 2023] [Release v4.0](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v4.0): LLaMA/Alpaca 33B versions are available. We also add privateGPT demo, C-Eval results, etc.
4345

4446
[May 16, 2023] [Release v3.2](https://github.com/ymcui/Chinese-LLaMA-Alpaca/releases/tag/v3.2): Add SFT scripts, LangChain supports, Gradio-based web demo, etc.
4547

@@ -233,7 +235,7 @@ In order to quickly evaluate the actual performance of related models, this proj
233235

234236
### NLU Performance Test
235237

236-
This project also conducted tests on relevant models using the "NLU" objective evaluation dataset. The results of this type of evaluation are objective and only require the output of given labels, so they can provide insights into the capabilities of large models from another perspective. In the recently launched [C-Eval dataset](https://cevalbenchmark.com/), this project tested the performance of the relevant models. The test set contains 12.3K multiple-choice questions covering 52 subjects. The following are the evaluation results (average) of some models on the validation and test sets, and the complete results will be updated in the [technical report](https://arxiv.org/abs/2304.08177) later.
238+
This project also conducted tests on relevant models using the "NLU" objective evaluation dataset. The results of this type of evaluation are objective and only require the output of given labels, so they can provide insights into the capabilities of large models from another perspective. In the recently launched [C-Eval dataset](https://cevalbenchmark.com/), this project tested the performance of the relevant models. The test set contains 12.3K multiple-choice questions covering 52 subjects. The following are the evaluation results (average) of some models on the validation and test sets. For complete results, please refer to our [technical report](https://arxiv.org/abs/2304.08177).
237239

238240
| Models | Valid (zero-shot) | Valid (5-shot) | Test (zero-shot) | Test (5-shot) |
239241
| ----------------------- | :---------------: | :------------: | :--------------: | :-----------: |

notebooks/README.md

+8
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# 笔记本示例 Notebooks
22

3+
### ceval_example_for_chinese_alpaca.ipynb
4+
5+
利用Chinese Alpaca模型解码C-Eval数据集的示例。
6+
7+
Example of decoding C-Eval dataset with Chinese Alpaca.
8+
9+
建议查看Colab上的最新版 / Check latest notebook:<a href="https://colab.research.google.com/drive/12YewimRT7JuqJGOejxN7YG8jq2de4DnF?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
10+
311
### convert_and_quantize_chinese_llama_and_alpaca.ipynb
412

513
Colab上的转换和量化中文LLaMA/Alpaca(含Plus版本)的运行示例(仅供流程参考)。

0 commit comments

Comments
 (0)