Skip to content

Commit 1ee944a

Browse files
committed
chore: update the examples
1 parent f73f9df commit 1ee944a

File tree

15 files changed

+661
-4
lines changed

15 files changed

+661
-4
lines changed

README.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
</p>
44

55

6+
<h3 align="center">
7+
<p>Don't Repeat Yourself</p>
8+
</h3>
9+
610
<div style="text-align:center">
711
<span style="width:80%;display:inline-block">
812

@@ -31,10 +35,10 @@
3135
* [虚拟对抗训练](#虚拟对抗训练)
3236
* [复杂训练控制](#复杂训练控制)
3337
* [文本生成](#文本生成)
38+
* [实现你自己的模型](#你自己的模型)
3439
* [More Document](#more-document)
3540

3641

37-
3842
虽然最近的一年多通用大模型吸引了大部分人的注意力,但是相信很多人已经意识到任务导向的模型在现阶段仍有其不可替代的一面,而且这些模型在处理某些特定任务时具有更好的可靠性和更高的效率,特别是这些模型可以实现一些Agent来与LLM进行配合。
3943

4044
任务导向的模型开发实际上不像LLM一样可以“一招鲜吃遍天”,而是每个任务的模型都需要针对性的开发,而在工作中我们经常需要对深度神经网络模型进行快速实验,搜索最优结构和参数,并将最优模型进行部署,有时还需要做出demo进行验证.
@@ -204,6 +208,18 @@ DLK依赖两个注册系统,一套是`intc`的`config`注册`cregister`,一
204208

205209
`dlk`还参考`fairseq`的实现,实现了多种的`token_sample`方法,为文本生成提供非常强大的控制能力
206210

211+
#### 实现你自己的模型
212+
213+
参考`./examples/001_first_example` 实现你自己的模型
214+
215+
看完例子之后。但你可能会有疑问,这似乎并不比我直接实现一个模型简单,甚至有很多概念让我觉得这看起来更复杂。
216+
是的,如果你只是想训练一个简单的模型,不需要考虑预测、演示等,没错,但是`dlk`提供了一个非常统一的框架,让你只需按照步骤来实现相应的组件,就可以获得一个可用的模型。并且所有的工作都是可重用的,包括你刚刚实现的组件。
217+
218+
而且`dlk`还提供了很多优化方面的工具,让你不是止步于简单模型
219+
220+
记住这个包的原则是Donot Repeat Yourself
221+
222+
207223
#### More Document
208224

209225
TODO

README_en.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22
<h2 align="center"> Deep Learning toolKit (dlk)</h2>
33
</p>
44

5+
<h3 align="center">
6+
<p>Don't Repeat Yourself</p>
7+
</h3>
8+
59

610
<div style="text-align:center">
711
<span style="width:80%;display:inline-block">
@@ -30,6 +34,7 @@
3034
* [Adversarial Training](#adversarial-training)
3135
* [Complex Training Control](#complex-training-control)
3236
* [Text Generation](#text-generation)
37+
* [Your Own Model](#your-own-model)
3338
* [More Documentation](#more-documentation)
3439

3540
Although the general-purpose large models have attracted most people's attention in the past year or so, I believe many people have realized that task-oriented models still have their irreplaceable side at this stage, and these models are very effective when dealing with certain specific tasks. With better reliability and higher efficiency, especially these models can implement some agents to cooperate with LLM.
@@ -196,6 +201,18 @@ The advantage of using registries is that we don't need to be concerned about wh
196201

197202
DLK also implements various `token_sample` methods, inspired by `fairseq`, providing powerful control capabilities for text generation.
198203

204+
205+
#### Your Own Model
206+
207+
Ref `./examples/001_first_example` implement yours
208+
209+
As you can see, a simple model is completed. But you may have questions, this doesn't seem to be simpler than me directly implementing a model, and there are even many concepts that make me think this seems more complicated.
210+
Yes, if you just want to train a simple model and do not need to consider prediction, demo, etc., that is true, but dlk provides a very unified framework, so that you can only follow the steps to implement the corresponding components, and all the work They are all reusable, including the components you just implemented.
211+
212+
Moreover, `dlk` also provides many optimization tools, so that you are not limited to simple models.
213+
214+
The principle of this package is donot repeat yourself
215+
199216
#### More Documentation
200217

201218
TODO

dlk/data/datamodule/default.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -156,8 +156,7 @@ def test_dataloader(self):
156156
)
157157

158158
def online_dataloader(self, data):
159-
"""get the data collate_fn"""
160-
# return DataLoader(self.mnist_test, batch_size=self.batch_size)
159+
"""get the online dataloader"""
161160
if not self._online_key_type_pairs:
162161
self._online_key_type_pairs = self.dataset_creator.real_key_type_pairs(
163162
self.dataset_config.key_type_pairs, data
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"entry": [
3+
"config"
4+
],
5+
"src": [
6+
"dlk",
7+
"src"
8+
]
9+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"@fit": {
3+
"specific": {},
4+
"log_dir": "./logs",
5+
"processed_data_dir": "./data/processed_data",
6+
"@trainer@lightning": {
7+
"max_epochs": 6,
8+
"devices": "auto",
9+
"strategy": "auto",
10+
"accelerator": "auto",
11+
"@callback@lr_monitor": {
12+
"logging_interval": "step"
13+
},
14+
"@callback@checkpoint": {
15+
"monitor": "val_acc",
16+
"save_top_k": 0
17+
}
18+
},
19+
"@datamodule@my_datamodule": {
20+
"train_batch_size": 32
21+
},
22+
"@imodel@default": {
23+
"@model@my_model": {
24+
"bert_model_path": "./pretrain/",
25+
"bert_dim": 768,
26+
"label_num": 2
27+
},
28+
"@scheduler@linear_warmup": {
29+
"num_warmup_steps": 100
30+
},
31+
"@loss@cross_entropy": {
32+
"pred_truth_pair": {
33+
"logits": "label_ids"
34+
},
35+
"ignore_index": -100
36+
},
37+
"@optimizer@adamw": {
38+
"lr": 2e-5,
39+
"eps": 1e-08
40+
},
41+
"@postprocessor@txt_cls": {
42+
"label_vocab": "label_vocab.json"
43+
}
44+
}
45+
}
46+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"@processor@my": {
3+
"tokenizer_path": "./pretrain/tokenizer.json"
4+
}
5+
}

examples/001_first_example/demo.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Copyright the author(s) of DLK.
2+
#
3+
# This source code is licensed under the Apache license found in the
4+
# LICENSE file in the root directory of this source tree.
5+
6+
from dlk.demo import Demo
7+
8+
demo = Demo(
9+
display_config={"@display@txt_cls": {}},
10+
process_config="./config/processor.jsonc",
11+
fit_config="./config/fit.jsonc",
12+
checkpoint="./logs/0/checkpoint/last.ckpt",
13+
)
14+
15+
# a subtle and well-crafted ( for the most part ) chiller .
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Copyright the author(s) of DLK.
2+
#
3+
# This source code is licensed under the Apache license found in the
4+
# LICENSE file in the root directory of this source tree.
5+
6+
import copy
7+
import json
8+
import uuid
9+
10+
import pandas as pd
11+
import src
12+
from datasets import load_dataset
13+
14+
from dlk.preprocess import PreProcessor
15+
16+
17+
def flat(data):
18+
"""flat the data like zip"""
19+
sentences = data["sentence"]
20+
uuids = data["uuid"]
21+
labelses = data["labels"]
22+
return [
23+
{"sentence": sentece, "labels": [labels], "uuid": uuid}
24+
for sentece, labels, uuid in zip(sentences, labelses, uuids)
25+
]
26+
27+
28+
label_map = {0: "neg", 1: "pos"}
29+
30+
data = load_dataset("data/sst2")
31+
data = data.map(
32+
lambda one: {
33+
"sentence": one["sentence"],
34+
"labels": label_map[one["label"]],
35+
"uuid": str(uuid.uuid1()),
36+
},
37+
remove_columns=["sentence", "label"],
38+
)
39+
input = {
40+
"train": pd.DataFrame(flat(data["train"].to_dict())),
41+
"valid": pd.DataFrame(flat(data["validation"].to_dict())),
42+
}
43+
44+
processor = PreProcessor("./config/processor.jsonc")
45+
processor.fit(input)
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
### The First Project
2+
3+
This example is the same as `../text_cls` but from scratch.
4+
5+
#### Prepare the BERT model and .intc.json
6+
7+
Download the BERT pretrained model and the `tokenizer.json` to `./pretrain`
8+
9+
add a file `.intc.json` in current dir
10+
```json
11+
{
12+
"entry": [
13+
"config" // the config will put in this dir, the intc-lsp will server on files in this dir
14+
],
15+
"src": [ // the source code for this project
16+
"dlk", // the dlk package
17+
"src" // and the src code will write
18+
]
19+
}
20+
21+
```
22+
23+
#### Prepare the Dataset
24+
25+
load and convert the `sst2` dataset, see `./process.py`
26+
27+
```python
28+
from datasets import load_dataset
29+
30+
def flat(data):
31+
"""flat the data like zip"""
32+
sentences = data["sentence"]
33+
uuids = data["uuid"]
34+
labelses = data["labels"]
35+
return [
36+
{"sentence": sentece, "labels": [labels], "uuid": uuid}
37+
for sentece, labels, uuid in zip(sentences, labelses, uuids)
38+
]
39+
40+
41+
label_map = {0: "neg", 1: "pos"}
42+
43+
data = load_dataset("sst2") # load dataset from huggingface dataset or what you want
44+
data = data.map(
45+
lambda one: {
46+
"sentence": one["sentence"],
47+
"labels": label_map[one["label"]],
48+
"uuid": str(uuid.uuid1()),
49+
},
50+
remove_columns=["sentence", "label"],
51+
)
52+
input = {
53+
"train": pd.DataFrame(flat(data["train"].to_dict())), # convert the train data and valid part to pd.DataFrame
54+
"valid": pd.DataFrame(flat(data["validation"].to_dict())),
55+
}
56+
```
57+
58+
#### Prepare the Process code
59+
60+
dlk provide a general `@processor@default` processor to preprocess the data, but for this example we create a simple one, please check the code on `./src/my_process.py` and registed as `@processor@my`
61+
62+
#### Prepare the config of preprocessor
63+
config `@processor@my` see `./config/processor.jsonc`,
64+
65+
#### Run the preprocessor
66+
67+
then run preprocess
68+
69+
```bash
70+
python process.py
71+
```
72+
73+
#### Prepare the model code
74+
75+
dlk provide many modules, but we want just test ours.
76+
77+
we define a simple classification model at `./src/my_model.py` and registed as `@model@my_model`
78+
79+
#### Prepare the DataModule
80+
dlk also provide a general datamodule, for this case, we implement ours.
81+
82+
we define a simple datamodule at `./src/my_datamodule.py` and registed as `@datamodule@my_datamodule`
83+
84+
#### Prepare the `fit.json`
85+
86+
besides `datamodule` `model`, there are many other module like `loss`, `optimizer`, `schedule` which is easy to understand, we reuse them buitin dlk.
87+
88+
see `./config/fit.jsonc`
89+
90+
91+
#### Train your model
92+
93+
just prepare the train.py and
94+
```bash
95+
python train.py
96+
97+
```
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Copyright the author(s) of DLK.
2+
#
3+
# This source code is licensed under the Apache license found in the
4+
# LICENSE file in the root directory of this source tree.
5+
from . import my_datamodule, my_model, my_processor

0 commit comments

Comments
 (0)