chore: update the examples

cstsunfu · cstsunfu · commit 1ee944a300ca · 2024-04-09T00:53:25.000+08:00
diff --git a/README.md b/README.md
@@ -3,6 +3,10 @@
 </p>
 
 
+<h3 align="center">
+    <p>Don't Repeat Yourself</p>
+</h3>
+
 <div style="text-align:center">
 <span style="width:80%;display:inline-block">
 
@@ -31,10 +35,10 @@
         * [虚拟对抗训练](#虚拟对抗训练)
         * [复杂训练控制](#复杂训练控制)
         * [文本生成](#文本生成)
+    * [实现你自己的模型](#你自己的模型)
     * [More Document](#more-document)
 
 
-
 虽然最近的一年多通用大模型吸引了大部分人的注意力，但是相信很多人已经意识到任务导向的模型在现阶段仍有其不可替代的一面，而且这些模型在处理某些特定任务时具有更好的可靠性和更高的效率，特别是这些模型可以实现一些Agent来与LLM进行配合。
 
 任务导向的模型开发实际上不像LLM一样可以“一招鲜吃遍天”，而是每个任务的模型都需要针对性的开发，而在工作中我们经常需要对深度神经网络模型进行快速实验，搜索最优结构和参数，并将最优模型进行部署，有时还需要做出demo进行验证.
@@ -204,6 +208,18 @@ DLK依赖两个注册系统，一套是`intc`的`config`注册`cregister`，一
 
 `dlk`还参考`fairseq`的实现，实现了多种的`token_sample`方法，为文本生成提供非常强大的控制能力
 
+#### 实现你自己的模型
+
+参考`./examples/001_first_example` 实现你自己的模型
+
+看完例子之后。但你可能会有疑问，这似乎并不比我直接实现一个模型简单，甚至有很多概念让我觉得这看起来更复杂。
+是的，如果你只是想训练一个简单的模型，不需要考虑预测、演示等，没错，但是`dlk`提供了一个非常统一的框架，让你只需按照步骤来实现相应的组件，就可以获得一个可用的模型。并且所有的工作都是可重用的，包括你刚刚实现的组件。
+
+而且`dlk`还提供了很多优化方面的工具，让你不是止步于简单模型
+
+记住这个包的原则是Donot Repeat Yourself
+
+
 #### More Document
 
 TODO
diff --git a/README_en.md b/README_en.md
@@ -2,6 +2,10 @@
   <h2 align="center"> Deep Learning toolKit (dlk)</h2>
 </p>
 
+<h3 align="center">
+    <p>Don't Repeat Yourself</p>
+</h3>
+
 
 <div style="text-align:center">
 <span style="width:80%;display:inline-block">
@@ -30,6 +34,7 @@
         * [Adversarial Training](#adversarial-training)
         * [Complex Training Control](#complex-training-control)
         * [Text Generation](#text-generation)
+    * [Your Own Model](#your-own-model)
     * [More Documentation](#more-documentation)
 
 Although the general-purpose large models have attracted most people's attention in the past year or so, I believe many people have realized that task-oriented models still have their irreplaceable side at this stage, and these models are very effective when dealing with certain specific tasks. With better reliability and higher efficiency, especially these models can implement some agents to cooperate with LLM.
@@ -196,6 +201,18 @@ The advantage of using registries is that we don't need to be concerned about wh
 
 DLK also implements various `token_sample` methods, inspired by `fairseq`, providing powerful control capabilities for text generation.
 
+
+#### Your Own Model
+
+Ref `./examples/001_first_example` implement yours
+
+As you can see, a simple model is completed. But you may have questions, this doesn't seem to be simpler than me directly implementing a model, and there are even many concepts that make me think this seems more complicated.
+Yes, if you just want to train a simple model and do not need to consider prediction, demo, etc., that is true, but dlk provides a very unified framework, so that you can only follow the steps to implement the corresponding components, and all the work They are all reusable, including the components you just implemented.
+
+Moreover, `dlk` also provides many optimization tools, so that you are not limited to simple models.
+
+The principle of this package is donot repeat yourself
+
 #### More Documentation
 
 TODO
diff --git a/dlk/data/datamodule/default.py b/dlk/data/datamodule/default.py
@@ -156,8 +156,7 @@ def test_dataloader(self):
         )
 
     def online_dataloader(self, data):
-        """get the data collate_fn"""
-        # return DataLoader(self.mnist_test, batch_size=self.batch_size)
+        """get the online dataloader"""
         if not self._online_key_type_pairs:
             self._online_key_type_pairs = self.dataset_creator.real_key_type_pairs(
                 self.dataset_config.key_type_pairs, data
diff --git a/examples/001_first_example/.intc.json b/examples/001_first_example/.intc.json
@@ -0,0 +1,9 @@
+{
+    "entry": [
+        "config"
+    ],
+    "src": [
+        "dlk",
+        "src"
+    ]
+}
diff --git a/examples/001_first_example/config/fit.jsonc b/examples/001_first_example/config/fit.jsonc
@@ -0,0 +1,46 @@
+{
+    "@fit": {
+        "specific": {},
+        "log_dir": "./logs",
+        "processed_data_dir": "./data/processed_data",
+        "@trainer@lightning": {
+            "max_epochs": 6,
+            "devices": "auto",
+            "strategy": "auto",
+            "accelerator": "auto",
+            "@callback@lr_monitor": {
+                "logging_interval": "step"
+            },
+            "@callback@checkpoint": {
+                "monitor": "val_acc",
+                "save_top_k": 0
+            }
+        },
+        "@datamodule@my_datamodule": {
+            "train_batch_size": 32
+        },
+        "@imodel@default": {
+            "@model@my_model": {
+                "bert_model_path": "./pretrain/",
+                "bert_dim": 768,
+                "label_num": 2
+            },
+            "@scheduler@linear_warmup": {
+                "num_warmup_steps": 100
+            },
+            "@loss@cross_entropy": {
+                "pred_truth_pair": {
+                    "logits": "label_ids"
+                },
+                "ignore_index": -100
+            },
+            "@optimizer@adamw": {
+                "lr": 2e-5,
+                "eps": 1e-08
+            },
+            "@postprocessor@txt_cls": {
+                "label_vocab": "label_vocab.json"
+            }
+        }
+    }
+}
diff --git a/examples/001_first_example/config/processor.jsonc b/examples/001_first_example/config/processor.jsonc
@@ -0,0 +1,5 @@
+{
+    "@processor@my": {
+        "tokenizer_path": "./pretrain/tokenizer.json"
+    }
+}
diff --git a/examples/001_first_example/demo.py b/examples/001_first_example/demo.py
@@ -0,0 +1,15 @@
+# Copyright the author(s) of DLK.
+#
+# This source code is licensed under the Apache license found in the
+# LICENSE file in the root directory of this source tree.
+
+from dlk.demo import Demo
+
+demo = Demo(
+    display_config={"@display@txt_cls": {}},
+    process_config="./config/processor.jsonc",
+    fit_config="./config/fit.jsonc",
+    checkpoint="./logs/0/checkpoint/last.ckpt",
+)
+
+# a subtle and well-crafted ( for the most part ) chiller .
diff --git a/examples/001_first_example/process.py b/examples/001_first_example/process.py
@@ -0,0 +1,45 @@
+# Copyright the author(s) of DLK.
+#
+# This source code is licensed under the Apache license found in the
+# LICENSE file in the root directory of this source tree.
+
+import copy
+import json
+import uuid
+
+import pandas as pd
+import src
+from datasets import load_dataset
+
+from dlk.preprocess import PreProcessor
+
+
+def flat(data):
+    """flat the data like zip"""
+    sentences = data["sentence"]
+    uuids = data["uuid"]
+    labelses = data["labels"]
+    return [
+        {"sentence": sentece, "labels": [labels], "uuid": uuid}
+        for sentece, labels, uuid in zip(sentences, labelses, uuids)
+    ]
+
+
+label_map = {0: "neg", 1: "pos"}
+
+data = load_dataset("data/sst2")
+data = data.map(
+    lambda one: {
+        "sentence": one["sentence"],
+        "labels": label_map[one["label"]],
+        "uuid": str(uuid.uuid1()),
+    },
+    remove_columns=["sentence", "label"],
+)
+input = {
+    "train": pd.DataFrame(flat(data["train"].to_dict())),
+    "valid": pd.DataFrame(flat(data["validation"].to_dict())),
+}
+
+processor = PreProcessor("./config/processor.jsonc")
+processor.fit(input)
diff --git a/examples/001_first_example/readme.md b/examples/001_first_example/readme.md
@@ -0,0 +1,97 @@
+### The First Project
+
+This example is the same as `../text_cls` but from scratch.
+
+#### Prepare the BERT model and .intc.json
+
+Download the BERT pretrained model and the `tokenizer.json` to `./pretrain`
+
+add a file `.intc.json` in current dir
+```json
+{
+    "entry": [
+        "config" // the config will put in this dir, the intc-lsp will server on files in this dir
+    ],
+    "src": [ // the source code for this project
+        "dlk", // the dlk package
+        "src" // and the src code will write
+    ]
+}
+
+```
+
+#### Prepare the Dataset
+
+load and convert the `sst2` dataset, see `./process.py`
+
+```python
+from datasets import load_dataset
+
+def flat(data):
+    """flat the data like zip"""
+    sentences = data["sentence"]
+    uuids = data["uuid"]
+    labelses = data["labels"]
+    return [
+        {"sentence": sentece, "labels": [labels], "uuid": uuid}
+        for sentece, labels, uuid in zip(sentences, labelses, uuids)
+    ]
+
+
+label_map = {0: "neg", 1: "pos"}
+
+data = load_dataset("sst2")   # load dataset from huggingface dataset or what you want
+data = data.map(
+    lambda one: {
+        "sentence": one["sentence"],
+        "labels": label_map[one["label"]],
+        "uuid": str(uuid.uuid1()),
+    },
+    remove_columns=["sentence", "label"],
+)
+input = {
+    "train": pd.DataFrame(flat(data["train"].to_dict())),       # convert the train data and valid part to pd.DataFrame
+    "valid": pd.DataFrame(flat(data["validation"].to_dict())),
+}
+```
+
+#### Prepare the Process code
+
+dlk provide a general `@processor@default` processor to preprocess the data, but for this example we create a simple one, please check the code on `./src/my_process.py` and registed as `@processor@my`
+
+#### Prepare the config of preprocessor
+config `@processor@my` see `./config/processor.jsonc`,
+
+#### Run the preprocessor
+
+then run preprocess
+
+```bash
+python process.py
+```
+
+#### Prepare the model code
+
+dlk provide many modules, but we want just test ours.
+
+we define a simple classification model at `./src/my_model.py` and registed as `@model@my_model`
+
+#### Prepare the DataModule
+dlk also provide a general datamodule, for this case, we implement ours.
+
+we define a simple datamodule at `./src/my_datamodule.py` and registed as `@datamodule@my_datamodule`
+
+#### Prepare the `fit.json`
+
+besides `datamodule` `model`, there are many other module like `loss`, `optimizer`, `schedule` which is easy to understand, we reuse them buitin dlk.
+
+see `./config/fit.jsonc`
+
+
+#### Train your model
+
+just prepare the train.py and
+```bash
+python train.py
+
+```
diff --git a/examples/001_first_example/src/__init__.py b/examples/001_first_example/src/__init__.py
@@ -0,0 +1,5 @@
+# Copyright the author(s) of DLK.
+#
+# This source code is licensed under the Apache license found in the
+# LICENSE file in the root directory of this source tree.
+from . import my_datamodule, my_model, my_processor
diff --git a/examples/001_first_example/src/my_datamodule.py b/examples/001_first_example/src/my_datamodule.py
diff --git a/examples/001_first_example/src/my_model.py b/examples/001_first_example/src/my_model.py
diff --git a/examples/001_first_example/src/my_processor.py b/examples/001_first_example/src/my_processor.py
diff --git a/examples/001_first_example/train.py b/examples/001_first_example/train.py
diff --git a/examples/text_cls/readme.md b/examples/text_cls/readme.md

Original file line number	Diff line number	Diff line change
`@@ -156,8 +156,7 @@ def test_dataloader(self):`
`156`	`156`	`)`
`157`	`157`
`158`	`158`	`def online_dataloader(self, data):`
`159`		`- """get the data collate_fn"""`
`160`		`- # return DataLoader(self.mnist_test, batch_size=self.batch_size)`
	`159`	`+ """get the online dataloader"""`
`161`	`160`	`if not self._online_key_type_pairs:`
`162`	`161`	`self._online_key_type_pairs = self.dataset_creator.real_key_type_pairs(`
`163`	`162`	`self.dataset_config.key_type_pairs, data`
-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +{
 +    "@processor@my": {
 +        "tokenizer_path": "./pretrain/tokenizer.json"
 +    }
 +}