:Merge branch 'main' of https://github.com/hiyouga/LLaMA-Factory into…

… qwen2_audio
hiyouga · Jan 20, 2025 · 72e6fad · 72e6fad
2 parents 4a5cec0 + 1f47b61
commit 72e6fad
Show file tree

Hide file tree

Showing 62 changed files with 537 additions and 233 deletions.
diff --git a/README.md b/README.md
@@ -244,6 +244,7 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 | [Pixtral](https://huggingface.co/mistralai)                       | 12B                              | pixtral          |
 | [Qwen/QwQ (1-2.5) (Code/Math/MoE)](https://huggingface.co/Qwen)   | 0.5B/1.5B/3B/7B/14B/32B/72B/110B | qwen             |
 | [Qwen2-VL/QVQ](https://huggingface.co/Qwen)                       | 2B/7B/72B                        | qwen2_vl         |
+| [Qwen2-Audio](https://huggingface.co/Qwen)                        | 7B                               | qwen2_audio      |
 | [Skywork o1](https://huggingface.co/Skywork)                      | 8B                               | skywork_o1       |
 | [StarCoder 2](https://huggingface.co/bigcode)                     | 3B/7B/15B                        | -                |
 | [TeleChat2](https://huggingface.co/Tele-AI)                       | 3B/7B/35B/115B                   | telechat2        |
@@ -499,7 +500,7 @@ pip install .
 2. Install transformers from the main branch.
 
 ```bash
-git clone -b https://github.com/huggingface/transformers.git
+git clone -b main https://github.com/huggingface/transformers.git
 cd transformers
 pip install .
 ```

diff --git a/README_zh.md b/README_zh.md
@@ -246,6 +246,7 @@ https://github.com/user-attachments/assets/e6ce34b0-52d5-4f3e-a830-592106c4c272
 | [Pixtral](https://huggingface.co/mistralai)                       | 12B                              | pixtral          |
 | [Qwen/QwQ (1-2.5) (Code/Math/MoE)](https://huggingface.co/Qwen)   | 0.5B/1.5B/3B/7B/14B/32B/72B/110B | qwen             |
 | [Qwen2-VL/QVQ](https://huggingface.co/Qwen)                       | 2B/7B/72B                        | qwen2_vl         |
+| [Qwen2-Audio](https://huggingface.co/Qwen)                        | 7B                               | qwen2_audio      |
 | [Skywork o1](https://huggingface.co/Skywork)                      | 8B                               | skywork_o1       |
 | [StarCoder 2](https://huggingface.co/bigcode)                     | 3B/7B/15B                        | -                |
 | [TeleChat2](https://huggingface.co/Tele-AI)                       | 3B/7B/35B/115B                   | telechat2        |
@@ -501,7 +502,7 @@ pip install .
 2. 安装 transformers 的 main 分支版本。
 
 ```bash
-git clone -b https://github.com/huggingface/transformers.git
+git clone -b main https://github.com/huggingface/transformers.git
 cd transformers
 pip install .
 ```

diff --git a/assets/wechat.jpg b/assets/wechat.jpg
diff --git a/assets/wechat_npu.jpg b/assets/wechat_npu.jpg
diff --git a/data/README.md b/data/README.md
@@ -24,6 +24,7 @@ Currently we support datasets in **alpaca** and **sharegpt** format.
     "tools": "the column name in the dataset containing the tool description. (default: None)",
     "images": "the column name in the dataset containing the image inputs. (default: None)",
     "videos": "the column name in the dataset containing the videos inputs. (default: None)",
+    "audios": "the column name in the dataset containing the audios inputs. (default: None)",
     "chosen": "the column name in the dataset containing the chosen answers. (default: None)",
     "rejected": "the column name in the dataset containing the rejected answers. (default: None)",
     "kto_tag": "the column name in the dataset containing the kto tags. (default: None)"
@@ -150,6 +151,10 @@ An additional column `images` is required. Please refer to the [sharegpt](#share
 
 An additional column `videos` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.
 
+### Multimodal Audio Dataset
+
+An additional column `audios` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.
+
 ## Sharegpt Format
 
 ### Supervised Fine-Tuning Dataset
@@ -374,6 +379,47 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
 }
 ```
 
+### Multimodal Audio Dataset
+
+- [Example dataset](mllm_audio_demo.json)
+
+Multimodal audio datasets require a `audios` column containing the paths to the input audios.
+
+The number of audios should be identical to the `<audio>` tokens in the conversations.
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "<audio>human instruction"
+      },
+      {
+        "from": "gpt",
+        "value": "model response"
+      }
+    ],
+    "audios": [
+      "audio path (required)"
+    ]
+  }
+]
+```
+
+Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
+
+```json
+"dataset_name": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "audios": "audios"
+  }
+}
+```
+
 ### OpenAI Format
 
 The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt.

diff --git a/data/README_zh.md b/data/README_zh.md
@@ -24,6 +24,7 @@
     "tools": "数据集代表工具描述的表头名称（默认：None）",
     "images": "数据集代表图像输入的表头名称（默认：None）",
     "videos": "数据集代表视频输入的表头名称（默认：None）",
+    "audios": "数据集代表音频输入的表头名称（默认：None）",
     "chosen": "数据集代表更优回答的表头名称（默认：None）",
     "rejected": "数据集代表更差回答的表头名称（默认：None）",
     "kto_tag": "数据集代表 KTO 标签的表头名称（默认：None）"
@@ -150,6 +151,10 @@ KTO 数据集需要提供额外的 `kto_tag` 列。详情请参阅 [sharegpt](#s
 
 多模态视频数据集需要提供额外的 `videos` 列。详情请参阅 [sharegpt](#sharegpt-格式)。
 
+### 多模态音频数据集
+
+多模态音频数据集需要提供额外的 `audios` 列。详情请参阅 [sharegpt](#sharegpt-格式)。
+
 ## Sharegpt 格式
 
 ### 指令监督微调数据集
@@ -374,6 +379,48 @@ KTO 数据集需要额外添加一个 `kto_tag` 列，包含 bool 类型的人
 }
 ```
 
+### 多模态音频数据集
+
+- [样例数据集](mllm_audio_demo.json)
+
+多模态音频数据集需要额外添加一个 `audios` 列，包含输入音频的路径。
+
+注意音频的数量必须与文本中所有 `<audio>` 标记的数量严格一致。
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "<audio>人类指令"
+      },
+      {
+        "from": "gpt",
+        "value": "模型回答"
+      }
+    ],
+    "audios": [
+      "音频路径（必填）"
+    ]
+  }
+]
+```
+
+对于上述格式的数据，`dataset_info.json` 中的*数据集描述*应为：
+
+```json
+"数据集名称": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "audios": "audios"
+  }
+}
+```
+
+
 ### OpenAI 格式
 
 OpenAI 格式仅仅是 sharegpt 格式的一种特殊情况，其中第一条消息可能是系统提示词。

diff --git a/examples/extras/adam_mini/qwen2_full_sft.yaml b/examples/extras/adam_mini/qwen2_full_sft.yaml
@@ -34,7 +34,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/apollo/llama3_full_sft.yaml b/examples/extras/apollo/llama3_full_sft.yaml
@@ -39,7 +39,7 @@ pure_bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/badam/llama3_full_sft.yaml b/examples/extras/badam/llama3_full_sft.yaml
@@ -37,7 +37,7 @@ lr_scheduler_type: cosine
 warmup_ratio: 0.1
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/fsdp_qlora/llama3_lora_sft.yaml b/examples/extras/fsdp_qlora/llama3_lora_sft.yaml
@@ -7,6 +7,7 @@ trust_remote_code: true
 stage: sft
 do_train: true
 finetuning_type: lora
+lora_rank: 8
 lora_target: all
 
 ### dataset
@@ -35,7 +36,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/galore/llama3_full_sft.yaml b/examples/extras/galore/llama3_full_sft.yaml
@@ -38,7 +38,7 @@ pure_bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/llama_pro/llama3_freeze_sft.yaml b/examples/extras/llama_pro/llama3_freeze_sft.yaml
@@ -36,7 +36,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/loraplus/llama3_lora_sft.yaml b/examples/extras/loraplus/llama3_lora_sft.yaml
@@ -6,6 +6,7 @@ trust_remote_code: true
 stage: sft
 do_train: true
 finetuning_type: lora
+lora_rank: 8
 lora_target: all
 loraplus_lr_ratio: 16.0
 
@@ -35,7 +36,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/mod/llama3_full_sft.yaml b/examples/extras/mod/llama3_full_sft.yaml
@@ -35,7 +35,7 @@ pure_bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/extras/pissa/llama3_lora_sft.yaml b/examples/extras/pissa/llama3_lora_sft.yaml
@@ -6,6 +6,7 @@ trust_remote_code: true
 stage: sft
 do_train: true
 finetuning_type: lora
+lora_rank: 8
 lora_target: all
 pissa_init: true
 pissa_iter: 16
@@ -37,7 +38,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/train_full/llama3_full_sft.yaml b/examples/train_full/llama3_full_sft.yaml
@@ -34,7 +34,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/train_full/qwen2vl_full_sft.yaml b/examples/train_full/qwen2vl_full_sft.yaml
@@ -1,5 +1,7 @@
 ### model
 model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
+image_resolution: 262144
+video_resolution: 16384
 trust_remote_code: true
 
 ### method
@@ -37,7 +39,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/train_lora/llama3_lora_dpo.yaml b/examples/train_lora/llama3_lora_dpo.yaml
@@ -6,6 +6,7 @@ trust_remote_code: true
 stage: dpo
 do_train: true
 finetuning_type: lora
+lora_rank: 8
 lora_target: all
 pref_beta: 0.1
 pref_loss: sigmoid  # choices: [sigmoid (dpo), orpo, simpo]
@@ -36,7 +37,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/train_lora/llama3_lora_kto.yaml b/examples/train_lora/llama3_lora_kto.yaml
@@ -6,6 +6,7 @@ trust_remote_code: true
 stage: kto
 do_train: true
 finetuning_type: lora
+lora_rank: 8
 lora_target: all
 pref_beta: 0.1
 
@@ -35,7 +36,7 @@ bf16: true
 ddp_timeout: 180000000
 
 ### eval
-val_size: 0.1
-per_device_eval_batch_size: 1
-eval_strategy: steps
-eval_steps: 500
+# val_size: 0.1
+# per_device_eval_batch_size: 1
+# eval_strategy: steps
+# eval_steps: 500
diff --git a/examples/train_lora/llama3_lora_ppo.yaml b/examples/train_lora/llama3_lora_ppo.yaml
@@ -7,6 +7,7 @@ trust_remote_code: true
 stage: ppo
 do_train: true
 finetuning_type: lora
+lora_rank: 8
 lora_target: all
 
 ### dataset