Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我在微调 Qwen2-VL-7B-instruct 时遇到了问题,显示len(videos) is less than the number of tokens. #6421

Open
yzd11 opened this issue Dec 22, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@yzd11
Copy link

yzd11 commented Dec 22, 2024

我在自定义微调数据集上遇到问题

[ { "role": "user", "content": "图中的组织类型是什么?" }, { "role": "assistant", "content": "图中的组织切片显示了一个鼻内肿块的病理学样本,这是从鼻腔内部取样的。" }, { "role": "user", "content": "放大倍率是多少?" }, { "role": "assistant", "content": "图像的放大倍数为200倍。" }, { "role": "user", "content": "根据病理学特征,诊断是什么?" }, { "role": "assistant", "content": "诊断为B细胞淋巴瘤。B细胞淋巴瘤起源于B淋巴细胞,这是一种白血细胞。影像中显示,大量圆形大细胞弥漫性地遮盖了呼吸道上皮下的基底膜。结合其他临床和实验室资料,这一诊断得以确立。" } ]

这是我的 dataset_info.json 的配置信息

"llava_med_zh_60k": {
  "file_name": "/root/autodl-tmp/tool/data/llava-med-zh-instruct-60k/data",
  "formatting": "sharegpt",
  "num_samples": 5000,
  "columns": {
    "messages": "messages",
    "images": "images"
  },
  "tags": {
    "role_tag": "role",
    "content_tag": "content",
    "user_tag": "user",
    "assistant_tag": "assistant"
   }
  }

其他数据集如 alpaca-gpt4-data-zh 数据集加载正常,但加载此数据集存在问题

这是我所遇到的错误信息

[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,019 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,019 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,019 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,019 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,019 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,019 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2024-12-23 03:22:13,470 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|image_processing_base.py:373] 2024-12-23 03:22:13,471 >> loading configuration file /root/autodl-tmp/models/Qwen2-VL-7B-Instruct/preprocessor_config.json
[INFO|image_processing_base.py:373] 2024-12-23 03:22:13,473 >> loading configuration file /root/autodl-tmp/models/Qwen2-VL-7B-Instruct/preprocessor_config.json
[INFO|image_processing_base.py:429] 2024-12-23 03:22:13,473 >> Image processor Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "max_pixels": 12845056,
    "min_pixels": 3136
  },
  "temporal_patch_size": 2
}

[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,474 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,474 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,474 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,474 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,474 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-12-23 03:22:13,474 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2024-12-23 03:22:13,944 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|processing_utils.py:755] 2024-12-23 03:22:14,746 >> Processor Qwen2VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "max_pixels": 12845056,
    "min_pixels": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: Qwen2TokenizerFast(name_or_path='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

{
  "processor_class": "Qwen2VLProcessor"
}

[INFO|2024-12-23 03:22:14] llamafactory.data.template:157 >> Add <|im_end|> to stop words.
[INFO|2024-12-23 03:22:14] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/identity.json...
[INFO|2024-12-23 03:22:15] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/alpaca-gpt4-data-zh.json...
[INFO|2024-12-23 03:22:16] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/alpaca-gpt4-data-zh.json.
[INFO|2024-12-23 03:22:16] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/alpaca_data_cleaned.json...
[INFO|2024-12-23 03:22:17] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/alpaca_data_cleaned.json.
[INFO|2024-12-23 03:22:17] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/ChatMed_Consult-v0.3.json...
[INFO|2024-12-23 03:22:18] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/ChatMed_Consult-v0.3.json.
[INFO|2024-12-23 03:22:18] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/Chinese-medical-dialogue.json...
[INFO|2024-12-23 03:22:19] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/Chinese-medical-dialogue.json.
[INFO|2024-12-23 03:22:19] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/medical-zh-instruct.jsonl...
[INFO|2024-12-23 03:22:23] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/medical-zh-instruct.jsonl.
[INFO|2024-12-23 03:22:23] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/sharegpt_gpt4.jsonl...
[INFO|2024-12-23 03:22:24] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/sharegpt_gpt4.jsonl.
[INFO|2024-12-23 03:22:24] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/DISC-Med-SFT_released.jsonl...
[INFO|2024-12-23 03:22:25] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/DISC-Med-SFT_released.jsonl.
[INFO|2024-12-23 03:22:25] llamafactory.data.loader:157 >> Loading dataset /root/autodl-tmp/tool/data/llava-med-zh-instruct-60k/data...
Setting num_proc from 16 to 14 for the train split as it only contains 14 shards.
Generating train split: 56649 examples [00:09, 5991.54 examples/s] 
[INFO|2024-12-23 03:22:35] llamafactory.data.loader:157 >> Sampled 5000 examples from dataset /root/autodl-tmp/tool/data/llava-med-zh-instruct-60k/data.
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:20<00:00, 249.41 examples/s]
Running tokenizer on dataset (num_proc=16):   0%|                                                                                                                          | 0/40091 [00:03<?, ? examples/s]
multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 614, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3470, in _map_single
    batch = apply_function_on_filtered_inputs(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3349, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 107, in preprocess_supervised_dataset
    input_ids, labels = _encode_supervised_example(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 48, in _encode_supervised_example
    messages = template.mm_plugin.process_messages(prompt + response, images, videos, processor)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 615, in process_messages
    raise ValueError(f"`len(videos)` is less than the number of {VIDEO_PLACEHOLDER} tokens.")
ValueError: `len(videos)` is less than the number of  tokens.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/envs/llamafactory/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/cli.py", line 112, in main
    run_exp()
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
    dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/data/loader.py", line 270, in get_dataset
    dataset = _get_preprocessed_dataset(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/LLaMA-Factory/src/llamafactory/data/loader.py", line 205, in _get_preprocessed_dataset
    dataset = dataset.map(
              ^^^^^^^^^^^^
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3185, in map
    for rank, done, content in iflatmap_unordered(
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 654, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 654, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/llamafactory/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
ValueError: `len(videos)` is less than the number of  tokens.

但我的数据中只有 IMAGE,没有视频信息

我尝试设置环境变量为

VIDEO_PLACEHOLDER=""

但我在尝试任何东西时都没有成功

我猜是因为数据集格式是 parquet 格式,但我不确定,这是我第一次在没有经验的情况下微调大型多模态模型

我不知道如何正确处理这类数据,我看到的所有教程都是 json 格式,但这个是 parquet 格式

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant