-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
I’m encountering a runtime crash when using beam search (num_beams > 1) with the GOT-OCR2.0 model in swift infer.
This happens consistently and appears to be a beam dimension handling bug in the multimodal generation path (Qwen2-based GOT OCR).
Model
stepfun-ai/GOT-OCR2_0
Command used
CUDA_VISIBLE_DEVICES=0 swift infer
--adapters checkpoint-226000
--temperature 0
--num_beams 5
--repetition_penalty 1.08
--val_dataset est.jsonl
--max_new_tokens 4096
--stream false
--result_path 226k_beam5_results.jsonl
Expected behavior
Inference should complete normally with beam search enabled.
Actual behavior
Inference crashes immediately with a matrix shape mismatch during attention projection.
Full traceback
run sh: /home/vlm/dataset/printed/myenv/bin/python3 /home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/cli/infer.py --adapters checkpoint-226000 --temperature 0 --num_beams 5 --repetition_penalty 1.08 --val_dataset test.jsonl --max_new_tokens 4096 --stream false --result_path 226k_beam5_results.jsonl
[INFO:swift] Successfully registered /home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/dataset/data/dataset_info.json.
[INFO:swift] Loading the model using model_dir: checkpoint-226000
[INFO:swift] Successfully loaded /home/vlm/dataset/printed/checkpoint-226000/args.json.
[INFO:swift] rank: -1, local_rank: -1, world_size: 1, local_world_size: 1
[INFO:swift] Downloading the model from ModelScope Hub, model_id: stepfun-ai/GOT-OCR2_0
Downloading Model from https://www.modelscope.cn to directory: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
[INFO:swift] Loading the model using model_dir: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
torch_dtype is deprecated! Use dtype instead!
[INFO:swift] Because len(args.val_dataset) > 0, setting split_dataset_ratio: 0.0
[INFO:swift] Setting args.lazy_tokenize: True
[INFO:swift] Setting args.eval_human: False
[INFO:swift] Global seed set to 42
[INFO:swift] args: InferArguments(model='stepfun-ai/GOT-OCR2_0', model_type='got_ocr2', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=None, new_special_tokens=[], num_labels=None, problem_type=None, rope_scaling=None, device_map=None, max_memory={}, max_model_len=None, local_repo_path=None, init_strategy=None, template='got_ocr2', system=None, max_length=32768, truncation_strategy='delete', max_pixels=None, agent_template=None, norm_bbox=None, use_chat_template=True, padding_free=False, padding_side='right', loss_scale='default', sequence_parallel_size=1, response_prefix=None, template_backend='swift', dataset=[], val_dataset=['page_test.jsonl'], split_dataset_ratio=0.0, data_seed=42, dataset_num_proc=1, load_from_cache_file=True, dataset_shuffle=True, val_dataset_shuffle=False, streaming=False, interleave_prob=None, stopping_strategy='first_exhausted', shuffle_buffer_size=1000, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, model_name=None, model_author=None, custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=4096, temperature=0.0, top_k=None, top_p=None, repetition_penalty=1.08, num_beams=5, stream=False, stop_words=[], logprobs=False, top_logprobs=None, ckpt_dir='/home/vlm/dataset/printed/hindi_master_aug_dpi2/v0-20260113-092657/checkpoint-226000', lora_modules=[], tuner_backend='peft', train_type='lora', adapters=['/home/vlm/dataset/printed/hindi_master_aug_dpi2/v0-20260113-092657/checkpoint-226000'], external_plugins=[], seed=42, model_kwargs={}, load_args=True, load_data_args=False, packing=False, packing_length=None, lazy_tokenize=True, cached_dataset=[], custom_register_path=[], use_hf=False, hub_token=None, ddp_timeout=18000000, ddp_backend=None, ignore_args_error=False, use_swift_lora=False, vllm_gpu_memory_utilization=0.9, vllm_tensor_parallel_size=1, vllm_pipeline_parallel_size=1, vllm_enable_expert_parallel=False, vllm_max_num_seqs=256, vllm_max_model_len=None, vllm_disable_custom_all_reduce=True, vllm_enforce_eager=False, vllm_limit_mm_per_prompt={}, vllm_max_lora_rank=16, vllm_enable_prefix_caching=False, vllm_use_async_engine=False, vllm_quantization=None, vllm_reasoning_parser=None, vllm_disable_cascade_attn=False, vllm_data_parallel_size=1, gpu_memory_utilization=None, tensor_parallel_size=None, limit_mm_per_prompt=None, data_parallel_size=None, use_async_engine=None, sglang_tp_size=1, sglang_pp_size=1, sglang_dp_size=1, sglang_ep_size=1, sglang_enable_ep_moe=False, sglang_mem_fraction_static=None, sglang_context_length=None, sglang_disable_cuda_graph=False, sglang_quantization=None, sglang_kv_cache_dtype='auto', sglang_enable_dp_attention=False, sglang_disable_custom_all_reduce=True, lmdeploy_tp=1, lmdeploy_session_len=None, lmdeploy_cache_max_entry_count=0.8, lmdeploy_quant_policy=0, lmdeploy_vision_batch_size=1, merge_lora=False, safe_serialization=True, max_shard_size='5GB', infer_backend='pt', result_path='/home/vlm/dataset/printed/226k_beam5_results.jsonl', write_batch_size=1000, metric=None, max_batch_size=1, val_dataset_sample=None)
[INFO:swift] Downloading the model from ModelScope Hub, model_id: stepfun-ai/GOT-OCR2_0
Downloading Model from https://www.modelscope.cn to directory: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
[INFO:swift] Loading the model using model_dir: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
[INFO:swift] model_kwargs: {'device_map': 'cuda:0'}
torch_dtype is deprecated! Use dtype instead!
[INFO:swift] default_system: ' You should follow the instructions carefully and explain your answers in detail.'
[INFO:swift] max_length: 32768
[INFO:swift] response_prefix: ''
[INFO:swift] agent_template: hermes
[INFO:swift] norm_bbox: norm1000
[INFO:swift] Setting ROOT_IMAGE_DIR: None. You can adjust this hyperparameter through the environment variable: ROOT_IMAGE_DIR.
[INFO:swift] model: PeftModelForCausalLM(
(base_model): LoraModel(
(model): GOTQwenForCausalLM(
(model): GOTQwenModel(
(embed_tokens): Embedding(151860, 1024)
(layers): ModuleList(
(0-23): 24 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
)
(mlp): Qwen2MLP(
(gate_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=2816, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=2816, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(up_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=2816, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=2816, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(down_proj): lora.Linear(
(base_layer): Linear(in_features=2816, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=2816, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(act_fn): SiLUActivation()
)
(input_layernorm): Qwen2RMSNorm((1024,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((1024,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((1024,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
(vision_tower_high): ImageEncoderViT(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
)
(blocks): ModuleList(
(0-11): 12 x Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(proj): Linear(in_features=768, out_features=768, bias=True)
)
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(act): GELU(approximate='none')
)
)
)
(neck): Sequential(
(0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): LayerNorm2d()
(2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(3): LayerNorm2d()
)
(net_2): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(net_3): Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
)
(mm_projector_vary): Linear(in_features=1024, out_features=1024, bias=True)
)
(lm_head): Linear(in_features=1024, out_features=151860, bias=False)
)
)
)
[INFO:swift] Start time of running main: 2026-01-18 15:33:37.360416
[INFO:swift] swift.version: 2.36.0
[INFO:swift] request_config: RequestConfig(max_tokens=4096, temperature=0.0, top_k=None, top_p=None, repetition_penalty=1.08, num_beams=5, stop=[], seed=None, stream=False, logprobs=False, top_logprobs=None, n=1, best_of=None, presence_penalty=0.0, frequency_penalty=0.0, length_penalty=1.0, return_details=False)
[INFO:swift] val_dataset: Dataset({
features: ['images', 'messages'],
num_rows: 505
})
0%| | 0/505 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/cli/infer.py", line 5, in
infer_main()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 291, in infer_main
return SwiftInfer(args).main()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
result = self.run()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 91, in run
result = self.infer_dataset()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 247, in infer_dataset
result_list += self._batch_infer(shard_dataset, request_config)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 278, in _batch_infer
resp_list = self.infer(val_dataset, request_config, template=self.template, use_tqdm=True, **self.infer_kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer_engine/pt_engine.py", line 562, in infer
res += self._infer(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer_engine/pt_engine.py", line 525, in _infer
res = infer_func(**kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer_engine/pt_engine.py", line 370, in _infer_full
output = dict(template.generate(self.model, **generate_kwargs))
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/template/base.py", line 682, in generate
return model.generate(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/peft/peft_model.py", line 1973, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2564, in generate
result = decoding_method(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 3265, in _beam_search
model_outputs = self(**model_inputs, return_dict=True)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/.cache/huggingface/modules/transformers_modules/GOT_hyphen_OCR2_0/modeling_GOT.py", line 347, in forward
outputs = self.model(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/.cache/huggingface/modules/transformers_modules/GOT_hyphen_OCR2_0/modeling_GOT.py", line 300, in forward
return super(GOTQwenModel, self).forward(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/utils/generic.py", line 1064, in wrapper
outputs = func(self, *args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 384, in forward
hidden_states = decoder_layer(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in call
return super().call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 234, in forward
hidden_states, _ = self.self_attn(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 182, in forward
attn_output = self.o_proj(attn_output)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 757, in forward
result = self.base_layer(x, *args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 134, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (288x5120 and 1024x1024)
0%| | 0/505 [00:08<?, ?it/s]
...
File "transformers/models/qwen2/modeling_qwen2.py", line 182, in forward
attn_output = self.o_proj(attn_output)
Note:
- The dimension 5120 = 1024 * num_beams (5) suggests that the beam dimension is being merged into the hidden dimension, instead of being treated as a batch dimension.
- The error occurs inside the Qwen2 attention path (o_proj)
- This only happens for multimodal generation (vision + text)
- Greedy decoding (num_beams=1) works correctly
Your hardware and system info
ms-swift version: 2.36.0
transformers version: 4.57.1
torch: 2.9.0+cu128
cuda available: True
cuda version: 12.8
gpu: NVIDIA RTX 6000 Ada Generation
Python version: 3.10.12
OS: Ubuntu 24.04.3 LTS