Skip to content

Process hangs when using the given metrics to evaluate in PRM multi-gpu training. #83

@great-luao

Description

@great-luao

System Info

I use Linux platform and my python environment setting is the same as requirement.txt.
During my multi-gpu training of PRM, I found that my process will always hangs on at some point in the evaluation loop. After debugging, I found that it's caused by the given "preprocess_logits_for_metrics" function in the prm/code/finetune_qwen.py. What's strange is that the evaluation works well in a single-gpu training.
After making some tests, I found out that this phenomenon will only happen if the two gpus are preprocessing data samples with different nums of steps,a as what I show below, the final logits differ after we take out the step tags.
1733746042(1)

image

For now, the bug can be fixed by adding padding code in Transformers' Trainer and doing some small tricks, but I wonder if that could be fixed permanantly. or be fixed by setting some paramters.

image
image

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the codebase (such as scrips/, ...)
  • My own task or dataset (give details below)

Reproduction

Simply run the finetune_qwen.py with a multi_gpu setting.

Expected behavior

The evaluation process should work well.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions