[BUG] any clue for MFU drop? #6727

SeunghyunSEO · 2024-11-08T08:00:43Z

y-axis is MFU and x-axis is training step.

I'm testing qwen 72b with huggingface trainer and whenever i train 72b scale model with zero3 offload, i encounter periodic performance drop.
I'm sry that i cant share training codes but it always happens even when i reduce batch tokens from 8k to 6k (B*T=valid batch tokens and B, T varies).
(for some reason, I'm testing 40GB A100 gpus and it's 12 node training result)
is there any logic to manage or flush memory periodically ?? (i couldn't find any clue)
ofc, i never evaluate model or save checkpoint on this horizon (above figure does not even reach 1 epoch)

SeunghyunSEO · 2024-11-08T08:14:14Z

wallclock break down log looks fine, and deepspeed engine didnt warn potential memory allocation issue in this run

[2024-11-08 08:06:52,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1030.38
[2024-11-08 08:06:52,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9598.41 | bwd_microstep: 22109.06 | bwd_inner_microstep: 22029.22 | bwd_allreduce_microstep: 79.70 | step_microstep: 2089.56
[2024-11-08 08:06:52,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9598.39 | bwd: 22109.07 | bwd_inner: 22029.25 | bwd_allreduce: 79.71 | step: 2089.57
[2024-11-08 08:07:52,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 914.37
[2024-11-08 08:07:52,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 10297.89 | bwd_microstep: 22481.11 | bwd_inner_microstep: 22408.00 | bwd_allreduce_microstep: 72.99 | step_microstep: 1868.60
[2024-11-08 08:07:52,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 10297.87 | bwd: 22481.12 | bwd_inner: 22408.03 | bwd_allreduce: 72.99 | step: 1868.60
[2024-11-08 08:08:45,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 910.70
[2024-11-08 08:08:45,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 8604.31 | bwd_microstep: 22356.71 | bwd_inner_microstep: 22290.00 | bwd_allreduce_microstep: 66.60 | step_microstep: 1682.31
[2024-11-08 08:08:45,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 8604.30 | bwd: 22356.72 | bwd_inner: 22290.02 | bwd_allreduce: 66.60 | step: 1682.31
[2024-11-08 08:09:41,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 893.85
[2024-11-08 08:09:41,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9596.46 | bwd_microstep: 22489.96 | bwd_inner_microstep: 22427.50 | bwd_allreduce_microstep: 62.35 | step_microstep: 2290.03
[2024-11-08 08:09:41,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9596.45 | bwd: 22489.98 | bwd_inner: 22427.52 | bwd_allreduce: 62.35 | step: 2290.03
[2024-11-08 08:10:35,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 902.15
[2024-11-08 08:10:35,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9191.02 | bwd_microstep: 22638.14 | bwd_inner_microstep: 22595.38 | bwd_allreduce_microstep: 42.65 | step_microstep: 1648.88
[2024-11-08 08:10:35,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9191.01 | bwd: 22638.15 | bwd_inner: 22595.40 | bwd_allreduce: 42.66 | step: 1648.89
[2024-11-08 08:11:33,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 905.79
[2024-11-08 08:11:33,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 10292.54 | bwd_microstep: 22519.36 | bwd_inner_microstep: 22420.87 | bwd_allreduce_microstep: 98.35 | step_microstep: 2510.57
[2024-11-08 08:11:33,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 10292.52 | bwd: 22519.37 | bwd_inner: 22420.90 | bwd_allreduce: 98.37 | step: 2510.57

jomayeri · 2024-11-14T19:09:51Z

It happens every 100th step. There is nothing in the code happening every 100th step?

SeunghyunSEO · 2024-11-15T03:10:24Z

It happens every 100th step. There is nothing in the code happening every 100th step?

hi! ty for your response.
i just used hf trainer but if im right, there is nothing in every 100 step.
but i suspect gc or memory flush, defragment or something, so i insert gc collect code for every 50 steps, then it show mfu drop in every 50 steps (not 100 step in this time) and the gap is reduced.
so i think there might be problem for memory handling but i couldnt find any logic in deepspeed...

jomayeri · 2024-11-15T17:11:41Z

It may be garbage collection. The reason I am skeptical is gc wouldn't be that periodic in nature. Every 100th step seems intentional.

SeunghyunSEO added bug Something isn't working training labels Nov 8, 2024

jomayeri self-assigned this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] any clue for MFU drop? #6727

[BUG] any clue for MFU drop? #6727

SeunghyunSEO commented Nov 8, 2024 •

edited

Loading

SeunghyunSEO commented Nov 8, 2024

jomayeri commented Nov 14, 2024 •

edited

Loading

SeunghyunSEO commented Nov 15, 2024 •

edited

Loading

jomayeri commented Nov 15, 2024

[BUG] any clue for MFU drop? #6727

[BUG] any clue for MFU drop? #6727

Comments

SeunghyunSEO commented Nov 8, 2024 • edited Loading

SeunghyunSEO commented Nov 8, 2024

jomayeri commented Nov 14, 2024 • edited Loading

SeunghyunSEO commented Nov 15, 2024 • edited Loading

jomayeri commented Nov 15, 2024

SeunghyunSEO commented Nov 8, 2024 •

edited

Loading

jomayeri commented Nov 14, 2024 •

edited

Loading

SeunghyunSEO commented Nov 15, 2024 •

edited

Loading