Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] any clue for MFU drop? #6727

Open
SeunghyunSEO opened this issue Nov 8, 2024 · 4 comments
Open

[BUG] any clue for MFU drop? #6727

SeunghyunSEO opened this issue Nov 8, 2024 · 4 comments
Assignees
Labels
bug Something isn't working training

Comments

@SeunghyunSEO
Copy link

SeunghyunSEO commented Nov 8, 2024

Image

y-axis is MFU and x-axis is training step.

I'm testing qwen 72b with huggingface trainer and whenever i train 72b scale model with zero3 offload, i encounter periodic performance drop.
I'm sry that i cant share training codes but it always happens even when i reduce batch tokens from 8k to 6k (B*T=valid batch tokens and B, T varies).
(for some reason, I'm testing 40GB A100 gpus and it's 12 node training result)
is there any logic to manage or flush memory periodically ?? (i couldn't find any clue)
ofc, i never evaluate model or save checkpoint on this horizon (above figure does not even reach 1 epoch)

@SeunghyunSEO SeunghyunSEO added bug Something isn't working training labels Nov 8, 2024
@SeunghyunSEO
Copy link
Author

wallclock break down log looks fine, and deepspeed engine didnt warn potential memory allocation issue in this run

[2024-11-08 08:06:52,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1030.38
[2024-11-08 08:06:52,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9598.41 | bwd_microstep: 22109.06 | bwd_inner_microstep: 22029.22 | bwd_allreduce_microstep: 79.70 | step_microstep: 2089.56
[2024-11-08 08:06:52,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9598.39 | bwd: 22109.07 | bwd_inner: 22029.25 | bwd_allreduce: 79.71 | step: 2089.57
[2024-11-08 08:07:52,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 914.37
[2024-11-08 08:07:52,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 10297.89 | bwd_microstep: 22481.11 | bwd_inner_microstep: 22408.00 | bwd_allreduce_microstep: 72.99 | step_microstep: 1868.60
[2024-11-08 08:07:52,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 10297.87 | bwd: 22481.12 | bwd_inner: 22408.03 | bwd_allreduce: 72.99 | step: 1868.60
[2024-11-08 08:08:45,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 910.70
[2024-11-08 08:08:45,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 8604.31 | bwd_microstep: 22356.71 | bwd_inner_microstep: 22290.00 | bwd_allreduce_microstep: 66.60 | step_microstep: 1682.31
[2024-11-08 08:08:45,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 8604.30 | bwd: 22356.72 | bwd_inner: 22290.02 | bwd_allreduce: 66.60 | step: 1682.31
[2024-11-08 08:09:41,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 893.85
[2024-11-08 08:09:41,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9596.46 | bwd_microstep: 22489.96 | bwd_inner_microstep: 22427.50 | bwd_allreduce_microstep: 62.35 | step_microstep: 2290.03
[2024-11-08 08:09:41,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9596.45 | bwd: 22489.98 | bwd_inner: 22427.52 | bwd_allreduce: 62.35 | step: 2290.03
[2024-11-08 08:10:35,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 902.15
[2024-11-08 08:10:35,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9191.02 | bwd_microstep: 22638.14 | bwd_inner_microstep: 22595.38 | bwd_allreduce_microstep: 42.65 | step_microstep: 1648.88
[2024-11-08 08:10:35,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9191.01 | bwd: 22638.15 | bwd_inner: 22595.40 | bwd_allreduce: 42.66 | step: 1648.89
[2024-11-08 08:11:33,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 905.79
[2024-11-08 08:11:33,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 10292.54 | bwd_microstep: 22519.36 | bwd_inner_microstep: 22420.87 | bwd_allreduce_microstep: 98.35 | step_microstep: 2510.57
[2024-11-08 08:11:33,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 10292.52 | bwd: 22519.37 | bwd_inner: 22420.90 | bwd_allreduce: 98.37 | step: 2510.57

@jomayeri
Copy link
Contributor

jomayeri commented Nov 14, 2024

It happens every 100th step. There is nothing in the code happening every 100th step?

@jomayeri jomayeri self-assigned this Nov 14, 2024
@SeunghyunSEO
Copy link
Author

SeunghyunSEO commented Nov 15, 2024

It happens every 100th step. There is nothing in the code happening every 100th step?

hi! ty for your response.
i just used hf trainer but if im right, there is nothing in every 100 step.
but i suspect gc or memory flush, defragment or something, so i insert gc collect code for every 50 steps, then it show mfu drop in every 50 steps (not 100 step in this time) and the gap is reduced.
so i think there might be problem for memory handling but i couldnt find any logic in deepspeed...

@jomayeri
Copy link
Contributor

It may be garbage collection. The reason I am skeptical is gc wouldn't be that periodic in nature. Every 100th step seems intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants