-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support split qkv linear and sp overlap comm #415
base: main
Are you sure you want to change the base?
Conversation
SP is a fantastic piece of work, it is very elegant and concise, at the current stage, a transformer layer's forward and backward passes involve 8 all-to-all operations, with 5 opportunities for overlapping communication: Forward pass: The QKV matrix operations can be pipelined alongside some of the all-to-all communications. Backward pass: DQ, DK, DV all-to-all communications can be pipelined alongside matrix operations. Backward pass: DO_w can be parallel with DO_input, involving matrix operations and all-to-all communications. Similar overlap-comm strategies are used in Megatron for TP/TP-sp parallelism. I tested under conditions of 1N8C zero1, disabled activation checkpointing, ds-sp=8, and gbs=16: 1B 64K 7B 16K They showed over 10% improvement (where I found that for mega-ds, using split QKV itself can also enhance performance due to reducing slice + cat operations in fwd/bwd), despite some TFLOPs already performing at a relatively good level. co-work with microsoft/Megatron-DeepSpeed#415 --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Heyang Qin <[email protected]>
microsoft/DeepSpeed#5691 is merged. @inkcherry do you still need this PR be reviewed? Can you resolve conflict on this branch? |
@tohtana , @loadams notice microsoft/DeepSpeed#5691 is merged, could you merge this one ? thanks! |
Hello,When I run the pretrain_gpt.py,I met the following bugs, |
@yingtongxiong If using this branch, |
Hi @inkcherry - could you take a look at resolving the merge conflicts on this? |
Hi, @loadams , Currently master mds + master ds (197~200 steps):
this branch + ds fix patch + enable overlap(197~200 steps):
|
hello, and now I met this problem, the run python file is the pretrain_gpt.py |
I can run this shell (where I enable flash-v2 and disable activation-checkpoint) if I don't enable two overlap options. |
@yingtongxiong |
work with microsoft/DeepSpeed#5691
when use ds_sequence_parallel, open the following 2 flags to enable overlap comm.
--split-qkv-linear
--ds-sequence-parallel-overlap-comm