[Pytorch] Update context parallel softmax lse correction func #716

Kite0011 · 2024-03-13T06:42:17Z

The original implementation would result in 'nan' when the value of lse.exp() exceeds the range of double, causing incorrect values and gradients at the corresponding positions.

ptrendx · 2024-03-13T22:08:02Z

Hi @Kite0011, thank you for your contribution! Could you please sign your commits (as outlined here: https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)?

xrennvidia · 2024-03-13T23:00:22Z

Hi @Kite0011 , really appreciate the fix. The math calculation is correct. The problem is that the function should do in-place update.

softmax_lse is initialized at here softmax_lse = torch.clone(softmax_lse_per_step[0]).to(torch.double). Every time you call flash_attn_fwd_softmax_lse_correction, you should always update the initialized softmax_lse. Some function calls update the whole softmax_lse, some only update half of it, but all function calls should work on the same tensor. Otherwise, the final softmax_lse is wrong.

xrennvidia · 2024-03-14T00:10:48Z

@Kite0011 out of curiosity, you indeed hit a case where the value is even out of the range of double data type?

Kite0011 · 2024-03-14T02:25:58Z

Hi @Kite0011, thank you for your contribution! Could you please sign your commits (as outlined here: https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)?

Thank you for your response, I will modify my PR according to this link later.

Kite0011 · 2024-03-14T02:39:53Z

@Kite0011 out of curiosity, you indeed hit a case where the value is even out of the range of double data type?

Thank you for your reply, you are correct, the softmax_lse that comes in some steps is only its own slice, I will change this function to an inplace operation. (I understand that changing the final return to copy should be enough?) For the second issue, we indeed found a problem in the warm start where the forward calculation could not be aligned, and finally traced it to the fact that the lse of the attn calculation exceeded the range of double after exp.

xrennvidia · 2024-03-14T02:50:41Z

@Kite0011 out of curiosity, you indeed hit a case where the value is even out of the range of double data type?

Thank you for your reply, you are correct, the softmax_lse that comes in some steps is only its own slice, I will change this function to an inplace operation. (I understand that changing the final return to copy should be enough?) For the second issue, we indeed found a problem in the warm start where the forward calculation could not be aligned, and finally traced it to the fact that the lse of the attn calculation exceeded the range of double after exp.

Yeah, I think changing to softmax_lse.copy_(new_scale) should work.

Interesting that even double cannot cover the range. However, with your fix, I think now even FP32 softmax_lse should work. Anyway, that needs some further test, you can leave it as double type now, I will try to make a change if nencessary.

Thanks for the fix, really appreciate it.

Kite0011 · 2024-03-14T03:26:03Z

Hi @Kite0011, thank you for your contribution! Could you please sign your commits (as outlined here: https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)?

I've made a revision. Could you help me see if there is anything else that needs to be done?

Kite0011 · 2024-03-14T03:28:41Z

@Kite0011 out of curiosity, you indeed hit a case where the value is even out of the range of double data type?

Thank you for your reply, you are correct, the softmax_lse that comes in some steps is only its own slice, I will change this function to an inplace operation. (I understand that changing the final return to copy should be enough?) For the second issue, we indeed found a problem in the warm start where the forward calculation could not be aligned, and finally traced it to the fact that the lse of the attn calculation exceeded the range of double after exp.

Yeah, I think changing to softmax_lse.copy(new_scale)_ should work.

Interesting that even double cannot cover the range. However, with your fix, I think now even FP32 softmax_lse should work. Anyway, that needs some further test, you can leave it as double type now, I will try to make a change if nencessary.

Thanks for the fix, really appreciate it.

I have made a revision, could you please review it again? Thank you for your hard work.

＆ It seems like you're right, lse-related operations should all be safe at this moment; but considering that lse itself doesn't take up much memory and time, it's also good to keep the data type as double.

xrennvidia · 2024-03-14T04:28:52Z

LGTM. Thanks!

ptrendx

LGTM after signing the commits.

Infi-zc · 2024-03-19T07:11:03Z

By the way, I would like to ask whether the accumulation of the forward out and the backward dq, dkv should also first be converted to fp32 for accumulation, and then converted back to fp16/bf16 before being copied back to the inplace position ?
The approach of 'first accumulating in fp32 and then copying' may result in less numerical bias compared to 'directly accumulating in fp16/bf16'. @xrennvidia

xrennvidia · 2024-03-19T07:13:48Z

By the way, I would like to ask whether the accumulation of the forward out and the backward dq, dkv should also first be converted to fp32 for accumulation, and then converted back to fp16/bf16 before being copied back to the inplace position ? The approach of 'first accumulating in fp32 and then copying' may result in less numerical bias compared to 'directly accumulating in fp16/bf16'. @xrennvidia

@Infi-zc I also considered this, but with BF16 results correction, I did not see loss curver issue. Did you encounter any case that needs FP32 accumulation?

Infi-zc · 2024-03-19T07:25:55Z

By the way, I would like to ask whether the accumulation of the forward out and the backward dq, dkv should also first be converted to fp32 for accumulation, and then converted back to fp16/bf16 before being copied back to the inplace position ? The approach of 'first accumulating in fp32 and then copying' may result in less numerical bias compared to 'directly accumulating in fp16/bf16'. @xrennvidia

@Infi-zc I also considered this, but with BF16 results correction, I did not see loss curver issue. Did you encounter any case that needs FP32 accumulation?

Not yet, I'm not quite certain. Plan to conduct more experiments to verify this further.

xrennvidia · 2024-03-19T07:29:40Z

By the way, I would like to ask whether the accumulation of the forward out and the backward dq, dkv should also first be converted to fp32 for accumulation, and then converted back to fp16/bf16 before being copied back to the inplace position ? The approach of 'first accumulating in fp32 and then copying' may result in less numerical bias compared to 'directly accumulating in fp16/bf16'. @xrennvidia

@Infi-zc I also considered this, but with BF16 results correction, I did not see loss curver issue. Did you encounter any case that needs FP32 accumulation?

Not yet, I'm not quite certain. Plan to conduct more experiments to verify this further.

Yeah, sounds good, let me know if you encounter the issue. Thanks.

Kite0011 · 2024-03-19T12:50:50Z

By the way, I would like to ask whether the accumulation of the forward out and the backward dq, dkv should also first be converted to fp32 for accumulation, and then converted back to fp16/bf16 before being copied back to the inplace position ? The approach of 'first accumulating in fp32 and then copying' may result in less numerical bias compared to 'directly accumulating in fp16/bf16'. @xrennvidia

In my case, I didn't encounter loss diff from bf16's fwd o and bwd's dq, dkv; we have tried fp32's accumulate but the final result didn't make much difference.
@Infi-zc @xrennvidia

ptrendx · 2024-03-19T22:52:36Z

@Kite0011 I see some unrelated commits in this PR - I think this is some rebase issue. Could you resolve that?

Kite0011 · 2024-03-20T06:33:39Z

@Kite0011 I see some unrelated commits in this PR - I think this is some rebase issue. Could you resolve that?

Thank you for your reminder. It seems there's some issue with my sign-off and rebase operation.
Done~@ptrendx

Signed-off-by: kitefang <[email protected]>

ptrendx · 2024-03-20T16:20:38Z

/te-ci pytorch

ptrendx · 2024-03-21T06:40:09Z

Merged, thank you for the contribution @Kite0011 !

Infi-zc · 2024-03-21T13:28:57Z

By the way, I would like to ask whether the accumulation of the forward out and the backward dq, dkv should also first be converted to fp32 for accumulation, and then converted back to fp16/bf16 before being copied back to the inplace position ? The approach of 'first accumulating in fp32 and then copying' may result in less numerical bias compared to 'directly accumulating in fp16/bf16'. @xrennvidia

In my case, I didn't encounter loss diff from bf16's fwd o and bwd's dq, dkv; we have tried fp32's accumulate but the final result didn't make much difference. @Infi-zc @xrennvidia

I've noticed a slight discrepancy, with differences at the level of 1e-3 to 1e-2 at certain steps, when compared to the results without checkpointing (cp) or with checkpointing where out and dqkv are accumulated using fp32. But their convergence trend are consistent. @xrennvidia @Kite0011

…#716) [Pytorch] Update context parallel softmax lse correction func. Signed-off-by: kitefang <[email protected]> Co-authored-by: kitefang <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

…#716) [Pytorch] Update context parallel softmax lse correction func. Signed-off-by: kitefang <[email protected]> Co-authored-by: kitefang <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

Kite0011 force-pushed the debug branch from 68a34cb to e258d00 Compare March 14, 2024 03:16

ptrendx approved these changes Mar 14, 2024

View reviewed changes

Kite0011 force-pushed the debug branch from 70006c1 to 93b5a11 Compare March 19, 2024 13:02

Kite0011 force-pushed the debug branch 2 times, most recently from 5f8df41 to b381929 Compare March 20, 2024 06:58

[Pytorch] Update context parallel softmax lse correction func.

d075b5e

Signed-off-by: kitefang <[email protected]>

Kite0011 force-pushed the debug branch from b381929 to d075b5e Compare March 20, 2024 07:01

ptrendx merged commit 59bfc17 into NVIDIA:main Mar 21, 2024
18 of 20 checks passed

kunlunl added a commit to kunlunl/TransformerEngine that referenced this pull request Apr 2, 2024

Make softmax_lse_correction consistent with NVIDIA#716

42a6b8d

kunlunl added a commit to kunlunl/TransformerEngine that referenced this pull request Apr 22, 2024

Make softmax_lse_correction consistent with NVIDIA#716

feef471

kunlunl added a commit to kunlunl/TransformerEngine that referenced this pull request Apr 22, 2024

Make softmax_lse_correction consistent with NVIDIA#716

fc57685

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch] Update context parallel softmax lse correction func #716

[Pytorch] Update context parallel softmax lse correction func #716

Kite0011 commented Mar 13, 2024

ptrendx commented Mar 13, 2024

xrennvidia commented Mar 13, 2024

xrennvidia commented Mar 14, 2024

Kite0011 commented Mar 14, 2024

Kite0011 commented Mar 14, 2024

xrennvidia commented Mar 14, 2024 •

edited

Loading

Kite0011 commented Mar 14, 2024

Kite0011 commented Mar 14, 2024 •

edited

Loading

xrennvidia commented Mar 14, 2024

ptrendx left a comment

Infi-zc commented Mar 19, 2024

xrennvidia commented Mar 19, 2024 •

edited

Loading

Infi-zc commented Mar 19, 2024

xrennvidia commented Mar 19, 2024

Kite0011 commented Mar 19, 2024 •

edited

Loading

ptrendx commented Mar 19, 2024

Kite0011 commented Mar 20, 2024 •

edited

Loading

ptrendx commented Mar 20, 2024

ptrendx commented Mar 21, 2024

Infi-zc commented Mar 21, 2024

[Pytorch] Update context parallel softmax lse correction func #716

[Pytorch] Update context parallel softmax lse correction func #716

Conversation

Kite0011 commented Mar 13, 2024

ptrendx commented Mar 13, 2024

xrennvidia commented Mar 13, 2024

xrennvidia commented Mar 14, 2024

Kite0011 commented Mar 14, 2024

Kite0011 commented Mar 14, 2024

xrennvidia commented Mar 14, 2024 • edited Loading

Kite0011 commented Mar 14, 2024

Kite0011 commented Mar 14, 2024 • edited Loading

xrennvidia commented Mar 14, 2024

ptrendx left a comment

Choose a reason for hiding this comment

Infi-zc commented Mar 19, 2024

xrennvidia commented Mar 19, 2024 • edited Loading

Infi-zc commented Mar 19, 2024

xrennvidia commented Mar 19, 2024

Kite0011 commented Mar 19, 2024 • edited Loading

ptrendx commented Mar 19, 2024

Kite0011 commented Mar 20, 2024 • edited Loading

ptrendx commented Mar 20, 2024

ptrendx commented Mar 21, 2024

Infi-zc commented Mar 21, 2024

xrennvidia commented Mar 14, 2024 •

edited

Loading

Kite0011 commented Mar 14, 2024 •

edited

Loading

xrennvidia commented Mar 19, 2024 •

edited

Loading

Kite0011 commented Mar 19, 2024 •

edited

Loading

Kite0011 commented Mar 20, 2024 •

edited

Loading