[PyTorch] Adjusted the logic of MHA and DPA to enable speculative decoding #668

Oleg-Goncharov · 2024-02-15T18:45:35Z

This PR modifies the logic of MHA and DPA by using InferenceParams (KV-cache), and extends the supported shapes of matrices in unfused softmax to rectangular to enable speculative decoding.

Added test cases into test_numerics.py to check if the outputs of TransformerLayer or MultiheadAttention generated in a single full run match with those generated incrementally using KV-cache
The longer the input sequence, the larger the divergence of the output due to accumulation of numerical errors in a forward pass

Signed-off-by: Oleg Goncharov <[email protected]>

Oleg-Goncharov · 2024-02-16T16:51:07Z

/te-ci

ptrendx · 2024-02-16T22:21:30Z

/te-ci pytorch

transformer_engine/pytorch/softmax.py

transformer_engine/pytorch/attention.py

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Oleg Goncharov <[email protected]>

tests/pytorch/test_numerics.py

transformer_engine/pytorch/attention.py

Signed-off-by: Oleg Goncharov <[email protected]>

…ode review Signed-off-by: Oleg Goncharov <[email protected]>

Signed-off-by: Oleg Goncharov <[email protected]>

timmoon10

LGTM, pending CI.

timmoon10 · 2024-02-26T19:32:08Z

/te-ci pytorch

Signed-off-by: Oleg Goncharov <[email protected]>

timmoon10 · 2024-02-27T17:29:03Z

/te-ci pytorch

ksivaman · 2024-02-29T19:42:11Z

/te-ci pytorch

ksivaman · 2024-03-06T19:28:51Z

/te-ci pytorch

Oleg-Goncharov added 3 commits February 9, 2024 10:47

Modified MHA and DPA logic to use causal softmax and FA for inference

0e3c114

Signed-off-by: Oleg Goncharov <[email protected]>

Adjusted unfused attention and softmax logic for inference

368f5d9

Signed-off-by: Oleg Goncharov <[email protected]>

Cleaned up the code per pylint

f2cd7ab

Signed-off-by: Oleg Goncharov <[email protected]>

Oleg-Goncharov requested a review from ptrendx February 15, 2024 18:45

Oleg-Goncharov added the enhancement New feature or request label Feb 16, 2024

Added test cases to evaluate numerics of incremental decoding

fb471a8

Signed-off-by: Oleg Goncharov <[email protected]>

Oleg-Goncharov changed the title ~~[Pytorch] Adjusted the logic of MHA and DPA to enable speculative decoding~~ [PyTorch] Adjusted the logic of MHA and DPA to enable speculative decoding Feb 16, 2024

Merge branch 'main' into pr_inference_params

1f79a7e

timmoon10 reviewed Feb 17, 2024

View reviewed changes

timmoon10 self-requested a review February 17, 2024 01:08

Oleg-Goncharov and others added 3 commits February 19, 2024 17:47

Apply suggestions from code review

87bc3cd

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Oleg Goncharov <[email protected]>

Apply suggestions from code review [sequence start-end]

97f315f

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Oleg Goncharov <[email protected]>

Apply suggestions from code review [inference_params offset update]]

65070c1

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Oleg Goncharov <[email protected]>