[`Fp8 Training Feature Request`] Smooth SwiGlu and Configurable AdamWFp8 #1691

vasqu · 2025-02-10T21:38:12Z

Based on https://arxiv.org/abs/2409.12517 it would be nice to have the 2 features used there:

SmoothSwiglu -> an improved version on swiglu to stabilize training on fp8 (not sure if it's already somewhat possible with the current features).
An fp8 AdamWFp8 optimizer that uses different fp8 formats depending on the moment, i.e. e4m3 (first momentum) and e5m2 (second momentum).

Might have overlooked things so lmk if it's already integrated or something. And thx for the wonderful project :)

supriyar · 2025-02-11T03:50:00Z

We have fp8 Adam optimizer in torchao, implemented by @gau-nernst. More details here https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#benchmarks

supriyar · 2025-02-11T03:50:38Z

@vkuzo thoughts on the priority of adding a SmoothSwiglu for float8 training?

gau-nernst · 2025-02-11T03:56:50Z

Regarding FP8 optimizer, they don't publish the code, but I think they use tensor-wise scaling for FP8 optim state. Our implementation is different - we use group-wise scaling (256 by default), similar to 8-bit AdamW from bnb, hence our implementation should be better, and we don't need different dtypes for 1st and 2nd moments. You can try.

vkuzo · 2025-02-11T04:47:24Z

SmoothSwiglu -> an improved version on swiglu to stabilize training on fp8 (not sure if it's already somewhat possible with the current features).

SmoothSwiglu sounds interesting to try, we'd welcome community contributions to add it.

Note that the paper states that this is useful for delayed scaling - we lowered the priority of delayed scaling due to lack of use cases / excitement in the community, see #1680, we plan to split it from the production API and move to prototype folder.

vasqu · 2025-02-11T17:15:08Z

Ah ic, thx for the pointers. Yea, ig SmoothSwiGlu still might be interesting 👍 sad to see delayed scaling being deprecated but it is what it is.

Will keep it open for smooth swiglu for now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Fp8 Training Feature Request`] Smooth SwiGlu and Configurable AdamWFp8 #1691

[`Fp8 Training Feature Request`] Smooth SwiGlu and Configurable AdamWFp8 #1691

vasqu commented Feb 10, 2025

supriyar commented Feb 11, 2025

supriyar commented Feb 11, 2025

gau-nernst commented Feb 11, 2025

vkuzo commented Feb 11, 2025

vasqu commented Feb 11, 2025

[Fp8 Training Feature Request] Smooth SwiGlu and Configurable AdamWFp8 #1691

[Fp8 Training Feature Request] Smooth SwiGlu and Configurable AdamWFp8 #1691

Comments

vasqu commented Feb 10, 2025

supriyar commented Feb 11, 2025

supriyar commented Feb 11, 2025

gau-nernst commented Feb 11, 2025

vkuzo commented Feb 11, 2025

vasqu commented Feb 11, 2025

[`Fp8 Training Feature Request`] Smooth SwiGlu and Configurable AdamWFp8 #1691

[`Fp8 Training Feature Request`] Smooth SwiGlu and Configurable AdamWFp8 #1691