Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fp8 Training Feature Request] Smooth SwiGlu and Configurable AdamWFp8 #1691

Open
vasqu opened this issue Feb 10, 2025 · 5 comments
Open

[Fp8 Training Feature Request] Smooth SwiGlu and Configurable AdamWFp8 #1691

vasqu opened this issue Feb 10, 2025 · 5 comments

Comments

@vasqu
Copy link

vasqu commented Feb 10, 2025

Based on https://arxiv.org/abs/2409.12517 it would be nice to have the 2 features used there:

  • SmoothSwiglu -> an improved version on swiglu to stabilize training on fp8 (not sure if it's already somewhat possible with the current features).
  • An fp8 AdamWFp8 optimizer that uses different fp8 formats depending on the moment, i.e. e4m3 (first momentum) and e5m2 (second momentum).

Might have overlooked things so lmk if it's already integrated or something. And thx for the wonderful project :)

@supriyar
Copy link
Contributor

We have fp8 Adam optimizer in torchao, implemented by @gau-nernst. More details here https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#benchmarks

@supriyar
Copy link
Contributor

@vkuzo thoughts on the priority of adding a SmoothSwiglu for float8 training?

@gau-nernst
Copy link
Collaborator

Regarding FP8 optimizer, they don't publish the code, but I think they use tensor-wise scaling for FP8 optim state. Our implementation is different - we use group-wise scaling (256 by default), similar to 8-bit AdamW from bnb, hence our implementation should be better, and we don't need different dtypes for 1st and 2nd moments. You can try.

@vkuzo
Copy link
Contributor

vkuzo commented Feb 11, 2025

SmoothSwiglu -> an improved version on swiglu to stabilize training on fp8 (not sure if it's already somewhat possible with the current features).

SmoothSwiglu sounds interesting to try, we'd welcome community contributions to add it.

Note that the paper states that this is useful for delayed scaling - we lowered the priority of delayed scaling due to lack of use cases / excitement in the community, see #1680, we plan to split it from the production API and move to prototype folder.

@vasqu
Copy link
Author

vasqu commented Feb 11, 2025

Ah ic, thx for the pointers. Yea, ig SmoothSwiGlu still might be interesting 👍 sad to see delayed scaling being deprecated but it is what it is.

Will keep it open for smooth swiglu for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants