-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fp8 Training Feature Request
] Smooth SwiGlu and Configurable AdamWFp8
#1691
Comments
We have fp8 Adam optimizer in torchao, implemented by @gau-nernst. More details here https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#benchmarks |
@vkuzo thoughts on the priority of adding a SmoothSwiglu for float8 training? |
Regarding FP8 optimizer, they don't publish the code, but I think they use tensor-wise scaling for FP8 optim state. Our implementation is different - we use group-wise scaling (256 by default), similar to 8-bit AdamW from bnb, hence our implementation should be better, and we don't need different dtypes for 1st and 2nd moments. You can try. |
SmoothSwiglu sounds interesting to try, we'd welcome community contributions to add it. Note that the paper states that this is useful for delayed scaling - we lowered the priority of delayed scaling due to lack of use cases / excitement in the community, see #1680, we plan to split it from the production API and move to prototype folder. |
Ah ic, thx for the pointers. Yea, ig SmoothSwiGlu still might be interesting 👍 sad to see delayed scaling being deprecated but it is what it is. Will keep it open for smooth swiglu for now |
Based on https://arxiv.org/abs/2409.12517 it would be nice to have the 2 features used there:
Might have overlooked things so lmk if it's already integrated or something. And thx for the wonderful project :)
The text was updated successfully, but these errors were encountered: