[Question FBGEMM_GPU] Adam optmizer not optimized #2824

JacoCheung · 2024-07-11T05:36:35Z

Hi team, I'm using Adam optimizer for my model. But there is a warning regarding performance. (Can It be resolved? Or do you have any quantitative number for the perf degradation?)

[FBGEMM_GPU] NOTE: The training optimizer 'adam' is marked as
        EXPERIMENTAL and thus not optimized, in order to reduce code compilation
        times and build sizes!

I also noted that there was a discussion about the optimizer.
It seemed that adam was not considered for optimizations. I'd like to know what's the plan for Adam for today. Thanks!

The text was updated successfully, but these errors were encountered:

JacoCheung · 2024-07-29T02:52:07Z

Another issue is when I specify the ouput datatype as bf16, it hit me with not implemented error.

sryap · 2024-07-29T03:06:15Z

Hi @JacoCheung

I'm using Adam optimizer for my model. But there is a warning regarding performance. (Can It be resolved? Or do you have any quantitative number for the perf degradation?)

You can move Adam off the experimental optimizer list by setting

FBGEMM/fbgemm_gpu/codegen/genscript/optimizers.py

Line 995 in a300568

"is_experimental_optimizer": True,

to False. This should make it more performant

Another issue is when I specify the ouput datatype as bf16, it hit me with not implemented error.

We have enabled BF16 output for every optimizer. Could you share an error log?

JacoCheung · 2024-07-29T03:22:55Z

Hi @sryap , thanks for your reply. I'll try this flag out.

Re BF16, it seems that the error is rasied by forward kernel (regardless of optimzer).
I was using v6.0.0. I checked changelog just now and found out it's supported since v7.0.0.

JacoCheung · 2024-07-29T03:28:24Z

Regarding the fp16 output dtype, fbgemm does not have a scaler for backward/update. Is this intended?

sryap · 2024-07-29T03:40:20Z

Regarding the fp16 output dtype, fbgemm does not have a scaler for backward/update. Is this intended?

Which scalar are you referring to?

JacoCheung · 2024-07-29T03:44:46Z

The scaler used in mxied precision training.

sryap · 2024-08-06T18:33:53Z

Could you please share the link to the scalar that you're referring to? Thanks

JacoCheung · 2024-08-13T12:04:27Z

Sorry for my confusion. Let me clarify a little bit.

The scalar I refer to is a generic concept in mixed-precision training esp in fp16 training. In fp16 training schema, the loss is usually scaled, and so the dgrad is scaled in the bwd. There should be a unscaling process for wgrad(or dgrad).

However, fbgemm_gpu fuses update with bwd / dgrad (TBE does not have explict wgrad ). So I expect the forward() function of TBE operator to accept a scaling factor, and do the dgrad/wgrad unscaling at backward stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question FBGEMM_GPU] Adam optmizer not optimized #2824

[Question FBGEMM_GPU] Adam optmizer not optimized #2824

JacoCheung commented Jul 11, 2024

JacoCheung commented Jul 29, 2024

sryap commented Jul 29, 2024

JacoCheung commented Jul 29, 2024

JacoCheung commented Jul 29, 2024

sryap commented Jul 29, 2024

JacoCheung commented Jul 29, 2024

sryap commented Aug 6, 2024

JacoCheung commented Aug 13, 2024 •

edited

Loading

[Question FBGEMM_GPU] Adam optmizer not optimized #2824

[Question FBGEMM_GPU] Adam optmizer not optimized #2824

Comments

JacoCheung commented Jul 11, 2024

JacoCheung commented Jul 29, 2024

sryap commented Jul 29, 2024

JacoCheung commented Jul 29, 2024

JacoCheung commented Jul 29, 2024

sryap commented Jul 29, 2024

JacoCheung commented Jul 29, 2024

sryap commented Aug 6, 2024

JacoCheung commented Aug 13, 2024 • edited Loading

JacoCheung commented Aug 13, 2024 •

edited

Loading