Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advantage of the 4-bit Quantization #4

Open
amitsrivastava78 opened this issue Jun 12, 2019 · 8 comments
Open

Advantage of the 4-bit Quantization #4

amitsrivastava78 opened this issue Jun 12, 2019 · 8 comments

Comments

@amitsrivastava78
Copy link

Hi @submission2019 ,
First of all i would like to congratulate you guys for coming up with this paper and opening the github project for the analysis. I have gone though your paper and github project deeply, and i would like to know the following : -

  1. What is the advantage of this approach over 8-bit quantization ? Since all the operation should be byte aligned that mean mathematical operations should least be 8-bit also storage part also seems to be 8-bit aligned so i can not understand where the advantage lies in doing 4-bit quantization ? Also i can see there is a drop of accuracy of about 2~3% compared to 8-bit quantization.

So may be there is a bigger picture which i am not able to see, can you guys please point me to the right direction.

Regards
Amit

@submission2019
Copy link
Owner

Hello.
The advantage of 4-bit weights and activations due to 2x reduction in bandwidth. Lot's of neural network workloads are bandwidth bound, reducing amount of bits increase throughput and reduces power consumption.

Of course in order to benefit from 4-bit quantization we need dedicated HW that supports manipulations with resolution lower than byte(8bit). Some HW vendors already suggest experimental HW/features for enthusiasts to experiment with int4. For example NVidia added support of int4/uint4 datatype as part of Cuda10 TensorCores HW.
On other hand a lot of academical and industrial research focusing on suggesting methods that will bring accuracy of int4 inference near to int8. The goal of our work to suggest and evaluation such methods that will allow int4 inference of convolutional neural networks with relatively small degradation of accuracy.

@amitsrivastava78
Copy link
Author

@submission2019 , @ynahshan , thanks for pointing me to the right direction. The paper looks promising , have you thought about commercializing this solution in any product?
Also using your algo on mobilenet , accuracy is very less, can you throw some light on this.

Regards
Amit

@submission2019
Copy link
Owner

Hi.
We didn't tried to apply our methods on mobilenet. I don't know what are the reason for poor results you observe. It could be related to depth wise convolution that mobilinet mostly consist. Unfortunately with diversity of deep learning models it often requires to analyse the model and fine tune quantization methods for specific model.

@amitsrivastava78
Copy link
Author

@submission2019 , @ynahshan thanks for the reply, i am closing this issue. If i plan to make mobile net accuracy better will post the code and method here as well.

Regards
Amit

@amitsrivastava78
Copy link
Author

@submission2019 , thanks for the reply for the Mobilenet part, yes we are facing the same issue with mobilenetV2 of low accuracy. can you please describe in the measures you have take, for us the Top1 accuracy with 4-bit for mobilenet_v2 is coming to ~49%, can you please tell the exact steps for making it 70%.

Regards
Amit

@limerainne
Copy link

Dear @amitsrivastava78,

In the previous comment, I made a mistake (accidently set bitwidth to 8bit) in the test which brought incorrect higher accuracy.

Sorry for wrong information and deleting my comment without proper notice.

P.S. For avoiding confusion (the authors were refered in your comment), I'm not related to authors.

@jonathanbonnard
Copy link

jonathanbonnard commented Nov 7, 2019

Hi,
I have encountered the same problem in mobilenetv2 and I think I know where the problem is.
In fact, the program is quantizing the 3rd sub layer (aka linear bottleneck) but it should not. The output of this sub layer has to be kept at 4bit + 4bits + log2(nb_out_channels) else the dynamic range range will be clipped and this leads to wrong input values for the next 1x1 channel.
However, I don't know where the program should be modified to change this feature... Maybe authors can help?

@ghost
Copy link

ghost commented Feb 8, 2020

Hi,
I want to save the quantized model and analysis its metrics such as inference time, model size, FLOPs, and Parameters, can anyone give me some advice? Or have you already finish it?
@amitsrivastava78 @submission2019 @limerainne @jonathanbonnard @ynahshan
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants