Optimize quantization process with QTensor::quantize_onto #2408
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation:
The current
QTensor::quantize
quantizes thesrc
tensor onto the same device assrc
. This behavior is OK for most use cases, but there is a specific condition where this is problematic: anytime you are not quantizing a tensor on the CPU. This is the case because we only support quantization on the CPU.To implement quantization on non-CPU device, we do the following:
Quantize on the CPU
Trigger a synchronizing htod copy here (same for Metal):
Because of the 2 copies and the fact that we are synchronizing the CUDA device (I'm not sure about the semantics for Metal, but we are certainly copying the data), this hurts performance!
The solution is a simple modification and introduction of a new API. This new API will take a CPU tensor, quantize it on the CPU, and then perform one htod synchronizing copy. This halves the data transfer/synchronizations which take place.