Simplify mm dispatch and remove qint16 and qint32 #94

dacorvo · 2024-02-21T13:30:13Z

This simplifies the QTensor mm dispatch method to:

restrict them to configurations actually used in transformers models,
return a float Tensor instead of a qint32 QTensor.

Eventually, these dispatched methods will not materialize the intermediate int32 Tensor, thanks to fused kernels.

The qint16 and qint32 qtype are removed.

This change is required to remove qint32 QTensor: the materialization of an int32 output Tensor should be avoided, as it is large and cannot be used downstream without a dequantization.

This completely removes the need for qint32.

dacorvo added 6 commits February 21, 2024 12:41

test(qconv2d): adjust gradient precision for CUDA

ef7a98a

feat(qtensor): (b)mm dispatch now returns a Tensor

4ea8059

This change is required to remove qint32 QTensor: the materialization of an int32 output Tensor should be avoided, as it is large and cannot be used downstream without a dequantization.

feat(qtensor): remove unused dot dispatch

d0c4957

feat(qtensor): only support multiplication by a scalar

45cd518

feat(qtensor): remove unused addmm dispatch

67335ac

This completely removes the need for qint32.

feat(quanto): remove qint16 and qint32

71cca12

dacorvo changed the title ~~Simply mm dispatch and remove qint16 and qint32~~ Simplify mm dispatch and remove qint16 and qint32 Feb 21, 2024

dacorvo merged commit cb386af into main Feb 21, 2024
3 checks passed

dacorvo deleted the mixed_mm branch February 21, 2024 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify mm dispatch and remove qint16 and qint32 #94

Simplify mm dispatch and remove qint16 and qint32 #94

dacorvo commented Feb 21, 2024

Simplify mm dispatch and remove qint16 and qint32 #94

Simplify mm dispatch and remove qint16 and qint32 #94

Conversation

dacorvo commented Feb 21, 2024