Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify mm dispatch and remove qint16 and qint32 #94

Merged
merged 6 commits into from
Feb 21, 2024
Merged

Simplify mm dispatch and remove qint16 and qint32 #94

merged 6 commits into from
Feb 21, 2024

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Feb 21, 2024

This simplifies the QTensor mm dispatch method to:

  • restrict them to configurations actually used in transformers models,
  • return a float Tensor instead of a qint32 QTensor.

Eventually, these dispatched methods will not materialize the intermediate int32 Tensor, thanks to fused kernels.

The qint16 and qint32 qtype are removed.

This change is required to remove qint32 QTensor: the materialization
of an int32 output Tensor should be avoided, as it is large and cannot
be used downstream without a dequantization.
This completely removes the need for qint32.
@dacorvo dacorvo changed the title Simply mm dispatch and remove qint16 and qint32 Simplify mm dispatch and remove qint16 and qint32 Feb 21, 2024
@dacorvo dacorvo merged commit cb386af into main Feb 21, 2024
3 checks passed
@dacorvo dacorvo deleted the mixed_mm branch February 21, 2024 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant