-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph: backend: dnnl: support permute for scale and zps inputs #2291
base: main
Are you sure you want to change the base?
Conversation
953b5a6
to
ce87611
Compare
tests/benchdnn/inputs/graph/complex_fusion/mha/sdpa-compressed-k-int8-gs32.json
Outdated
Show resolved
Hide resolved
cb5d6c4
to
3a713a8
Compare
make test |
3a713a8
to
1b80646
Compare
make test |
1b80646
to
b1eed4d
Compare
make test |
7d7be8a
to
30f2fd8
Compare
During the validation of dynamic per-channel quantization cases, especially for those with zps and transposed matmul, some issues were discovered. So I pushed 2 new commits for enhancing the handling for dynamic per-channel quantization cases in the PR, please review again:
|
95d50e6
to
825293a
Compare
make test |
825293a
to
96d8cab
Compare
96d8cab
to
4611b49
Compare
Description
We recently supported the compressed SDPA patterns which incorporate scales and zero points as inputs. Also, we aim to support the problem with a Key shape of ( N, H, S, D ) with
transpose_b
=true for the QK MatMul. For such case with transposed MatMul, the graph library addspermute
ops prior to the pattern input, including the scale and zp tensors. To be more specific, for the following woq MatMul with tranpose_b=true:After the graph compilation, we will have the graph as follows. The permute ops will not touch the physical memories but only change the mds.
However, we faced a issue that DNNL backend always regards scale and zp tensors as
abx
format, leading to failing results, hence we need to set the tag explicitly asabx
and execute extra reorders. After the change, the graph after compilation is like: