New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

doc: graph: add document for sdpa with compressed key and value #2301

Open

wzt1997 wants to merge 1 commit into main from zhitao/doc-compressed-sdpa

+125 −0

Contributor

wzt1997 commented Dec 20, 2024 •

edited

Loading

General

Add document for SDPA with compressed Key and Value

wzt1997 added the documentation label

wzt1997 requested a review from TaoLv

December 20, 2024 03:49

wzt1997 self-assigned this

wzt1997 requested review from a team as code owners

December 20, 2024 03:49

mgouicem reviewed

View reviewed changes

Contributor

mgouicem left a comment

(open question) Do we want to have separate document for non-quantized and quantized patterns?
My understanding of existing SDPA page was that it would regroup different data-types given that it currently has a subsection for floating-point pattern.

doc/graph/sdpa_with_compressed_kv.md Outdated Show resolved Hide resolved

Contributor

TaoLv commented Dec 20, 2024

(open question) Do we want to have separate document for non-quantized and quantized patterns?

Thank you @mgouicem ! That's also my question. Initially when i added the floating-point section, i was thinking to include all these fp, int8 quantized, and only kv quantized patterns together in a single page. But with this PR, it seems there will be too much information. Maybe we need @ranukund 's input for which is a better format.

wzt1997 force-pushed the zhitao/doc-compressed-sdpa branch from 51d1971 to 9c91ae9 Compare

December 24, 2024 01:26

Contributor Author

wzt1997 commented Dec 24, 2024

Do we want to have separate document for non-quantized and quantized patterns?

Thanks for the question! From my perspective, I think it's better for us to put the quantized sdpa patterns in a separate document as it requires much more information compared with pure floating-point patterns, such as fpmath mode setting, group quantization and adding extra dynamic quantization ops.

It's also worth noting that this PR only includes the document for quantization sdpa patterns with compressed KV, but not the pure quantized sdpa. We may need to think about fusing them together in the future.

TaoLv reviewed

View reviewed changes

Contributor

TaoLv left a comment

Document folder has been changed on main branch. Please rebase the PR. Thanks.

doc/graph/sdpa_with_compressed_kv.md Outdated Show resolved Hide resolved

doc/graph/sdpa_with_compressed_kv.md Outdated Show resolved Hide resolved

doc/graph/sdpa_with_compressed_kv.md Outdated Show resolved Hide resolved

doc/graph/sdpa_with_compressed_kv.md Outdated Show resolved Hide resolved

doc/graph/sdpa_with_compressed_kv.md Outdated

+              | dt_fp   | dt_fp   | N/A     | N/A             | dt_int | dt_fp   | u4,s4,u8,s8,s32 | dt_fp  |
+              Notes:
+              - dt_fp can be either: f16, bf16 and f32.

Contributor

TaoLv Dec 24, 2024

what's the problem if we directly expand dt_fp as f32, bf16, f16 in the table above?

Contributor Author

wzt1997 Dec 24, 2024

If we directly use f32, bf16, f16 in the table, it would result in an excessively lengthy table. Furthermore, it could lead users to misinterpret that these data types can be mixed, such as using f16 for Query with bf16 for Key scale. However, all sections involving dt_fp should utilize the same data type.


          doc: graph: add document for sdpa with compressed key and value

2f307af

wzt1997 force-pushed the zhitao/doc-compressed-sdpa branch from 9c91ae9 to 2f307af Compare

December 24, 2024 05:34

Member

vpirogov commented Dec 30, 2024

@ranukund, please help with the review.

TaoLv added the component:graph-api label

ranukund reviewed

View reviewed changes

Contributor

ranukund left a comment

I've left a few comments, please incorporate as you see fit.

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md


		## Overview

		The KV Cache mechanism was developed to improve the efficiency of models by

Contributor

ranukund Jan 3, 2025

We need to spell out KV.

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md


		## Overview

		The KV Cache mechanism was developed to improve the efficiency of models by

Contributor

ranukund Jan 3, 2025

Suggested change

      
            The KV Cache mechanism was developed to improve the efficiency of models by
          
            The KV Cache mechanism is developed to improve the efficiency of models by

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+              usage, and are subsequently de-quantized to wider floating point types such as
+              f16 and bf16 for computation.
+              It's worth noting that grouped quantization is required to improve model

Contributor

ranukund Jan 3, 2025

Suggested change

      
            It's worth noting that grouped quantization is required to improve model
          
            Note that grouped quantization is required to improve the model

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+              f16 and bf16 for computation.
+              It's worth noting that grouped quantization is required to improve model
+              accurarcy, especially for int4 data types. In this case, group size is needed

Contributor

ranukund Jan 3, 2025

Suggested change

      
            accurarcy, especially for int4 data types. In this case, group size is needed
          
            accuracy, especially for int4 data types. In this case, group size is needed

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+              as an attribute for quantization, which indicates the number of elements that
+              share the same scaling factor and zero-points in each quantization group.
+              The notations used in the document:

Contributor

ranukund Jan 3, 2025

Suggested change

      
            The notations used in the document:
          
            The notations used in this topic are:

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+              intermediate results of the dot products between Query and Key which takes
+              \f$O(S^2)\f$ memory. It may lead to Out-of-Memory error when computing long
+              sequence length input on platforms with limited memory.
+. The compressed SDPA patterns functionally support all input shapes meeting

Contributor

ranukund Jan 3, 2025

Suggested change

      
            2. The compressed SDPA patterns functionally support all input shapes meeting
          
            - The compressed SDPA patterns functionally support all input shapes meeting

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+              sequence length input on platforms with limited memory.
+. The compressed SDPA patterns functionally support all input shapes meeting
+              the shape requirements of each operation in the graph.
+. CPU

Contributor

ranukund Jan 3, 2025

Suggested change

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+                  - oneDNN does not provide optimized implementation on CPU currently. All
+                  executions will be implemented with the primitive-based reference
+                  computation.
+. GPU

Contributor

ranukund Jan 3, 2025

Suggested change

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+. GPU
+                  - Optimized implementation is available for 4D Q/K/V tensors with the shape
+                  defined as (N, H, S, D) for Query, Key and Value, and (N, H, S, D/G) for
+                  scales and zero-points( if available ).

Contributor

ranukund Jan 3, 2025

Suggested change

      
                scales and zero-points( if available ).
          
                scales and zero-points (if available).

doc/graph/fusion_patterns/sdpa_with_compressed_kv.md

+                  computation data type on Intel Graphics Products with Intel(R) Xe Matrix
+                  Extensions (Intel(R) XMX) support.
+                  - If int4 zero-points are specified, optimized implementation will be only
+                  avaibable when group size equals to 16.

Contributor

ranukund Jan 3, 2025

Suggested change

      
                avaibable when group size equals to 16.
          
                available when the group size equals 16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:graph-api documentation