New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

xe: sdpa: Improve performance of quantization with better alignment and prefetching #2322

Open

umar456 wants to merge 4 commits into main from uarshad/sdpa_scale_zp_alignment

+160 −32

Contributor

umar456 commented Dec 27, 2024

Description

This PR improves the performance of the micro SDPA kernel by using prefetching and setting better alignment when generating the microkernels. This change has a significant impact on certain sizes ranging from (0.89x-1.26x) over the original version.

umar456 added performance platform:gpu-intel labels

umar456 requested a review from a team as a code owner

December 27, 2024 20:56

Contributor Author

umar456 commented Dec 27, 2024

make test
disable device_cpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_graph

petercad reviewed

View reviewed changes

src/gpu/intel/ocl/micro_sdpa.cl Outdated

    
              #if VAL_SCALES == QUANTIZE_2D

                      /* Prefetch V scales. */

                      cooperative_prefetch_2d_maybe_rem(V_scales, d / VAL_GROUP_SIZE, k - k0,

                              d / VAL_GROUP_SIZE,

Contributor

petercad Jan 2, 2025 •

edited

Loading

You'll want a constant here to allow the compiler to unroll loops inside cooperative_prefetch_2d_maybe_rem:

Suggested change

      
                            d / VAL_GROUP_SIZE,
          
                            D_MAX / VAL_GROUP_SIZE,

umar456 added 4 commits

January 3, 2025 01:00


          xe: sdpa: Improve scale and zp alignment

7eb0386


          xe: sdpa: Prefetch scales and zero_points

64e2b75


          xe: ukernel: fix cooperative prefetch function to avoid overlap

05c73c2


          xe: sdpa: Update prefetch functions to improve performance

umar456 force-pushed the uarshad/sdpa_scale_zp_alignment branch from 9af6de9 to 4746108 Compare

January 3, 2025 09:01

Contributor Author

umar456 commented Jan 3, 2025

make test
disable device_cpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_graph

petercad reviewed

View reviewed changes

src/gpu/intel/ocl/micro_sdpa.cl

    
                          /* n_sg */ sg_per_wg,

                          /* sg_size */ SUBGROUP_SIZE,

                          /* cache */ LSC_LDCC_L1C_L3C);

                  //return;

Contributor

petercad Jan 3, 2025

Does it improve performance to have the first K tile prefetch here (before loading Q)? IIRC in my earlier testing it was better to delay the first K tile prefetch until after issuing the Q load.

petercad reviewed

View reviewed changes

src/gpu/intel/ocl/micro_sdpa.cl

    
                  cooperative_prefetch_2d_k(

                          /* ptr */ K,

                          /* r */ k,

                          /* c */ d, // faster than D_MAX

Contributor

petercad Jan 3, 2025

It's not so much that it's faster than D_MAX but rather we need to avoid out-of-bounds prefetches.

petercad reviewed

View reviewed changes

src/gpu/intel/ocl/micro_sdpa.cl

    
                  cooperative_prefetch_2d_maybe_rem(

                          /* ptr */ K_scales,

                          /* r */ k,

                          /* c */ D_MAX / KEY_GROUP_SIZE,

Contributor

petercad Jan 3, 2025

Need to avoid out-of-bounds prefetches here, and similarly for the zp prefetches:

Suggested change

      
                        /* c */ D_MAX / KEY_GROUP_SIZE,
          
                        /* c */ d / KEY_GROUP_SIZE,

petercad reviewed

View reviewed changes

src/gpu/intel/ocl/micro_sdpa.cl

    
                          cooperative_prefetch_2d_k(

                                  /* ptr */ K + (k0 + ugemm_kq_wg_tile_m) * stride_k,

                                  /* r */ k - k0 - ugemm_kq_wg_tile_m,

                                  /* c */ D_MAX,

Contributor

petercad Jan 3, 2025

Avoid OOB access:

Suggested change

      
                                /* c */ D_MAX,
          
                                /* c */ d,

petercad reviewed

View reviewed changes

src/gpu/intel/ocl/micro_sdpa.cl

    
                                  /* sg_id */ sg_ij,

                                  /* n_sg */ sg_per_wg,

                                  /* sg_size */ SUBGROUP_SIZE,

                                  /* cache */ LSC_LDCC_L1C_L3C);

Contributor

petercad Jan 3, 2025

This tile is so small that it doesn't need cooperative prefetching (hence the earlier simpler logic). Does this change improve performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance platform:gpu-intel