Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] wgmma.mma_async instructions are serialized due to insufficient register resources #1840

Open
vmarkovtsev opened this issue Sep 25, 2024 · 5 comments

Comments

@vmarkovtsev
Copy link
Contributor

vmarkovtsev commented Sep 25, 2024

When I build the latest cutlass library for 90a, I see a lot of warnings like:

ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'

Am I doing something wrong, or is that intended?

@thakkarV
Copy link
Collaborator

thakkarV commented Sep 25, 2024

This is quite a large tile size for pingpong. @IonThruster do you know why we are stamping this tile size out?

@vmarkovtsev what is your CTK version? Please use 12.3 or newer if you are not already

@vmarkovtsev
Copy link
Contributor Author

vmarkovtsev commented Sep 25, 2024

My CTK is 12.4. Those kernels are generated with python/cutlass_library/generator.py from the following cmake:

cmake -B build_12.4 -D CUTLASS_NVCC_ARCHS=90;90a -D CMAKE_CUDA_ARCHITECTURES=90;90a -D CUTLASS_ENABLE_TESTS=OFF -D CUTLASS_LIBRARY_OPERATIONS=gemm -D CUTLASS_LIBRARY_KERNELS=sgemm,bf16 -D CUTLASS_ENABLE_EXAMPLES=OFF -D CUTLASS_LIBRARY_IGNORE_KERNELS=planar,complex,spgemm,grouped,mixed,_u8,_s8,_tt,s*bf16gemm,fp8,e4m3,e5m2,void,s16816gemm -D CMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc -D CMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'

@vmarkovtsev
Copy link
Contributor Author

Full log

+ make -j192 cutlass_library_static
  [  0%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/src/manifest.cpp.o
  [  0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/handle.cu.o
  [  0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/singleton.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/all_sm80_sgemm_gemm_operations.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/util.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int4.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/operation_table.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nn_align1.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_canonical.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/all_sm50_sgemm_gemm_operations.cu.o
  [  1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_32.cu.o
  [  2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x5_nn_align1.cu.o
  [  2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x5_nn_align1.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_64.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nn_align1.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x4_nn_align1.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e4m3out.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x4_nn_align1.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e4m3out.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x4_nn_align1.cu.o
  [  3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e5m2out.cu.o
  [  4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x64_8x2_nn_align1.cu.o
  [  4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x64_8x5_nn_align1.cu.o
  [  5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e5m2out.cu.o
  [  5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x128_8x2_nn_align1.cu.o
  [  5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x128_8x5_nn_align1.cu.o
  [  5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x64_8x2_nn_align1.cu.o
  [  6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/all_sm90_s64x256x16gemm_bf16_gemm_operations.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp16out.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x64_8x5_nn_align1.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x32_8x2_nn_align1.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_32x128_8x2_nn_align1.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nt_align1.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_32x128_8x5_nn_align1.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_bf16out.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x32_8x5_nn_align1.cu.o
  [  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [  8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [  8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x64_8x2_nt_align1.cu.o
  [  8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/all_sm90_bf16_s64x256x16gemm_bf16_gemm_operations.cu.o
  [  9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nt_align1.cu.o
  [  9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8.cu.o
  [  9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x128_8x2_nt_align1.cu.o
  [  9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x64_8x2_nt_align1.cu.o
  [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8.cu.o
  [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x5_nt_align1.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x4_nt_align1.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_32x128_8x2_nt_align1.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x5_nt_align1.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp32out.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x32_8x2_nt_align1.cu.o
  [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tn_align1.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x64_8x2_tn_align1.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp32out.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x4_nt_align1.cu.o
  [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8.cu.o
  [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x128_8x2_tn_align1.cu.o
  [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_64x64_8x2_tn_align1.cu.o
  [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8.cu.o
  [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_other.cu.o
  [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x32_8x2_tn_align1.cu.o
  [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x4_nt_align1.cu.o
  [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_mixed_input.cu.o
  [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_32x128_8x2_tn_align1.cu.o
  [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x64_8x5_nt_align1.cu.o
  [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int_mixed_input.cu.o
  [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x128_8x5_nt_align1.cu.o
  [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/all_sm90_bf16_s64x128x16gemm_bf16_gemm_operations.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x64_8x5_nt_align1.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/initialize_reference_operations.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/all_sm90_s64x128x16gemm_bf16_gemm_operations.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x32_8x5_nt_align1.cu.o
  [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/reduction_device.cu.o
  [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8.cu.o
  [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o
  [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_32x128_8x5_nt_align1.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tn_align1.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/init_reduction_operations.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv2d.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv3d.cu.o
  [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x5_tn_align1.cu.o
  [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x5_tn_align1.cu.o
  [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x256_8x4_tn_align1.cu.o
  [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8.cu.o
  [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/all_gemm_operations.cu.o
  [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8.cu.o
  [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x128_8x4_tn_align1.cu.o
  [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8.cu.o
  [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x64_8x5_tn_align1.cu.o
  [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x128_8x5_tn_align1.cu.o
  [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8.cu.o
  [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8.cu.o
  [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_32x128_8x5_tn_align1.cu.o
  [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x4_tn_align1.cu.o
  [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_64x64_8x5_tn_align1.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_128x32_8x5_tn_align1.cu.o
  [ 40%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/initialize_all.cpp.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8.cu.o
  [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8.cu.o
  [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o
  [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8.cu.o
  [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8.cu.o
  [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8.cu.o
  [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8.cu.o
  [ 47%] Built target cutlass_library_gemm_sm50_sgemm_objs
  [ 47%] Linking CUDA static library libcutlass_gemm_sm50_sgemm.a
  [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 48%] Built target cutlass_library_gemm_sm50_sgemm_static
  [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8.cu.o
  [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8.cu.o
  [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 54%] Built target cutlass_library_gemm_sm80_sgemm_objs
  [ 54%] Linking CUDA static library libcutlass_gemm_sm80_sgemm.a
  [ 54%] Built target cutlass_library_gemm_sm80_sgemm_static
  [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8.cu.o
  [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o
  [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o
  [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8.cu.o
  [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8.cu.o
  [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8.cu.o
  [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8.cu.o
  [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8.cu.o
  [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x256x16gemm_bf16/cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8.cu.o
  [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o
  [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o
  [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 78%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 79%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 80%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 81%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 82%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 85%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 86%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 87%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o
  [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 88%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o
  [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 89%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 90%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  [ 91%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_ntn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 91%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 92%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI119cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 93%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 94%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 94%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 94%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI117cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_epi_nosmemEEvNT_6ParamsE'
  [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI126cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 95%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 96%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 96%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI128cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmemEEvNT_6ParamsE'
  [ 96%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 97%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o
  [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [ 98%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  [100%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI123cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_f32_f32_128x256x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  [100%] Built target cutlass_library_gemm_sm90_s64x256x16gemm_bf16_objs
  [100%] Linking CUDA static library libcutlass_gemm_sm90_s64x256x16gemm_bf16.a
  [100%] Built target cutlass_library_gemm_sm90_s64x256x16gemm_bf16_static
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  [100%] Built target cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_objs
  [100%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x256x16gemm_bf16.a
  [100%] Built target cutlass_library_gemm_sm90_bf16_s64x256x16gemm_bf16_static
  ptxas info    : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelI125cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tmaEEvNT_6ParamsE'
  [100%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs
  [100%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_bf16.a
  [100%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_static
  [100%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs
  [100%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a
  [100%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_static
  [100%] Built target cutlass_library_objs
  [100%] Linking CXX static library libcutlass.a
  [100%] Built target cutlass_library_static

@thakkarV
Copy link
Collaborator

@vmarkovtsev I think you can ignore these if its not coming from majority of the profiler kernels. Sometimes the Cartesian product of configurations can lead to a few outliers that spill.

Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants