Allow compiling cuda without mmq and flash attention #11190

milot-mirdita · 2025-01-11T09:29:26Z

I have integrated the ProstT5 protein language into Foldseek. Thanks a lot for the great library! I am upstreaming a few fixes for issues I found in ggml during the integration. I hope that it's okay to push the changes here and that they get synced at some point to the main ggml repo.

This is the last patch in my patch series. Feel free to reject this one since it might be too specific.

I want to reduce CI compile times and binary sizes for the CUDA builds. My model doesn't profit a lot from flash attention and the i only use f16 weights, so I added options to disable the kernels that compile the longest and contribute most to binary size. For MMQ I reuse the FORCE_CUBLAS option, for flash attention I added a new GGML_CUDA_FA option.

JohannesGaessler

The FLASH_ATTN_AVAILABLE macros is not being applied correctly to all ggml FlashAttention kernels but this is not the fault of this PR; I'll fix it myself (unless you want to do it).

JohannesGaessler · 2025-01-11T09:52:21Z

ggml/CMakeLists.txt

@@ -149,6 +149,7 @@ set   (GGML_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
                                            "ggml: max. batch size for using peer access")
 option(GGML_CUDA_NO_PEER_COPY               "ggml: do not use peer to peer copies"            OFF)
 option(GGML_CUDA_NO_VMM                     "ggml: do not try to use CUDA VMM"                OFF)
+option(GGML_CUDA_FA                         "ggml: compile with FlashAttention"               ON)


Suggested change

option(GGML_CUDA_FA "ggml: compile with FlashAttention" ON)

option(GGML_CUDA_FA "ggml: compile ggml FlashAttention kernels" ON)

JohannesGaessler · 2025-01-11T09:55:38Z

ggml/src/ggml-cuda/CMakeLists.txt

-        file(GLOB   SRCS "template-instances/fattn-vec*f16-f16.cu")
+        list(FILTER GGML_SOURCES_CUDA EXCLUDE REGEX ".*fattn.*")
+        list(FILTER GGML_HEADERS_CUDA EXCLUDE REGEX ".*fattn.*")
+        # message(FATAL_ERROR ${GGML_SOURCES_CUDA})


Forgot to remove?

JohannesGaessler · 2025-01-11T09:57:21Z

ggml/src/ggml-cuda/common.cuh

@@ -151,6 +151,10 @@ typedef float2 dfloat2;
 #define FLASH_ATTN_AVAILABLE
 #endif // !(defined(GGML_USE_MUSA) && __MUSA_ARCH__ <= GGML_CUDA_CC_QY1)

+#if !defined(GGML_CUDA_FA)
+#undef FLASH_ATTN_AVAILABLE
+#endif


Suggested change

#endif

#endif // !defined(GGML_CUDA_FA)

JohannesGaessler · 2025-01-11T09:57:38Z

ggml/src/ggml-cuda/ggml-cuda.cu

 #include "ggml-cuda/fattn.cuh"
+#endif


Suggested change

#endif

#endif // FLASH_ATTN_AVAILABLE

JohannesGaessler · 2025-01-11T09:57:55Z

ggml/src/ggml-cuda/ggml-cuda.cu

            ggml_cuda_flash_attn_ext(ctx, dst);
            break;
+#else
+            return false;
+#endif


Suggested change

#endif

#endif // FLASH_ATTN_AVAILABLE

JohannesGaessler · 2025-01-11T09:59:32Z

ggml/src/ggml-cuda/mmq.cu

+#ifdef GGML_CUDA_FORCE_CUBLAS
+void ggml_cuda_op_mul_mat_q(
+    ggml_backend_cuda_context &,
+    const ggml_tensor *, const ggml_tensor *, ggml_tensor *, const char *, const float *,
+    const char *, float *, const int64_t, const int64_t, const int64_t,
+    const int64_t, cudaStream_t) {}
+#else


Add GGML_ABORT("CUDA was compiled without MMQ support") to the function instead.

JohannesGaessler · 2025-01-11T10:00:12Z

ggml/src/ggml-cuda/mmq.cuh

@@ -2924,6 +2925,7 @@ extern DECL_MMQ_CASE(GGML_TYPE_IQ3_S);
 extern DECL_MMQ_CASE(GGML_TYPE_IQ1_S);
 extern DECL_MMQ_CASE(GGML_TYPE_IQ4_NL);
 extern DECL_MMQ_CASE(GGML_TYPE_IQ4_XS);
+#endif


Suggested change

#endif

#endif // !defined(GGML_CUDA_FORCE_CUBLAS)

milot-mirdita requested a review from JohannesGaessler as a code owner January 11, 2025 09:29

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 11, 2025

Allow compiling ggml-cuda without mmq or flash attention

8dfe3d8

milot-mirdita force-pushed the allow-compiling-cuda-without-mmq-and-flash-attention branch from 3a1d670 to 8dfe3d8 Compare January 11, 2025 09:30

JohannesGaessler reviewed Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow compiling cuda without mmq and flash attention #11190

Allow compiling cuda without mmq and flash attention #11190

milot-mirdita commented Jan 11, 2025

JohannesGaessler left a comment

JohannesGaessler Jan 11, 2025

JohannesGaessler Jan 11, 2025

JohannesGaessler Jan 11, 2025

JohannesGaessler Jan 11, 2025

JohannesGaessler Jan 11, 2025

JohannesGaessler Jan 11, 2025

JohannesGaessler Jan 11, 2025

	option(GGML_CUDA_FA "ggml: compile with FlashAttention" ON)
	option(GGML_CUDA_FA "ggml: compile ggml FlashAttention kernels" ON)

Allow compiling cuda without mmq and flash attention #11190

Are you sure you want to change the base?

Allow compiling cuda without mmq and flash attention #11190

Conversation

milot-mirdita commented Jan 11, 2025

JohannesGaessler left a comment

Choose a reason for hiding this comment

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 11, 2025

Choose a reason for hiding this comment