Replies: 3 comments 1 reply
-
I tried other combinations. The Flux Q4_K_S just seems to be faster than the smaller Flux Q3_K_S, despite the latter being The T5 FP8 + Flux Q3_K_S obviously don't fit together in 8 GB VRAM, and still the Flux Q3_K_S was With 512 x 512 images: T5 FP8 + Flux Q3_K_S (
T5 Q3_K_L + Flux Q4_K_S (
|
Beta Was this translation helpful? Give feedback.
-
I think that the Q4 quants are, in general, faster to dequantise than the Q3 or Q5 quants. I suspect this is because 4-bit quants fit more simply into the 8-bit pseudo-tensors than 3bit or 5bit? |
Beta Was this translation helpful? Give feedback.
-
The code I'm using from city96 has on the fly encoding to Q4_1, Q5_1, and Q8_0 - not Q6. So to do a Q6 mix requires using the patching mechanism (downloading a Q6 model and mixing it in). I tend to do Q4 and Q8 and bfloat16 mixes. But I'm working on some objective measures of the accuracy of different quants in different bits of the model. |
Beta Was this translation helpful? Give feedback.
-
My system:
I have an old system with low VRAM, so Flux was always slow.
But with the Flux Q3_K_S model and the T5 Q3_K_L encoder, I was able to generate, without offloading to RAM.
https://github.com/city96/ComfyUI-GGUF
I thought this would improve speeds, but it's about the same and in some cases even worse, when
loaded completely
is used.So is there a different bottleneck here?
This is with 1024 x 1024 images.
The speed was about the same as with larger models, where Flux needs to be
loaded partially
.T5 Q3_K_L + Flux Q3_K_S (
loaded completely
)T5 FP8 + Flux Q4_K_S (
loaded partially
)And with 512 x 512 images, generation is even faster when the Flux model is
loaded partially
!T5 Q3_K_L + Flux Q3_K_S (
loaded completely
)T5 FP8 + Flux Q4_K_S (
loaded partially
)Beta Was this translation helpful? Give feedback.
All reactions