Why is Flux faster with VRAM offloading ("loaded partially") ?! #4601

JorgeR81 · 2024-08-25T14:19:16Z

JorgeR81
Aug 25, 2024

My system:

GTX 1070 ( 8 GB VRAM )
32 GB RAM
Windows 10
pytorch version: 2.3.1+cu121

I have an old system with low VRAM, so Flux was always slow.

But with the Flux Q3_K_S model and the T5 Q3_K_L encoder, I was able to generate, without offloading to RAM.
https://github.com/city96/ComfyUI-GGUF

I thought this would improve speeds, but it's about the same and in some cases even worse, when loaded completely is used.
So is there a different bottleneck here?

This is with 1024 x 1024 images.
The speed was about the same as with larger models, where Flux needs to be loaded partially .

T5 Q3_K_L + Flux Q3_K_S ( loaded completely )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 2732.35888671875 True
Requested to load Flux
Loading 1 new model
loaded completely 0.0 4991.648681640625 True
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [04:48<00:00, 24.01s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 315.33 seconds

T5 FP8 + Flux Q4_K_S ( loaded partially )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 4778.66552734375 True
Requested to load Flux
Loading 1 new model
loaded partially 5730.115987487793 5729.402587890625 0
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [04:42<00:00, 23.54s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 317.06 seconds

And with 512 x 512 images, generation is even faster when the Flux model is loaded partially !

T5 Q3_K_L + Flux Q3_K_S ( loaded completely )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 2732.35888671875 True
Requested to load Flux
Loading 1 new model
loaded completely 0.0 4991.648681640625 True
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:27<00:00,  7.30s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 121.11 seconds

T5 FP8 + Flux Q4_K_S ( loaded partially )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 4778.66552734375 True
Requested to load Flux
Loading 1 new model
loaded partially 5918.819987487793 5906.859619140625 0
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:19<00:00,  6.61s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 93.36 seconds

JorgeR81 · 2024-08-25T15:14:25Z

JorgeR81
Aug 25, 2024
Author

I tried other combinations.

The Flux Q4_K_S just seems to be faster than the smaller Flux Q3_K_S, despite the latter being loaded completely.
( Maybe it's got something to do with the quantization method ? )

The T5 FP8 + Flux Q3_K_S obviously don't fit together in 8 GB VRAM, and still the Flux Q3_K_S was loaded completely, so maybe I'm just not reading the console right ...

With 512 x 512 images:

T5 FP8 + Flux Q3_K_S ( loaded completely )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 4778.66552734375 True
Requested to load Flux
Loading 1 new model
loaded completely 0.0 4991.648681640625 True
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:26<00:00,  7.24s/it]
Prompt executed in 100.84 seconds

T5 Q3_K_L + Flux Q4_K_S ( loaded partially )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 2732.35888671875 True
Requested to load Flux
Loading 1 new model
loaded partially 5918.819987487793 5906.859619140625 0
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:18<00:00,  6.57s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 96.80 seconds

0 replies

chrisgoringe · 2024-08-28T21:52:29Z

chrisgoringe
Aug 28, 2024

I think that the Q4 quants are, in general, faster to dequantise than the Q3 or Q5 quants. I suspect this is because 4-bit quants fit more simply into the 8-bit pseudo-tensors than 3bit or 5bit?

1 reply

JorgeR81 Aug 28, 2024
Author

I haven't tried much of the Q5 model, but I think I prefer results with the Q4 or Q6 models.

Maybe you could try mixing Q6 instead of Q5 in your custom node. Or just all Q4 and Q8.
https://github.com/chrisgoringe/cg-mixed-casting

chrisgoringe · 2024-08-28T22:56:15Z

chrisgoringe
Aug 28, 2024

The code I'm using from city96 has on the fly encoding to Q4_1, Q5_1, and Q8_0 - not Q6.

So to do a Q6 mix requires using the patching mechanism (downloading a Q6 model and mixing it in).

I tend to do Q4 and Q8 and bfloat16 mixes. But I'm working on some objective measures of the accuracy of different quants in different bits of the model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is Flux faster with VRAM offloading ("loaded partially") ?! #4601

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why is Flux faster with VRAM offloading ("loaded partially") ?! #4601

JorgeR81 Aug 25, 2024

Replies: 3 comments · 1 reply

JorgeR81 Aug 25, 2024 Author

chrisgoringe Aug 28, 2024

JorgeR81 Aug 28, 2024 Author

chrisgoringe Aug 28, 2024

JorgeR81
Aug 25, 2024

Replies: 3 comments 1 reply

JorgeR81
Aug 25, 2024
Author

chrisgoringe
Aug 28, 2024

JorgeR81 Aug 28, 2024
Author

chrisgoringe
Aug 28, 2024