Add a launch argument for non_blocking=True #5268

mturnshek · 2024-10-17T16:10:13Z

Feature Idea

In model_management.py

def device_should_use_non_blocking(device):
    if not device_supports_non_blocking(device):
        return False
    return False
    # return True #TODO: figure out why this causes memory issues on Nvidia and possibly others

ComfyUI/comfy/model_management.py

Line 834 in 7390ff3

 # return True #TODO: figure out why this causes memory issues on Nvidia and possibly others 

Changing this function back to its pre-TODO state results in a large speedup in model patching. (19s -> 6s for Flux LoRAs on my computer.) It probably also speeds up loading in other areas.

This is because the largest bottleneck is the one-by-one blocking transfer of each layer of the unet to the GPU, which is massively accelerated if non_blocking=True.

Are there still memory issues? Changes like this (39f114c) since this TODO was written could mean the same problems that used to cause memory issues may be less relevant than before or non-existent.

Please consider re-adding support for non_blocking=True as a launch argument so users can start trying it out again.

Existing Solutions

No response

Other

No response

The text was updated successfully, but these errors were encountered:

comfyanonymous · 2024-10-17T21:33:02Z

Can you check if it's better now?

mturnshek · 2024-10-17T22:07:03Z

I can confirm that ComfyUI is way faster for both loading and model patching for me now as of your commit from 40 minutes ago 6715899.

Loading the base model has gone from 40s to about 5 seconds for me. Probably 3-8x speedup overall for patching and loading which are huge bottlenecks.

mturnshek · 2024-10-17T22:15:49Z

By the way, have you ever thought about / looked into combining similar layer patches into one tensor before casting them and transferring to the device in this section of model_patcher.py?

        load_completely.sort(reverse=True)
        for x in load_completely:
            n = x[1]
            m = x[2]
            weight_key = "{}.weight".format(n)
            bias_key = "{}.bias".format(n)
            if hasattr(m, "comfy_patched_weights"):
                if m.comfy_patched_weights == True:
                    continue

            self.patch_weight_to_device(weight_key, device_to=device_to)
            self.patch_weight_to_device(bias_key, device_to=device_to)
            logging.debug("lowvram: loaded module regularly {} {}".format(n, m))
            m.comfy_patched_weights = True

They're all done individually since there are differently shaped layers and it's easy. But if they were combined so that all similar layers were sent over as one stacked tensor, the number of calls would be reduced from something like 429 to ~14 for Flux's unet, with much more data per call.

I'm not sure how efficient sending lots of small tensors over is compared to grouping and sending it as a larger block, but it could be significant.

mturnshek added the Feature A new feature to add to ComfyUI. label Oct 17, 2024

mturnshek changed the title ~~Add a launch argument for nonblocking=True~~ Add a launch argument for non_blocking=True Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a launch argument for non_blocking=True #5268

Add a launch argument for non_blocking=True #5268

mturnshek commented Oct 17, 2024 •

edited

Loading

comfyanonymous commented Oct 17, 2024

mturnshek commented Oct 17, 2024

mturnshek commented Oct 17, 2024 •

edited

Loading

Add a launch argument for non_blocking=True #5268

Add a launch argument for non_blocking=True #5268

Comments

mturnshek commented Oct 17, 2024 • edited Loading

Feature Idea

Existing Solutions

Other

comfyanonymous commented Oct 17, 2024

mturnshek commented Oct 17, 2024

mturnshek commented Oct 17, 2024 • edited Loading

mturnshek commented Oct 17, 2024 •

edited

Loading

mturnshek commented Oct 17, 2024 •

edited

Loading