-
Notifications
You must be signed in to change notification settings - Fork 11.2k
MultiGPU Work Units For Accelerated Sampling #7063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
…nsHook are not yet operational
…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
…added some doc strings and removed a so-far unused variable
…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)
…t torch hardware device
|
I've been given the greenlight to finish this PR! Anyone who experienced black images, could you tell me your operating system, pytorch version, and hardware? I am able to reproduce, but want to confirm the scale of the problem. Anyone who experienced issues with changing loras causing memory leak messages, could you give a step by step guide to reproduce? I think I can brute force the steps, but getting a good way to reproduce would be great! @monstari I am able to reproduce your issue with Sage Attention, but haven't tried torch compile yet. What happens if you do torch compile without sage attention? From my initial look, it may be a triton bug w/ sage attention, but I'll need to reconfirm later. Since I have a way to reproduce the memory management issues when close to VRAM cap, I'll work on that too. Goal will be for the major problems to be solved over a course of 1-3 weeks and then submit the PR for review. |
Thanks very much for working on this. I have been using for about a month or 2 now. Started with 2 3090 and recently 3 3090 (not great, it has to be 4 3090). So far its been ok aside from the OOM after repeated runs. I just updated and now going to test again see if now resolved. |
|
@jkyamog for the OOMs, could you try to purposely made them happen with a 'simple' workflow and detail the steps here? I actually have a dual 3090 setup for testing at the moment as well, so would be very helpful! Also, what operating system and pytorch version are you running just so I can take a note? |
sure, this is the stock workflow from comfyui wiki, I then added MultiGPU and put 2 gpu. I got OOM error after a 2nd run. Flux doesn't really work well with multi gpu, but several batches or more complicated workflow still improves it. This is just a simple workflow to get OOM error. I have attached the image here, so you can easily import this on comfyui. Might be not related I have noticed after updating to head that usually now when putting 3 gpu unit it will keep on eating system ram and swap, until the kernel kills WSL. Before I can run 3 gpus, but not really helpful as 1 gpu is typically idle. So, it's not really an issue and it might be not related. I have 96GB ram, 88GB allocated to WSL with 32 GB swap. I am running in WSL here are the relevant libs on conda/pip |
|
I noticed that there’s no speed boost when using distilled models with CFG=1. Since Normalized Attention Guidance already provides similar negative conditioning at CFG=1, would it be possible to explore solutions similar to XDit’s parallel processing in the future? Also, if we have more than two GPUs, I assume this solution wouldn’t be as useful, since we can only apply two conditioning streams. Thanks for all your work still! |
Hello, I also want to run this model with 4 4090 gpus. Can u share the workflow? |
|
@Kosinkadink AMD users will need this patch diff --git a/comfy/model_management.py b/comfy/model_management.py
index 4ac04b8b..b396a034 100644
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -194,7 +194,7 @@ def get_all_torch_devices(exclude_current=False):
global cpu_state
devices = []
if cpu_state == CPUState.GPU:
- if is_nvidia():
+ if is_nvidia() or is_amd():
for i in range(torch.cuda.device_count()):
devices.append(torch.device(i))
elif is_intel_xpu():
|
|
Thank you for the additional info! @DKingAlpha thanks for the heads up! Firstly, has anyone here been able to get this working on Linux (not WSL)? And if so, what type of GPUs were they? Secondly, @jkyamog this PR currently only does conditioning splitting - making conditioning run on separate GPUs. Wan2.1 has only two conditioning (positive and negative) without masking, so you are only able to accelerate it 2x with 2 GPUs - the other GPUs will have no work to be split for them. This is also the issue with using models that only have one conditioning - there is nothing to split. I will be looking at some parallel attention schemes to try to overcome this limitation soon. I did not have as much time to look into the remaining issues as I thought, I apologize for the delay. I will keep looking into it + accelerating without just conditioning soon. |
|
Hi, in order to make this setup work with 2 GPUs do you need enough VRAM to be able to run the Wan model 2 times on your first GPU? I noticed I get OOM errors when the deep.clone part starts, I'm guessing that the clone requires the full model to load and then also the copy of the model before it can paste it into the 2nd GPU? Thanks. |
|
That should not be a requirement. What are your exact errors? (post full stack trace + workflow) |
|
Multigpu work units are a feature only for nodes that use native sampling or specifically reimplement support - the node you're looking at is a wrapper custom node that does not use native sampling. |
|
Do you have a workflow that I can test to see if I installed this correctly? |
I think I have it working with a fix. 2xA40 on a runpod. I reproduced black outputs, and colorful noise in flux-dev fp8, cfg=1.1. I got a black screen on some async tensor casting experiments I was doing for another change, and debugged it to be a race between the cuda streams and the pytorch garbage collector, so I thought id check for the same bug here. I remember @Kosinkadink saying this was blocked by black screens in a discord post. So I think something similar is going on here, where the GPU->GPU ops are asynchronous WRT to the CPU and the CPU is able to run ahead and queue a cudaAsyncFree on one GPU while the other is still bus mastering the .to transfers, depending on who is the bus master and tensor owner. In the case of pull DMA this can easily be a race that corrupts tensors before transfer completes. Pytorch documentation is sparse on this so its all theory. So if i'm right, this can be fixed by always bounce buffering through RAM which syncs the CPU: There is in theory a performance penalty here as it changes the DMA path from master-slave to master-RAM-master, but i'm not observing any penalty in my initial tests. Here is B=4 1024x1024 cfg=1.1 Flux dev speeds: Properly syncing the GPU->GPU DMA is a complex web of driver specifics, so this is a lot easier. If this ends up being slow for other use cases (very large latents), you could chunk the .to as a series of queued copies instead, so the two bus masters start overlapping work and the performance will likely converge on something very close to master-slave give the above.
|





Overview
This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.
As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.
The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:

Nvidia (CUDA): Tested, works ✅.⚠️ .
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux
Implementation Details
Based on
max_gpusand the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher'sadditional_modelsunder keymultigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with anyadditional_modelskept consistent. To avoid unnecessarily deepcloning models,currently_loaded_modelsfromcomfy.model_managementare checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.When native conds are used as the work units,
_calc_cond_batchcalls and returns_calc_cond_batch_multigputo avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.Since deepcloning requires that the base model is 'clean',
comfy.model_managementhas received aunload_model_and_clonesfunction to unload only specific models and their clones.The


--cuda-devicestartup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)
Performance (will add more examples soon)
Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.


Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090

