Skip to content

Conversation

Copy link

Copilot AI commented Jan 4, 2026

Fix VRAM cache calculation to properly account for device_working_mem_gb

Problem Analysis

When generating with larger models (like Q8 Z-Image Turbo), users get OOM errors during VAE decoding.

Root Cause:
The ZImageLatentsToImageInvocation and ZImageImageToLatentsInvocation do not request additional working memory for VAE operations, unlike the standard SD1.5/SDXL/SD3/CogView4 invocations. This means the model cache doesn't offload enough models from VRAM before VAE operations run, leaving no room for the operation's intermediate tensors.

Comparison:

  • LatentsToImageInvocation (SD1.5/SDXL): Calls estimate_vae_working_memory_sd15_sdxl() and passes working_mem_bytes to model_on_device()
  • SD3LatentsToImageInvocation: Calls estimate_vae_working_memory_sd3() and passes working_mem_bytes
  • CogView4LatentsToImageInvocation: Calls estimate_vae_working_memory_cogview4() and passes working_mem_bytes
  • ZImageLatentsToImageInvocation: Didn't estimate or request working memory (NOW FIXED ✅)
  • ZImageImageToLatentsInvocation: Didn't estimate or request working memory (NOW FIXED ✅)

Changes Made

  • Analyze the issue and identify root cause
  • Confirm the bug in z_image_latents_to_image.py and z_image_image_to_latents.py
  • Implement working memory estimation for Z-Image VAE decode
  • Implement working memory estimation for Z-Image VAE encode
  • Update both invocations to request working memory via model_on_device(working_mem_bytes=...)
  • Add test to verify working memory estimation is called correctly
  • Fix test to handle FluxAutoEncoder vs AutoencoderKL differences
  • Fix test to use model_construct() to bypass Pydantic validation for mock objects
  • Run code review (passed with no issues)
  • Run security scan (no vulnerabilities found)
  • Final validation complete

Technical Details

The fix adds working memory estimation to both Z-Image VAE invocations:

  1. Detects whether the VAE is FLUX (FluxAutoEncoder) or Diffusers (AutoencoderKL)
  2. Calls the appropriate estimation function:
    • estimate_vae_working_memory_flux() for FLUX VAE
    • estimate_vae_working_memory_sd3() for AutoencoderKL
  3. Passes the estimated working memory to model_on_device(working_mem_bytes=...)

This ensures the model cache properly offloads models to make room for VAE operations before they run, preventing OOM errors.

Test Fixes

  1. Fixed unit test issue where config attribute was being set on FluxAutoEncoder mock, which doesn't have this attribute. The test now only sets config attributes for AutoencoderKL VAEs.
  2. Fixed Pydantic validation error by using model_construct() instead of the regular constructor to create invocation instances with mock fields, bypassing validation while still testing the core logic.

Files Modified

  • invokeai/app/invocations/z_image_latents_to_image.py: Added working memory estimation for decode
  • invokeai/app/invocations/z_image_image_to_latents.py: Added working memory estimation for encode
  • tests/app/invocations/test_z_image_working_memory.py: Added tests to verify working memory estimation

Expected Impact

Users will no longer need to manually set max_cache_vram_gb to work around OOM errors. The device_working_mem_gb setting (default 3GB) will now work correctly for Z-Image models, as the VAE operations will request appropriate working memory and the model cache will offload models accordingly.

Original prompt

This section details on the original issue you should resolve

<issue_title>[bug]: Out of Memory errors with larger models</issue_title>
<issue_description>### Is there an existing issue for this problem?

  • I have searched the existing issues

Install method

Invoke's Launcher

Operating system

Linux

GPU vendor

Nvidia (CUDA)

GPU model

RTX 4070

GPU VRAM

12GB

Version number

6.10.0rc2

Browser

No response

System Information

No response

What happened

When generating with the Q8 Z-Image Turbo model, I am getting out of memory errors during the VAE decoding phase. I can avoid the errors by setting max_cache_vram_gb to 4 GB, at which point I see VRAM memory use rise to ~4 GB. However it doesn't seem intuitive to me that adjusting the VRAM cache should be the way to fix the error.

I also tried setting device_working_mem_gb: 4 in my config file, but without the VRAM cache setting, I again get OOM.

Here is the log from a successful generation with the VRAM cache limited to 4 GB:

[2026-01-04 13:56:33,447]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 4096.00 MB. Heuristics applied: [1, 2].
[2026-01-04 13:56:33,521]::[InvokeAI]::INFO --> Invoke running on http://127.0.0.1:9090 (Press CTRL+C to quit)
[2026-01-04 13:57:16,700]::[InvokeAI]::INFO --> Executing queue item 124, session 59a3e7ce-10c9-4f8f-8243-e42ac139e7b8
C:\DWR\gits\invoke-20251225\InvokeAI\invokeai\backend\quantization\gguf\loaders.py:43: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavio
r. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch
\csrc\utils\tensor_numpy.cpp:209.)
  torch_tensor = torch.from_numpy(tensor.data)
[2026-01-04 13:57:24,816]::[Qwen3EncoderGGUFLoader]::INFO --> Detected llama.cpp GGUF format, converting keys to PyTorch format
[2026-01-04 13:57:24,818]::[Qwen3EncoderGGUFLoader]::INFO --> Qwen3 GGUF Encoder config detected: layers=36, hidden=2560, heads=32, kv_heads=8, intermediate=9728, head_dim=128
[2026-01-04 13:57:25,834]::[Qwen3EncoderGGUFLoader]::INFO --> Dequantized embed_tokens weight for embedding lookups
[2026-01-04 13:57:25,835]::[Qwen3EncoderGGUFLoader]::INFO --> Tied lm_head.weight to embed_tokens.weight (GGUF tied embeddings)
[2026-01-04 13:57:28,880]::[ModelManagerService]::INFO --> [MODEL CACHE] Locking model cache entry cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder (Type: Qwen3ForCausalLM), but it has already been dropped from the RAM cache. This is a sign that the model
 loading order is non-optimal in the invocation code (See https://github.com/invoke-ai/InvokeAI/issues/7513).
[2026-01-04 13:57:30,223]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder' (Qwen3ForCausalLM) onto cuda device in 1.34s. Total model size: 4326.88MB, VRAM: 3585.01MB (82.9%)
[2026-01-04 13:57:30,224]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:tokenizer' (Qwen2TokenizerFast) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
[2026-01-04 13:57:31,127]::[ModelManagerService]::INFO --> [MODEL CACHE] Unlocking model cache entry cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder (Type: Qwen3ForCausalLM), but it has already been dropped from the RAM cache. This is a sign that the mod
el loading order is non-optimal in the invocation code (See https://github.com/invoke-ai/InvokeAI/issues/7513).
[2026-01-04 13:57:39,397]::[Qwen3EncoderGGUFLoader]::INFO --> Detected llama.cpp GGUF format, converting keys to PyTorch format
[2026-01-04 13:57:39,398]::[Qwen3EncoderGGUFLoader]::INFO --> Qwen3 GGUF Encoder config detected: layers=36, hidden=2560, heads=32, kv_heads=8, intermediate=9728, head_dim=128
[2026-01-04 13:57:40,505]::[Qwen3EncoderGGUFLoader]::INFO --> Dequantized embed_tokens weight for embedding lookups
[2026-01-04 13:57:40,505]::[Qwen3EncoderGGUFLoader]::INFO --> Tied lm_head.weight to embed_tokens.weight (GGUF tied embeddings)
[2026-01-04 13:57:41,865]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder' (Qwen3ForCausalLM) onto cuda device in 1.31s. Total model size: 4326.88MB, VRAM: 3585.01MB (82.9%)
[2026-01-04 13:57:41,866]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:tokenizer' (Qwen2TokenizerFast) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
[2026-01-04 13:57:52,471]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '8f5ce803-d406-403b-babf-490a231dcdbc:transformer' (ZImageTransformer2DModel) onto cuda device in 1.28s. Total mo...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes lstein/InvokeAI#7

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/lstein/InvokeAI/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits January 4, 2026 22:47
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix out of memory errors with larger models Fix Z-Image VAE operations not reserving working memory for OOM prevention Jan 4, 2026
Copilot AI requested a review from lstein January 4, 2026 22:53
@lstein
Copy link
Owner

lstein commented Jan 5, 2026

@copilot This error is occurring in the unit test:

FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[AutoEncoder] - AttributeError: Mock object has no attribute 'config'

…utoEncoder

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 5, 2026

@copilot This error is occurring in the unit test:

FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[Aut...

Fixed in commit e13401b. The issue was that FluxAutoEncoder doesn't have a config attribute (only AutoencoderKL does). Updated the test to only set config attributes when testing with AutoencoderKL, matching the actual code behavior where config.scaling_factor and config.shift_factor are only accessed for AutoencoderKL VAEs.

@lstein
Copy link
Owner

lstein commented Jan 5, 2026

@copilot Still failures on the python tests:

=========================== short test summary info ============================
FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[AutoencoderKL] - pydantic_core._pydantic_core.ValidationError: 2 validation errors for ZImageLatentsToImageInvocation
latents
  Input should be a valid dictionary or instance of LatentsField [type=model_type, input_value=<MagicMock id='140315402449072'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
vae
  Input should be a valid dictionary or instance of VAEField [type=model_type, input_value=<MagicMock id='140315402451952'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[AutoEncoder] - pydantic_core._pydantic_core.ValidationError: 2 validation errors for ZImageLatentsToImageInvocation
latents
  Input should be a valid dictionary or instance of LatentsField [type=model_type, input_value=<MagicMock id='140315406743648'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
vae
  Input should be a valid dictionary or instance of VAEField [type=model_type, input_value=<MagicMock id='140315402388832'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
= 2 failed, 649 passed, 100 skipped, 3 deselected, 10 warnings in 80.41s (0:01:20) =
Error: Process completed with exit code 1.

…fields

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
@lstein lstein marked this pull request as ready for review January 5, 2026 03:14
@lstein lstein merged commit 4201d4c into lstein-master Jan 5, 2026
19 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants