feat: transition from compile-time to runtime backend discovery by wbruna · Pull Request #1448 · leejet/stable-diffusion.cpp

wbruna · 2026-04-20T21:03:43Z

Applies the following parts from #1184 and #1368 :

Introduce ggml_extend_backend.hpp for dynamic backend management.
Convert backend-specific SD_USE_* preprocessor tests to runtime tests, propagating the backend handler when needed.

Additionally, to make this work with minimal changes:

A new function sd_get_default_backend replaces the default backend selection on src/stable-diffusion.cpp and src/upscaler.cpp, preserving the SD_VK_DEVICE env var support.
Clean up SD_USE_* defines from CMakeLists.txt (no other build changes).

This is not just a refactor, because it improves device selection a bit:

Previously, Vulkan selected device 0 by default, but this was the wrong choice on my system, which has the iGPU as 0 and the discrete card as 1. The new selection algorithm correctly prioritizes the discrete GPU.
The upscaler now follows SD_VK_DEVICE too.

wbruna · 2026-04-20T21:12:59Z

    set(GGML_SYCL ON)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-narrowing -fsycl")
-    add_definitions(-DSD_USE_SYCL)
    # disable fast-math on host, see:


This part (not changed by the PR) is weird: it could make sense to set it for ggml, but for sd.cpp? If so, what will happen when we add support for GGML_BACKEND_DL?

wbruna · 2026-04-20T21:19:52Z

+#define __STATIC_INLINE__ static inline
+#endif
+
+inline void ggml_backend_load_all_once() {


@Cyberhan123 , why exactly this needs to be inline?

Also: do we really need the #if defined(GGML_BACKEND_DL) guard? As I understand it, for static-backend mode ggml_backend_device_count would return > 0 anyway?

To avoid conflicts with the ggml API, inlining is required.

The reason for this is that in a shared library scenario, it's necessary to call ggml_backend_load_all(); or ggml_backend_load_all_from_path externally. In this case, backend all is already greater than 0, so execution is skipped.

Therefore, a more reasonable solution is to call ggml_init_best in example/xxx, not in the library's init_best.

This approach directly exposes ggml_backend_t in the C API, which is actually better.

It's just a compromise I made to adapt to the previous C API chosen for setting the device via strings.

For a concrete example, please refer to this link: https://github.com/Cyberhan123/slab.rs/blob/main/crates/slab-ggml/src/lib.rs#L49

To explain, in DLL mode, the ggml diffusion and the executable are not in the same directory, so ggml_load_from_path is required.

Actually, I am not asking about the dynamic case: I'm asking why this block is skipped with #ifdef for the static one 🙂 Does anything break if we do this?

diff --git a/src/ggml_extend_backend.hpp b/src/ggml_extend_backend.hpp index 19cf56f..1b7a083 100644 --- a/src/ggml_extend_backend.hpp +++ b/src/ggml_extend_backend.hpp @@ -7,33 +7,31 @@ #include "ggml-backend.h" #include "ggml.h" #ifndef __STATIC_INLINE__ #define __STATIC_INLINE__ static inline #endif inline void ggml_backend_load_all_once() { -#if defined(GGML_BACKEND_DL) // If the host process already preloaded backends explicitly // (for example via ggml_backend_load / ggml_backend_load_all_from_path), // do not rescan the default paths again. if (ggml_backend_dev_count() > 0) { return; } // In dynamic-backend mode the backend modules are discovered at runtime, // so we must load them before asking for the CPU backend or its proc table. static std::once_flag once; std::call_once(once, []() { if (ggml_backend_dev_count() > 0) { return; } ggml_backend_load_all(); }); -#endif } #if defined(GGML_BACKEND_DL)

There may be problems with it, because it is read through share libary.

The reason I am asking is: if we ever have a situation where a file includes this with GGML_BACKEND_DL defined, and another one without GGML_BACKEND_DL, we'll hit an ODR violation. IMHO, that could be harder to debug than whatever issue comes from ggml_backend_dev_count() possibly returning 0 for a static-backend build (and looking at the ggml code, that would only happen with no backend enabled anyway).

leejet · 2026-04-22T15:38:59Z

There are currently multiple PRs addressing similar changes around backend discovery and default selection. Among them, this PR has the most focused scope and does not mix in additional features, which makes it easier to review and merge safely.

I also see that @stduhpf and @Cyberhan123 are already listed as co-authors here — thanks for taking care of that.

Given that, I’m going to start with this PR and get it merged first.

For #1184 and #1368, since they include additional functionality beyond backend discovery/selection, I’d recommend splitting those parts into separate, focused PRs. This will make the review process smoother and help get the remaining changes merged incrementally.

Thanks everyone for the work on this!

Applies the following parts from leejet#1184 and leejet#1368 : - Introduce ggml_extend_backend.hpp for dynamic backend management. - Convert backend-specific SD_USE_* preprocessor tests to runtime tests, propagating the backend handler when needed. Additionally, to make this work with minimal changes: - A new function sd_get_default_backend replaces the default backend selection on src/stable-diffusion.cpp and src/upscaler.cpp, preserving the SD_VK_DEVICE env var support. - Clean up SD_USE_* defines from CMakeLists.txt (no other build changes). This is not just a refactor, because it improves device selection a bit: - Previously, Vulkan selected device 0 by default, but this was the wrong choice on my system, which has the iGPU as 0 and the discrete card as 1. The new selection algorithm correctly prioritizes the discrete GPU. - The upscaler now follows SD_VK_DEVICE too. Co-authored-by: Stéphane du Hamel <stephduh@live.fr> Co-authored-by: Cyberhan123 <255542417@qq.com>

If one translation unit includes the header with GGML_BACKEND_DL defined, and another includes it without, we'll hit an ODR violation, which is undefined behavior. For the current static-backend build, `ggml_backend_dev_count() > 0` will skip the `ggml_backend_load_all` call anyway.

The `#include "ggml-cpu.h"` removal triggered a build error in util.cpp. To avoid this kind of compatibility issue, it is preferable to build the same source code regardless of the GGML_BACKEND_DL configuration. The problem is that util.cpp requires ggml-cpu.h, which cannot be included alongside ggml_extend_backend.hpp. But since this is only required for a single function, I am using a include-inside-namespace approach to move away the definitions, and keep the diff minimal. Other ways to avoid this issue could be: * removing sd_get_system_info entirely (which may not be a bad idea, as dynamic backends could have been built with completely different flags); * moving the sd_get_system_info definition to a separate source file; * putting ggml_extend_backend.hpp functions into their own namespace; * renaming all ggml_extend_backend.hpp symbols to ggml_ext_* .

…split-spec Adds the LTX-2-specific overrides on top of the auto-fit lazy-load and split-spec machinery that lives on backend-fit: - LTXAVEmbedder::set_llm_lazy_load forwards to the inner Gemma LLM runner so the planner's "lazy load if init-time SUM exceeds device cap" path can defer Gemma's params alloc until the first encode. - LTXAVModel::set_lazy_load forwards to the LTXAVRunner. - stable-diffusion.cpp's LTX-2 branch primes the cond/DiT split specs (prime_cond_spec / prime_dit_spec) before constructing the runners, so layer-split / row-split can apply. - LTXAVModel now uses diffusion_backend (per-component) rather than the legacy single backend handle. Also restores the 4-arg patch_weight signature in src/llm.hpp; the ltx2.3 PR predated the runtime-backend-discovery PR (leejet#1448) which changed patch_weight's signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The TRUE per-layer streaming path was unconditionally evicting every block back to CPU after each forward pass, even when there was plenty of free VRAM left. For an 8-step generation that re-streams the entire model 7 extra times. Decide once, on the first sampling step, how many leading blocks fit permanently in VRAM (after subtracting prefetch headroom + compute buffer + safety margin) and skip the eviction for those indices. Later steps' prime_prefetch starts at the first non-resident block, so the cache prefix is hit for free. Pattern follows ComfyUI's ModelPatcher.partially_load() — a static partition is simpler and cheaper than dynamic eviction for the cyclic-sequential access pattern of diffusion sampling. Also fix MemoryBudgetManager::query_device_memory(): the SD_USE_CUDA guard was dead code after PR leejet#1448 switched to runtime backend discovery, so every build was returning the hardcoded 8 GB / 4 GB fallback regardless of the real GPU. Use ggml_backend_dev_memory() instead — works for CUDA, Vulkan, Metal. For ZImage 8 steps at 688x1024 on RTX 3060 12 GB: before: 7.21s/step steady, 57.80s sampling after: 4.45s/step steady, 39.64s sampling (1.46x) Same caching helper (compute_resident_block_count) added to LayerExecutionEngine and applied to z_image, flux (double + single), mmdit, anima, qwen_image. UNet (skip connections) and WAN (no per-layer streaming yet) unchanged.

wbruna commented Apr 20, 2026

View reviewed changes

This was referenced Apr 20, 2026

Replace init_backend with ggml_backend_init_best #1259

Open

fix: build with GGML_BACKEND_DL=ON #1446

Open

wbruna and others added 3 commits April 22, 2026 19:57

wbruna force-pushed the sd_runtime_backend branch from acba5a9 to 0fd3c74 Compare April 22, 2026 23:30

leejet added 3 commits April 29, 2026 22:39

Merge branch 'master' into sd_runtime_backend

3b82964

fix some bugs

9d943a1

format code

1515070

leejet merged commit b8079e2 into leejet:master Apr 29, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: transition from compile-time to runtime backend discovery#1448

feat: transition from compile-time to runtime backend discovery#1448
leejet merged 6 commits intoleejet:masterfrom
wbruna:sd_runtime_backend

wbruna commented Apr 20, 2026

Uh oh!

wbruna Apr 20, 2026

Uh oh!

wbruna Apr 20, 2026

Uh oh!

Cyberhan123 Apr 20, 2026

Uh oh!

Cyberhan123 Apr 20, 2026

Uh oh!

Cyberhan123 Apr 20, 2026

Uh oh!

wbruna Apr 20, 2026

Uh oh!

Cyberhan123 Apr 21, 2026

Uh oh!

wbruna Apr 21, 2026

Uh oh!

leejet commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wbruna commented Apr 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leejet commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants