feat: transition from compile-time to runtime backend discovery#1448
feat: transition from compile-time to runtime backend discovery#1448leejet merged 6 commits intoleejet:masterfrom
Conversation
| set(GGML_SYCL ON) | ||
| set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-narrowing -fsycl") | ||
| add_definitions(-DSD_USE_SYCL) | ||
| # disable fast-math on host, see: |
There was a problem hiding this comment.
This part (not changed by the PR) is weird: it could make sense to set it for ggml, but for sd.cpp? If so, what will happen when we add support for GGML_BACKEND_DL?
| #define __STATIC_INLINE__ static inline | ||
| #endif | ||
|
|
||
| inline void ggml_backend_load_all_once() { |
There was a problem hiding this comment.
@Cyberhan123 , why exactly this needs to be inline?
Also: do we really need the #if defined(GGML_BACKEND_DL) guard? As I understand it, for static-backend mode ggml_backend_device_count would return > 0 anyway?
There was a problem hiding this comment.
To avoid conflicts with the ggml API, inlining is required.
There was a problem hiding this comment.
The reason for this is that in a shared library scenario, it's necessary to call ggml_backend_load_all(); or ggml_backend_load_all_from_path externally. In this case, backend all is already greater than 0, so execution is skipped.
Therefore, a more reasonable solution is to call ggml_init_best in example/xxx, not in the library's init_best.
This approach directly exposes ggml_backend_t in the C API, which is actually better.
It's just a compromise I made to adapt to the previous C API chosen for setting the device via strings.
There was a problem hiding this comment.
For a concrete example, please refer to this link: https://github.com/Cyberhan123/slab.rs/blob/main/crates/slab-ggml/src/lib.rs#L49
To explain, in DLL mode, the ggml diffusion and the executable are not in the same directory, so ggml_load_from_path is required.
There was a problem hiding this comment.
Actually, I am not asking about the dynamic case: I'm asking why this block is skipped with #ifdef for the static one 🙂 Does anything break if we do this?
diff --git a/src/ggml_extend_backend.hpp b/src/ggml_extend_backend.hpp
index 19cf56f..1b7a083 100644
--- a/src/ggml_extend_backend.hpp
+++ b/src/ggml_extend_backend.hpp
@@ -7,33 +7,31 @@
#include "ggml-backend.h"
#include "ggml.h"
#ifndef __STATIC_INLINE__
#define __STATIC_INLINE__ static inline
#endif
inline void ggml_backend_load_all_once() {
-#if defined(GGML_BACKEND_DL)
// If the host process already preloaded backends explicitly
// (for example via ggml_backend_load / ggml_backend_load_all_from_path),
// do not rescan the default paths again.
if (ggml_backend_dev_count() > 0) {
return;
}
// In dynamic-backend mode the backend modules are discovered at runtime,
// so we must load them before asking for the CPU backend or its proc table.
static std::once_flag once;
std::call_once(once, []() {
if (ggml_backend_dev_count() > 0) {
return;
}
ggml_backend_load_all();
});
-#endif
}
#if defined(GGML_BACKEND_DL)There was a problem hiding this comment.
There may be problems with it, because it is read through share libary.
There was a problem hiding this comment.
The reason I am asking is: if we ever have a situation where a file includes this with GGML_BACKEND_DL defined, and another one without GGML_BACKEND_DL, we'll hit an ODR violation. IMHO, that could be harder to debug than whatever issue comes from ggml_backend_dev_count() possibly returning 0 for a static-backend build (and looking at the ggml code, that would only happen with no backend enabled anyway).
|
There are currently multiple PRs addressing similar changes around backend discovery and default selection. Among them, this PR has the most focused scope and does not mix in additional features, which makes it easier to review and merge safely. I also see that @stduhpf and @Cyberhan123 are already listed as co-authors here — thanks for taking care of that. Given that, I’m going to start with this PR and get it merged first. For #1184 and #1368, since they include additional functionality beyond backend discovery/selection, I’d recommend splitting those parts into separate, focused PRs. This will make the review process smoother and help get the remaining changes merged incrementally. Thanks everyone for the work on this! |
Applies the following parts from leejet#1184 and leejet#1368 : - Introduce ggml_extend_backend.hpp for dynamic backend management. - Convert backend-specific SD_USE_* preprocessor tests to runtime tests, propagating the backend handler when needed. Additionally, to make this work with minimal changes: - A new function sd_get_default_backend replaces the default backend selection on src/stable-diffusion.cpp and src/upscaler.cpp, preserving the SD_VK_DEVICE env var support. - Clean up SD_USE_* defines from CMakeLists.txt (no other build changes). This is not just a refactor, because it improves device selection a bit: - Previously, Vulkan selected device 0 by default, but this was the wrong choice on my system, which has the iGPU as 0 and the discrete card as 1. The new selection algorithm correctly prioritizes the discrete GPU. - The upscaler now follows SD_VK_DEVICE too. Co-authored-by: Stéphane du Hamel <stephduh@live.fr> Co-authored-by: Cyberhan123 <255542417@qq.com>
If one translation unit includes the header with GGML_BACKEND_DL defined, and another includes it without, we'll hit an ODR violation, which is undefined behavior. For the current static-backend build, `ggml_backend_dev_count() > 0` will skip the `ggml_backend_load_all` call anyway.
The `#include "ggml-cpu.h"` removal triggered a build error in util.cpp. To avoid this kind of compatibility issue, it is preferable to build the same source code regardless of the GGML_BACKEND_DL configuration. The problem is that util.cpp requires ggml-cpu.h, which cannot be included alongside ggml_extend_backend.hpp. But since this is only required for a single function, I am using a include-inside-namespace approach to move away the definitions, and keep the diff minimal. Other ways to avoid this issue could be: * removing sd_get_system_info entirely (which may not be a bad idea, as dynamic backends could have been built with completely different flags); * moving the sd_get_system_info definition to a separate source file; * putting ggml_extend_backend.hpp functions into their own namespace; * renaming all ggml_extend_backend.hpp symbols to ggml_ext_* .
acba5a9 to
0fd3c74
Compare
…split-spec Adds the LTX-2-specific overrides on top of the auto-fit lazy-load and split-spec machinery that lives on backend-fit: - LTXAVEmbedder::set_llm_lazy_load forwards to the inner Gemma LLM runner so the planner's "lazy load if init-time SUM exceeds device cap" path can defer Gemma's params alloc until the first encode. - LTXAVModel::set_lazy_load forwards to the LTXAVRunner. - stable-diffusion.cpp's LTX-2 branch primes the cond/DiT split specs (prime_cond_spec / prime_dit_spec) before constructing the runners, so layer-split / row-split can apply. - LTXAVModel now uses diffusion_backend (per-component) rather than the legacy single backend handle. Also restores the 4-arg patch_weight signature in src/llm.hpp; the ltx2.3 PR predated the runtime-backend-discovery PR (leejet#1448) which changed patch_weight's signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TRUE per-layer streaming path was unconditionally evicting every block back to CPU after each forward pass, even when there was plenty of free VRAM left. For an 8-step generation that re-streams the entire model 7 extra times. Decide once, on the first sampling step, how many leading blocks fit permanently in VRAM (after subtracting prefetch headroom + compute buffer + safety margin) and skip the eviction for those indices. Later steps' prime_prefetch starts at the first non-resident block, so the cache prefix is hit for free. Pattern follows ComfyUI's ModelPatcher.partially_load() — a static partition is simpler and cheaper than dynamic eviction for the cyclic-sequential access pattern of diffusion sampling. Also fix MemoryBudgetManager::query_device_memory(): the SD_USE_CUDA guard was dead code after PR leejet#1448 switched to runtime backend discovery, so every build was returning the hardcoded 8 GB / 4 GB fallback regardless of the real GPU. Use ggml_backend_dev_memory() instead — works for CUDA, Vulkan, Metal. For ZImage 8 steps at 688x1024 on RTX 3060 12 GB: before: 7.21s/step steady, 57.80s sampling after: 4.45s/step steady, 39.64s sampling (1.46x) Same caching helper (compute_resident_block_count) added to LayerExecutionEngine and applied to z_image, flux (double + single), mmdit, anima, qwen_image. UNet (skip connections) and WAN (no per-layer streaming yet) unchanged.
Applies the following parts from #1184 and #1368 :
Additionally, to make this work with minimal changes:
This is not just a refactor, because it improves device selection a bit: