Name and Version
"llama-b7761-bin-win-vulkan-x64"
C:\llama-vulkan-llm>llama-cli.exe --version
load_backend: loaded RPC backend from C:\llama-vulkan-llm\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\llama-vulkan-llm\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llama-vulkan-llm\ggml-cpu-alderlake.dll
version: 7761 (a89002f)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
VULKAN BACKEND (version: 7761 (a89002f07), built with Clang 19.1.5 for Windows x86_64):
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 45 -c 4096 -b 20 -ub 8 -t 8 -fa off --no-mmap --no-kv-offload
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 45 -c 4096 -b 20 -ub 8 -t 8 -fa off
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 45 -c 4096 -b 20 -ub 8 -t 8
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 45 -c 1024 -b 20 -ub 8 -t 8
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 10 -c 1024 -b 1 -ub 1 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 45 -c 1024 -b 1 -ub 1 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 20 -c 1024 -b 1 -ub 1 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 10 -c 1024 -b 1 -ub 1 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 15 -c 1024 -b 1 -ub 1 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 10 -c 1024 -b 10 -ub 4 -t 8 -fa off
SYCL BACKEND (version: 7770 (fe44d3557), built with Clang 19.1.5 for Windows x86_64):
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 45 -n 4096 -c 8192 -b 28 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 20 -n 512 -c 1024 -b 14 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 20 -n 512 -c 1024 -b 1 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 10 -n 512 -c 1024 -b 1 -t 4
llama-cli.exe -m "D:\LLM-Images\Qwen3\gguf\Qwen3_30B_A3B_Q4_K_M.gguf" -ngl 0 -n 512 -c 1024 -b 1 -t 4
Problem description & steps to reproduce
COORDINATED BUG REPORT: Critical Incompatibilities between Intel Lunar Lake APU, Vulkan Driver, and llama.cpp Vulkan Backend
Subject: Systematic instability, performance regression, and memory management corruption during LLM inference on Intel Core Ultra 7 258V (Lunar Lake) with integrated Arc 140V GPU.
Target Audiences: Intel Graphics Driver Team, Khronos Vulkan Working Group / Intel Vulkan Implementation, llama.cpp developer team (ggml-vulkan).
Date: 19.01.2026
Note: This report was created in collaboration with the DeepSeek AI
Affected Hardware: MSI Claw 8 AI with Intel Core Ultra 7 258V processor
Software: Primary - LLAMA.CPP (Vulkan Backend), Secondary - LLAMA.CPP (SYCL Backend)
-
Summary and Critical Priority
This report documents a cascade of deep-seated, mutually reinforcing software bugs that prevent the stable and performant operation of modern Large Language Models (LLMs) on the new Intel Lunar Lake APU architecture (Unified Memory Architecture, UMA) using the Vulkan backend of llama.cpp. The bugs manifest in two unacceptable operational states between which the user must choose, both of which require significant compromises:
Mode A (Manual Memory Override): Stability purchased at the cost of massive performance degradation (~50% reduction) and non-deterministic crashes with certain parameters.
Mode B (Dynamic Auto Mode): Near-total failure of GPU acceleration (OutOfDeviceMemory) or, if functional at all, absurd memory accounting errors (64-bit overflows) coupled with unusable performance. This represents a severe functional impairment of a leading-edge AI platform and requires coordinated efforts from all three parties to resolve.
-
Complete System Configuration
Hardware: Laptop with Intel Core Ultra 7 258V processor (Lunar Lake). Integrated Intel Arc 140V GPU. 32 GB LPDDR5X system RAM (Unified Memory). BIOS: E1T52IMS.112
Operating System: Microsoft Windows 11 Home, Version 10.0.26200 (Build 26200).
Graphics Driver: Intel Graphics Driver, Version 32.0.101.8331.
Application Software: llama.cpp (command-line binary llama-cli.exe), Version b7761 (Commit a89002f), compiled with Vulkan support.
Test Model: "Qwen3_30B_A3B_Q4_K_M.gguf" (Mixture-of-Experts), GGUF format, 4-bit quantization.
-
Detailed Bug Description and Isolation
Bug 1 (Intel Driver / Vulkan Implementation): Flash-Attention pipeline creation failure in Auto Mode.
Symptom: vk::Device::createComputePipeline: ErrorOutOfDeviceMemory when attempting to load the model with -fa auto (or default).
Environment: Dynamic memory mode ("Shared GPU Memory Override" disabled).
Consequence: Complete blockade of GPU utilization with Flash-Attention, a critical performance feature. Workaround: Forcing -fa off.
Bug 2 (Intel Driver / Vulkan Implementation): Insufficient Device Memory allocation in Auto Mode.
Symptom: vk::CommandBuffer::begin: ErrorOutOfDeviceMemory or ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory when loading the model, even with -fa off.
Environment: Dynamic mode. Occurs even with moderate GPU offloading (-ngl 20) of a 30B model.
Consequence: GPU acceleration for realistic LLMs in Auto Mode is impossible. The driver apparently cannot allocate sufficient contiguous Device-Local memory within the UMA for llama.cpp's requirements.
Bug 3 (llama.cpp Vulkan Backend & Driver Interaction): Catastrophic memory accounting and buffer size errors.
Symptom 1: Vulkan0 compute buffer size of X MiB, does not match expectation of Y MiB when closing the context. X is typically >30 times larger than Y.
Symptom 2: Absurd values in the memory breakdown, e.g., free: 17592186043963 MiB (17 exbibytes, clearly a 64-bit integer overflow or uninitialized value).
Environment: Occurs in both modes (Auto and Override), particularly pronounced with large contexts (-c 4096).
Consequence: Points to severe logical errors in the calculation, allocation, or deallocation of GPU buffers in ggml-vulkan.cpp. This is a memory leak or corruption risk.
Bug 4 (Interaction Bug Driver/Backend): Destabilizing combination of manual override and batch processing.
Symptom: Cumulative memory corruption after N requests, leading to generation of only "?" or complete crash.
Environment: Exclusively in manual override mode (e.g., 57% / 24GB reserved). Directly correlated with (logical and physical) batch size (-b, -ub).
Consequence: Forces conservative, performance-limiting parameters (-b 16 -ub 8) to achieve stability.
-
Reproduction Instructions
Provision a system with the above configuration.
For Bugs 1 & 2: Disable "Shared GPU Memory Override", restart. Attempt to load model with: llama-cli.exe -m <30B_MoE_Model> -ngl 20 -fa off.
For Bug 3: Perform any successful load operation, then exit llama-cli. Check the memory breakdown log for the above error message.
For Bug 4: Set "Shared GPU Memory Override" to 57% (24GB), restart. Load model with unstable parameters: llama-cli.exe -m <30B_MoE_Model> -ngl 45 -c 2048 -b 28 -ub 14 -t 8 -fa off --no-mmap. Wait for crash or corruption within the first 10 requests.
-
Workarounds and Current Mitigation
The only functional, albeit suboptimal, state requires:
Activation of "Shared GPU Memory Override" (~24GB).
Use of severely restricted llama.cpp parameters: -ngl 45 -c 1024 -b 20 -ub 8 -t 6 -fa off --no-mmap.
Resulting Performance: ~21 tokens/s (Prompt), ~25 tokens/s (Generation).
-
Specific Requests to the Respective Teams
To the Intel Graphics Driver Team:
Priority 1: Resolution of Bugs 1 and 2. The Vulkan driver must be capable of reliably creating compute pipelines for Flash-Attention and allocating sufficient Device-Local memory for demanding LLM workloads in dynamic UMA mode (Auto).
Priority 2: Investigation of the interaction with llama.cpp in manual override mode (Bug 4). Why does the combination of a large, static VRAM block and intensive, repeated allocation/deallocation by llama.cpp lead to unstable memory management?
To the Vulkan Implementers (Intel) / Khronos Group:
Review of Vulkan runtime compliance in the UMA scenario, particularly concerning intensive use of VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT on system memory.
To the llama.cpp Development Team (ggml-vulkan maintainers):
Priority 1: Investigation and resolution of the critical memory accounting bug (Bug 3). The discrepancy between expected and actual buffer size, as well as the memory overflow, are unacceptable and indicate serious bugs.
Priority 2: Implementation of more robust fallback mechanisms for driver errors (automatic fallback to -fa off, more conservative default parameters for Intel iGPUs).
Priority 3: Improvement of interaction with manually configured UMA memory, potentially through an adaptive batch size logic.
-
Conclusion
The Intel Lunar Lake APU represents a promising platform for local AI inference. However, the current interplay between the driver, Vulkan API, and the popular llama.cpp software is fundamentally impaired. The documented bugs force users to accept significant performance degradation and instability. A coordinated resolution by the aforementioned teams is urgently required to unlock the potential of this hardware and ensure a positive user experience.
Attachments (Text files): MSI-CLAW-8-AI-Win11-System-Information.txt, llama--version.txt, CMD-VULKAN--log01.txt, CMD-SYCL--log01.txt, LM-Studio-0_3_39-System-Information.txt, LM-Studio-0_3_39-ERROR-LOG01-ngl34, LM-Studio-0_3_39-LOG02-ngl20;
MSI-CLAW-8-AI-Win11-System-Information.txt
CMD-VULKAN--log01.txt
CMD-SYCL--log01.txt
LM-Studio-0_3_39-System-Information.txt
LM-Studio-0_3_39-ERROR-LOG01-ngl34.txt
LM-Studio-0_3_39-LOG02-ngl20.txt
Name and Version
"llama-b7761-bin-win-vulkan-x64"
C:\llama-vulkan-llm>llama-cli.exe --version
load_backend: loaded RPC backend from C:\llama-vulkan-llm\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\llama-vulkan-llm\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llama-vulkan-llm\ggml-cpu-alderlake.dll
version: 7761 (a89002f)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
Problem description & steps to reproduce
COORDINATED BUG REPORT: Critical Incompatibilities between Intel Lunar Lake APU, Vulkan Driver, and llama.cpp Vulkan Backend
Subject: Systematic instability, performance regression, and memory management corruption during LLM inference on Intel Core Ultra 7 258V (Lunar Lake) with integrated Arc 140V GPU.
Target Audiences: Intel Graphics Driver Team, Khronos Vulkan Working Group / Intel Vulkan Implementation, llama.cpp developer team (ggml-vulkan).
Date: 19.01.2026
Note: This report was created in collaboration with the DeepSeek AI
Affected Hardware: MSI Claw 8 AI with Intel Core Ultra 7 258V processor
Software: Primary - LLAMA.CPP (Vulkan Backend), Secondary - LLAMA.CPP (SYCL Backend)
Summary and Critical Priority
This report documents a cascade of deep-seated, mutually reinforcing software bugs that prevent the stable and performant operation of modern Large Language Models (LLMs) on the new Intel Lunar Lake APU architecture (Unified Memory Architecture, UMA) using the Vulkan backend of llama.cpp. The bugs manifest in two unacceptable operational states between which the user must choose, both of which require significant compromises:
Mode A (Manual Memory Override): Stability purchased at the cost of massive performance degradation (~50% reduction) and non-deterministic crashes with certain parameters.
Mode B (Dynamic Auto Mode): Near-total failure of GPU acceleration (OutOfDeviceMemory) or, if functional at all, absurd memory accounting errors (64-bit overflows) coupled with unusable performance. This represents a severe functional impairment of a leading-edge AI platform and requires coordinated efforts from all three parties to resolve.
Complete System Configuration
Hardware: Laptop with Intel Core Ultra 7 258V processor (Lunar Lake). Integrated Intel Arc 140V GPU. 32 GB LPDDR5X system RAM (Unified Memory). BIOS: E1T52IMS.112
Operating System: Microsoft Windows 11 Home, Version 10.0.26200 (Build 26200).
Graphics Driver: Intel Graphics Driver, Version 32.0.101.8331.
Application Software: llama.cpp (command-line binary llama-cli.exe), Version b7761 (Commit a89002f), compiled with Vulkan support.
Test Model: "Qwen3_30B_A3B_Q4_K_M.gguf" (Mixture-of-Experts), GGUF format, 4-bit quantization.
Detailed Bug Description and Isolation
Bug 1 (Intel Driver / Vulkan Implementation): Flash-Attention pipeline creation failure in Auto Mode.
Symptom: vk::Device::createComputePipeline: ErrorOutOfDeviceMemory when attempting to load the model with -fa auto (or default).
Environment: Dynamic memory mode ("Shared GPU Memory Override" disabled).
Consequence: Complete blockade of GPU utilization with Flash-Attention, a critical performance feature. Workaround: Forcing -fa off.
Bug 2 (Intel Driver / Vulkan Implementation): Insufficient Device Memory allocation in Auto Mode.
Symptom: vk::CommandBuffer::begin: ErrorOutOfDeviceMemory or ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory when loading the model, even with -fa off.
Environment: Dynamic mode. Occurs even with moderate GPU offloading (-ngl 20) of a 30B model.
Consequence: GPU acceleration for realistic LLMs in Auto Mode is impossible. The driver apparently cannot allocate sufficient contiguous Device-Local memory within the UMA for llama.cpp's requirements.
Bug 3 (llama.cpp Vulkan Backend & Driver Interaction): Catastrophic memory accounting and buffer size errors.
Symptom 1: Vulkan0 compute buffer size of X MiB, does not match expectation of Y MiB when closing the context. X is typically >30 times larger than Y.
Symptom 2: Absurd values in the memory breakdown, e.g., free: 17592186043963 MiB (17 exbibytes, clearly a 64-bit integer overflow or uninitialized value).
Environment: Occurs in both modes (Auto and Override), particularly pronounced with large contexts (-c 4096).
Consequence: Points to severe logical errors in the calculation, allocation, or deallocation of GPU buffers in ggml-vulkan.cpp. This is a memory leak or corruption risk.
Bug 4 (Interaction Bug Driver/Backend): Destabilizing combination of manual override and batch processing.
Symptom: Cumulative memory corruption after N requests, leading to generation of only "?" or complete crash.
Environment: Exclusively in manual override mode (e.g., 57% / 24GB reserved). Directly correlated with (logical and physical) batch size (-b, -ub).
Consequence: Forces conservative, performance-limiting parameters (-b 16 -ub 8) to achieve stability.
Reproduction Instructions
Provision a system with the above configuration.
For Bugs 1 & 2: Disable "Shared GPU Memory Override", restart. Attempt to load model with: llama-cli.exe -m <30B_MoE_Model> -ngl 20 -fa off.
For Bug 3: Perform any successful load operation, then exit llama-cli. Check the memory breakdown log for the above error message.
For Bug 4: Set "Shared GPU Memory Override" to 57% (24GB), restart. Load model with unstable parameters: llama-cli.exe -m <30B_MoE_Model> -ngl 45 -c 2048 -b 28 -ub 14 -t 8 -fa off --no-mmap. Wait for crash or corruption within the first 10 requests.
Workarounds and Current Mitigation
The only functional, albeit suboptimal, state requires:
Activation of "Shared GPU Memory Override" (~24GB).
Use of severely restricted llama.cpp parameters: -ngl 45 -c 1024 -b 20 -ub 8 -t 6 -fa off --no-mmap.
Resulting Performance: ~21 tokens/s (Prompt), ~25 tokens/s (Generation).
Specific Requests to the Respective Teams
To the Intel Graphics Driver Team:
Priority 1: Resolution of Bugs 1 and 2. The Vulkan driver must be capable of reliably creating compute pipelines for Flash-Attention and allocating sufficient Device-Local memory for demanding LLM workloads in dynamic UMA mode (Auto).
Priority 2: Investigation of the interaction with llama.cpp in manual override mode (Bug 4). Why does the combination of a large, static VRAM block and intensive, repeated allocation/deallocation by llama.cpp lead to unstable memory management?
To the Vulkan Implementers (Intel) / Khronos Group:
Review of Vulkan runtime compliance in the UMA scenario, particularly concerning intensive use of VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT on system memory.
To the llama.cpp Development Team (ggml-vulkan maintainers):
Priority 1: Investigation and resolution of the critical memory accounting bug (Bug 3). The discrepancy between expected and actual buffer size, as well as the memory overflow, are unacceptable and indicate serious bugs.
Priority 2: Implementation of more robust fallback mechanisms for driver errors (automatic fallback to -fa off, more conservative default parameters for Intel iGPUs).
Priority 3: Improvement of interaction with manually configured UMA memory, potentially through an adaptive batch size logic.
Conclusion
The Intel Lunar Lake APU represents a promising platform for local AI inference. However, the current interplay between the driver, Vulkan API, and the popular llama.cpp software is fundamentally impaired. The documented bugs force users to accept significant performance degradation and instability. A coordinated resolution by the aforementioned teams is urgently required to unlock the potential of this hardware and ensure a positive user experience.
Attachments (Text files): MSI-CLAW-8-AI-Win11-System-Information.txt, llama--version.txt, CMD-VULKAN--log01.txt, CMD-SYCL--log01.txt, LM-Studio-0_3_39-System-Information.txt, LM-Studio-0_3_39-ERROR-LOG01-ngl34, LM-Studio-0_3_39-LOG02-ngl20;
MSI-CLAW-8-AI-Win11-System-Information.txt
CMD-VULKAN--log01.txt
CMD-SYCL--log01.txt
LM-Studio-0_3_39-System-Information.txt
LM-Studio-0_3_39-ERROR-LOG01-ngl34.txt
LM-Studio-0_3_39-LOG02-ngl20.txt