If I load the model and draft onto the same GPU (for example GPU.0) - then the problem does not arise. If I load the model onto the GPU.0, and draft on GPU.1 - then an error appears.
Linux xpu 6.19.3-061903-generic #202602191659 SMP PREEMPT_DYNAMIC Sat Feb 21 08:17:10 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
"Qwen3-14B-int4-ov-spec": {
"model_name": "Qwen3-14B-int4-ov-spec",
"model_path": "/mnt/data2/models/OpenVINO/Qwen3-14B-int4-ov",
"device": "GPU.1",
"model_type": "llm",
"engine": "ovgenai",
"draft_model_path": "/mnt/data2/models/OpenVINO/Qwen3-0.6B-int4-ov",
"draft_device": "GPU.2",
"num_assistant_tokens": 7,
"runtime_config": {
"PERFORMANCE_HINT": "LATENCY"
}
},
2026-02-22 12:41:58,202 - ERROR - [DEBUG] draft_model_loaded: True
2026-02-22 12:41:58,203 - ERROR - [DEBUG] self.model_num_assistant_tokens: 3
2026-02-22 12:41:58,203 - ERROR - [DEBUG] generation_kwargs.num_assistant_tokens: 3
2026-02-22 12:41:58,203 - ERROR - [DEBUG] generation_kwargs.assistant_confidence_threshold: 0.0
2026-02-22 12:42:17,029 - INFO - [LLM Worker: Qwen3-14B-int4-ov-spec] Metrics: {'load_time (s)': 28.29, 'ttft (s)': 0.37, 'tpot (ms)': 54.28816, 'prefill_throughput (tokens/s)': 2000.81, 'decode_throughput (tokens/s)': 18.42022, 'decode_duration (s)': 18.82504, 'input_token': 731, 'new_token': 341, 'total_token': 1072, 'stream': True, 'stream_chunk_tokens': 1}
2026-02-22 12:42:17,758 - INFO - Request received: POST /v1/chat/completions from 127.0.0.1
2026-02-22 12:42:17,765 - INFO - "Qwen3-8B-int4-ov" request received
2026-02-22 12:42:17,766 - INFO - Request completed: POST /v1/chat/completions status=400 duration=0.007s
2026-02-22 12:42:33,721 - INFO - Request received: POST /openarc/unload from 127.0.0.1
2026-02-22 12:42:34,434 - INFO - [Qwen3-14B-int4-ov-spec] unloaded successfully
2026-02-22 12:42:34,435 - INFO - Request completed: POST /openarc/unload status=200 duration=0.714s
2026-02-22 12:42:41,835 - INFO - Request received: POST /openarc/load from 127.0.0.1
2026-02-22 12:42:41,837 - INFO - Qwen3-14B-int4-ov-spec loading...
2026-02-22 12:42:41,837 - INFO - ModelType.LLM on GPU.1 with {}
2026-02-22 12:42:42,245 - INFO - Loaded draft model from /mnt/data2/models/OpenVINO/Qwen3-0.6B-int4-ov on GPU.2
2026-02-22 12:43:09,562 - ERROR - Model loading failed for Qwen3-14B-int4-ov-spec
Traceback (most recent call last):
File "/home/arc/OpenArc/src/server/model_registry.py", line 145, in _load_task
model_instance = await create_model_instance(load_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arc/OpenArc/src/server/model_registry.py", line 254, in create_model_instance
await asyncio.to_thread(model_instance.load_model, load_config)
File "/usr/local/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/arc/OpenArc/src/engine/ov_genai/llm.py", line 306, in load_model
self.model = LLMPipeline(
^^^^^^^^^^^^
RuntimeError: Exception from src/inference/src/cpp/core.cpp:110:
Exception from src/inference/src/dev/plugin.cpp:54:
Check 'false' failed at src/plugins/intel_gpu/src/plugin/program_builder.cpp:163:
[GPU] ProgramBuilder build failed!
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_common.hpp:40:
[GPU] clEnqueueNDRangeKernel, error code: -52 CL_INVALID_KERNEL_ARGS
2026-02-22 12:43:09,669 - INFO - Request completed: POST /openarc/load status=500 duration=27.834s
(openarc) (openarc) arc@xpu:~/OpenArc$ uv pip list
Package Version Editable project location
-------------------------- ---------------------- -------------------------
about-time 4.2.1
addict 2.4.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.14
aiosignal 1.4.0
alive-progress 3.2.0
annotated-types 0.7.0
anyio 4.9.0
asttokens 3.0.0
attrs 25.3.0
audioread 3.0.1
autograd 1.8.0
babel 2.17.0
blis 1.3.0
brotli 1.1.0
catalogue 2.0.10
certifi 2025.7.14
cffi 2.0.0
charset-normalizer 3.4.2
click 8.2.1
cloudpathlib 0.22.0
cma 4.2.0
colorama 0.4.6
comm 0.2.3
confection 0.1.5
contourpy 1.3.2
cryptography 46.0.3
csvw 3.6.0
curated-tokenizers 0.0.9
curated-transformers 0.1.1
cycler 0.12.1
cymem 2.0.11
datasets 4.0.0
ddgs 9.6.1
debugpy 1.8.17
decorator 5.2.1
deprecated 1.2.18
dill 0.3.8
distro 1.9.0
dlinfo 2.0.0
docopt 0.6.2
espeakng-loader 0.2.4
evdev 1.9.2
executing 2.2.1
fastapi 0.116.1
filelock 3.18.0
fonttools 4.58.5
frozenlist 1.7.0
fsspec 2025.3.0
grapheme 0.6.0
griffe 1.14.0
h11 0.16.0
h2 4.3.0
hf-xet 1.1.5
hpack 4.1.0
httpcore 1.0.9
httpx 0.28.1
httpx-sse 0.4.3
huggingface-hub 0.33.4
hyperframe 6.1.0
idna 3.10
iniconfig 2.3.0
inquirerpy 0.3.4
ipykernel 7.0.1
ipython 9.6.0
ipython-pygments-lexers 1.1.1
ipywidgets 8.1.7
isodate 0.7.2
jedi 0.19.2
jinja2 3.1.6
jiter 0.11.0
joblib 1.5.1
jsonschema 4.24.0
jsonschema-specifications 2025.4.1
jupyter-client 8.6.3
jupyter-core 5.9.1
jupyterlab-widgets 3.0.15
kiwisolver 1.4.8
kokoro 0.9.4
langcodes 3.5.0
language-data 1.3.0
language-tags 1.2.0
lazy-loader 0.4
librosa 0.11.0
llvmlite 0.45.0
loguru 0.7.3
lxml 6.0.2
marisa-trie 1.3.1
markdown-it-py 3.0.0
markupsafe 3.0.2
matplotlib 3.10.3
matplotlib-inline 0.1.7
mcp 1.20.0
mdurl 0.1.2
misaki 0.9.4
mpmath 1.3.0
msgpack 1.1.1
multidict 6.6.3
multiprocess 0.70.16
murmurhash 1.0.13
natsort 8.4.0
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1.4
nncf 2.17.0
num2words 0.5.14
numba 0.62.0
numpy 2.2.6
onnx 1.18.0
openai 2.2.0
openai-agents 0.4.2
openarc 2.0 /home/arc/OpenArc
openvino 2026.1.0.dev20260221
openvino-genai 2026.1.0.0.dev20260221
openvino-telemetry 2025.2.0
openvino-tokenizers 2026.1.0.0.dev20260221
optimum 1.27.0
optimum-intel 1.25.2
packaging 25.0
pandas 2.2.3
parso 0.8.5
pexpect 4.9.0
pfzy 0.3.4
phonemizer-fork 3.3.2
pillow 11.3.0
pip 25.2
platformdirs 4.4.0
pluggy 1.6.0
pooch 1.8.2
preshed 3.0.10
primp 0.15.0
prompt-toolkit 3.0.52
propcache 0.3.2
protobuf 6.31.1
psutil 7.0.0
ptyprocess 0.7.0
pure-eval 0.2.3
pyarrow 20.0.0
pycparser 2.23
pydantic 2.11.7
pydantic-core 2.33.2
pydantic-settings 2.11.0
pydot 3.0.4
pygments 2.19.2
pyjwt 2.10.1
pymoo 0.6.1.5
pynput 1.8.1
pyparsing 3.2.3
pytest 8.4.2
python-dateutil 2.9.0.post0
python-dotenv 1.2.1
python-multipart 0.0.20
python-xlib 0.33
pytz 2025.2
pyyaml 6.0.2
pyzmq 27.1.0
rdflib 7.2.1
referencing 0.36.2
regex 2024.11.6
requests 2.32.4
rfc3986 1.5.0
rich 14.0.0
rich-click 1.8.9
rpds-py 0.26.0
safetensors 0.5.3
scikit-learn 1.7.0
scipy 1.16.0
segments 2.3.0
setuptools 80.9.0
shellingham 1.5.4
six 1.17.0
smart-open 7.3.1
smolagents 1.22.0
sniffio 1.3.1
socksio 1.0.0
sounddevice 0.5.2
soundfile 0.13.1
soxr 1.0.0
spacy 3.8.7
spacy-curated-transformers 0.3.1
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.5.1
sse-starlette 3.0.3
stack-data 0.6.3
starlette 0.47.1
sympy 1.14.0
tabulate 0.9.0
termcolor 3.1.0
thinc 8.3.6
threadpoolctl 3.6.0
tokenizers 0.21.2
torch 2.8.0+cpu
torchvision 0.23.0+cpu
tornado 6.5.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.52.4
typer 0.19.2
types-requests 2.32.4.20250913
typing-extensions 4.14.1
typing-inspection 0.4.1
tzdata 2025.2
uritemplate 4.2.0
urllib3 2.5.0
uvicorn 0.35.0
wasabi 1.1.3
wcwidth 0.2.14
weasel 0.4.1
widgetsnbextension 4.0.14
wrapt 1.17.2
xxhash 3.5.0
yarl 1.20.1
Hello,
If I load the model and draft onto the same GPU (for example GPU.0) - then the problem does not arise. If I load the model onto the GPU.0, and draft on GPU.1 - then an error appears.
Linux xpu 6.19.3-061903-generic #202602191659 SMP PREEMPT_DYNAMIC Sat Feb 21 08:17:10 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Config
OpenARC server log
UV PIP LIST