When offloading to iGPU UHD 770 in a docker from https://github.com/mudler/LocalAI (b2128) llama.cpp crashes with the following error:
The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)Exception caught at file:/build/backend/cpp/llama/llama.cpp/ggml-sycl.cpp, line:12708
From trial and error it happens if I have number of tokens predicted >256. I mean that if I limit the tokens with 256 it does not happen.
Tested with multiple 7b mistral models with both Q6 and Q8 quantization
Intel oneAPI version 2024.0
When offloading to iGPU UHD 770 in a docker from https://github.com/mudler/LocalAI (b2128) llama.cpp crashes with the following error:
The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)Exception caught at file:/build/backend/cpp/llama/llama.cpp/ggml-sycl.cpp, line:12708From trial and error it happens if I have number of tokens predicted >256. I mean that if I limit the tokens with 256 it does not happen.
Tested with multiple 7b mistral models with both Q6 and Q8 quantization
Intel oneAPI version 2024.0