Note: This issue was copied from ggml-org#16012
Original Author: @samteezy
Original Issue Number: ggml-org#16012
Created: 2025-09-15T13:55:42Z
Prerequisites
Feature Description
Per the issue I ran into ggml-org#15804 and then getting directed to PR ggml-org#14236, I suggest we add --mmproj-device as an arg to be consistent with the rest of how llama.cpp is used.
This should be available in both llama-cli and llama-server.
Motivation
Users may want to specify different devices for different configs, and managing an env var is not ideal, nor is it consistent with how similar features are already set up within llama.cpp. In my setup, my more powerful, newer GPU is recognized as Vulkan1 and an older one as Vulkan0.
I should add that I'm actually having trouble getting the existing environment variable MTMD_BACKEND_DEVICE to work. I try setting Vulkan1 or 1, but llama-server doesn't do anything other than use the default Vulkan0. Thus vision inference tk/s performance takes a dive due to Vulkan0 being a bottleneck.
Possible Implementation
No response
Note: This issue was copied from ggml-org#16012
Original Author: @samteezy
Original Issue Number: ggml-org#16012
Created: 2025-09-15T13:55:42Z
Prerequisites
Feature Description
Per the issue I ran into ggml-org#15804 and then getting directed to PR ggml-org#14236, I suggest we add
--mmproj-deviceas an arg to be consistent with the rest of how llama.cpp is used.This should be available in both
llama-cliandllama-server.Motivation
Users may want to specify different devices for different configs, and managing an env var is not ideal, nor is it consistent with how similar features are already set up within llama.cpp. In my setup, my more powerful, newer GPU is recognized as
Vulkan1and an older one asVulkan0.I should add that I'm actually having trouble getting the existing environment variable
MTMD_BACKEND_DEVICEto work. I try settingVulkan1or1, but llama-server doesn't do anything other than use the defaultVulkan0. Thus vision inference tk/s performance takes a dive due toVulkan0being a bottleneck.Possible Implementation
No response