webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts#18655
webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts#18655allozaur merged 374 commits intoggml-org:masterfrom
Conversation
|
Thank you for the architectural unification! The SearchableDropdownMenu refactor is superb, we're making good progress! Only remaining items/features for a MVP (& testing on my side) :
|
tools/server/webui/docs/architecture/high-level-architecture-simplified.md
Show resolved
Hide resolved
|
I'm interested in this, so gave the PR a look and just wanted to ask, are local MCP servers planned to be supported? Right now it looks like URL is required, without a local "command" "args" "env" type option. (for Node, NPX / UVX, Docker, etc) Might be able to get around this with a MCP proxy server, but built-in support of local servers like many MCP clients would be welcomed. e.g. Cursor, VS Code, OpenCode, Roo Code, Antigravity, LM Studio, and others support the following with small variations: Lots of examples here. I know it's still WIP, but just wanted to ask. Or maybe I've missed it? |
This PR is browser-side (Svelte) -> TCP (streamable-http / sse / websocket), so no stdio support on browser. EDIT: And once we have the MCP client in the browser, nothing prevents a small example script in Python or Node.js from relaying MCP to stdio :) |
Hey! We are introducing a solid basis for MCP support in llama.cpp, starting with pure WebUI implementation. We will add further enhancements in near future ;) |
Thanks folks, makes sense. I'll make a small proxy script then, just wanted to make sure I wasn't overlooking a component.
I think I can whip something up, but I wouldn't say no to a reference. I would say that if the MCP button appears in the WebUI as is, this sort of question is probably bound to come up. A small example proxy script in a docs might not be a bad idea perhaps. I appreciate the work being done on this and other WebUI PRs. |
1c7048d to
b11b32e
Compare
|
Since this works, all that remains is to refactor the CoT with the new pre-rendering format (client-specific tags <<< ... >>>) to have complete control over the context and sending the "interleaving" back to the server during an agentic loop. This will answer several questions from llama.cpp users and developers about powerful models like Minimax. And for Qwen, it will finally provide visibility into the CoT during the agentic loop! |
|
Testing interleaved reasoning block and toolcall. I need to remove "Filter reasoning after first turn" useless now InterleavedThinkingBlockAndImage.mp4 |
|
Obviously, with the refactoring of the CoT display, for now, all the reasoning is sent back to the model with our proprietary UI tags included ! A simple strip that reuses the regular expressions before sending it to the API will restore the previous functionality. Later, we'll need an option to choose whether to send this reasoning back to the backend, preserving the actual location of each reasoning block! For the MCP tool responses, the model has access to everything during the agentic loop, but once a new loop is started, the previous loops are "compressed" to the last N lines of the option, just like the display. This eliminates the token cache, so the backend has to rebuild the last modified agentic loop. This optimizes the context and doesn't degrade agentic performance because the LLM is supposed to have already performed its synthesis. |
|
A little fun with image generation through MCP (other server different from the LLM server but dedicated to image generation, nothing prevents from having both 2 in 1 if we have enough VRAM for both inference instances) MCP-Image-Gen2.mp4 |
|
The CI fails because we simplified the code path so that the modality type is detected in the post-upload codepath (better), instead of pre-uploading (filepicker, too limiting and incompatible). The Storybook tests still check the old UX where Images/Audio buttons were disabled based on model modalities, but that logic got nuked from ChatForm in the refactor. Now the filepicker accepts everything and validation happens client-side after upload with text fallback, so tests are looking for DOM elements that don't exist anymore. The modality props became orphaned, ChatFormActions still receives them but ChatForm doesn't compute or pass them anymore. We need to nuke these obsolete UI tests and keep the actual modality validation logic in unit tests where it belongs. |
|
I've certainly wondered about this scenario. There are cases where I can certainly see chaining vision and non-vision models to be useful, such as having a vision model do OCR and then a non-vision model follow-up. I'd love an omni-model that's on par, but in my tests, even the recent large Qwen vision models still lack in coding and logic compared to their non-vision counterparts. That said, the "transformation" is incomplete and based on prompt, isn't it? Like, describing the image or OCR'ing the image, rather than actually converting the image into non-vision text tokens? I assume it was grayed out because the non-vision model wouldn't be able to 'see' the image / make sense of the tokens, Personally I'd totally support a PR to change it regardless. |
As a precaution, I simply fixed the problem. But the question was raised, I performed a simple test: I sent an image and asked the model NOT to respond; it accepted. After several exchanges, I requested a very detailed description of the image, and it managed to provide it without error! |
|
Actually, it's quite simple. The image remains attached in the client-side prompt. And of course, it's sent again on the next request. A non-VL model would throw an error. So the feature is simply a security measure, Alternatively, you could just filter out the image and continue in text mode using the user-requested description instead, though this is less precise than having a VL model process the actual image. In fact, it would reduce code complexity and give users more freedom. You'd just need a small notification that the image and its full description are no longer considered in the context. |
|
Okay, I've updated the entire Qwen3.5 series with the latest GGUF and mmproj from Unsloth, and it works well. But indeed, I'm asking the model again; it's not about the inside of the agentic loop, although it should work: The LLM needs to be given a loop to expect a specific thing from the generator. Then I can also do it on the sandbox, which can return images like Claude's with the "View" command. I'll test it right away. |
|
Agentic loop with exit condition based on image description : Simply enter "an unspecified object" in the image generator prompt. Then describe the image. If it turns out to be something that can be a container, keep prompting the generator with "an unspecified object" until you get something other than something that can be a container. Loop-On-Image.mp4Work... |
More seriously, we need to try and see why it's not working for you. |
|
conversation_0d2d6fae-79f3-4a3b-a171-4c7b722a127f_agentic_loop_with_ex.json Conversation with embedded images My last commit 213c4a0 I update from master and retry :
No problem on last master |
|
Be aware that sometimes model can mimic tool calls and responses; for MCP, remember to force the LLM to populate a description field in the JSON: it helps a lot! We have a fix included on this PR (striping reserved display tags); you need to be on a very recent "master" version. |
|
On the next conversation iteration, the tool call is no longer present, and it's just "my prompt for generate x", "assistant reply explaining it couldn't see the image in the tool call but that text string, exactly as shown in screenshot", "image_url message part, holding a data url to png image", "my comment asking if the assistant can see the image now". In its reply, Qwen is now somewhat confused, and confirms that it can see an image, but thinks it has been added from some different source and is probably not related to the tool call. So perhaps the way my own custom MCP server is not returning images correctly -- though from my reading of the mcp sdk docs, that is how you're supposed to do it. |
|
Are you in router mode with multiple models or a single model? |
|
Single model mode. Debugged it to this point: [AgenticStore] Skipping image attachment (model "undefined" does not support vision). This was in console all along. Clearly, this is wrong, the model is perfectly capable of images. |
|
Cool, we finally found the bug! I'm going to make a patch and PR it! |
|
In the meantime, you can switch to router mode and install just one model to test it. Then you'll see, you'll want more. |
|
I am aware of the multi-model mode. The reason I don't use it there is an annoying usability snafu in the webui in that I always have to choose the loaded model because the select doesn't seem to default to one of the models already loaded. |
|
Oh, there's another, more serious reason. The model routing proxy works very poorly with agentic coding, when you have something like Strix Halo only and the prompt is like 100k tokens long and for some reason it must be reprocessed. This takes a while -- at least 15 minutes, I think -- and every software I've ever tried timeouts at 5 minutes. So they retry. With the multi-model mode, what happens is that the connection disconnects to the proxy, but the proxy doesn't disconnect to the backend server actually doing the inference, and soon the client sends a new request, and at this point you'll have bunch of inferences all going, all starting from 1st token all over again. With a single-model mode, when the client disconnects, the server stops, and saves progress in the prompt. So the next retry continues appropriately. Multi-model mode is pretty much useless to me because of this problem, as long as I use slow hardware. |
I admit it annoys me too. I'll keep it in mind for the next PR! This is exactly the kind of feedback we need to improve the user experience! |
That, on the other hand, is very important! I also have Strix Halo devices at work with people working on them using llama.cpp in router mode. So I need to look into that! |
|
Yep I've had Strix Halo doing "work" all night which involved just reprocessing some 200k token prompts over and over again. Because this hardware is mediocre for this kind of long context work using the 122B model, prompt processing and token generation can get very slow, but it doesn't really bother me as long as it happens when I sleep. It's not like I hear its fans screaming or anything, and when I come back in the morning, the "night shift" has usually made lots of nice progress. But it does have to have way to recover from a single http timeout... |
|
We'll find this bug, don't worry. I have the equipment to test it! Can you open both GitHub issues with what we've learned? That way we won't forget anything! |
|
Yes, I suppose I will be doing that. |
|
I made the MCP model=undefined ticket. I think I am not confident enough to create the ticket concerning the multimodel server and backend timeouts issue. At least, not until I confirm the bad behavior with long contexts with the routing server again, I suppose. I see that there's been work done on llama.cpp, such as steady context checkpoints which also could influence the behavior. One major issue that I see is that prompt processing isn't interruptible. For instance, if the web client closes the http request, the llama.cpp continues processing the prompt regardless. I think a key improvement would be to cancel the prompt if the client no longer wants it. I'm also seeing that there are probably bugs in how concurrent requests targeting the same prompt at multiple slots are handled. The bug is pretty bad, it is that all the work gets thrown away and prompt processing starts from scratch. This is probably what I was seeing to happen to me with the routing server. When you have multiple copies of the same prompt, it seems that the first inference finishes, and then the new slot gets its turn, but it doesn't reuse the cache, it seems to delete everything and start from 0. My testing of this problem is quite rough, but I think I reproduced it multiple times tonight just by clicking cancel and continue in Kilocode. I also am seeing that this problem seems to go away if you specify -np 1, which removes the multiple parallel slots. I think what happens with -np 1 is that the next request in queue isn't processed at all until the previous inference fully completes, and then it uses the KV cache appropriately. I don't think parallel prompts which have the exact same prefix to like 99 % their length have to actually work, and just canceling when the client isn't listening anymore would seem a simple and sufficient solution to me. |
|
You're right, I'm also limiting myself to -np 1 because I've encountered performance issues, but I haven't been able to pinpoint the cause to properly troubleshoot the issue; it was random. I don't think it's due to KV cache fragmentation with -kvu, but further testing is needed. However, interrupting an inference process works; have you tested it recently? -> With the webui or a custom client? |
|
I think inference does interrupt appropriately, but it seems to me that the prompt processing does not. It seems to be on rails and run its course even when the http client is no longer connected to the server. I think it terminates immediately afterwards without inferring a single token, probably. |
|
That's possible, the interrupt for the prompt preprocessing is missing, on RTX it's quite fast but it should be fixed, it must stop at the boundaries of the batches. |
|
Just chiming in that on a 3090, I'm pretty sure I've hit the bug of continuing to process old prompts in parallel to new prompts after a cancel. My solution was also to set
Bothers me as well, tbh. |
Interesting I'm going to set it back to 4 to continue trying to pinpoint the bug! Note that -kvu (kv unified) is also enabled by default, which is supposed to make it transparent when using only one thread.
Please open an issue if it doesn't already exist so that we don't forget it |
Oh yeah, sorry - for me I believe I hit it with Roo Code. It's been too long to remember the details very well at this point unfortunately. Usually something unexpected would happen like a loop or a crash on Roo's side or an unloaded model, then I'd restart the server, and find everything going at a snail's pace and with signs that pointed to the old attempt processing in the background (and usually stuck). Problem went away with np=1 and I haven't thought much about it since.
Couldn't find one exactly, so sure. Different enough than this to open another issue I think, but it's also important: 😉
|
|
for all you that have stdin/stdout binaries and need a bridge for this feature; https://github.com/AgentForgeEngine/mpc-bridge - more features coming soon. |





New features
llama-servercommand witg--webui-mcp-proxyflag to enable it)UI Improvements
Architecture refactors/improvements:
.service.tsformatImportant
This PR includes MCP-only changes, but it builds on couple of PRs improving architecture and UI foundations in the codebase: #19689, #19685, #19596, #19586, #19571, #19556, #19551, #19541 and #20066
Video demos
Adding a new MCP Server and using it within an Agentic Loop
demo1.mp4
Using MCP Prompts
demo2.mp4
Using MCP Resources
demo3.mp4
Image Generation and Web Search using different MCP servers
demo4.mp4