Skip to content

server : add Anthropic Messages API support#17570

Merged
ngxson merged 9 commits intoggml-org:masterfrom
noname22:feature/anthropic-api-support
Nov 28, 2025
Merged

server : add Anthropic Messages API support#17570
ngxson merged 9 commits intoggml-org:masterfrom
noname22:feature/anthropic-api-support

Conversation

@noname22
Copy link
Contributor

claude-code

Summary

This PR adds Anthropic Messages API compatibility to llama-server. The implementation converts Anthropic's format to OpenAI-compatible internal format, reusing existing inference pipeline.

Motivation

  • Enables llama.cpp to serve as a local/self-hosted alternative to Anthropic's Claude API
  • Allows Claude Code and other Anthropic-compatible clients to work with llama-server

Features Implemented

Endpoints:

  • POST /v1/messages - Chat completions with streaming support
  • POST /v1/messages/count_tokens - Token counting for prompts

Functionality:

  • Streaming with proper Anthropic SSE event types (message_start, content_block_delta, etc.)
  • Tool use (function calling) with tool_use/tool_result content blocks
  • Vision support with image content blocks (base64 and URL)
  • System prompts and multi-turn conversations
  • Extended thinking parameter support

Testing

  • Tests in test_anthropic_api.py
  • Tests cover: basic messages, streaming, tools, vision, token counting, parameters, error handling, content block indices

@noname22
Copy link
Contributor Author

New PR to allow maintainers to edit.
Old PR here: #17425

@github-actions github-actions bot added examples python python script changes server labels Nov 28, 2025
@noname22
Copy link
Contributor Author

The RISCV test is getting

The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I'm guessing it's not related to the PR? Any way to retry?

@ngxson
Copy link
Contributor

ngxson commented Nov 28, 2025

This PR can be merge when server CI passes. Other CI are not important.

@ngxson ngxson merged commit ddf9f94 into ggml-org:master Nov 28, 2025
65 of 69 checks passed
@ericcurtin
Copy link
Collaborator

ericcurtin commented Nov 28, 2025

I stumbled across this as it hit conflicts with my PR. I am curious. What models does this work with? With sufficient hardware is this capable of beating Claude cloud models?

@noname22
Copy link
Contributor Author

Technically it works with pretty much any model but to get anywhere near Claude Sonnet you'd probably need a large, agentic model like MiniMax M2, Kimi K2, Qwen3 Coder 480B-A35B, etc.

That being said, I've had decent results for simple tasks with Qwen3 Coder 30B-A3B and gpt-oss-20b on a single 4090.

In my very subjective experience, the same models tend to perform a lot better with the Claude Code CLI app than with alternatives such as Open Code or gemini-cli and its clones, like Qwen3-Coder (the cli app).

@ericcurtin
Copy link
Collaborator

ericcurtin commented Nov 28, 2025

Interesting... If you want to take a quick peek, I fixed the conflicts here:

#17554

although they weren't major conflicts, it was just moving code from one place to another.

@noname22
Copy link
Contributor Author

Interesting... If you want to take a quick peek, I fixed the conflicts here:

#17554

although they weren't major conflicts, it was just moving code from one place to another.

What files were the conflicts in, server.cpp?

json server_task_result_cmpl_partial::to_json_anthropic() {
json events = json::array();
bool first = (n_decoded == 1);
static bool text_block_started = false;
Copy link
Contributor

@ngxson ngxson Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noname22 is there any reasons why this is static?

using static here will cause data race problem where one 2 requests running in parallel. please create a PR to remove this static nvm, I'll remove it

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
* server : add Anthropic Messages API support

* remove -@pytest.mark.slow from tool calling/jinja tests

* server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py

* server : removed redundant n field logic in anthropic_params_from_json

* server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()

* server : refactor Anthropic API to use OAI conversion

* make sure basic test always go first

* clean up

* clean up api key check, add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
* server : add Anthropic Messages API support

* remove -@pytest.mark.slow from tool calling/jinja tests

* server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py

* server : removed redundant n field logic in anthropic_params_from_json

* server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()

* server : refactor Anthropic API to use OAI conversion

* make sure basic test always go first

* clean up

* clean up api key check, add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants