From 9e5c06970eba56854dba4e19efc98e93e34ebdaf Mon Sep 17 00:00:00 2001 From: dk67604 Date: Fri, 16 Jan 2026 21:57:39 -0800 Subject: [PATCH 1/2] docs: add strands-vllm community model provider documentation Add documentation for strands-vllm, a vLLM model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. Features documented: - OpenAI-compatible API integration with vLLM - TITO support for capturing prompt_token_ids and token_ids - Tool call validation hooks for RL-friendly error feedback - Agent Lightning integration via OpenTelemetry spans - TokenManager for building RL trajectories with loss masks References: - strands-vllm: https://github.com/agents-community/strands-vllm - Agent Lightning: https://github.com/microsoft/agent-lightning Closes #432 --- docs/community/model-providers/vllm.md | 278 +++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 279 insertions(+) create mode 100644 docs/community/model-providers/vllm.md diff --git a/docs/community/model-providers/vllm.md b/docs/community/model-providers/vllm.md new file mode 100644 index 00000000..1b612503 --- /dev/null +++ b/docs/community/model-providers/vllm.md @@ -0,0 +1,278 @@ +# vLLM + +{{ community_contribution_banner }} + +!!! info "Language Support" + This provider is only supported in Python. + +[strands-vllm](https://github.com/agents-community/strands-vllm) is a [vLLM](https://docs.vllm.ai/) model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides integration with vLLM's OpenAI-compatible API, optimized for reinforcement learning workflows with [Agent Lightning](https://blog.vllm.ai/2025/10/22/agent-lightning.html). + +**Features:** + +- **OpenAI-Compatible API**: Uses vLLM's OpenAI-compatible `/v1/chat/completions` endpoint with streaming +- **TITO Support**: Captures `prompt_token_ids` and `token_ids` directly from vLLM - no retokenization drift +- **Tool Call Validation**: Hook-based validation to reject unknown tools and invalid JSON inputs (RL-friendly error feedback) +- **Agent Lightning Integration**: Automatically adds token IDs to OpenTelemetry spans for RL training data extraction +- **Streaming**: Full streaming support with token ID capture via `VLLMTokenRecorder` + +## Installation + +Install strands-vllm along with the Strands Agents SDK: + +```bash +pip install strands-vllm strands-agents-tools +``` + +For retokenization drift demos (requires HuggingFace tokenizer): + +```bash +pip install "strands-vllm[drift]" strands-agents-tools +``` + +## Requirements + +- vLLM server running with your model (v0.10.2+ for `return_token_ids` support) +- For tool calling: vLLM must be started with tool-calling enabled and appropriate chat template + +## Usage + +### 1. Start vLLM Server + +First, start a vLLM server with your model: + +```bash +vllm serve \ + --host 0.0.0.0 \ + --port 8000 +``` + +For tool calling support, add the appropriate flags for your model: + +```bash +vllm serve \ + --host 0.0.0.0 \ + --port 8000 \ + --enable-auto-tool-choice \ + --tool-call-parser # e.g., llama3_json, hermes, etc. +``` + +See [vLLM tool calling documentation](https://docs.vllm.ai/en/latest/features/tool_calling.html) for supported parsers and chat templates. + +### 2. Basic Agent + +```python +import os +from strands import Agent +from strands_vllm import VLLMModel, VLLMTokenRecorder + +# Configure via environment variables or directly +base_url = os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1") +model_id = os.getenv("VLLM_MODEL_ID", "") + +model = VLLMModel( + base_url=base_url, + model_id=model_id, + return_token_ids=True, +) + +recorder = VLLMTokenRecorder() +agent = Agent(model=model, callback_handler=recorder) + +result = agent("What is the capital of France?") +print(result) + +# Access TITO data for RL training +print(f"Prompt tokens: {len(recorder.prompt_token_ids or [])}") +print(f"Response tokens: {len(recorder.token_ids or [])}") +``` + +### 3. Tool Call Validation (Recommended for RL) + +vLLM tool parsers can post-process model outputs, potentially creating tool calls for unknown tools. Use `VLLMToolValidationHooks` to validate tool calls before execution: + +```python +import os +from strands import Agent +from strands_tools.calculator import calculator +from strands_vllm import VLLMModel, VLLMToolValidationHooks + +model = VLLMModel( + base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"), + model_id=os.getenv("VLLM_MODEL_ID", ""), + return_token_ids=True, +) + +# Add validation hook - rejects unknown tools with deterministic error feedback +agent = Agent( + model=model, + tools=[calculator], + hooks=[VLLMToolValidationHooks()], +) + +result = agent("Compute 17 * 19 using the calculator tool.") +print(result) +``` + +The hook validates: + +- **Tool name**: Must exist in agent's tool registry +- **Tool input**: If input is a JSON string, must be valid JSON (catches `JSONDecodeError`) + +Invalid tool calls receive a deterministic error `toolResult`, providing clean RL signals. + +### 4. Agent Lightning Integration + +`VLLMTokenRecorder` automatically adds token IDs to OpenTelemetry spans for [Agent Lightning](https://blog.vllm.ai/2025/10/22/agent-lightning.html) compatibility: + +```python +import os +from strands import Agent +from strands_vllm import VLLMModel, VLLMTokenRecorder + +model = VLLMModel( + base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"), + model_id=os.getenv("VLLM_MODEL_ID", ""), + return_token_ids=True, +) + +# add_to_span=True (default) adds token IDs to OpenTelemetry spans +recorder = VLLMTokenRecorder(add_to_span=True) +agent = Agent(model=model, callback_handler=recorder) + +result = agent("Hello!") +``` + +The following span attributes are set: + +| Attribute | Description | +| --------- | ----------- | +| `llm.token_count.prompt` | Token count for the prompt (OpenTelemetry semantic convention) | +| `llm.token_count.completion` | Token count for the completion (OpenTelemetry semantic convention) | +| `llm.hosted_vllm.prompt_token_ids` | Token ID array for the prompt | +| `llm.hosted_vllm.response_token_ids` | Token ID array for the response | + +### 5. RL Training with TokenManager + +For building RL-ready trajectories with loss masks: + +```python +import asyncio +import os +from strands import Agent, tool +from strands_tools.calculator import calculator as _calculator_impl +from strands_vllm import TokenManager, VLLMModel, VLLMTokenRecorder, VLLMToolValidationHooks + +@tool +def calculator(expression: str) -> dict: + return _calculator_impl(expression=expression) + +async def main(): + model = VLLMModel( + base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"), + model_id=os.getenv("VLLM_MODEL_ID", ""), + return_token_ids=True, + ) + + recorder = VLLMTokenRecorder() + agent = Agent( + model=model, + tools=[calculator], + hooks=[VLLMToolValidationHooks()], + callback_handler=recorder, + ) + + await agent.invoke_async("What is 25 * 17?") + + # Build RL trajectory with loss mask + tm = TokenManager() + for entry in recorder.history: + if entry.get("prompt_token_ids"): + tm.add_prompt(entry["prompt_token_ids"]) # loss_mask=0 + if entry.get("token_ids"): + tm.add_response(entry["token_ids"]) # loss_mask=1 + + print(f"Total tokens: {len(tm)}") + print(f"Prompt tokens: {sum(1 for m in tm.loss_mask if m == 0)}") + print(f"Response tokens: {sum(1 for m in tm.loss_mask if m == 1)}") + print(f"Token IDs: {tm.token_ids[:20]}...") # First 20 tokens + print(f"Loss mask: {tm.loss_mask[:20]}...") + +asyncio.run(main()) +``` + +## Configuration + +### Model Configuration + +The `VLLMModel` accepts the following parameters: + +| Parameter | Description | Example | Required | +| --------- | ----------- | ------- | -------- | +| `base_url` | vLLM server URL | `"http://localhost:8000/v1"` | Yes | +| `model_id` | Model identifier | `""` | Yes | +| `api_key` | API key (usually "EMPTY" for local vLLM) | `"EMPTY"` | No (default: "EMPTY") | +| `return_token_ids` | Request token IDs from vLLM | `True` | No (default: False) | +| `disable_tools` | Remove tools/tool_choice from requests | `True` | No (default: False) | +| `params` | Additional generation parameters | `{"temperature": 0, "max_tokens": 256}` | No | + +### VLLMTokenRecorder Configuration + +| Parameter | Description | Default | +| --------- | ----------- | ------- | +| `inner` | Inner callback handler to chain | `None` | +| `add_to_span` | Add token IDs to OpenTelemetry spans | `True` | + +### VLLMToolValidationHooks + +No configuration required. Simply add to agent's hooks: + +```python +agent = Agent(model=model, tools=[...], hooks=[VLLMToolValidationHooks()]) +``` + +## Why TITO Matters + +In agent RL training, retokenization drift can cause training instability: + +1. **Non-unique tokenization**: The word "HAVING" might be tokenized as `H` + `AVING` during generation but `HAV` + `ING` when retokenized +2. **Tool-call serialization**: Tool call JSON may be normalized/reformatted by parsers +3. **Chat template differences**: Different frameworks may use different chat templates + +Using vLLM's `return_token_ids` feature captures the exact tokens used during inference, eliminating these issues. + +Reference: [No More Retokenization Drift](https://blog.vllm.ai/2025/10/22/agent-lightning.html) + +## Troubleshooting + +### Connection errors to vLLM server + +Ensure your vLLM server is running and accessible: + +```bash +# Check if server is responding +curl http://localhost:8000/health +``` + +### No token IDs captured + +Ensure: + +1. vLLM version is 0.10.2 or later +2. `return_token_ids=True` is set on `VLLMModel` +3. Your vLLM server supports `return_token_ids` in streaming mode + +### Tool calls for unknown tools + +If vLLM's tool parser produces tool calls for tools not in your registry, add `VLLMToolValidationHooks` to get deterministic error feedback instead of crashes. + +### Model only supports single tool calls + +Some models/chat templates only support one tool call per message. If you see `"This model only supports single tool-calls at once!"`, adjust your prompts to request one tool at a time. + +## References + +* [strands-vllm Repository](https://github.com/agents-community/strands-vllm) +* [vLLM Documentation](https://docs.vllm.ai/) +* [Agent Lightning GitHub](https://github.com/microsoft/agent-lightning) - The absolute trainer to light up AI agents +* [Agent Lightning Blog Post](https://blog.vllm.ai/2025/10/22/agent-lightning.html) - No More Retokenization Drift +* [Strands Agents API](../../api-reference/python/models/model.md) diff --git a/mkdocs.yml b/mkdocs.yml index 7f725987..232898f4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -228,6 +228,7 @@ nav: - Nebius Token Factory: community/model-providers/nebius-token-factory.md - NVIDIA NIM: community/model-providers/nvidia-nim.md - SGLang: community/model-providers/sglang.md + - vLLM: community/model-providers/vllm.md - MLX: community/model-providers/mlx.md - Session Managers: - Amazon AgentCore Memory: community/session-managers/agentcore-memory.md From d1172ef36acce56067737a848935e9a69a2f2a35 Mon Sep 17 00:00:00 2001 From: dk67604 Date: Thu, 22 Jan 2026 22:07:28 -0800 Subject: [PATCH 2/2] Update docs --- docs/community/model-providers/vllm.md | 48 ++++++++++++-------------- 1 file changed, 22 insertions(+), 26 deletions(-) diff --git a/docs/community/model-providers/vllm.md b/docs/community/model-providers/vllm.md index 1b612503..02af6817 100644 --- a/docs/community/model-providers/vllm.md +++ b/docs/community/model-providers/vllm.md @@ -11,10 +11,13 @@ - **OpenAI-Compatible API**: Uses vLLM's OpenAI-compatible `/v1/chat/completions` endpoint with streaming - **TITO Support**: Captures `prompt_token_ids` and `token_ids` directly from vLLM - no retokenization drift -- **Tool Call Validation**: Hook-based validation to reject unknown tools and invalid JSON inputs (RL-friendly error feedback) +- **Tool Call Validation**: Optional hooks for RL-friendly error messages (allowed tools list, schema validation) - **Agent Lightning Integration**: Automatically adds token IDs to OpenTelemetry spans for RL training data extraction - **Streaming**: Full streaming support with token ID capture via `VLLMTokenRecorder` +!!! tip "Why TITO?" + Traditional retokenization can cause drift in RL training—the same text may tokenize differently during inference vs. training (e.g., "HAVING" → `H`+`AVING` vs. `HAV`+`ING`). TITO captures exact tokens from vLLM, eliminating this issue. See [No More Retokenization Drift](https://blog.vllm.ai/2025/10/22/agent-lightning.html) for details. + ## Installation Install strands-vllm along with the Strands Agents SDK: @@ -86,9 +89,9 @@ print(f"Prompt tokens: {len(recorder.prompt_token_ids or [])}") print(f"Response tokens: {len(recorder.token_ids or [])}") ``` -### 3. Tool Call Validation (Recommended for RL) +### 3. Tool Call Validation (Optional, Recommended for RL) -vLLM tool parsers can post-process model outputs, potentially creating tool calls for unknown tools. Use `VLLMToolValidationHooks` to validate tool calls before execution: +Strands SDK already handles unknown tools and malformed JSON gracefully. `VLLMToolValidationHooks` adds RL-friendly enhancements: ```python import os @@ -102,7 +105,6 @@ model = VLLMModel( return_token_ids=True, ) -# Add validation hook - rejects unknown tools with deterministic error feedback agent = Agent( model=model, tools=[calculator], @@ -113,12 +115,12 @@ result = agent("Compute 17 * 19 using the calculator tool.") print(result) ``` -The hook validates: +**What it adds beyond Strands defaults:** -- **Tool name**: Must exist in agent's tool registry -- **Tool input**: If input is a JSON string, must be valid JSON (catches `JSONDecodeError`) +- **Unknown tool errors include allowed tools list** — helps RL training learn valid tool names +- **Schema validation** — catches missing required args and unknown args before tool execution -Invalid tool calls receive a deterministic error `toolResult`, providing clean RL signals. +Invalid tool calls receive deterministic error messages, providing cleaner RL training signals. ### 4. Agent Lightning Integration @@ -222,25 +224,19 @@ The `VLLMModel` accepts the following parameters: | `inner` | Inner callback handler to chain | `None` | | `add_to_span` | Add token IDs to OpenTelemetry spans | `True` | -### VLLMToolValidationHooks - -No configuration required. Simply add to agent's hooks: - -```python -agent = Agent(model=model, tools=[...], hooks=[VLLMToolValidationHooks()]) -``` +### VLLMToolValidationHooks Configuration -## Why TITO Matters - -In agent RL training, retokenization drift can cause training instability: - -1. **Non-unique tokenization**: The word "HAVING" might be tokenized as `H` + `AVING` during generation but `HAV` + `ING` when retokenized -2. **Tool-call serialization**: Tool call JSON may be normalized/reformatted by parsers -3. **Chat template differences**: Different frameworks may use different chat templates +| Parameter | Description | Default | +| --------- | ----------- | ------- | +| `include_allowed_tools_in_errors` | Include list of allowed tools in error messages | `True` | +| `max_allowed_tools_in_error` | Maximum tool names to show in error messages | `25` | +| `validate_input_shape` | Validate required/unknown args against schema | `True` | -Using vLLM's `return_token_ids` feature captures the exact tokens used during inference, eliminating these issues. +**Example error messages** (more informative than Strands defaults): -Reference: [No More Retokenization Drift](https://blog.vllm.ai/2025/10/22/agent-lightning.html) +- Unknown tool: `Error: unknown tool: fake_tool | allowed_tools=[calculator, search, ...]` +- Missing argument: `Error: tool_name= | missing required argument(s): expression` +- Unknown argument: `Error: tool_name= | unknown argument(s): invalid_param` ## Troubleshooting @@ -261,9 +257,9 @@ Ensure: 2. `return_token_ids=True` is set on `VLLMModel` 3. Your vLLM server supports `return_token_ids` in streaming mode -### Tool calls for unknown tools +### RL training needs cleaner error signals -If vLLM's tool parser produces tool calls for tools not in your registry, add `VLLMToolValidationHooks` to get deterministic error feedback instead of crashes. +Strands handles unknown tools gracefully, but for RL training you may want more informative errors. Add `VLLMToolValidationHooks` to get errors that include the list of allowed tools and validate argument schemas. ### Model only supports single tool calls