feat: Integrate Chutes API with Kimi K2.5-TEE model#1
Conversation
- Add ChutesClient class for Chutes API (https://llm.chutes.ai/v1) - Support CHUTES_API_TOKEN environment variable for authentication - Set moonshotai/Kimi-K2.5-TEE as default model - Enable thinking mode by default with <think>...</think> parsing - Use Kimi K2.5 recommended parameters (temp=1.0, top_p=0.95 for thinking) - Increase context limit to 256K tokens for Kimi K2.5 - Add openai>=1.0.0 dependency for OpenAI-compatible API client - Keep LiteLLMClient as fallback for other providers - Add get_llm_client() factory function for provider selection Based on tau-agent integration pattern from: https://github.com/unconst/tau-agent
📝 WalkthroughWalkthroughMulti-provider LLM support is added via a factory function that selects between Chutes API and OpenRouter clients based on configuration. The agent initialization flow now retrieves LLM clients dynamically. Configuration defaults updated to use Chutes provider with Kimi-K2.5-TEE model, including thinking mode and cost tracking. New dependency on openai>=1.0.0. Changes
Sequence DiagramsequenceDiagram
participant Agent
participant Factory as get_llm_client
participant Config
participant ChutesClient
participant ChutesAPI
Agent->>Config: Read provider config
Config-->>Agent: provider="chutes", model="Kimi-K2.5-TEE"
Agent->>Factory: Call get_llm_client(provider, model, ...)
Factory->>Factory: Check provider == "chutes"
Factory->>ChutesClient: Initialize with API token & model
ChutesClient-->>Factory: Instance created
Factory-->>Agent: Return ChutesClient
Agent->>ChutesClient: chat(messages, thinking_mode=True)
ChutesClient->>ChutesAPI: POST /chat/completions
ChutesAPI-->>ChutesClient: Response with thinking + text
ChutesClient->>ChutesClient: Parse thinking, calculate cost
ChutesClient-->>Agent: LLMResponse(text, thinking, cost)
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/llm/client.py`:
- Line 381: The two clients use different hard-coded defaults for cost_limit
(LiteLLMClient sets 10.0 while ChutesClient sets 100.0); make them consistent by
centralizing the default: add a shared constant (e.g., DEFAULT_LLM_COST_LIMIT)
or read the same env var fallback in both classes, and update LiteLLMClient and
ChutesClient to use that constant/env-based default when initializing
self.cost_limit so both clients default to the same value and honor
LLM_COST_LIMIT uniformly.
🧹 Nitpick comments (7)
src/llm/client.py (5)
23-23: Unused importsys.The
sysmodule is imported but not used anywhere in this file.🧹 Proposed fix
-import sys
194-204: Regex only captures the first<think>block.If the model returns multiple
<think>...</think>blocks, this regex withre.searchwill only capture the first one, andre.subwill remove all of them but the thinking content from subsequent blocks will be lost.♻️ Proposed fix to capture all thinking blocks
def _parse_thinking_content(self, text: str) -> Tuple[str, str]: """Parse thinking content from response. Kimi K2.5 can return thinking content in: 1. <think>...</think> tags (for some deployments) 2. reasoning_content field (official API) Returns (thinking_content, final_response). """ if not text: return "", "" # Check for <think>...</think> pattern think_pattern = r"<think>(.*?)</think>" - match = re.search(think_pattern, text, re.DOTALL) + matches = re.findall(think_pattern, text, re.DOTALL) - if match: - thinking = match.group(1).strip() + if matches: + thinking = "\n\n".join(m.strip() for m in matches) # Remove the think block from the response response = re.sub(think_pattern, "", text, flags=re.DOTALL).strip() return thinking, response return "", text
260-267: Preserve exception chain for better debugging.When re-raising exceptions, use
raise ... from eto preserve the original traceback, which aids debugging.♻️ Proposed fix
try: response = self._client.chat.completions.create(**kwargs) self._request_count += 1 except Exception as e: error_msg = str(e) if "authentication" in error_msg.lower() or "api_key" in error_msg.lower() or "unauthorized" in error_msg.lower(): - raise LLMError(error_msg, code="authentication_error") + raise LLMError(error_msg, code="authentication_error") from e elif "rate" in error_msg.lower() or "limit" in error_msg.lower(): - raise LLMError(error_msg, code="rate_limit") + raise LLMError(error_msg, code="rate_limit") from e else: - raise LLMError(error_msg, code="api_error") + raise LLMError(error_msg, code="api_error") from e
296-304: Hardcoded pricing may become stale.The cost estimation uses hardcoded pricing values. Consider adding a comment noting where to find current pricing, or making these configurable.
📝 Suggested improvement
- # Estimate cost (Kimi K2.5 pricing via Chutes - approximate) - # $0.60 per million input tokens, $2.50 per million output tokens - input_cost_per_1k = 0.0006 # $0.60 / 1000 - output_cost_per_1k = 0.0025 # $2.50 / 1000 + # Estimate cost (Kimi K2.5 pricing via Chutes - approximate as of Feb 2026) + # Check https://chutes.ai/pricing for current rates + # $0.60 per million input tokens, $2.50 per million output tokens + input_cost_per_1k = 0.0006 # $0.60 / 1M = $0.0006 / 1K + output_cost_per_1k = 0.0025 # $2.50 / 1M = $0.0025 / 1K
466-473: Preserve exception chain for better debugging.Same issue as in
ChutesClient- useraise ... from eto preserve the original traceback.♻️ Proposed fix
except Exception as e: error_msg = str(e) if "authentication" in error_msg.lower() or "api_key" in error_msg.lower(): - raise LLMError(error_msg, code="authentication_error") + raise LLMError(error_msg, code="authentication_error") from e elif "rate" in error_msg.lower() or "limit" in error_msg.lower(): - raise LLMError(error_msg, code="rate_limit") + raise LLMError(error_msg, code="rate_limit") from e else: - raise LLMError(error_msg, code="api_error") + raise LLMError(error_msg, code="api_error") from esrc/config/defaults.py (1)
82-89: Clarify caching behavior for Chutes API.The comments mention "Chutes may support server-side caching" and "Keep system prompt stable for best performance," but
cache_extended_retentionandcache_keyappear to be unused inChutesClient(as noted by static analysis). Consider documenting whether these are placeholders for future use or should be removed.agent.py (1)
60-60: Unused importsChutesClientandLiteLLMClient.These classes are imported but never directly referenced in this file. The factory function
get_llm_clientreturns them internally.🧹 Proposed fix
-from src.llm.client import get_llm_client, CostLimitExceeded, ChutesClient, LiteLLMClient +from src.llm.client import get_llm_client, CostLimitExceeded
| self.model = model | ||
| self.temperature = temperature | ||
| self.max_tokens = max_tokens | ||
| self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "10.0")) |
There was a problem hiding this comment.
Inconsistent default cost limit between clients.
LiteLLMClient defaults to $10.0 while ChutesClient defaults to $100.0. This inconsistency could surprise users when switching providers.
🔧 Proposed fix for consistency
- self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "10.0"))
+ self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "100.0"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "10.0")) | |
| self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "100.0")) |
🤖 Prompt for AI Agents
In `@src/llm/client.py` at line 381, The two clients use different hard-coded
defaults for cost_limit (LiteLLMClient sets 10.0 while ChutesClient sets 100.0);
make them consistent by centralizing the default: add a shared constant (e.g.,
DEFAULT_LLM_COST_LIMIT) or read the same env var fallback in both classes, and
update LiteLLMClient and ChutesClient to use that constant/env-based default
when initializing self.cost_limit so both clients default to the same value and
honor LLM_COST_LIMIT uniformly.
This umbrella commit combines changes from all three feature PRs: - PR #1: Chutes API integration with Kimi K2.5-TEE model - PR #2: Comprehensive documentation with Mermaid diagrams - PR #3: Remove OpenRouter support, replace litellm with Chutes API Conflicts resolved by taking the latest implementation from PR #3, which provides a cleaner httpx-based client without litellm dependency.
Summary
This PR integrates Chutes.ai API into baseagent, following the integration pattern from tau-agent.
Changes
New Features
https://llm.chutes.ai/v1)moonshotai/Kimi-K2.5-TEE(1T params, 32B activated)<think>...</think>parsingConfiguration Updates
chutes1.0for thinking mode,0.6for instant mode0.95(Kimi K2.5 best practice)Dependencies
openai>=1.0.0for OpenAI-compatible API clientUsage
References
Acceptance Criteria
Summary by CodeRabbit