Skip to content

feat: Integrate Chutes API with Kimi K2.5-TEE model#1

Merged
echobt merged 1 commit intomainfrom
feature/chutes-kimi-k25-integration
Feb 3, 2026
Merged

feat: Integrate Chutes API with Kimi K2.5-TEE model#1
echobt merged 1 commit intomainfrom
feature/chutes-kimi-k25-integration

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Feb 3, 2026

Summary

This PR integrates Chutes.ai API into baseagent, following the integration pattern from tau-agent.

Changes

New Features

  • ChutesClient class for Chutes API (https://llm.chutes.ai/v1)
  • CHUTES_API_TOKEN environment variable for authentication
  • Default model: moonshotai/Kimi-K2.5-TEE (1T params, 32B activated)
  • Thinking mode enabled by default with <think>...</think> parsing
  • get_llm_client() factory function for easy provider selection

Configuration Updates

  • Default provider: chutes
  • Temperature: 1.0 for thinking mode, 0.6 for instant mode
  • Top-p: 0.95 (Kimi K2.5 best practice)
  • Context window: 256K tokens

Dependencies

  • Added openai>=1.0.0 for OpenAI-compatible API client

Usage

# With Chutes API (default)
export CHUTES_API_TOKEN="your-token"
python agent.py --instruction "Your task here..."

# With OpenRouter (fallback)
export LLM_PROVIDER="openrouter"
export OPENROUTER_API_KEY="your-key"
python agent.py --instruction "Your task here..."

References

Acceptance Criteria

  • LLM client supports Chutes API at https://llm.chutes.ai/v1
  • CHUTES_API_TOKEN environment variable is used for authentication
  • Default model is set to moonshotai/Kimi-K2.5-TEE
  • Thinking mode is enabled by default with proper parsing
  • Configuration uses Kimi K2.5 recommended parameters
  • Dependencies are updated (openai package)
  • Agent remains backward-compatible with LiteLLMClient fallback

Summary by CodeRabbit

  • New Features
    • Added multi-provider LLM support with Chutes API as default provider and OpenRouter as fallback option
    • Integrated Chutes API with Kimi K2.5-TEE model
    • Enabled thinking mode capability for enhanced LLM responses
    • Added per-call cost tracking and cost limit enforcement
    • Introduced configuration-driven provider selection and flexible model parameters

- Add ChutesClient class for Chutes API (https://llm.chutes.ai/v1)
- Support CHUTES_API_TOKEN environment variable for authentication
- Set moonshotai/Kimi-K2.5-TEE as default model
- Enable thinking mode by default with <think>...</think> parsing
- Use Kimi K2.5 recommended parameters (temp=1.0, top_p=0.95 for thinking)
- Increase context limit to 256K tokens for Kimi K2.5
- Add openai>=1.0.0 dependency for OpenAI-compatible API client
- Keep LiteLLMClient as fallback for other providers
- Add get_llm_client() factory function for provider selection

Based on tau-agent integration pattern from:
https://github.com/unconst/tau-agent
@coderabbitai
Copy link

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

Multi-provider LLM support is added via a factory function that selects between Chutes API and OpenRouter clients based on configuration. The agent initialization flow now retrieves LLM clients dynamically. Configuration defaults updated to use Chutes provider with Kimi-K2.5-TEE model, including thinking mode and cost tracking. New dependency on openai>=1.0.0.

Changes

Cohort / File(s) Summary
Multi-provider LLM factory and client implementation
agent.py, src/llm/client.py
Introduces get_llm_client factory function for dynamic provider selection. Adds new ChutesClient class with thinking mode, cost tracking, and Chutes API integration. Extends LLMResponse with thinking and cost fields. Updates LiteLLMClient constructor and chat signature. Agent updated to use factory instead of direct client instantiation with configuration-driven provider selection.
Configuration defaults
src/config/defaults.py
Replaces previous Codex-simulated config with Chutes/OpenRouter-friendly defaults. Sets provider to "chutes", model to "moonshotai/Kimi-K2.5-TEE", introduces enable_thinking, cost_limit, cache_extended_retention, cache_key fields. Updates token limits (model_context_limit: 200000→256000) and adds cache/compaction parameters.
Dependency management
pyproject.toml, requirements.txt
Adds openai>=1.0.0 as runtime dependency to support Chutes API integration.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Factory as get_llm_client
    participant Config
    participant ChutesClient
    participant ChutesAPI

    Agent->>Config: Read provider config
    Config-->>Agent: provider="chutes", model="Kimi-K2.5-TEE"
    Agent->>Factory: Call get_llm_client(provider, model, ...)
    Factory->>Factory: Check provider == "chutes"
    Factory->>ChutesClient: Initialize with API token & model
    ChutesClient-->>Factory: Instance created
    Factory-->>Agent: Return ChutesClient
    Agent->>ChutesClient: chat(messages, thinking_mode=True)
    ChutesClient->>ChutesAPI: POST /chat/completions
    ChutesAPI-->>ChutesClient: Response with thinking + text
    ChutesClient->>ChutesClient: Parse thinking, calculate cost
    ChutesClient-->>Agent: LLMResponse(text, thinking, cost)
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A factory springs up, providers now dance—
Chutes and OpenRouter get their chance,
Thinking modes enabled, costs tracked with care,
Kimi-K2.5 whispers wisdom in the air! 🎩✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: integrating Chutes API with Kimi K2.5-TEE model, which is the primary objective reflected across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/chutes-kimi-k25-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/llm/client.py`:
- Line 381: The two clients use different hard-coded defaults for cost_limit
(LiteLLMClient sets 10.0 while ChutesClient sets 100.0); make them consistent by
centralizing the default: add a shared constant (e.g., DEFAULT_LLM_COST_LIMIT)
or read the same env var fallback in both classes, and update LiteLLMClient and
ChutesClient to use that constant/env-based default when initializing
self.cost_limit so both clients default to the same value and honor
LLM_COST_LIMIT uniformly.
🧹 Nitpick comments (7)
src/llm/client.py (5)

23-23: Unused import sys.

The sys module is imported but not used anywhere in this file.

🧹 Proposed fix
-import sys

194-204: Regex only captures the first <think> block.

If the model returns multiple <think>...</think> blocks, this regex with re.search will only capture the first one, and re.sub will remove all of them but the thinking content from subsequent blocks will be lost.

♻️ Proposed fix to capture all thinking blocks
     def _parse_thinking_content(self, text: str) -> Tuple[str, str]:
         """Parse thinking content from response.
         
         Kimi K2.5 can return thinking content in:
         1. <think>...</think> tags (for some deployments)
         2. reasoning_content field (official API)
         
         Returns (thinking_content, final_response).
         """
         if not text:
             return "", ""
         
         # Check for <think>...</think> pattern
         think_pattern = r"<think>(.*?)</think>"
-        match = re.search(think_pattern, text, re.DOTALL)
+        matches = re.findall(think_pattern, text, re.DOTALL)
         
-        if match:
-            thinking = match.group(1).strip()
+        if matches:
+            thinking = "\n\n".join(m.strip() for m in matches)
             # Remove the think block from the response
             response = re.sub(think_pattern, "", text, flags=re.DOTALL).strip()
             return thinking, response
         
         return "", text

260-267: Preserve exception chain for better debugging.

When re-raising exceptions, use raise ... from e to preserve the original traceback, which aids debugging.

♻️ Proposed fix
         try:
             response = self._client.chat.completions.create(**kwargs)
             self._request_count += 1
         except Exception as e:
             error_msg = str(e)
             if "authentication" in error_msg.lower() or "api_key" in error_msg.lower() or "unauthorized" in error_msg.lower():
-                raise LLMError(error_msg, code="authentication_error")
+                raise LLMError(error_msg, code="authentication_error") from e
             elif "rate" in error_msg.lower() or "limit" in error_msg.lower():
-                raise LLMError(error_msg, code="rate_limit")
+                raise LLMError(error_msg, code="rate_limit") from e
             else:
-                raise LLMError(error_msg, code="api_error")
+                raise LLMError(error_msg, code="api_error") from e

296-304: Hardcoded pricing may become stale.

The cost estimation uses hardcoded pricing values. Consider adding a comment noting where to find current pricing, or making these configurable.

📝 Suggested improvement
-        # Estimate cost (Kimi K2.5 pricing via Chutes - approximate)
-        # $0.60 per million input tokens, $2.50 per million output tokens
-        input_cost_per_1k = 0.0006  # $0.60 / 1000
-        output_cost_per_1k = 0.0025  # $2.50 / 1000
+        # Estimate cost (Kimi K2.5 pricing via Chutes - approximate as of Feb 2026)
+        # Check https://chutes.ai/pricing for current rates
+        # $0.60 per million input tokens, $2.50 per million output tokens
+        input_cost_per_1k = 0.0006  # $0.60 / 1M = $0.0006 / 1K
+        output_cost_per_1k = 0.0025  # $2.50 / 1M = $0.0025 / 1K

466-473: Preserve exception chain for better debugging.

Same issue as in ChutesClient - use raise ... from e to preserve the original traceback.

♻️ Proposed fix
         except Exception as e:
             error_msg = str(e)
             if "authentication" in error_msg.lower() or "api_key" in error_msg.lower():
-                raise LLMError(error_msg, code="authentication_error")
+                raise LLMError(error_msg, code="authentication_error") from e
             elif "rate" in error_msg.lower() or "limit" in error_msg.lower():
-                raise LLMError(error_msg, code="rate_limit")
+                raise LLMError(error_msg, code="rate_limit") from e
             else:
-                raise LLMError(error_msg, code="api_error")
+                raise LLMError(error_msg, code="api_error") from e
src/config/defaults.py (1)

82-89: Clarify caching behavior for Chutes API.

The comments mention "Chutes may support server-side caching" and "Keep system prompt stable for best performance," but cache_extended_retention and cache_key appear to be unused in ChutesClient (as noted by static analysis). Consider documenting whether these are placeholders for future use or should be removed.

agent.py (1)

60-60: Unused imports ChutesClient and LiteLLMClient.

These classes are imported but never directly referenced in this file. The factory function get_llm_client returns them internally.

🧹 Proposed fix
-from src.llm.client import get_llm_client, CostLimitExceeded, ChutesClient, LiteLLMClient
+from src.llm.client import get_llm_client, CostLimitExceeded

self.model = model
self.temperature = temperature
self.max_tokens = max_tokens
self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "10.0"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent default cost limit between clients.

LiteLLMClient defaults to $10.0 while ChutesClient defaults to $100.0. This inconsistency could surprise users when switching providers.

🔧 Proposed fix for consistency
-        self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "10.0"))
+        self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "100.0"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "10.0"))
self.cost_limit = cost_limit or float(os.environ.get("LLM_COST_LIMIT", "100.0"))
🤖 Prompt for AI Agents
In `@src/llm/client.py` at line 381, The two clients use different hard-coded
defaults for cost_limit (LiteLLMClient sets 10.0 while ChutesClient sets 100.0);
make them consistent by centralizing the default: add a shared constant (e.g.,
DEFAULT_LLM_COST_LIMIT) or read the same env var fallback in both classes, and
update LiteLLMClient and ChutesClient to use that constant/env-based default
when initializing self.cost_limit so both clients default to the same value and
honor LLM_COST_LIMIT uniformly.

echobt added a commit that referenced this pull request Feb 3, 2026
This umbrella commit combines changes from all three feature PRs:
- PR #1: Chutes API integration with Kimi K2.5-TEE model
- PR #2: Comprehensive documentation with Mermaid diagrams
- PR #3: Remove OpenRouter support, replace litellm with Chutes API

Conflicts resolved by taking the latest implementation from PR #3,
which provides a cleaner httpx-based client without litellm dependency.
@echobt echobt merged commit c25a4a5 into main Feb 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant