Bundle: Added "count-tokens" procedure to inform about the size#59
Conversation
WalkthroughThe changes introduce a new utility function to count tokens in output files using both GPT-4o and Anthropic Claude tokenizers. This function is integrated into the bundle process, and relevant documentation and dependencies are updated to reflect the new capability and model token limitations. Changes
Sequence Diagram(s)sequenceDiagram
participant Builder as LllmsTxtBuilder
participant Util as util.count_tokens
participant GPT as tiktoken
participant Anthropic as Anthropic API
Builder->>Builder: Write llms-full.txt
Builder->>Util: count_tokens(llms-full.txt)
Util->>GPT: Encode file content
GPT-->>Util: Return GPT token count
alt If ANTHROPIC_API_KEY set
Util->>Anthropic: count_tokens(system prompt, user message)
Anthropic-->>Util: Return Anthropic token count
end
Util->>Util: Log token counts and warnings if necessary
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGES.md(1 hunks)pyproject.toml(2 hunks)src/cratedb_about/bundle/llmstxt.py(2 hunks)src/cratedb_about/bundle/util.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
CHANGES.md (3)
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
src/cratedb_about/bundle/llmstxt.py (4)
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
Learnt from: amotl
PR: #32
File: src/cratedb_about/outline/cratedb-outline.yaml:321-329
Timestamp: 2025-05-15T21:25:54.870Z
Learning: In the CrateDB outline YAML, content organization prioritizes thematic grouping (keeping related topics together) over content type grouping (separating tutorials from reference docs), as demonstrated by placing the multi-tenancy tutorial alongside user management and privileges documentation in the API section.
🧬 Code Graph Analysis (1)
src/cratedb_about/bundle/llmstxt.py (1)
src/cratedb_about/bundle/util.py (1)
count_tokens(11-43)
🔇 Additional comments (7)
src/cratedb_about/bundle/util.py (2)
6-6: LGTM - Clean import organization.The tiktoken import is properly placed with other third-party imports.
24-24: Model Name Verification CompleteThe model identifier
claude-sonnet-4-20250514is confirmed as current and valid per Anthropic’s July 2025 API documentation. No changes are needed.CHANGES.md (1)
5-6: LGTM - Clear and informative changelog entry.The changelog entry clearly documents the new functionality and provides important context about the token limitations for Sonnet and Opus models.
src/cratedb_about/bundle/llmstxt.py (2)
11-11: LGTM - Proper import placement.The import is correctly placed with other project-specific imports and follows the established import organization.
54-54: LGTM - Well-positioned token counting integration.The token counting call is appropriately placed after the llms-full.txt file is generated, providing immediate feedback about the file size constraints.
pyproject.toml (2)
82-82: LGTM - Appropriate tiktoken dependency addition.The tiktoken dependency is correctly added to the main dependencies since it's used unconditionally for GPT token counting, with a reasonable version constraint.
95-95: LGTM - Logical placement of anthropic dependency.The anthropic dependency is appropriately placed in the optional LLM dependencies group since it's only used when the API key is available.
a7b8426 to
6f1b28a
Compare
6f1b28a to
3a01872
Compare
3a01872 to
6cd40db
Compare
173220a to
16bc361
Compare
6cd40db to
be7459f
Compare
be7459f to
18dc78e
Compare
... of the outcome. Sonnet and Opus are limited to 200_000 input tokens.
18dc78e to
73004d5
Compare
... of the outcome. Sonnet and Opus are limited to 200_000 input tokens.