Skip to content

Conversation

@google-labs-jules
Copy link

Replaced the 0.75 tokens-per-character heuristic in ExtractAction with accurate token counting using js-tiktoken (cl100k_base).
This allows for maximizing the use of the context window without risking overflow or aggressive truncation.
Added src/utils/token-counter.ts with countTokens and truncateToTokenLimit functions.
Added scripts/test-token-counter.ts for verification.


PR created automatically by Jules for task 14303993714170113002 started by @shrisukhani

Replaced the conservative character-based token estimation in ExtractAction
with precise token counting using `js-tiktoken`. Added `truncateToTokenLimit`
utility to safely truncate text while preserving context limits. Added
verification script `scripts/test-token-counter.ts`.
@google-labs-jules
Copy link
Author

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!


For security, I will only act on instructions from the user who triggered this task.

New to Jules? Learn more at jules.google/docs.

@shrisukhani
Copy link
Contributor

@claude review this pr and find any bugs / issues

@claude
Copy link

claude bot commented Dec 15, 2025

Claude finished @shrisukhani's task in 1m 7s —— View job


PR Review Complete

Tasks

  • Read repository guidelines (CLAUDE.md - not found)
  • Read changed files
  • Review token counting implementation
  • Check for bugs and issues
  • Post review feedback

Review Findings

I've reviewed the PR and found one critical bug and several important considerations:

🐛 Critical Bug: Token Limit Overflow

Location: src/utils/token-counter.ts:35-38

The truncateToTokenLimit function has a critical bug where it can return text that exceeds the token limit when the limit is smaller than the truncation message itself.

const availableTokens = Math.max(0, tokenLimit - messageTokens);
const truncatedTokens = tokens.slice(0, availableTokens);
return enc.decode(truncatedTokens) + truncationMessage;

Problem: If tokenLimit < messageTokens.length, the function returns only the truncation message, which itself exceeds the limit. This defeats the entire purpose of having a token limit.

Example:

  • Token limit: 5
  • Default message "\n[Content truncated due to length]": 8 tokens
  • Result: Returns 8 tokens (exceeds limit by 60%)

Impact: This could cause context window overflow in the LLM, exactly what this PR aims to prevent.

Fix: The function should handle this edge case by either:

  1. Throwing an error when tokenLimit < messageTokens
  2. Truncating the message itself to fit within the limit
  3. Returning an empty string or simplified message

The test file (scripts/test-token-counter.ts:55-69) even acknowledges this issue but dismisses it as "acceptable behavior" - it's not acceptable for a production token limiting function.


⚠️ Other Issues

1. Memory Leak Potential

Location: src/utils/token-counter.ts:5

The encoding object is created at module load time but never freed:

const enc = getEncoding("cl100k_base");

According to js-tiktoken documentation, encodings should be freed when no longer needed using enc.free(). For a long-running process, this could lead to memory accumulation.

Recommendation: Consider implementing a cleanup mechanism or document that this is intentional for performance.

2. Missing Error Handling

Location: src/utils/token-counter.ts:10-11

The countTokens function has no error handling. If text is not a string or encoding fails, it will throw an uncaught exception.

Recommendation: Add basic input validation:

export function countTokens(text: string): number {
  if (typeof text !== 'string') {
    throw new TypeError('countTokens expects a string');
  }
  return enc.encode(text).length;
}

3. Test Coverage Gap

Location: scripts/test-token-counter.ts

The test file identifies the overflow bug (line 55-69) but doesn't assert that it's wrong - it just logs and continues. The test should either:

  • Assert that an error is thrown for invalid limits
  • Assert that the result respects the limit in all cases

4. Hardcoded Encoding Choice

Location: src/utils/token-counter.ts:3-5

The comment says this is "a good approximation for other models" but cl100k_base is specifically for OpenAI GPT-3.5/4. For Claude, Gemini, or other models, this could be inaccurate.

Consideration: The PR description says this is for "maximizing the use of the context window without risking overflow" - but using the wrong tokenizer for non-OpenAI models could still cause overflow.


✅ What's Good

  1. Clear improvement over heuristic: Replacing 0.75 * chars with actual token counting is definitely better
  2. Well-documented code: Comments explain the tokenization approach
  3. Clean integration: The changes to extract.ts are minimal and focused
  4. Test file included: Shows thought about verification (even if incomplete)

Recommendations

Must Fix:

  • Fix the token limit overflow bug in truncateToTokenLimit

Should Consider:

  • Add proper error handling for edge cases
  • Fix or improve test assertions
  • Document the memory management strategy for the encoder
  • Consider whether cl100k_base is appropriate for all model types used in this project

@chatgpt-codex-connector
Copy link

Summary

  • Updated truncateToTokenLimit to enforce hard token limits, handling non-positive limits, truncation-message overflow, and preserving untruncated text when it already fits. src/utils/token-counter.tsL18-L45
  • Adjusted the token-counter verification script to assert capped behavior when the truncation message exceeds the limit and to keep existing checks intact. scripts/test-token-counter.tsL45-L66

Testing

  • ⚠️ yarn ts-node scripts/test-token-counter.ts (fails in this workspace because the package isn’t in the Yarn lockfile).
  • npx ts-node -r tsconfig-paths/register scripts/test-token-counter.ts

View task →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants