Skip to content

Add xAI Grok Tokenizer Integration #10

@ivelin-web

Description

@ivelin-web
  • Use @xenova/transformers to load the Xenova/grok-1-tokenizer from Hugging Face.

    • For Grok models:

      • Grok-1: SentencePiece-based tokenizer with 131 072 vocab.
      • Grok-2 / Grok-2-Vision: Confirm when available; assume same tokenizer or adjust if xAI releases a separate package.
  • Implement dynamic import:

     ```js
     const { AutoTokenizer } = await import('@xenova/transformers');
     const tok = await AutoTokenizer.fromPretrained('Xenova/grok-1-tokenizer');
     return (txt) => tok.encode(txt).length;
     ```
    
    • Determine Grok’s model context limits (e.g., 131 k for Grok-1) and save in config.
    • In tokenizers/index.ts, when tenant === 'grok', return (text) => tok.encode(text).length.
    • Add fallback approximation (text.length / 4) if loading fails, and prefix meter with .
    • Create a small unit test: load a 1 000 token Grok prompt and confirm the count aligns with xAI’s published numbers within ±5 %.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions