Skip to content

embed fails on threads with bodies over 8192 tokens #2

@RomneyDa

Description

@RomneyDa

Repro

Running gitcrawl refresh openclaw/openclaw against the live repo:

[embed] embedding 1-64 of 6917
[embed] embedding 65-128 of 6917
openai embeddings failed with status 400: Invalid 'input[29]': maximum input length is 8192 tokens.

Offender: openclaw/openclaw#27137 — body is 38,697 chars / 10,154 cl100k_base tokens, over the 8,192 cap. The whole batch of 64 fails because of one oversized input, and the run aborts.

Cause

embeddingTextForBasis (internal/store/embedding_tasks.go:100) builds title + "\n\n" + body with no length cap. There is no token counting anywhere in the repo. client.Embed (internal/openai/client.go:67) sends the batch as-is.

Fix

Truncate per-input before send. Plan: cap at 24,000 runes (~6.3k tokens at the observed 3.81 chars/token) inside embeddingTextForBasis, with a defensive cap in the OpenAI client. Include the cap in embeddingContentHash so changes force re-embed. Log when truncation happens.

Chunking (split + average vectors, or multi-vector schema) was considered and rejected — too much surface area for a rare case where the trailing tail of a paste-dump issue is rarely the identifying signal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions