Repro
Running gitcrawl refresh openclaw/openclaw against the live repo:
[embed] embedding 1-64 of 6917
[embed] embedding 65-128 of 6917
openai embeddings failed with status 400: Invalid 'input[29]': maximum input length is 8192 tokens.
Offender: openclaw/openclaw#27137 — body is 38,697 chars / 10,154 cl100k_base tokens, over the 8,192 cap. The whole batch of 64 fails because of one oversized input, and the run aborts.
Cause
embeddingTextForBasis (internal/store/embedding_tasks.go:100) builds title + "\n\n" + body with no length cap. There is no token counting anywhere in the repo. client.Embed (internal/openai/client.go:67) sends the batch as-is.
Fix
Truncate per-input before send. Plan: cap at 24,000 runes (~6.3k tokens at the observed 3.81 chars/token) inside embeddingTextForBasis, with a defensive cap in the OpenAI client. Include the cap in embeddingContentHash so changes force re-embed. Log when truncation happens.
Chunking (split + average vectors, or multi-vector schema) was considered and rejected — too much surface area for a rare case where the trailing tail of a paste-dump issue is rarely the identifying signal.
Repro
Running
gitcrawl refresh openclaw/openclawagainst the live repo:Offender: openclaw/openclaw#27137 — body is 38,697 chars / 10,154 cl100k_base tokens, over the 8,192 cap. The whole batch of 64 fails because of one oversized input, and the run aborts.
Cause
embeddingTextForBasis(internal/store/embedding_tasks.go:100) buildstitle + "\n\n" + bodywith no length cap. There is no token counting anywhere in the repo.client.Embed(internal/openai/client.go:67) sends the batch as-is.Fix
Truncate per-input before send. Plan: cap at 24,000 runes (~6.3k tokens at the observed 3.81 chars/token) inside
embeddingTextForBasis, with a defensive cap in the OpenAI client. Include the cap inembeddingContentHashso changes force re-embed. Log when truncation happens.Chunking (split + average vectors, or multi-vector schema) was considered and rejected — too much surface area for a rare case where the trailing tail of a paste-dump issue is rarely the identifying signal.