embed fails on threads with bodies over 8192 tokens

## Repro

Running `gitcrawl refresh openclaw/openclaw` against the live repo:

```
[embed] embedding 1-64 of 6917
[embed] embedding 65-128 of 6917
openai embeddings failed with status 400: Invalid 'input[29]': maximum input length is 8192 tokens.
```

Offender: openclaw/openclaw#27137 — body is 38,697 chars / **10,154 cl100k_base tokens**, over the 8,192 cap. The whole batch of 64 fails because of one oversized input, and the run aborts.

## Cause

`embeddingTextForBasis` (`internal/store/embedding_tasks.go:100`) builds `title + "\n\n" + body` with no length cap. There is no token counting anywhere in the repo. `client.Embed` (`internal/openai/client.go:67`) sends the batch as-is.

## Fix

Truncate per-input before send. Plan: cap at 24,000 runes (~6.3k tokens at the observed 3.81 chars/token) inside `embeddingTextForBasis`, with a defensive cap in the OpenAI client. Include the cap in `embeddingContentHash` so changes force re-embed. Log when truncation happens.

Chunking (split + average vectors, or multi-vector schema) was considered and rejected — too much surface area for a rare case where the trailing tail of a paste-dump issue is rarely the identifying signal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

embed fails on threads with bodies over 8192 tokens #2

Repro

Cause

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

embed fails on threads with bodies over 8192 tokens #2

Description

Repro

Cause

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions