Skip to content

Embedding truncation and chunking should have clearer responsibilities across memory, file, and directory vectorization #531

@dddgogogo

Description

@dddgogogo

Summary

OpenViking currently has multiple ways to cope with oversized vectorization inputs, but the responsibilities between them are not very clear:

  • some paths benefit from chunking
  • some paths rely on truncation
  • some paths do both indirectly

This makes it hard to reason about whether a fix is a real semantic strategy or just an emergency guard.

Why I am opening this

From an operator / maintainer perspective, there is an important design question:

  • when should OpenViking chunk content before indexing?
  • when should it merely truncate?

These two strategies solve different problems and should not be treated as interchangeable.

Observed distinction

In practice, we found:

  • chunking is the right main strategy for long memory content, because it preserves semantic coverage and improves retrieval quality
  • truncation is still useful as a low-level safety guard for generic vectorization inputs, such as unexpectedly large file content or directory metadata

The problem is that the system boundary between those responsibilities is not obvious.

Actual concern

Without a clear design boundary, it is easy to end up with one of these failure modes:

  1. truncation silently hides semantic loss that should have been solved by chunking
  2. chunking logic only exists for one content type while other large inputs still hit generic guards
  3. operators cannot tell whether a retrieval problem came from chunking policy or silent truncation

Expected behavior

I think OpenViking should make the strategy explicit:

  • chunking = semantic indexing strategy for long structured content
  • truncation = final safety guard for oversized raw inputs that should never crash the embedding path

Suggested fixes

  1. Document a clear policy

    • which context types should chunk
    • which context types may truncate
    • which paths should do both, and in what order
  2. Make truncation observable

    • log when truncation happens
    • include original size and truncated size
    • ideally expose metrics for truncated vectorization events
  3. Prefer chunking for semantically important long content

    • memory content
    • possibly large text resources
    • possibly other long context records where retrieval precision matters
  4. Keep truncation as a last-resort guard

    • especially for file content, directory meta, or generic embedding API calls
    • but make sure it is obvious that semantic coverage may be reduced
  5. Clarify retrieval implications

    • how chunk results map back to parent context
    • how reranking / dedup should handle many chunks from one source

Why this matters

This is partly a correctness issue and partly an operator ergonomics issue. If chunking and truncation are not clearly separated, systems may look stable while actually losing recall quality or silently indexing incomplete content.

Concrete request

Please consider either:

  • formalizing the current intended design in docs and config,
  • or refactoring the indexing pipeline so chunking vs truncation responsibilities are more explicit and consistent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions