Summary
OpenViking currently has multiple ways to cope with oversized vectorization inputs, but the responsibilities between them are not very clear:
- some paths benefit from chunking
- some paths rely on truncation
- some paths do both indirectly
This makes it hard to reason about whether a fix is a real semantic strategy or just an emergency guard.
Why I am opening this
From an operator / maintainer perspective, there is an important design question:
- when should OpenViking chunk content before indexing?
- when should it merely truncate?
These two strategies solve different problems and should not be treated as interchangeable.
Observed distinction
In practice, we found:
- chunking is the right main strategy for long memory content, because it preserves semantic coverage and improves retrieval quality
- truncation is still useful as a low-level safety guard for generic vectorization inputs, such as unexpectedly large file content or directory metadata
The problem is that the system boundary between those responsibilities is not obvious.
Actual concern
Without a clear design boundary, it is easy to end up with one of these failure modes:
- truncation silently hides semantic loss that should have been solved by chunking
- chunking logic only exists for one content type while other large inputs still hit generic guards
- operators cannot tell whether a retrieval problem came from chunking policy or silent truncation
Expected behavior
I think OpenViking should make the strategy explicit:
- chunking = semantic indexing strategy for long structured content
- truncation = final safety guard for oversized raw inputs that should never crash the embedding path
Suggested fixes
-
Document a clear policy
- which context types should chunk
- which context types may truncate
- which paths should do both, and in what order
-
Make truncation observable
- log when truncation happens
- include original size and truncated size
- ideally expose metrics for truncated vectorization events
-
Prefer chunking for semantically important long content
- memory content
- possibly large text resources
- possibly other long context records where retrieval precision matters
-
Keep truncation as a last-resort guard
- especially for file content, directory meta, or generic embedding API calls
- but make sure it is obvious that semantic coverage may be reduced
-
Clarify retrieval implications
- how chunk results map back to parent context
- how reranking / dedup should handle many chunks from one source
Why this matters
This is partly a correctness issue and partly an operator ergonomics issue. If chunking and truncation are not clearly separated, systems may look stable while actually losing recall quality or silently indexing incomplete content.
Concrete request
Please consider either:
- formalizing the current intended design in docs and config,
- or refactoring the indexing pipeline so chunking vs truncation responsibilities are more explicit and consistent.
Summary
OpenViking currently has multiple ways to cope with oversized vectorization inputs, but the responsibilities between them are not very clear:
This makes it hard to reason about whether a fix is a real semantic strategy or just an emergency guard.
Why I am opening this
From an operator / maintainer perspective, there is an important design question:
These two strategies solve different problems and should not be treated as interchangeable.
Observed distinction
In practice, we found:
The problem is that the system boundary between those responsibilities is not obvious.
Actual concern
Without a clear design boundary, it is easy to end up with one of these failure modes:
Expected behavior
I think OpenViking should make the strategy explicit:
Suggested fixes
Document a clear policy
Make truncation observable
Prefer chunking for semantically important long content
Keep truncation as a last-resort guard
Clarify retrieval implications
Why this matters
This is partly a correctness issue and partly an operator ergonomics issue. If chunking and truncation are not clearly separated, systems may look stable while actually losing recall quality or silently indexing incomplete content.
Concrete request
Please consider either: