Embedding truncation and chunking should have clearer responsibilities across memory, file, and directory vectorization

## Summary
OpenViking currently has multiple ways to cope with oversized vectorization inputs, but the responsibilities between them are not very clear:
- some paths benefit from chunking
- some paths rely on truncation
- some paths do both indirectly

This makes it hard to reason about whether a fix is a real semantic strategy or just an emergency guard.

## Why I am opening this
From an operator / maintainer perspective, there is an important design question:
- when should OpenViking **chunk** content before indexing?
- when should it merely **truncate**?

These two strategies solve different problems and should not be treated as interchangeable.

## Observed distinction
In practice, we found:
- **chunking** is the right main strategy for long memory content, because it preserves semantic coverage and improves retrieval quality
- **truncation** is still useful as a low-level safety guard for generic vectorization inputs, such as unexpectedly large file content or directory metadata

The problem is that the system boundary between those responsibilities is not obvious.

## Actual concern
Without a clear design boundary, it is easy to end up with one of these failure modes:
1. truncation silently hides semantic loss that should have been solved by chunking
2. chunking logic only exists for one content type while other large inputs still hit generic guards
3. operators cannot tell whether a retrieval problem came from chunking policy or silent truncation

## Expected behavior
I think OpenViking should make the strategy explicit:
- **chunking** = semantic indexing strategy for long structured content
- **truncation** = final safety guard for oversized raw inputs that should never crash the embedding path

## Suggested fixes
1. **Document a clear policy**
   - which context types should chunk
   - which context types may truncate
   - which paths should do both, and in what order

2. **Make truncation observable**
   - log when truncation happens
   - include original size and truncated size
   - ideally expose metrics for truncated vectorization events

3. **Prefer chunking for semantically important long content**
   - memory content
   - possibly large text resources
   - possibly other long context records where retrieval precision matters

4. **Keep truncation as a last-resort guard**
   - especially for file content, directory meta, or generic embedding API calls
   - but make sure it is obvious that semantic coverage may be reduced

5. **Clarify retrieval implications**
   - how chunk results map back to parent context
   - how reranking / dedup should handle many chunks from one source

## Why this matters
This is partly a correctness issue and partly an operator ergonomics issue. If chunking and truncation are not clearly separated, systems may look stable while actually losing recall quality or silently indexing incomplete content.

## Concrete request
Please consider either:
- formalizing the current intended design in docs and config,
- or refactoring the indexing pipeline so chunking vs truncation responsibilities are more explicit and consistent.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding truncation and chunking should have clearer responsibilities across memory, file, and directory vectorization #531

Summary

Why I am opening this

Observed distinction

Actual concern

Expected behavior

Suggested fixes

Why this matters

Concrete request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Embedding truncation and chunking should have clearer responsibilities across memory, file, and directory vectorization #531

Description

Summary

Why I am opening this

Observed distinction

Actual concern

Expected behavior

Suggested fixes

Why this matters

Concrete request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions