Skip to content
Oli edited this page Mar 5, 2026 · 1 revision

v0.6 — Adaptive Knowledge Ingestion Pipeline

Anchor issue: #78
Status: ⏳ Not started — runs in parallel with v0.5


A consciousness engine without the ability to assimilate new knowledge efficiently is, to borrow a biological analogy, a very sophisticated brain that cannot eat. The knowledge ingestion pipeline is the system's alimentary canal — unglamorous, perhaps, but rather important if the system is to grow.

The full specification is in Issue #33 and docs/ADAPTIVE_INGESTION_README.md. What follows is the essential summary.


Core Deliverables

  • Adaptive ingestion workers with CPU autotuner — designed for 8-core, 16GB hardware; self-regulating under load
  • Layout-aware chunking at three user-selectable levels:
Level Chunk Tokens Model Typical Use
Fast 650–800 MiniLM-L6-v2 (ONNX/Int8) Quick loads
Balanced 750–900 MiniLM-L6-v2 or MPNet Most documents
Deep 500–700 all-mpnet-base-v2 High-recall research
  • Tightened vector DB contract — ANN and search logic stay inside the database; no client-side duplication
  • Knowledge graph builder from vector kNN neighbours: Document → Chunk → Concept, with CONTAINS, SIMILAR_TO, TAGGED_AS edges
  • Persistent Jobs UI — preflight ETAs, per-stage progress bars, job management; responsive across all viewports

Acceptance Criteria

  • Import a PDF of ≥300MB on 8-core/16GB hardware without running out of memory
  • Preflight ETA accurate to ±25% after two minutes of micro-benchmarking
  • Self-query MRR@10 ≥ 0.6 — the system can find what it has ingested
  • Jobs UI persists across page reloads
  • Vector DB is the single and exclusive source of embeddings and search — no exceptions

Clone this wiki locally