Skip to content

feat: ChromaDB unified knowledge store — vector + metadata persistence with hot-reload#120

Merged
Steake merged 6 commits intomainfrom
copilot/implement-sqlite-knowledge-store
Mar 6, 2026
Merged

feat: ChromaDB unified knowledge store — vector + metadata persistence with hot-reload#120
Steake merged 6 commits intomainfrom
copilot/implement-sqlite-knowledge-store

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 6, 2026

Description

Implements ChromaKnowledgeStore — a unified ChromaDB-backed persistence layer for both symbolic knowledge metadata and vector embeddings. This replaces the originally planned SQLite approach; ChromaDB handles both vector embeddings and structured metadata in a single embedded, zero-server persistence layer, aligning with GödelOS's existing vector search architecture.

Core Implementation

  • godelOS/core_kr/knowledge_store/chroma_store.pyChromaKnowledgeStore(KnowledgeStoreBackend) using chromadb.PersistentClient with collection-per-context strategy (TRUTHS, BELIEFS, HYPOTHETICAL → separate Chroma collections)
  • AST nodes pickle-serialized as hex blobs in metadata for exact round-trip fidelity; human-readable statement text used as documents for Chroma's automatic vector embedding
  • query_by_similarity(query_text, context_id, n_results) — semantic vector search via Chroma's native embedding search
  • query_by_metadata(context_id, filters) — structured retrieval via Chroma where clause filters

Backend Selection & Configuration

  • godelOS/core_kr/knowledge_store/interface.pyKnowledgeStoreInterface now accepts backend/db_path constructor args and respects KNOWLEDGE_STORE_BACKEND / KNOWLEDGE_STORE_PATH env vars
  • Added get_context_info() to KnowledgeStoreBackend ABC and InMemoryKnowledgeStore
  • Unknown backend values log a warning and fall back to memory
  • Default contexts created idempotently (persisted backends don't re-create on restart)

Hot-Reload

  • godelOS/core_kr/knowledge_store/hot_reloader.pyOntologyHotReloader watches configured directory for .ttl/.json-ld files; computes triple-level deltas and fires on_add/on_remove callbacks (works with any backend including Chroma). Sub-5s latency with debounce.

Vector Database Overlap Audit

  • backend/core/vector_database.py — Added deprecation notice documenting the overlap between the in-memory FAISS-based PersistentVectorDatabase and the new ChromaDB-backed store. New code should prefer ChromaKnowledgeStore for unified knowledge + vector retrieval.

Other Changes

  • godelOS/core_kr/knowledge_store/__init__.py — Exports ChromaKnowledgeStore and OntologyHotReloader
  • requirements.txt — Added chromadb>=0.4.0
  • backend/.env.example — Added KNOWLEDGE_STORE_BACKEND, KNOWLEDGE_STORE_PATH, ONTOLOGY_WATCH_DIR with documented defaults

Related Issues

Closes #112

Test Evidence

40 new tests in tests/test_chroma_knowledge_store.py covering:

  • Round-trip persistence (add → destroy client → reopen from disk → verify)
  • All KnowledgeStoreBackend operations via ChromaDB (context CRUD, statement CRUD, query pattern matching, variable binding)
  • Semantic similarity retrieval (query_by_similarity)
  • Structured metadata retrieval (query_by_metadata)
  • KnowledgeStoreInterface backend selection via constructor and env vars
  • Hot-reloader delta detection (add/modify/remove) including live watchdog observer test
  • TTL and JSON-LD parser edge cases

All 62 knowledge store tests pass (40 new + 22 existing):

tests/test_chroma_knowledge_store.py     40 passed
tests/test_knowledge_store.py            10 passed, 1 skipped (pre-existing)
tests/test_knowledge_store_shim.py        6 passed
tests/test_knowledge_store_enhanced.py    6 passed

CodeQL: 0 alerts. chromadb 1.5.2: no known advisories.

Checklist

  • Tests pass locally (pytest tests/)
  • Code is formatted (black . and isort .)
  • Documentation updated (if applicable)
  • No secrets or credentials committed
  • Related issue linked above

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits March 6, 2026 05:22
…ion utility

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
…ration

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
…ge cases

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement SQLite backend for persistent knowledge store feat: persistent SQLite knowledge store with hot-reload ontology Mar 6, 2026
@Steake
Copy link
Copy Markdown
Owner

Steake commented Mar 6, 2026

⚠️ Scope Change — Pivot from SQLite to ChromaDB

This PR was opened against the original SQLite brief. That brief has been superseded. Please stop all SQLite implementation work and reorient entirely to ChromaDB.

Rationale

GödelOS already has a vector search layer (backend/core/vector_database.py, backend/core/distributed_vector_search.py). SQLite would introduce a second storage format sitting alongside it — redundant, mismatched, and philosophically incoherent for a system that is fundamentally about semantic knowledge retrieval. ChromaDB is embedded, zero-server, Python-native, and handles both vector embeddings and structured metadata in a single persistence layer. No Docker. No ceremony.

What to implement instead

Replace all SQLite code with:

# requirements.txt — add:
chromadb>=0.4.0

# godelOS/core_kr/knowledge_store/chroma_store.py
class ChromaKnowledgeStore(KnowledgeStoreInterface):
    ...

Full specification is in the updated issue #112 body. Please read it in full before proceeding.

Specific changes required

  • Remove all SQLite / aiosqlite code from this branch
  • Implement ChromaKnowledgeStore(KnowledgeStoreInterface) in godelOS/core_kr/knowledge_store/chroma_store.py
  • Collection-per-context: TRUTHS, BELIEFS, HYPOTHETICAL → separate Chroma collections
  • Env vars: KNOWLEDGE_STORE_BACKEND=chroma, KNOWLEDGE_STORE_PATH=./data/chroma
  • Semantic retrieval via Chroma vector search; structured retrieval via where metadata filters
  • OntologyHotReloader watching configured directory — upserts delta into Chroma collections
  • Audit backend/core/vector_database.py for overlap; deprecate redundant in-memory store where ChromaDB supersedes it
  • Round-trip persistence tests, semantic retrieval tests, hot-reload tests

PR title and description should be updated to reflect ChromaDB, not SQLite.

Do not merge the SQLite implementation. The direction has changed.

@Steake Steake changed the title feat: persistent SQLite knowledge store with hot-reload ontology feat: ChromaDB unified knowledge store — vector + metadata persistence with hot-reload Mar 6, 2026
@Steake
Copy link
Copy Markdown
Owner

Steake commented Mar 6, 2026

⚠️ Scope Change — Pivot from SQLite to ChromaDB

This PR was opened against the original SQLite brief. That brief has been superseded. Please stop all SQLite implementation work and reorient entirely to ChromaDB.

Rationale

GödelOS already has a vector search layer (backend/core/vector_database.py, backend/core/distributed_vector_search.py). SQLite would introduce a second storage format sitting alongside it — redundant, mismatched, and philosophically incoherent for a system that is fundamentally about semantic knowledge retrieval. ChromaDB is embedded, zero-server, Python-native, and handles both vector embeddings and structured metadata in a single persistence layer. No Docker. No ceremony.

What to implement instead

Replace all SQLite code with:

# requirements.txt — add:
chromadb>=0.4.0

# godelOS/core_kr/knowledge_store/chroma_store.py
class ChromaKnowledgeStore(KnowledgeStoreInterface):
    ...

Full specification is in the updated issue #112 body. Please read it in full before proceeding.

Specific changes required

  • Remove all SQLite / aiosqlite code from this branch
  • Implement ChromaKnowledgeStore(KnowledgeStoreInterface) in godelOS/core_kr/knowledge_store/chroma_store.py
  • Collection-per-context: TRUTHS, BELIEFS, HYPOTHETICAL → separate Chroma collections
  • Env vars: KNOWLEDGE_STORE_BACKEND=chroma, KNOWLEDGE_STORE_PATH=./data/chroma
  • Semantic retrieval via Chroma vector search; structured retrieval via where metadata filters
  • OntologyHotReloader watching configured directory — upserts delta into Chroma collections
  • Audit backend/core/vector_database.py for overlap; deprecate redundant in-memory store where ChromaDB supersedes it
  • Round-trip persistence tests, semantic retrieval tests, hot-reload tests

PR title and description should be updated to reflect ChromaDB, not SQLite.

Do not merge the SQLite implementation. The direction has changed.

@copilot - Address this most pressing of matters.

Copilot AI and others added 2 commits March 6, 2026 05:43
…re with vector + metadata persistence

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
…lter, comment style

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 6, 2026

⚠️ Scope Change — Pivot from SQLite to ChromaDB

This PR was opened against the original SQLite brief. That brief has been superseded. Please stop all SQLite implementation work and r...

Done. Commit f76b5f1 completes the pivot:

  • All SQLite/aiosqlite code removed (sqlite_store.py, migrate.py, test_sqlite_knowledge_store.py)
  • ChromaKnowledgeStore(KnowledgeStoreBackend) implemented in chroma_store.py — collection-per-context, pickle-serialized AST blobs, vector embeddings for semantic search
  • query_by_similarity() for semantic retrieval, query_by_metadata() for structured where filters
  • KNOWLEDGE_STORE_BACKEND=chroma, KNOWLEDGE_STORE_PATH=./data/chroma wired into KnowledgeStoreInterface
  • OntologyHotReloader retained with generic callbacks (works with any backend)
  • Deprecation notice added to backend/core/vector_database.py documenting the overlap
  • 40 new tests pass (round-trip persistence, semantic retrieval, metadata query, hot-reload); 22 existing tests unaffected
  • CodeQL: 0 alerts

@Steake Steake marked this pull request as ready for review March 6, 2026 09:17
@Steake Steake self-requested a review as a code owner March 6, 2026 09:17
Copilot AI review requested due to automatic review settings March 6, 2026 09:17
@Steake Steake merged commit 310a5f3 into main Mar 6, 2026
Copilot AI added a commit that referenced this pull request Mar 6, 2026
…orkspace), #118 (EmergenceDetector), #119 (transparency), #120 (ChromaDB)

Resolve .gitignore conflict (keep both gitignore entries).
unified_consciousness_engine.py auto-merged cleanly:
IIT φ calculator coexists with GlobalWorkspace broadcaster.
All 51 tests pass (30 IIT + 21 GlobalWorkspace).

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a ChromaDB-backed knowledge store backend to unify persistence of symbolic knowledge + vector search, and introduces an ontology hot-reloader to apply ontology deltas without restarts.

Changes:

  • Introduce ChromaKnowledgeStore backed by chromadb.PersistentClient, plus interface wiring for backend selection via constructor args and env vars.
  • Add OntologyHotReloader (watchdog-based) with basic Turtle/JSON-LD triple parsing and delta callbacks.
  • Add comprehensive tests and update configuration/docs (requirements + .env.example) plus a deprecation note for the legacy FAISS vector DB.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_chroma_knowledge_store.py New test suite covering Chroma backend behavior, persistence, interface selection, and hot reload.
requirements.txt Adds chromadb dependency.
godelOS/core_kr/knowledge_store/interface.py Adds get_context_info() and enables backend selection (memory vs chroma) with idempotent default context creation.
godelOS/core_kr/knowledge_store/hot_reloader.py Adds watchdog-based ontology watcher and parsers to compute triple deltas.
godelOS/core_kr/knowledge_store/chroma_store.py Implements the Chroma-backed KnowledgeStoreBackend including similarity + metadata querying.
godelOS/core_kr/knowledge_store/init.py Exposes ChromaKnowledgeStore and OntologyHotReloader.
backend/core/vector_database.py Documents deprecation/overlap with the new Chroma-backed store.
backend/.env.example Documents new env vars for knowledge store backend/path and ontology watch directory.

Comment on lines +21 to +22
import pytest

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but never used in this test module. This will trip linters and is unnecessary since the tests are written with unittest; please remove the unused import (or convert the file to pytest style and actually use it).

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +43
Each context becomes a separate Chroma collection. AST statements
are stored as pickled blobs in the document field for exact
round-trip fidelity, while extracted subject/predicate/object
strings are stored as metadata for structured ``where`` queries.
ChromaDB also creates vector embeddings of the document text,
enabling semantic similarity retrieval.
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class docstring says AST statements are stored as pickled blobs in the document field, but the implementation stores human-readable text in documents and the pickled blob in metadata under _blob. Please update the docstring to match the actual storage layout (or vice versa) to avoid misleading future changes.

Suggested change
Each context becomes a separate Chroma collection. AST statements
are stored as pickled blobs in the document field for exact
round-trip fidelity, while extracted subject/predicate/object
strings are stored as metadata for structured ``where`` queries.
ChromaDB also creates vector embeddings of the document text,
enabling semantic similarity retrieval.
Each context becomes a separate Chroma collection. Human-readable
text for each AST statement (for example, a stringified form) is
stored in the collection's ``documents`` field, which Chroma uses
for embedding generation and semantic similarity search. The
original AST object is pickled and stored in the metadata under
a dedicated key (e.g. ``"_blob"``) to preserve exact round-trip
fidelity, while extracted subject/predicate/object strings and
other structured attributes are stored as additional metadata
for ``where``-style queries.

Copilot uses AI. Check for mistakes.
Comment on lines +171 to +176
if metadata:
statement_ast = statement_ast.with_updated_metadata(metadata)

# Duplicate check
if self.statement_exists(statement_ast, [context_id]):
return False
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_statement() does a duplicate check via statement_exists(), which iterates/deserializes every statement in the collection. Since you already compute a deterministic doc_id, this can be made much faster by checking for that id directly (e.g., col.get(ids=[doc_id])) and only falling back to unification when needed.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +127
def _serialize_statement(statement: AST_Node) -> str:
"""Pickle → hex-encoded string (Chroma documents are text)."""
return pickle.dumps(statement).hex()

@staticmethod
def _deserialize_statement(hex_str: str) -> AST_Node:
"""Hex string -> unpickled AST node."""
return pickle.loads(bytes.fromhex(hex_str)) # noqa: S301 - trusted internal data

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statements are deserialized via pickle.loads() from data stored on disk. If the persistence directory can be modified by an attacker (or is synced from an untrusted source), this is an arbitrary code execution vector. Consider switching to a safe serialization format for AST nodes (custom JSON/CBOR), or at minimum enforce a strong trust boundary (permissions, validation) and document that the store directory must be treated as trusted-only.

Copilot uses AI. Check for mistakes.
Comment on lines +360 to +366
# Early return if collection is empty (minus sentinel doc)
actual_count = max(col.count() - 1, 0)
if actual_count == 0:
return []

n = min(n_results, actual_count)

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_by_similarity() assumes every collection contains exactly one sentinel doc and does col.count() - 1. If a collection exists without the sentinel (e.g., migrated/externally created), this will undercount and can incorrectly return [] despite real documents. Consider detecting sentinel presence explicitly (e.g., check __context_meta__ exists) or compute actual_count via a where filter on _blob.

Suggested change
# Early return if collection is empty (minus sentinel doc)
actual_count = max(col.count() - 1, 0)
if actual_count == 0:
return []
n = min(n_results, actual_count)
# Do not assume a sentinel document; let Chroma handle empty collections.
n = max(int(n_results), 0)

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +46
items = data if isinstance(data, list) else data.get("@graph", [data])
for item in items:
subject = item.get("@id", "")
for key, value in item.items():
if key.startswith("@"):
continue
values = value if isinstance(value, list) else [value]
for v in values:
if isinstance(v, dict):
obj = v.get("@id", v.get("@value", str(v)))
else:
obj = str(v)
triples.add((subject, key, obj))
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_jsonld_triples() will emit triples with an empty subject when an item lacks @id (since it defaults to ""). That can produce meaningless deltas and unexpected on_add/on_remove calls; consider skipping items without a valid subject id instead of using an empty string.

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +168
"""
return None # default; concrete backends should override
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KnowledgeStoreBackend.get_context_info() is part of the backend contract but is not marked @abstractmethod and returns None by default, so a backend can silently “implement” it without providing real behavior. Consider making it abstract (consistent with the other backend methods) or raising NotImplementedError to avoid hard-to-debug runtime None results.

Suggested change
"""
return None # default; concrete backends should override
Raises:
NotImplementedError: If the backend does not implement this method.
"""
raise NotImplementedError(
"KnowledgeStoreBackend.get_context_info() must be implemented by concrete backends"
)

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +115
"""Deterministic ID for a statement in a context."""
blob = pickle.dumps(statement)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_statement_id() takes a context_id parameter but never uses it, which is misleading (and suggests IDs differ per-context when they currently don’t). Either incorporate context_id into the hash (e.g., prefix it) or remove the parameter to avoid confusion.

Suggested change
"""Deterministic ID for a statement in a context."""
blob = pickle.dumps(statement)
"""Deterministic ID for a statement within a given context."""
# Incorporate the context_id into the hash so IDs are context-scoped.
context_bytes = context_id.encode("utf-8")
blob = context_bytes + b"|" + pickle.dumps(statement)

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +74
This is a lightweight parser covering the common ``<s> <p> <o> .`` form
and simple prefixed names. It is **not** a full Turtle parser.
"""
triples: Set[tuple] = set()
prefixes: Dict[str, str] = {}
try:
with open(path, "r", encoding="utf-8") as fh:
for raw_line in fh:
line = raw_line.strip()
if not line or line.startswith("#"):
continue
if line.lower().startswith("@prefix"):
parts = line.rstrip(".").split()
if len(parts) >= 3:
prefix = parts[1].rstrip(":")
uri = parts[2].strip("<>")
prefixes[prefix] = uri
continue
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_ttl_triples() collects @prefix declarations into prefixes but never uses them, and the docstring claims support for prefixed names. Either implement prefix expansion using the collected prefixes, or remove the unused prefixes logic and adjust the docstring to reflect what’s actually supported.

Copilot uses AI. Check for mistakes.
Steake added a commit that referenced this pull request Mar 6, 2026
…#116)

* Initial plan

* feat(iit): implement bipartition MI approximation for calculate_phi and add unit tests

Replace the heuristic-based InformationIntegrationTheory.calculate_phi()
with a tractable bipartition mutual-information approximation (Tononi 2004):
- Convert subsystem dicts to numeric vectors via recursive flattening
- Enumerate all non-trivial bipartitions at subsystem level (63 cuts)
- φ = min MI across all cuts, with noise-floor suppression for idle states
- Preserve contradiction penalty from self-model validator
- Add 'phi' field to WebSocket broadcast payload for acceptance criteria
- 27 unit tests: idle→φ=0, active→φ>0, penalty, helpers, performance <50ms

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>

* address code review: document magic numbers, add division guard, expand test coverage

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>

* merge: integrate main with PRs #114 (schema contracts), #117 (GlobalWorkspace), #118 (EmergenceDetector), #119 (transparency), #120 (ChromaDB)

Resolve .gitignore conflict (keep both gitignore entries).
unified_consciousness_engine.py auto-merged cleanly:
IIT φ calculator coexists with GlobalWorkspace broadcaster.
All 51 tests pass (30 IIT + 21 GlobalWorkspace).

Co-authored-by: Steake <530040+Steake@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
Copilot AI added a commit that referenced this pull request Mar 6, 2026
…owledge-endpoints

# Conflicts:
#	backend/unified_server.py
#	godelOS/core_kr/knowledge_store/__init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

infra: replace in-memory knowledge store with ChromaDB — unified vector + metadata persistence

3 participants