Conversation
…ion utility Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
…ration Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
…ge cases Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
|
@copilot - Address this most pressing of matters. |
…re with vector + metadata persistence Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
…lter, comment style Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
Done. Commit
|
…orkspace), #118 (EmergenceDetector), #119 (transparency), #120 (ChromaDB) Resolve .gitignore conflict (keep both gitignore entries). unified_consciousness_engine.py auto-merged cleanly: IIT φ calculator coexists with GlobalWorkspace broadcaster. All 51 tests pass (30 IIT + 21 GlobalWorkspace). Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a ChromaDB-backed knowledge store backend to unify persistence of symbolic knowledge + vector search, and introduces an ontology hot-reloader to apply ontology deltas without restarts.
Changes:
- Introduce
ChromaKnowledgeStorebacked bychromadb.PersistentClient, plus interface wiring for backend selection via constructor args and env vars. - Add
OntologyHotReloader(watchdog-based) with basic Turtle/JSON-LD triple parsing and delta callbacks. - Add comprehensive tests and update configuration/docs (requirements +
.env.example) plus a deprecation note for the legacy FAISS vector DB.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_chroma_knowledge_store.py | New test suite covering Chroma backend behavior, persistence, interface selection, and hot reload. |
| requirements.txt | Adds chromadb dependency. |
| godelOS/core_kr/knowledge_store/interface.py | Adds get_context_info() and enables backend selection (memory vs chroma) with idempotent default context creation. |
| godelOS/core_kr/knowledge_store/hot_reloader.py | Adds watchdog-based ontology watcher and parsers to compute triple deltas. |
| godelOS/core_kr/knowledge_store/chroma_store.py | Implements the Chroma-backed KnowledgeStoreBackend including similarity + metadata querying. |
| godelOS/core_kr/knowledge_store/init.py | Exposes ChromaKnowledgeStore and OntologyHotReloader. |
| backend/core/vector_database.py | Documents deprecation/overlap with the new Chroma-backed store. |
| backend/.env.example | Documents new env vars for knowledge store backend/path and ontology watch directory. |
| import pytest | ||
|
|
There was a problem hiding this comment.
pytest is imported but never used in this test module. This will trip linters and is unnecessary since the tests are written with unittest; please remove the unused import (or convert the file to pytest style and actually use it).
| import pytest |
| Each context becomes a separate Chroma collection. AST statements | ||
| are stored as pickled blobs in the document field for exact | ||
| round-trip fidelity, while extracted subject/predicate/object | ||
| strings are stored as metadata for structured ``where`` queries. | ||
| ChromaDB also creates vector embeddings of the document text, | ||
| enabling semantic similarity retrieval. |
There was a problem hiding this comment.
The class docstring says AST statements are stored as pickled blobs in the document field, but the implementation stores human-readable text in documents and the pickled blob in metadata under _blob. Please update the docstring to match the actual storage layout (or vice versa) to avoid misleading future changes.
| Each context becomes a separate Chroma collection. AST statements | |
| are stored as pickled blobs in the document field for exact | |
| round-trip fidelity, while extracted subject/predicate/object | |
| strings are stored as metadata for structured ``where`` queries. | |
| ChromaDB also creates vector embeddings of the document text, | |
| enabling semantic similarity retrieval. | |
| Each context becomes a separate Chroma collection. Human-readable | |
| text for each AST statement (for example, a stringified form) is | |
| stored in the collection's ``documents`` field, which Chroma uses | |
| for embedding generation and semantic similarity search. The | |
| original AST object is pickled and stored in the metadata under | |
| a dedicated key (e.g. ``"_blob"``) to preserve exact round-trip | |
| fidelity, while extracted subject/predicate/object strings and | |
| other structured attributes are stored as additional metadata | |
| for ``where``-style queries. |
| if metadata: | ||
| statement_ast = statement_ast.with_updated_metadata(metadata) | ||
|
|
||
| # Duplicate check | ||
| if self.statement_exists(statement_ast, [context_id]): | ||
| return False |
There was a problem hiding this comment.
add_statement() does a duplicate check via statement_exists(), which iterates/deserializes every statement in the collection. Since you already compute a deterministic doc_id, this can be made much faster by checking for that id directly (e.g., col.get(ids=[doc_id])) and only falling back to unification when needed.
| def _serialize_statement(statement: AST_Node) -> str: | ||
| """Pickle → hex-encoded string (Chroma documents are text).""" | ||
| return pickle.dumps(statement).hex() | ||
|
|
||
| @staticmethod | ||
| def _deserialize_statement(hex_str: str) -> AST_Node: | ||
| """Hex string -> unpickled AST node.""" | ||
| return pickle.loads(bytes.fromhex(hex_str)) # noqa: S301 - trusted internal data | ||
|
|
There was a problem hiding this comment.
Statements are deserialized via pickle.loads() from data stored on disk. If the persistence directory can be modified by an attacker (or is synced from an untrusted source), this is an arbitrary code execution vector. Consider switching to a safe serialization format for AST nodes (custom JSON/CBOR), or at minimum enforce a strong trust boundary (permissions, validation) and document that the store directory must be treated as trusted-only.
| # Early return if collection is empty (minus sentinel doc) | ||
| actual_count = max(col.count() - 1, 0) | ||
| if actual_count == 0: | ||
| return [] | ||
|
|
||
| n = min(n_results, actual_count) | ||
|
|
There was a problem hiding this comment.
query_by_similarity() assumes every collection contains exactly one sentinel doc and does col.count() - 1. If a collection exists without the sentinel (e.g., migrated/externally created), this will undercount and can incorrectly return [] despite real documents. Consider detecting sentinel presence explicitly (e.g., check __context_meta__ exists) or compute actual_count via a where filter on _blob.
| # Early return if collection is empty (minus sentinel doc) | |
| actual_count = max(col.count() - 1, 0) | |
| if actual_count == 0: | |
| return [] | |
| n = min(n_results, actual_count) | |
| # Do not assume a sentinel document; let Chroma handle empty collections. | |
| n = max(int(n_results), 0) |
| items = data if isinstance(data, list) else data.get("@graph", [data]) | ||
| for item in items: | ||
| subject = item.get("@id", "") | ||
| for key, value in item.items(): | ||
| if key.startswith("@"): | ||
| continue | ||
| values = value if isinstance(value, list) else [value] | ||
| for v in values: | ||
| if isinstance(v, dict): | ||
| obj = v.get("@id", v.get("@value", str(v))) | ||
| else: | ||
| obj = str(v) | ||
| triples.add((subject, key, obj)) |
There was a problem hiding this comment.
_parse_jsonld_triples() will emit triples with an empty subject when an item lacks @id (since it defaults to ""). That can produce meaningless deltas and unexpected on_add/on_remove calls; consider skipping items without a valid subject id instead of using an empty string.
| """ | ||
| return None # default; concrete backends should override |
There was a problem hiding this comment.
KnowledgeStoreBackend.get_context_info() is part of the backend contract but is not marked @abstractmethod and returns None by default, so a backend can silently “implement” it without providing real behavior. Consider making it abstract (consistent with the other backend methods) or raising NotImplementedError to avoid hard-to-debug runtime None results.
| """ | |
| return None # default; concrete backends should override | |
| Raises: | |
| NotImplementedError: If the backend does not implement this method. | |
| """ | |
| raise NotImplementedError( | |
| "KnowledgeStoreBackend.get_context_info() must be implemented by concrete backends" | |
| ) |
| """Deterministic ID for a statement in a context.""" | ||
| blob = pickle.dumps(statement) |
There was a problem hiding this comment.
_statement_id() takes a context_id parameter but never uses it, which is misleading (and suggests IDs differ per-context when they currently don’t). Either incorporate context_id into the hash (e.g., prefix it) or remove the parameter to avoid confusion.
| """Deterministic ID for a statement in a context.""" | |
| blob = pickle.dumps(statement) | |
| """Deterministic ID for a statement within a given context.""" | |
| # Incorporate the context_id into the hash so IDs are context-scoped. | |
| context_bytes = context_id.encode("utf-8") | |
| blob = context_bytes + b"|" + pickle.dumps(statement) |
| This is a lightweight parser covering the common ``<s> <p> <o> .`` form | ||
| and simple prefixed names. It is **not** a full Turtle parser. | ||
| """ | ||
| triples: Set[tuple] = set() | ||
| prefixes: Dict[str, str] = {} | ||
| try: | ||
| with open(path, "r", encoding="utf-8") as fh: | ||
| for raw_line in fh: | ||
| line = raw_line.strip() | ||
| if not line or line.startswith("#"): | ||
| continue | ||
| if line.lower().startswith("@prefix"): | ||
| parts = line.rstrip(".").split() | ||
| if len(parts) >= 3: | ||
| prefix = parts[1].rstrip(":") | ||
| uri = parts[2].strip("<>") | ||
| prefixes[prefix] = uri | ||
| continue |
There was a problem hiding this comment.
_parse_ttl_triples() collects @prefix declarations into prefixes but never uses them, and the docstring claims support for prefixed names. Either implement prefix expansion using the collected prefixes, or remove the unused prefixes logic and adjust the docstring to reflect what’s actually supported.
…#116) * Initial plan * feat(iit): implement bipartition MI approximation for calculate_phi and add unit tests Replace the heuristic-based InformationIntegrationTheory.calculate_phi() with a tractable bipartition mutual-information approximation (Tononi 2004): - Convert subsystem dicts to numeric vectors via recursive flattening - Enumerate all non-trivial bipartitions at subsystem level (63 cuts) - φ = min MI across all cuts, with noise-floor suppression for idle states - Preserve contradiction penalty from self-model validator - Add 'phi' field to WebSocket broadcast payload for acceptance criteria - 27 unit tests: idle→φ=0, active→φ>0, penalty, helpers, performance <50ms Co-authored-by: Steake <530040+Steake@users.noreply.github.com> * address code review: document magic numbers, add division guard, expand test coverage Co-authored-by: Steake <530040+Steake@users.noreply.github.com> * merge: integrate main with PRs #114 (schema contracts), #117 (GlobalWorkspace), #118 (EmergenceDetector), #119 (transparency), #120 (ChromaDB) Resolve .gitignore conflict (keep both gitignore entries). unified_consciousness_engine.py auto-merged cleanly: IIT φ calculator coexists with GlobalWorkspace broadcaster. All 51 tests pass (30 IIT + 21 GlobalWorkspace). Co-authored-by: Steake <530040+Steake@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Steake <530040+Steake@users.noreply.github.com>
…owledge-endpoints # Conflicts: # backend/unified_server.py # godelOS/core_kr/knowledge_store/__init__.py
Description
Implements
ChromaKnowledgeStore— a unified ChromaDB-backed persistence layer for both symbolic knowledge metadata and vector embeddings. This replaces the originally planned SQLite approach; ChromaDB handles both vector embeddings and structured metadata in a single embedded, zero-server persistence layer, aligning with GödelOS's existing vector search architecture.Core Implementation
godelOS/core_kr/knowledge_store/chroma_store.py—ChromaKnowledgeStore(KnowledgeStoreBackend)usingchromadb.PersistentClientwith collection-per-context strategy (TRUTHS,BELIEFS,HYPOTHETICAL→ separate Chroma collections)query_by_similarity(query_text, context_id, n_results)— semantic vector search via Chroma's native embedding searchquery_by_metadata(context_id, filters)— structured retrieval via Chromawhereclause filtersBackend Selection & Configuration
godelOS/core_kr/knowledge_store/interface.py—KnowledgeStoreInterfacenow acceptsbackend/db_pathconstructor args and respectsKNOWLEDGE_STORE_BACKEND/KNOWLEDGE_STORE_PATHenv varsget_context_info()toKnowledgeStoreBackendABC andInMemoryKnowledgeStorememoryHot-Reload
godelOS/core_kr/knowledge_store/hot_reloader.py—OntologyHotReloaderwatches configured directory for.ttl/.json-ldfiles; computes triple-level deltas and fireson_add/on_removecallbacks (works with any backend including Chroma). Sub-5s latency with debounce.Vector Database Overlap Audit
backend/core/vector_database.py— Added deprecation notice documenting the overlap between the in-memory FAISS-basedPersistentVectorDatabaseand the new ChromaDB-backed store. New code should preferChromaKnowledgeStorefor unified knowledge + vector retrieval.Other Changes
godelOS/core_kr/knowledge_store/__init__.py— ExportsChromaKnowledgeStoreandOntologyHotReloaderrequirements.txt— Addedchromadb>=0.4.0backend/.env.example— AddedKNOWLEDGE_STORE_BACKEND,KNOWLEDGE_STORE_PATH,ONTOLOGY_WATCH_DIRwith documented defaultsRelated Issues
Closes #112
Test Evidence
40 new tests in
tests/test_chroma_knowledge_store.pycovering:KnowledgeStoreBackendoperations via ChromaDB (context CRUD, statement CRUD, query pattern matching, variable binding)query_by_similarity)query_by_metadata)KnowledgeStoreInterfacebackend selection via constructor and env varsAll 62 knowledge store tests pass (40 new + 22 existing):
CodeQL: 0 alerts.
chromadb 1.5.2: no known advisories.Checklist
pytest tests/)black .andisort .)✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.