SMF
Filesystem-native memory infrastructure for AI agents and organisational knowledge
Semantic Memory Filesystem (SMF) is a research-grade memory architecture built around a simple proposition: the filesystem itself can serve as the primary substrate for agent memory.
Directories represent entity classes. Files represent entities. Symbolic links represent relationships. Standard POSIX operations become part of the retrieval surface.
Rather than wrapping a database in a filesystem metaphor, SMF treats the filesystem as the actual store. This makes the memory layer directly inspectable, versionable, portable, and auditable.
SMF is built for legibility.
Most memory systems hide structure behind APIs, vector indexes, or orchestration layers. SMF keeps the structure exposed:
- entities are stored as ordinary filesystem objects
- relations are encoded with symlinks
- provenance is attached to the objects themselves
- retrieval combines lexical, semantic, graph, temporal, fact, and auxiliary channels over an inspectable substrate
- the store remains compatible with ordinary shell tools and Git workflows
The result is a memory system that is both machine-usable and human-readable.
Source Material
↓
Stage 0 · INGEST
sanitisation, chunking, addressing
Stage 1 · EXTRACT
entities, summaries, facts, events, topics
Stage 2 · LINK
resolution, typed relationships, graph construction
Stage 3 · ENRICH
derived signals, profiles, secondary structure
Stage 4 · SYNTHESIZE
smart folders, materialised views
Stage 5 · META-REFLECT
confidence adjustment, maintenance, lifecycle logic
Memory Store
actors/
interactions/
vco/
decisions/
rationale/
time/
events/
topics/
Retrieval
BM25
embeddings
graph traversal
temporal filtering
fact search
event search
auxiliary memory channels
The repository currently exposes an eight-class ontology, a six-stage pipeline, and a multi-channel retrieval stack. Public benchmark reporting is centered on LoCoMo and on the effects of prompt and judge methodology.
This repository includes both benchmarked paths and broader system modules that are still being validated.
| Component | Status |
|---|---|
| Entity store and ontology | Implemented |
| Stages 0–2 | Implemented and benchmarked |
| Stages 3–5 | Implemented; selectively exercised |
| Multi-channel retrieval | Implemented and benchmarked |
| RAPTOR support | Implemented |
| Turbo backend | Implemented; still being tuned |
| Operational memory modules | Implemented; not the main public benchmark path |
| Lifecycle management | Implemented; limited evaluation coverage |
| MCP and security layers | Implemented; broader hardening in progress |
SMF is evaluated across multiple benchmarks for long-horizon conversational and agent memory.
| Benchmark | Scope | Status |
|---|---|---|
| LoCoMo | Long-conversation memory (5 categories, 1,986 QA pairs) | 1-conv results below; full 10-conv runs in progress |
| LongMemEval | Long-term memory across sessions | Harness integrated, runs in progress |
| BEAM | Large-scale memory (1M+ tokens) | Harness integrated, runs in progress |
Results will be updated as runs complete across all three benchmarks.
All J-scores below use a dedicated GPT-4.1 judge independent of the QA model. Earlier configurations that used self-judging (the QA model evaluating its own answers) produced inflated J-scores up to 0.20 higher and have been removed.
| Configuration | J-score | F1 | Matches | Notes |
|---|---|---|---|---|
| Sonnet 4.6 store + Cohere rerank | 70.9% | 0.541 | 141/199 | Best overall |
| Full retrieval (Sonnet store) | 70.4% | 0.597 | 140/199 | All channels enabled |
| Baseline retrieval (Sonnet store) | 70.4% | 0.557 | 140/199 | BM25 + graph + temporal only |
| Groq 70B (all stages) | 65.8% | 0.505 | 131/199 | — |
| Groq 8B (all stages) | 62.3% | 0.515 | 124/199 | Structure carries even with 8B |
Key findings:
- Structure > retrieval sophistication. Stripping the retrieval stack to BM25 and graph traversal produces identical J-score (70.4%) and match count (140/199) as the full stack with embeddings, RAPTOR, neural reranking, and all multi-stage retrievers.
- Structure > model scale. Moving from 8B to Sonnet (50x scale) improves J-score by only 8.1 percentage points. The filesystem structure carries the performance.
- Self-judging inflates scores. Our own earlier configurations scored up to J=0.91 when the QA model judged itself. Under a dedicated judge, the same architectures score 0.62–0.71. This ~0.20 inflation is comparable to what we observe across the ecosystem.
This repository takes an explicit position that LoCoMo results across systems are not directly comparable. The ecosystem has no standard evaluation protocol — systems use different judge models, different judge prompts, different category subsets, and different metrics, making published scores incomparable.
What we found:
- The most widely adopted judge prompt instructs: "be generous...as long as it touches on the same topic." An independent audit found this accepts 62.8% of intentionally wrong answers.
- One system reports J=0.912 but F1=0.279 — 91% "correct" by a lenient judge, but only 28% token overlap with gold answers.
- Another excludes adversarial questions (the hardest category) from its reported 90.1%.
- Reproducibility gaps of 17–54 percentage points exist for the same system on the same benchmark.
What we do differently: dedicated GPT-4.1 judge (independent of QA model), strict prompt ("same core fact/meaning"), all 5 categories included, both F1 and J-score reported, and rejudge.py provided for independent verification.
For the curious: under ecosystem-standard practices (lenient judge, gpt-4o-mini, frontier QA model), our internal estimates place SMF at 88–92%. We report 70.4% instead.
| Dimension | SMF | Conventional memory systems |
|---|---|---|
| Primary store | Filesystem | Database / vector store |
| Relations | Symlinks | Hidden graph edges / foreign keys |
| Inspection | POSIX-native | Product-specific tooling |
| Versioning | Git-native | Usually secondary |
| Portability | Filesystem operations | Export / import workflows |
| Failure analysis | Inspect the memory directly | Inspect layers around it |
uv sync
uv run smf doctor
uv run pytest -q
uv run smf ingest path/to/transcript.txt
uv run smf daemon
uv run smf-mcpsmf benchmark-run-locomo --preset score --graph-first \
--force-provider groq --judge-model gpt-4.1 \
--out data/results.json
smf benchmark-run-locomo --prompt-set competitive
smf benchmark-run-locomo --judge-mode strictsmf/
├── api/ FastAPI server
├── benchmark/ LoCoMo, LongMemEval, BEAM harnesses
├── cli/ Typer CLI
├── core/ Config and core models
├── daemon/ Background execution and scheduling
├── inference/ Provider integrations
├── lifecycle/ Memory lifecycle management
├── memory/ Operational memory layer
├── mcp/ MCP server
├── pipeline/ Six-stage processing pipeline
├── qa/ Answer generation and prompt sets
├── search/ Retrieval stack
├── security/ ACLs, redaction, agent scoping
├── storage/ Entity store and provenance
└── turbo/ Optional acceleration layer
GROQ_API_KEY=
CEREBRAS_API_KEY=
ANTHROPIC_API_KEY=
OPENAI_API_KEY=
GEMINI_API_KEY=- full 10-conversation LoCoMo evaluation with dedicated judge
- LongMemEval and BEAM benchmark runs
- re-evaluate earlier configurations (C1–C8) with dedicated GPT-4.1 judge
- expand validation for Turbo, lifecycle, and operational-memory paths
- harden MCP and security workflows for broader deployments
See LICENSE.