Dynamis Labs · Semantic Memory Filesystem

SMF
Filesystem-native memory infrastructure for AI agents and organisational knowledge

github.com/Dynamis-Labs/SMF

Abstract

Semantic Memory Filesystem (SMF) is a research-grade memory architecture built around a simple proposition: the filesystem itself can serve as the primary substrate for agent memory.

Directories represent entity classes. Files represent entities. Symbolic links represent relationships. Standard POSIX operations become part of the retrieval surface.

Rather than wrapping a database in a filesystem metaphor, SMF treats the filesystem as the actual store. This makes the memory layer directly inspectable, versionable, portable, and auditable.

Design

SMF is built for legibility.

Most memory systems hide structure behind APIs, vector indexes, or orchestration layers. SMF keeps the structure exposed:

entities are stored as ordinary filesystem objects
relations are encoded with symlinks
provenance is attached to the objects themselves
retrieval combines lexical, semantic, graph, temporal, fact, and auxiliary channels over an inspectable substrate
the store remains compatible with ordinary shell tools and Git workflows

The result is a memory system that is both machine-usable and human-readable.

System Structure

Source Material
    ↓
Stage 0 · INGEST
    sanitisation, chunking, addressing

Stage 1 · EXTRACT
    entities, summaries, facts, events, topics

Stage 2 · LINK
    resolution, typed relationships, graph construction

Stage 3 · ENRICH
    derived signals, profiles, secondary structure

Stage 4 · SYNTHESIZE
    smart folders, materialised views

Stage 5 · META-REFLECT
    confidence adjustment, maintenance, lifecycle logic

Memory Store
    actors/
    interactions/
    vco/
    decisions/
    rationale/
    time/
    events/
    topics/

Retrieval
    BM25
    embeddings
    graph traversal
    temporal filtering
    fact search
    event search
    auxiliary memory channels

The repository currently exposes an eight-class ontology, a six-stage pipeline, and a multi-channel retrieval stack. Public benchmark reporting is centered on LoCoMo and on the effects of prompt and judge methodology.

Implementation Status

This repository includes both benchmarked paths and broader system modules that are still being validated.

Component	Status
Entity store and ontology	Implemented
Stages 0–2	Implemented and benchmarked
Stages 3–5	Implemented; selectively exercised
Multi-channel retrieval	Implemented and benchmarked
RAPTOR support	Implemented
Turbo backend	Implemented; still being tuned
Operational memory modules	Implemented; not the main public benchmark path
Lifecycle management	Implemented; limited evaluation coverage
MCP and security layers	Implemented; broader hardening in progress

Evaluation

SMF is evaluated across multiple benchmarks for long-horizon conversational and agent memory.

Active Benchmarks

Benchmark	Scope	Status
LoCoMo	Long-conversation memory (5 categories, 1,986 QA pairs)	1-conv results below; full 10-conv runs in progress
LongMemEval	Long-term memory across sessions	Harness integrated, runs in progress
BEAM	Large-scale memory (1M+ tokens)	Harness integrated, runs in progress

Results will be updated as runs complete across all three benchmarks.

LoCoMo Results (dedicated GPT-4.1 judge, strict evaluation)

All J-scores below use a dedicated GPT-4.1 judge independent of the QA model. Earlier configurations that used self-judging (the QA model evaluating its own answers) produced inflated J-scores up to 0.20 higher and have been removed.

Configuration	J-score	F1	Matches	Notes
Sonnet 4.6 store + Cohere rerank	70.9%	0.541	141/199	Best overall
Full retrieval (Sonnet store)	70.4%	0.597	140/199	All channels enabled
Baseline retrieval (Sonnet store)	70.4%	0.557	140/199	BM25 + graph + temporal only
Groq 70B (all stages)	65.8%	0.505	131/199	—
Groq 8B (all stages)	62.3%	0.515	124/199	Structure carries even with 8B

Key findings:

Structure > retrieval sophistication. Stripping the retrieval stack to BM25 and graph traversal produces identical J-score (70.4%) and match count (140/199) as the full stack with embeddings, RAPTOR, neural reranking, and all multi-stage retrievers.
Structure > model scale. Moving from 8B to Sonnet (50x scale) improves J-score by only 8.1 percentage points. The filesystem structure carries the performance.
Self-judging inflates scores. Our own earlier configurations scored up to J=0.91 when the QA model judged itself. Under a dedicated judge, the same architectures score 0.62–0.71. This ~0.20 inflation is comparable to what we observe across the ecosystem.

Evaluation methodology

This repository takes an explicit position that LoCoMo results across systems are not directly comparable. The ecosystem has no standard evaluation protocol — systems use different judge models, different judge prompts, different category subsets, and different metrics, making published scores incomparable.

What we found:

The most widely adopted judge prompt instructs: "be generous...as long as it touches on the same topic." An independent audit found this accepts 62.8% of intentionally wrong answers.
One system reports J=0.912 but F1=0.279 — 91% "correct" by a lenient judge, but only 28% token overlap with gold answers.
Another excludes adversarial questions (the hardest category) from its reported 90.1%.
Reproducibility gaps of 17–54 percentage points exist for the same system on the same benchmark.

What we do differently: dedicated GPT-4.1 judge (independent of QA model), strict prompt ("same core fact/meaning"), all 5 categories included, both F1 and J-score reported, and rejudge.py provided for independent verification.

For the curious: under ecosystem-standard practices (lenient judge, gpt-4o-mini, frontier QA model), our internal estimates place SMF at 88–92%. We report 70.4% instead.

Why This Substrate

Dimension	SMF	Conventional memory systems
Primary store	Filesystem	Database / vector store
Relations	Symlinks	Hidden graph edges / foreign keys
Inspection	POSIX-native	Product-specific tooling
Versioning	Git-native	Usually secondary
Portability	Filesystem operations	Export / import workflows
Failure analysis	Inspect the memory directly	Inspect layers around it

Quick Start

uv sync
uv run smf doctor
uv run pytest -q

uv run smf ingest path/to/transcript.txt
uv run smf daemon
uv run smf-mcp

Benchmark commands

smf benchmark-run-locomo --preset score --graph-first \
  --force-provider groq --judge-model gpt-4.1 \
  --out data/results.json

smf benchmark-run-locomo --prompt-set competitive
smf benchmark-run-locomo --judge-mode strict

Repository Layout

smf/
├── api/          FastAPI server
├── benchmark/    LoCoMo, LongMemEval, BEAM harnesses
├── cli/          Typer CLI
├── core/         Config and core models
├── daemon/       Background execution and scheduling
├── inference/    Provider integrations
├── lifecycle/    Memory lifecycle management
├── memory/       Operational memory layer
├── mcp/          MCP server
├── pipeline/     Six-stage processing pipeline
├── qa/           Answer generation and prompt sets
├── search/       Retrieval stack
├── security/     ACLs, redaction, agent scoping
├── storage/      Entity store and provenance
└── turbo/        Optional acceleration layer

Environment

GROQ_API_KEY=
CEREBRAS_API_KEY=
ANTHROPIC_API_KEY=
OPENAI_API_KEY=
GEMINI_API_KEY=

Roadmap

full 10-conversation LoCoMo evaluation with dedicated judge
LongMemEval and BEAM benchmark runs
re-evaluate earlier configurations (C1–C8) with dedicated GPT-4.1 judge
expand validation for Turbo, lifecycle, and operational-memory paths
harden MCP and security workflows for broader deployments

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
paper		paper
smf		smf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamis Labs · Semantic Memory Filesystem

Abstract

Design

System Structure

Implementation Status

Evaluation

Active Benchmarks

LoCoMo Results (dedicated GPT-4.1 judge, strict evaluation)

Evaluation methodology

Why This Substrate

Quick Start

Benchmark commands

Repository Layout

Environment

Roadmap

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dynamis Labs · Semantic Memory Filesystem

Abstract

Design

System Structure

Implementation Status

Evaluation

Active Benchmarks

LoCoMo Results (dedicated GPT-4.1 judge, strict evaluation)

Evaluation methodology

Why This Substrate

Quick Start

Benchmark commands

Repository Layout

Environment

Roadmap

Related

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages