Skip to content

Dynamis-Labs/SMF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dynamis Labs · Semantic Memory Filesystem

SMF
Filesystem-native memory infrastructure for AI agents and organisational knowledge

github.com/Dynamis-Labs/SMF


Abstract

Semantic Memory Filesystem (SMF) is a research-grade memory architecture built around a simple proposition: the filesystem itself can serve as the primary substrate for agent memory.

Directories represent entity classes. Files represent entities. Symbolic links represent relationships. Standard POSIX operations become part of the retrieval surface.

Rather than wrapping a database in a filesystem metaphor, SMF treats the filesystem as the actual store. This makes the memory layer directly inspectable, versionable, portable, and auditable.


Design

SMF is built for legibility.

Most memory systems hide structure behind APIs, vector indexes, or orchestration layers. SMF keeps the structure exposed:

  • entities are stored as ordinary filesystem objects
  • relations are encoded with symlinks
  • provenance is attached to the objects themselves
  • retrieval combines lexical, semantic, graph, temporal, fact, and auxiliary channels over an inspectable substrate
  • the store remains compatible with ordinary shell tools and Git workflows

The result is a memory system that is both machine-usable and human-readable.


System Structure

Source Material
    ↓
Stage 0 · INGEST
    sanitisation, chunking, addressing

Stage 1 · EXTRACT
    entities, summaries, facts, events, topics

Stage 2 · LINK
    resolution, typed relationships, graph construction

Stage 3 · ENRICH
    derived signals, profiles, secondary structure

Stage 4 · SYNTHESIZE
    smart folders, materialised views

Stage 5 · META-REFLECT
    confidence adjustment, maintenance, lifecycle logic

Memory Store
    actors/
    interactions/
    vco/
    decisions/
    rationale/
    time/
    events/
    topics/

Retrieval
    BM25
    embeddings
    graph traversal
    temporal filtering
    fact search
    event search
    auxiliary memory channels

The repository currently exposes an eight-class ontology, a six-stage pipeline, and a multi-channel retrieval stack. Public benchmark reporting is centered on LoCoMo and on the effects of prompt and judge methodology.


Implementation Status

This repository includes both benchmarked paths and broader system modules that are still being validated.

Component Status
Entity store and ontology Implemented
Stages 0–2 Implemented and benchmarked
Stages 3–5 Implemented; selectively exercised
Multi-channel retrieval Implemented and benchmarked
RAPTOR support Implemented
Turbo backend Implemented; still being tuned
Operational memory modules Implemented; not the main public benchmark path
Lifecycle management Implemented; limited evaluation coverage
MCP and security layers Implemented; broader hardening in progress

Evaluation

SMF is evaluated across multiple benchmarks for long-horizon conversational and agent memory.

Active Benchmarks

Benchmark Scope Status
LoCoMo Long-conversation memory (5 categories, 1,986 QA pairs) 1-conv results below; full 10-conv runs in progress
LongMemEval Long-term memory across sessions Harness integrated, runs in progress
BEAM Large-scale memory (1M+ tokens) Harness integrated, runs in progress

Results will be updated as runs complete across all three benchmarks.

LoCoMo Results (dedicated GPT-4.1 judge, strict evaluation)

All J-scores below use a dedicated GPT-4.1 judge independent of the QA model. Earlier configurations that used self-judging (the QA model evaluating its own answers) produced inflated J-scores up to 0.20 higher and have been removed.

Configuration J-score F1 Matches Notes
Sonnet 4.6 store + Cohere rerank 70.9% 0.541 141/199 Best overall
Full retrieval (Sonnet store) 70.4% 0.597 140/199 All channels enabled
Baseline retrieval (Sonnet store) 70.4% 0.557 140/199 BM25 + graph + temporal only
Groq 70B (all stages) 65.8% 0.505 131/199
Groq 8B (all stages) 62.3% 0.515 124/199 Structure carries even with 8B

Key findings:

  • Structure > retrieval sophistication. Stripping the retrieval stack to BM25 and graph traversal produces identical J-score (70.4%) and match count (140/199) as the full stack with embeddings, RAPTOR, neural reranking, and all multi-stage retrievers.
  • Structure > model scale. Moving from 8B to Sonnet (50x scale) improves J-score by only 8.1 percentage points. The filesystem structure carries the performance.
  • Self-judging inflates scores. Our own earlier configurations scored up to J=0.91 when the QA model judged itself. Under a dedicated judge, the same architectures score 0.62–0.71. This ~0.20 inflation is comparable to what we observe across the ecosystem.

Evaluation methodology

This repository takes an explicit position that LoCoMo results across systems are not directly comparable. The ecosystem has no standard evaluation protocol — systems use different judge models, different judge prompts, different category subsets, and different metrics, making published scores incomparable.

What we found:

What we do differently: dedicated GPT-4.1 judge (independent of QA model), strict prompt ("same core fact/meaning"), all 5 categories included, both F1 and J-score reported, and rejudge.py provided for independent verification.

For the curious: under ecosystem-standard practices (lenient judge, gpt-4o-mini, frontier QA model), our internal estimates place SMF at 88–92%. We report 70.4% instead.


Why This Substrate

Dimension SMF Conventional memory systems
Primary store Filesystem Database / vector store
Relations Symlinks Hidden graph edges / foreign keys
Inspection POSIX-native Product-specific tooling
Versioning Git-native Usually secondary
Portability Filesystem operations Export / import workflows
Failure analysis Inspect the memory directly Inspect layers around it

Quick Start

uv sync
uv run smf doctor
uv run pytest -q

uv run smf ingest path/to/transcript.txt
uv run smf daemon
uv run smf-mcp

Benchmark commands

smf benchmark-run-locomo --preset score --graph-first \
  --force-provider groq --judge-model gpt-4.1 \
  --out data/results.json

smf benchmark-run-locomo --prompt-set competitive
smf benchmark-run-locomo --judge-mode strict

Repository Layout

smf/
├── api/          FastAPI server
├── benchmark/    LoCoMo, LongMemEval, BEAM harnesses
├── cli/          Typer CLI
├── core/         Config and core models
├── daemon/       Background execution and scheduling
├── inference/    Provider integrations
├── lifecycle/    Memory lifecycle management
├── memory/       Operational memory layer
├── mcp/          MCP server
├── pipeline/     Six-stage processing pipeline
├── qa/           Answer generation and prompt sets
├── search/       Retrieval stack
├── security/     ACLs, redaction, agent scoping
├── storage/      Entity store and provenance
└── turbo/        Optional acceleration layer

Environment

GROQ_API_KEY=
CEREBRAS_API_KEY=
ANTHROPIC_API_KEY=
OPENAI_API_KEY=
GEMINI_API_KEY=

Roadmap

  • full 10-conversation LoCoMo evaluation with dedicated judge
  • LongMemEval and BEAM benchmark runs
  • re-evaluate earlier configurations (C1–C8) with dedicated GPT-4.1 judge
  • expand validation for Turbo, lifecycle, and operational-memory paths
  • harden MCP and security workflows for broader deployments

Related


License

See LICENSE.

About

Filesystem-native memory infrastructure for AI agents — directories as entities, symlinks as relations, POSIX as the retrieval surface.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors