mindsdb · torrmal · May 14, 2026
diff --git a/anton/README.md b/anton/README.md
@@ -614,8 +614,7 @@ Modes (env var `ANTON_ACC_MODE`, mirrors `ANTON_MEMORY_MODE`):
 | `passive` (default) | Layer 1 only. Lessons drain to memory at end-of-turn; next turn's system prompt picks them up. Adds zero surface to the turn loop. |
 | `active` | Layer 1 + Layer 2. Lessons ALSO inject inline as text blocks in `tool_results` so the LLM sees them on the very next round. Stronger learning signal; more invasive. |
 
-What is NOT yet wired (deliberate):
-  - **Layer 3 — retrieval-scored rule ranking.** At system-prompt assembly, score each candidate rule by relevance to the current turn's context and load the top-K within the token budget. Per-rule retrieval counters age out rules that never make the cut. Pairs with optional outcome-tracking (did the rule reduce its target pattern after it landed?). Needs a small embedding index over rules + a ranker call on the load path.
+  - **Layer 3 — retrieval-scored rule ranking.** Built. `Cortex.build_memory_context()` now routes `## When` rules through a BM25 ranker (`anton/core/memory/ranker.py`) scored against the current user message; only the top-K within the char budget land in the prompt. `## Always` / `## Never` rules bypass the ranker — they're unconditional by definition. Every rule that lands gets a retrieval counter bump in `rules.stats.json` (the sidecar at `anton/core/memory/rule_stats.py`). The Phase C outcome bridge wires the ACC's end-of-turn flush back into stats: when a detector fires AND its corresponding rule was loaded this turn, the rule's `ignored` counter bumps — high `ignored` is the consolidator's signal to rewrite or escalate. `/memory rankings` is the debug surface.
 
 ### Vocabulary discipline
 
@@ -659,6 +658,67 @@ Detectors are stateless functions of `Sequence[Event] → Lesson | None`. Each d
 
 Like the cerebellum, the ACC is a *producer*. It does not own storage. Lessons it generates flow into the same Engram pipeline that the cerebellum and consolidator already use. The de-dupe predicate is caller-supplied (`has_similar_lesson`) so the wiring layer can choose substring, embedding, or semantic similarity without changing ACC internals.
 
+## Layer 3 — Retrieval-Scored Rule Ranking
+
+Layers 1 and 2 produce rules; Layer 3 decides which ones to load on any given turn. The Cortex no longer dumps every `## When` rule into the system prompt — it scores each rule by relevance to the current user message and loads the top-K within budget. The mechanism is BM25 (lexical) rather than embeddings (semantic) because the corpus is tiny (<50 rules typically), the rules are 1–3 sentences, and rules + user messages share domain nouns. Microseconds per call, no LLM dependency.
+
+### What ranks vs. what doesn't
+
+| Section in `rules.md` | Treatment | Reason |
+|---|---|---|
+| `## Always` | Loaded in full every turn | Unconditional — ranking would defeat the point |
+| `## Never` | Loaded in full every turn | Unconditional |
+| `## When` | Ranked by BM25 relevance, top-K within budget | Conditional rules ARE relevance-shaped by construction |
+
+### Pieces
+
+| Module | Role |
+|---|---|
+| `anton/core/memory/ranker.py` | `Ranker.rank(rules, query)` → BM25-scored `RankedRule`s. `Ranker.select_within_budget(ranked, budget_tokens, floor_k, cap_k)` → final selection. No LLM call, no API key, deterministic. |
+| `anton/core/memory/rule_stats.py` | `RuleStats` sidecar at `~/.anton/memory/rules.stats.json`. Buffer-and-flush write pattern — `record_retrieval` / `record_ignored` are in-memory dict updates; `flush()` does a single atomic `.tmp + os.replace` under `fcntl.flock`. One disk write per turn, not one per rule. |
+| `anton/core/memory/cortex.py` | `_retrieve_relevant_rules` rewrites the `## When` section through the ranker and records retrievals. `consume_retrieved_this_turn()` exposes the per-turn rule-id set to the outcome bridge. |
+| `anton/core/session.py` | `_schedule_acc_flush` now consults the per-turn retrieval set and bumps `ignored` on rules whose ACC-detected pattern fired despite being loaded. |
+| `anton/memory/manage.py` | `/memory rankings` debug surface. Highlights noisy rules (high `ignored`) and cold rules (zero retrievals). |
+
+### Cold-start behaviour
+
+- No user message yet OR query has no scorable terms after stopword removal → all rules loaded in input order. Ranker only filters under budget pressure.
+- Corpus under `_RULES_BUDGET_CHARS` (~6000 chars) → no ranking; full corpus loaded.
+- New rule (just encoded) → starts at zero retrievals/ignored, isn't penalised at tiebreak. First retrieval creates its record.
+
+### Stable rule identity
+
+Stats key: `sha256(rule.text.strip().lower())[:16]`. Stable for the rule's lifetime — but a consolidator rewrite changes the hash and resets the counters. Acceptable v1 trade-off; v2 should attach a UUID in the rule's HTML-comment metadata so edits preserve identity. Without that, large-scale rephrasing zeroes out the very telemetry we'd use to decide which rules to keep.
+
+### The outcome bridge (Phase C)
+
+Layer 1's `_schedule_acc_flush()` already drains lessons through `cortex.encode()`. Layer 3 adds one step before encoding:
+
+1. Get the ACC's fired lessons via `at_end_of_turn()`.
+2. Call `cortex.consume_retrieved_this_turn()` — returns the set of rule IDs that landed in this turn's prompt, and clears the set.
+3. For each fired lesson, if its rule-ID is in the retrieved set, bump `rule_stats.record_ignored(rule.rule)`. The LLM saw the rule and the pattern fired anyway — that's a strong "this rule isn't sticking" signal.
+4. Flush stats. Then encode the engrams as before.
+
+Brand-new lessons (never been retrieved because the rule was just created) correctly skip the bump — the LLM can't be ignoring a rule it hasn't seen.
+
+### Debug surface — `/memory rankings`
+
+```
+$ anton  → /memory rankings
+
+Rule rankings (retrieval-scoring telemetry)
+
+   RETR   IGN  LAST        RULE
+     12     0  2026-05-14  Use ONE scratchpad name per task and reuse it for every cell...
+     11     2  2026-05-14  When a tool fails, don't retry with the same arguments...
+      7     0  2026-05-13  When the same error message appears repeatedly in one turn...
+      3     0  2026-05-12  Don't reset the scratchpad to recover from errors...
+      1     0  2026-04-30  For CSV files with mixed column types, pass low_memory=False...
+      0     0  —           Use httpx instead of requests for HTTP calls.
+```
+
+Noisy rules (`IGN > 0`) render in warning color — candidates for rewriting or escalation. Cold rules (`RETR = 0`) render dim — candidates for compaction. The consolidator can later read `rules.stats.json` directly to drive automated aging-out.
+
 ## Structured Output — `LLMClient.generate_object`
 
 Anton has a single primitive for getting structured data out of the LLM, used by the cerebellum, the consolidator, the cortex's identity/compaction passes, the connect collector, the skill drafter, and the custom-datasource flow. It lives at `anton/llm/client.py`:
@@ -778,6 +838,8 @@ anton/core/memory/                 LONG-TERM MEMORY (brain-mapped modules)
 ├── consolidator.py     Consolidator class (sleep-replay → Engrams)
 ├── cerebellum.py       Cerebellum class (per-cell supervised error learning)
 ├── acc.py              AnteriorCingulate class (turn-level pattern error detection)
+├── ranker.py           BM25 ranker for retrieval-scored rule selection (Layer 3)
+├── rule_stats.py       Per-rule retrieval/ignored counter sidecar (Layer 3)
 └── skills.py           Skill, SkillStore, SkillStats — procedural memory storage layer
 
 anton/memory/                      LEGACY / ORTHOGONAL (not the brain-mapped memory system)

diff --git a/anton/core/memory/cortex.py b/anton/core/memory/cortex.py
@@ -27,6 +27,8 @@
 from anton.core.memory.base import HippocampusProtocol
 from anton.core.memory.base import Engram
 from anton.core.memory.hippocampus import Hippocampus
+from anton.core.memory.ranker import Ranker
+from anton.core.memory.rule_stats import RuleStats, rule_id
 
 if TYPE_CHECKING:
     from anton.core.llm.client import LLMClient
@@ -129,6 +131,32 @@ def __init__(
         self._llm = llm_client
         self._turn_count = 0
 
+        # Layer 3 — retrieval-scored rule ranking.
+        # Stateless BM25 ranker (no LLM call, no API key). The
+        # cortex re-uses it for both global and project rule paths.
+        self._ranker = Ranker()
+        # Per-rule retrieval / outcome counters. Sidecar JSON lives
+        # alongside global rules.md (one file across project switches —
+        # rules can fire either scope). Best-effort: when the
+        # hippocampus is a remote / protocol-only backend without a
+        # local `_dir`, RuleStats stays None and the cortex skips the
+        # counter bumps without losing the ranker behaviour itself.
+        global_dir = getattr(global_hc, "_dir", None)
+        self._rule_stats: RuleStats | None = (
+            RuleStats(Path(global_dir) / "rules.stats.json")
+            if isinstance(global_dir, (str, Path))
+            else None
+        )
+        # Phase C — outcome bridge. Cumulative set of rule IDs that
+        # landed in this turn's system prompt across however many
+        # `build_memory_context` calls happened (a turn may rebuild
+        # the prompt mid-flight on retries/compaction). The ACC's
+        # end-of-turn flush drains this via `consume_retrieved_this_turn`
+        # and, for each lesson whose rule_id is in the set, bumps the
+        # corresponding rule's `ignored` counter — the LLM saw the
+        # rule and the pattern still fired.
+        self._retrieved_this_turn: set[str] = set()
+
         # One-time migration: identity is singular and global. Any entries that
         # landed in project scope from the old encode() bug are merged upward.
         # Global wins on key conflicts — orphaned entries are likely stale
@@ -213,28 +241,121 @@ async def build_memory_context(self, user_message: str = "") -> str:
         if minds_topic:
             sections.append(f"## Minds — Datasource Context\n{minds_topic}")
 
+        # Layer 3 — flush the buffered retrieval counters once per
+        # build (one disk write per turn, not one per rule). Best-
+        # effort: if there's no stats backing store (remote
+        # hippocampus, missing dir), this is a no-op.
+        if self._rule_stats is not None:
+            try:
+                self._rule_stats.flush()
+            except OSError:
+                # Stats are telemetry, not gating data — a failed write
+                # must not break system-prompt assembly.
+                pass
+
         if not sections:
             return ""
 
         return "\n\n" + "\n\n".join(sections)
 
-    async def _retrieve_relevant_rules(self, all_rules: str, user_message: str) -> str:
-        """Filter rules to only those relevant to the current user message.
+    # Regex to strip <!-- ... --> metadata from a rule line. Module-
+    # scoped at the class for readability; cheap to (re)compile.
+    import re as _re
+    _METADATA_RE = _re.compile(r"<!--.*?-->", _re.DOTALL)
 
-        Brain analog: dlPFC cue-dependent recall — the prefrontal cortex
-        selects which memories to activate based on current goals, rather
-        than loading everything into working memory.
+    def _extract_rule_body(self, line: str) -> str:
+        """Pull the human-readable rule text out of a bullet line.
 
-        Always/Never rules are behavioral constraints — always loaded in full.
-        Only conditional (When/If) rules are filtered by relevance.
-        If rules are under budget or no LLM is available, returns as-is.
+        ``- Use httpx instead of requests <!-- confidence:high ts:... -->``
+        →
+        ``Use httpx instead of requests``
+
+        Used as both the BM25 document AND the stable-hash input for
+        `RuleStats`, so the metadata comments (which carry per-rule
+        timestamps that change on every write) don't corrupt either.
         """
-        if not user_message or self._llm is None:
+        s = (line or "").strip()
+        if s.startswith("- "):
+            s = s[2:].strip()
+        s = self._METADATA_RE.sub("", s).strip()
+        return s
+
+    def _record_retrievals_for_lines(self, lines: list[str]) -> None:
+        """Bump retrieval counters for every rule-bullet line that
+        actually carries content. Section headers / blank lines are
+        skipped — they aren't rules. No-op when ``_rule_stats`` is
+        unavailable (remote backend, etc.).
+
+        Also populates ``self._retrieved_this_turn`` (rule-ID set)
+        so the Phase C outcome bridge can correlate fired lessons
+        against rules-that-were-actually-loaded."""
+        if self._rule_stats is None:
+            return
+        for line in lines:
+            stripped = line.strip()
+            if not stripped.startswith("- "):
+                continue
+            body = self._extract_rule_body(line)
+            if body:
+                self._rule_stats.record_retrieval(body)
+                self._retrieved_this_turn.add(rule_id(body))
+
+    def consume_retrieved_this_turn(self) -> set[str]:
+        """Return the set of rule IDs retrieved into the system prompt
+        since the last call, AND clear the set.
+
+        Take-and-clear: the consumer (typically the ACC end-of-turn
+        flush) reads the snapshot once per turn. Multiple consumers
+        would each see a different filtered view, which is rarely
+        what callers want — if more than one consumer needs the
+        signal, build a fan-out at the wiring layer instead.
+
+        Empty set is a valid answer (cold start, no rules in memory,
+        or remote hippocampus where stats tracking is disabled)."""
+        out = self._retrieved_this_turn
+        self._retrieved_this_turn = set()
+        return out
+
+    async def _retrieve_relevant_rules(self, all_rules: str, user_message: str) -> str:
+        """Select the rules that go into the system prompt.
+
+        Layer 3 — retrieval-scored rule ranking:
+
+          - ``## Always`` / ``## Never`` rules are unconditional and
+            always loaded in full. They're not ranked because ranking
+            unconditional rules is a category error.
+          - ``## When`` rules are ranked by BM25 relevance against the
+            current ``user_message``. The top-K within budget land in
+            the prompt; the rest are dropped for this turn.
+          - Every rule that lands in the prompt bumps its retrieval
+            counter via ``RuleStats``. Phase C (outcome bridge) will
+            later use these counters to compute an "ignored" signal
+            when ACC detects the corresponding pattern despite the
+            rule having been loaded.
+
+        Cold-start behaviour: when the corpus fits in the char budget
+        OR the user message has no scorable terms, all rules are
+        loaded and their retrievals recorded. The ranker is a
+        budget-pressure tool, not a permanent filter.
+
+        Brain analog: dlPFC cue-dependent recall. The PFC scores
+        relevance against current goals and activates the top
+        candidates rather than loading everything into working memory.
+        """
+        # No query → unfiltered. Still record retrievals so the
+        # telemetry is honest about what's in the prompt.
+        if not user_message:
+            self._record_retrievals_for_lines(all_rules.splitlines())
             return all_rules
+
+        # Under budget → no point ranking; load all + record.
         if len(all_rules) <= self._RULES_BUDGET_CHARS:
+            self._record_retrievals_for_lines(all_rules.splitlines())
             return all_rules
 
-        # Split rules into mandatory (Always/Never) and filterable (When)
+        # Split into mandatory (Always / Never / non-section lines) vs.
+        # rankable (When bullets). Section headers stay with mandatory
+        # so the output keeps its markdown structure.
         lines = all_rules.splitlines()
         mandatory_lines: list[str] = []
         when_lines: list[str] = []
@@ -250,7 +371,7 @@ async def _retrieve_relevant_rules(self, all_rules: str, user_message: str) -> s
                 mandatory_lines.append(line)
             elif stripped.startswith("## When"):
                 current_section = "when"
-                mandatory_lines.append(line)  # keep the header
+                mandatory_lines.append(line)
             elif stripped.startswith("## ") or stripped.startswith("# "):
                 current_section = ""
                 mandatory_lines.append(line)
@@ -259,37 +380,54 @@ async def _retrieve_relevant_rules(self, all_rules: str, user_message: str) -> s
             else:
                 mandatory_lines.append(line)
 
-        # If When section is small, no need to filter
+        # Tiny When section → no ranking work to do.
         when_text = "\n".join(when_lines).strip()
         if not when_text or len(when_text) < 1000:
+            self._record_retrievals_for_lines(lines)
             return all_rules
 
-        # Filter only the When rules
-        try:
-            response = await self._llm.code(
-                system=self._RULES_RETRIEVAL_PROMPT,
-                messages=[
-                    {
-                        "role": "user",
-                        "content": f"User message: {user_message}\n\nRules:\n{when_text}",
-                    }
-                ],
-                max_tokens=4096,
-            )
-            result = response.content.strip()
-            if result.upper() == "NONE":
-                filtered_when = ""
-            elif result:
-                filtered_when = result
-            else:
-                filtered_when = when_text
-        except Exception:
-            filtered_when = when_text
+        # Build (body, original_line) pairs so we can rank on bodies
+        # but emit the original markdown lines (preserving metadata
+        # comments the consumer / consolidator might still read).
+        candidates: list[tuple[str, str]] = []
+        for line in when_lines:
+            body = self._extract_rule_body(line)
+            if body:
+                candidates.append((body, line))
+
+        if not candidates:
+            self._record_retrievals_for_lines(lines)
+            return all_rules
+
+        bodies = [b for b, _ in candidates]
+        body_to_line = {b: l for b, l in candidates}
+
+        ranked = self._ranker.rank(bodies, user_message)
+
+        # Remaining char budget = total budget minus what mandatory
+        # lines already consume. Convert to a rough token budget
+        # (~4 chars/token English heuristic) for the ranker's selector.
+        mandatory_chars = sum(len(l) + 1 for l in mandatory_lines)
+        remaining_chars = max(0, self._RULES_BUDGET_CHARS - mandatory_chars)
+        remaining_tokens = max(100, remaining_chars // 4)
+        selected = self._ranker.select_within_budget(
+            ranked, budget_tokens=remaining_tokens
+        )
+
+        selected_lines: list[str] = []
+        for r in selected:
+            line = body_to_line.get(r.text)
+            if line is not None:
+                selected_lines.append(line)
+
+        # Record retrievals for everything that lands in the prompt —
+        # mandatory rules AND the selected When rules. (Section
+        # headers and blanks are filtered inside the helper.)
+        self._record_retrievals_for_lines(mandatory_lines + selected_lines)
 
-        # Reassemble: mandatory sections + filtered When rules
         output = "\n".join(mandatory_lines)
-        if filtered_when:
-            output += "\n" + filtered_when
+        if selected_lines:
+            output += "\n" + "\n".join(selected_lines)
         return output
 
     def get_scratchpad_context(self) -> str: