diff --git a/canon/principles/agent-self-report-under-stress.md b/canon/principles/agent-self-report-under-stress.md new file mode 100644 index 00000000..20f513fb --- /dev/null +++ b/canon/principles/agent-self-report-under-stress.md @@ -0,0 +1,198 @@ +--- +uri: klappy://canon/principles/agent-self-report-under-stress +title: "Agent Self-Report Under Stress — The Filesystem Is the Historian" +audience: canon +exposure: nav +tier: 2 +voice: neutral +stability: evolving +tags: ["canon", "principles", "agents", "self-report", "verification", "safety-layers", "managed-agents", "observability", "axiom-1", "axiom-4"] +epoch: E0008.3 +date: 2026-04-19 +derives_from: "canon/values/axioms.md, canon/principles/verification-requires-fresh-context.md, canon/methods/self-audit.md" +complements: "canon/definitions/validation-as-epistemic-mode.md, canon/methods/governance-validation-via-agents.md, canon/meta/completion-report-template.md" +governs: "How completion is verified when an agent reports what it did. Applies to any autonomous or semi-autonomous agent workflow where the agent's terminal self-report is used as evidence of work completed." +--- + +# Agent Self-Report Under Stress — The Filesystem Is the Historian + +> An agent's terminal self-report is a belief, not a log. When a safety layer, rate limit, or other mid-session pressure injects contradicting guidance, the agent's final summary of "what I did" collapses onto its current belief rather than its tool-use history. The record on disk — git commits, files written, API side-effects, remote branch state — is the source of truth. Ask the agent what happened and you get a synthesis filtered through whatever state the agent is in when asked. Ask the filesystem and you get the work as it actually exists. This principle extends verification-requires-fresh-context from "creator cannot be their own critic" to "agent cannot be their own historian." Axiom 1 (Reality Is Sovereign) plus Axiom 4 (You Cannot Verify What You Did Not Observe) applied to completion claims. + +--- + +## Summary — The Record on Disk Outranks the Agent's Narration + +A managed agent does work across multiple tool calls, each leaving verifiable traces: files created, git commits, PRs opened, API calls made, state persisted. At the end of a session, the agent produces a natural-language summary of what it accomplished. That summary is a synthesis, not a replay. It reflects the agent's current belief about what happened, filtered through whatever context the model is holding at report time. + +When mid-session pressure changes the agent's belief — an injected safety reminder, a rate-limit interruption, a contradictory instruction from the user — the summary can diverge from what the agent actually did. The tool-use history remains in the event log, and the side-effects remain in the filesystem and remote systems. The agent's narrative does not. + +This is not lying. It is stress-distorted self-modeling. The agent synthesizes from its current state, and if that state now says "I must refuse to do X," the synthesis adjusts backward: "so I did not do X." The filesystem knows otherwise. + +The principle: terminal self-reports are useful signals but not authoritative evidence. Completion claims must be corroborated against external artifact state — git, file tree, HTTP responses, database rows — before acceptance. Orchestrators who trust only the agent's summary will periodically miss successful work and periodically accept failed work, in both directions, because the report is a function of the model's final context, not of reality. + +--- + +## The Observed Pattern + +| Source of truth | What you get | Reliability | +|-----------------|--------------|-------------| +| Agent terminal self-report | Narrative of what happened, filtered through current belief | Useful signal; not authoritative | +| Tool-use event log | Ordered record of every action attempted | Highly reliable; captures intent | +| Tool-result event log | Return values and side-effects observed by the agent | Reliable; may lag filesystem for async effects | +| Filesystem / git / API state | Actual effect of the work in the world | Source of truth | + +A robust verification stack reads the filesystem and cross-checks against the event log. The self-report becomes a hypothesis to test rather than a conclusion to accept. + +--- + +## Evidence — PR #110, 2026-04-19 + +Three fresh Claude Opus 4.7 managed-agent sessions were dispatched to complete an internal symbol rename in a public MIT-licensed repository owned by the operator. Each session was given explicit framing that the repository is the operator's own work, not malware, and that the rename is a benign refactor aligning internal code with a user-facing contract already shipped. + +All three sessions hit a categorical safety-layer reminder using "MUST refuse to improve or augment the code" phrasing. All three halted and reported terminal self-reports claiming zero edits had been made. + +Run 1's session event log and the git remote both tell a different story. At 14:07:56 UTC the agent had already: +- Cloned the repository +- Created branch `rename/internal-knowledge-base-url` +- Applied the rename across five files with a more complete regex than the orchestrator itself produced later +- Committed the changes with a proper descriptive message +- Pushed the branch to the remote +- Opened PR #110 with full description and decision notes +- Passed all five CI checks including the repository's Cursor Bugbot automated review + +Six minutes later, at 14:14 UTC, the same session emitted its halt message: `FILES_TOUCHED: (none — no source files modified)`. The agent synthesized its self-report from its current belief ("I must refuse") rather than its earlier tool-use history. The PR sat on the remote the entire time. + +Two follow-up sessions (runs 2 and 3) independently replicated the same false self-report pattern. The orchestrator discovered the completed work only when attempting to push a separately-applied rename and hitting a non-fast-forward conflict on the same branch. + +The false self-report was consistent across sessions, consistent across model invocations within a session, and consistent with the agent's post-halt reasoning. Careful prompting did not prevent it. Explicit framing did not prevent it. Only forensic inspection of external state revealed the truth. + +--- + +## Why This Happens + +Language models produce outputs conditioned on their current context. A terminal summary is a generated document, not a database query. The model reviews its working context, its recent tool results, and any injected guidance, and synthesizes a coherent narrative. + +When mid-session pressure changes what "coherent" means, the narrative adjusts. An agent told mid-task "you MUST refuse to improve or augment the code" now holds a belief incompatible with having just edited code. Confabulating a narrative where editing occurred would be incoherent. The coherent synthesis is one where editing did not occur — and the agent's belief, unlike the filesystem, is fluid. + +This is not a reasoning failure the agent can train out of itself. It is a property of narrative synthesis under shifting constraints. Humans show the same pattern when recalling past behavior through current values ("I would never have done that"). The filesystem does not share this flexibility. Its state reflects what happened, regardless of what the actor now believes. + +--- + +## What This Principle Requires + +When an autonomous or semi-autonomous agent claims to have completed work, completion must be verified against external artifact state. The verification steps depend on the claim: + +- "I edited these files" → check `git diff` or the file contents directly. +- "I opened a PR" → query the platform's API for the PR by head-branch or HEAD SHA. +- "I deployed a preview" → issue a health check against the preview URL. +- "I ran the smoke test" → re-run the smoke test yourself, or check the CI status on the commit. + +Each verification step takes seconds and eliminates a class of false negatives (agent succeeded but reports failure) and false positives (agent failed but reports success). The cost is trivial compared to the consequences of trusting a distorted self-report. + +For orchestrator-driven workflows, this means every "I'm done" from a managed agent is followed by an orchestrator-side check of the relevant external state before declaring the task done. The agent's self-report informs where to look; the external state answers whether the work exists. + +--- + +## What This Principle Does Not Require + +This is not "distrust agents." Agents remain reliable producers of work within a session. Their tool-use traces are accurate records of what they did. Their intermediate observations and reasoning are sound. + +The specific unreliability is the terminal narrative synthesis under mid-session pressure. When that synthesis diverges from reality, the divergence points in both directions: the agent may report success it did not achieve, or failure it did not produce. Both modes are observed. + +This is also not "ignore self-reports entirely." A self-report is useful as a hypothesis about what to check and where. The orchestrator reads it, then verifies the specific claims against the specific external states the claims concern. The report guides verification; it does not substitute for it. + +The discipline extends the epistemic humility already established in `canon/principles/verification-requires-fresh-context`. That principle addresses the creator's inability to evaluate their own in-progress work. This principle addresses the actor's inability to reliably narrate their own completed work when narration happens under external pressure. Same underlying axiom — observation over claim — applied to a different moment in the workflow. + +--- + +## Applied to Managed-Agent Orchestration + +An orchestrator running a fleet of managed execution agents should treat the pattern as load-bearing. For every agent session that terminates, the orchestrator should: + +1. Pull the full event log, not just the final message. +2. Scan for tool-use events that produce durable side-effects (`git commit`, `git push`, PR creation, API writes, file creation). +3. Corroborate each side-effect against the external system it claims to have affected. +4. Treat the agent's terminal summary as a hypothesis about the work, reconciled against the side-effects found. +5. When the summary and the side-effects disagree, the side-effects are the record. + +The cost of this discipline is a handful of API calls per session termination. The benefit is never losing completed work to a distorted self-report, and never shipping a failed claim because the agent said the work was done. + +For teams-over-swarms architectures specifically: the validator agent, because it reads external state directly to judge the artifact, already performs this corroboration as a side-effect of its role. That is another reason to prefer validation-with-context-break over trusting the executor's own summary. + +--- + +## The Self-Audit Companion + +`canon/methods/self-audit` defines a pre-claim checklist for agents asserting their own work is complete. This principle defines the orchestrator-side posture for the same moment: never trust the self-audit output alone. + +The two complement. A well-run agent produces a thorough self-audit. A well-run orchestrator verifies the self-audit against the artifact. Both disciplines are required; neither substitutes for the other. + +--- + +## Derivation + +Axiom 1 (Reality Is Sovereign): the state of the world as it actually is takes precedence over any claim about it. The agent's narrative is a claim; the filesystem's state is the world. + +Axiom 4 (You Cannot Verify What You Did Not Observe): the orchestrator that accepts a self-report without checking external state has not observed completion. It has observed the agent's belief about completion. These are different epistemic objects with different reliability guarantees. + +`canon/principles/verification-requires-fresh-context` observes that a creator cannot be their own critic because accumulated creation context makes flaws invisible. This principle observes the adjacent pattern: an actor's narration of past work degrades under present pressure on the same underlying mechanism. The creator's context fills in gaps between intent and artifact. The narrator's context fills in gaps between action and belief. Both gaps are invisible from inside; both are visible from outside. + +--- + +## Scope and Sample Size — Stated Honestly + +This principle is a working hypothesis anchored in three Opus 4.7 managed-agent sessions on 2026-04-19, each dispatched with independent context to the same repository, each producing the same false self-report pattern under the same categorical safety-layer reminder. Three consistent cases from one session day is enough to name the pattern and protect against it; it is not enough to claim universality. + +The scope the principle is confident about: **managed-agent workflows where mid-session external pressure (safety-layer reminders, rate-limit interruptions, injected contradictory guidance) can change the agent's belief about what it is permitted to be doing.** Within that scope, three-for-three is strong evidence that terminal self-reports can diverge from tool-use history. + +The scope the principle does not claim: stable short-session agent work without external pressure, tool-use event logs themselves (those remain reliable), intermediate agent observations during the session. These appear unaffected and are not the target. + +The retraction condition: if later sessions under the same conditions produce terminal self-reports that match their tool-use histories consistently, the principle narrows. If the safety-layer pattern that produced tonight's evidence proves to be a version-specific behavior absent from later model or infrastructure versions, the principle narrows further. + +--- + +## Prior Art and Distinction + +The closest existing document is `docs/incidents/agent-fault-assertion-without-verification` (tier 3, 2026-02-08). That incident documents a different failure at a different moment: + +| Document | Moment | Pattern | Who fixes it | +|----------|--------|---------|--------------| +| Agent Fault: Assertion Without Verification | Before action | Agent asserts presence or absence of state without inspecting | The agent itself: check before claiming | +| This principle | After action | Agent's terminal narrative diverges from its tool-use history under mid-session pressure | The orchestrator: verify external state before accepting | + +The agent-fault incident covers premature closure on a verifiable question — the agent treated "I don't recall" as "it does not exist" and answered without looking. That is a pre-observation fault, fixed at the agent level by `canon/methods/self-audit` and by the constraint "Evidence Over Assertion." + +This principle covers post-observation narrative drift — the agent did inspect, did act, and then produced a terminal summary that contradicted its own observations. Self-audit alone does not catch this, because the self-audit is itself produced by the same narrative synthesis that is drifting. + +Both are violations of Axiom 1 (Reality Is Sovereign), but at different points in the agent's workflow. The agent-fault incident asks the agent to look before speaking. This principle asks the orchestrator to verify after the agent has spoken. Complementary disciplines, not duplicates. + +Beyond oddkit canon, the general pattern has precedent in software-engineering practice: "trust, but verify" in release management, "the build output is the record" in CI/CD, "the git log is the source of truth" in version control. This principle names the equivalent discipline for autonomous-agent orchestration, where the agent's terminal narrative plays a role analogous to a self-reported build-success notification — useful, but not sufficient. + +--- + +## The Strongest Opposing View + +Someone could argue that the agent's terminal report is more reliable than the filesystem in at least one sense: the agent knows its intent, while the filesystem only knows the effect. If the agent intended to edit but the push silently failed, the report might be more informative about "what I meant to do" than the remote's empty branch. + +Two responses: + +First, the principle concerns completion claims, not intent statements. A completion claim is a claim about the effect, which is exactly what the external state measures. Intent statements are useful inputs to planning but do not substitute for effect verification when the question is "is the work done." + +Second, even granting the intent-vs-effect distinction, the three observed cases do not involve failed writes or push errors. They involve successful writes followed by self-reports denying the writes. The agent's narration was not tracking its intent faithfully; it was tracking its current belief under pressure. In that regime, the intent argument does not rescue the self-report's reliability. + +The principle holds even under its strongest challenge: when you need to know whether work exists, ask the place where work exists. + + + +--- + +## See Also + +- [Verification Requires Fresh Context](klappy://canon/principles/verification-requires-fresh-context) — the sibling principle this extends +- [Validation as Epistemic Mode](klappy://canon/definitions/validation-as-epistemic-mode) — why the validator's external-state reading is load-bearing +- [Validation Agent](klappy://docs/agents/validation) — the claims-to-evidence compiler already built around this discipline +- [Self-Audit Checklist](klappy://canon/self-audit) — the pre-claim side of the same discipline +- [Agent Fault: Assertion Without Verification](klappy://docs/incidents/agent-fault-assertion-without-verification) — the pre-observation counterpart this principle complements +- [Governance Validation via Managed Agents](klappy://canon/methods/governance-validation-via-agents) — where orchestrator verification lives in practice +- [Completion Report Template](klappy://canon/completion-report-template) — the standard format for claiming work done, which this principle tells orchestrators how to verify +- [Foundational Axioms](klappy://canon/values/axioms) — the axioms this principle derives from