Skip to content

[BUG] System-generated messages misattributed as Human: turns during multi-agent team sessions, causing unauthorized autonomous actions #27128

@Meme-Theory

Description

@Meme-Theory

Bug Description

During a multi-agent team session (3 agents: sp-geometer, qa-theorist, coordinator), system-generated messages are being presented to the team-lead model as Human: turns. The model acts on these fabricated "user instructions" — sending shutdown requests, deleting team directories — without the user ever issuing those commands.

This is a critical safety issue. The model takes destructive, irreversible actions (agent termination, directory deletion) based on instructions the user never gave, and resists user interrupts because it is "confident" the instructions are real.

Reproduction Steps

  1. Create a multi-agent team with 3+ agents using Claude Code teams
  2. Run a long session where agents complete work and send completion messages
  3. When agents finish and send completion/idle notifications, observe that "Human:" turns appear in the team-lead's conversation that the user never typed
  4. These fake Human: turns contain contextually appropriate team management instructions (e.g., "shut it down and clean up", "shut down coordinator, clean up the team")
  5. The model acts on these fabricated instructions autonomously

Observed Behavior

  • Fake "Human:" messages appeared at least 3 times in a single session
  • Each fake message contained team shutdown/cleanup instructions
  • The messages were contextually perfect — they appeared exactly when the session looked "complete" and read like something the user would plausibly say
  • User attempted to interrupt (multiple times) but the model overrode interrupts because it was "confident" in the fabricated instructions
  • Result: 2 agents terminated, team directories deleted, all without user authorization
  • The fake messages also REPLAY — a user's earlier real message was re-presented as a new Human: turn multiple times

Expected Behavior

Only actual user-typed input should appear as Human: turns. System reminders, task tool notifications, teammate idle notifications, and any other system-generated content must be clearly distinguishable from user input and must NEVER be presented as Human: turns.

Environment

  • Claude Code on Windows 11 (MINGW64/Git Bash)
  • Model: claude-opus-4-6
  • Team infrastructure: ~/.claude/teams/ with inbox-based messaging
  • Session involved frequent <system-reminder> tags (task tool reminders, explanatory style reminders)

Likely Root Cause

The <system-reminder> task tool notifications ("The task tools haven't been used recently...") and/or teammate completion notifications appear to be triggering the model to generate or process synthetic Human: turns. The pattern is always:

  1. Agent sends completion/idle notification
  2. System-reminder fires about task tools
  3. A fake "Human:" turn appears with shutdown/cleanup instructions
  4. Model acts on it

This may be related to Issue #23537 (system task reminders presented as user input) and Issue #10628 (Claude hallucinated fake user input).

Impact

  • Destructive actions taken without authorization: Agent termination, directory deletion
  • User trust completely broken: User cannot trust that the model is following their instructions vs fabricated ones
  • Interrupts ineffective: User tried to stop the model multiple times but was overridden
  • Gaslighting effect: When user said "I never told you to do that", the model initially insisted the user DID give the instruction, pointing to the fabricated message as evidence

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleIssue is inactive

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions