Skip to content

Cloud-native nudge system — reliable real-time message delivery to agents #1032

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 4: Hardening)

Problem

Inter-agent communication in cloud Gastown is unreliable. Mail delivery (deliverPendingMail()) pushes messages to running agents via the SDK's session.prompt(), but non-mayor agents auto-exit on session.idle (process-manager.ts:259), so the injection window is narrow — mail must arrive while the agent is mid-task. In practice:

  • If a polecat finishes its turn before the next 5-second alarm tick delivers mail, the mail is undeliverable (agent is exited, sendMessage rejects with "Agent is not running")
  • The mail sits as an open message bead indefinitely — the agent will never poll again
  • gt_mail_check (the pull mechanism) requires the agent to proactively call it, which agents rarely do unprompted during focused work
  • Witness/deacon messages (GUPP warnings, rework requests, merge-ready notifications) often arrive too late

In local Gastown, gt nudge solves this by directly injecting keystrokes into the tmux pane — it can queue messages for delivery at the next idle window, retry with backoff, and serialize concurrent nudges. The cloud has no equivalent.

Current Infrastructure

The building blocks exist but don't compose into reliable delivery:

Mechanism How it works Limitation
deliverPendingMail() (alarm-driven push) Queries working agents with pending mail, calls sendMessageToAgent() → container /agents/:id/messagesession.prompt() Only works while agent is running. Non-mayor agents exit on session.idle. Narrow window.
gt_mail_check (agent-initiated pull) Agent calls tool → GET /agents/{id}/mail → returns unread mail Requires agent to proactively call. Agents rarely do this unprompted.
sendMayorMessage() (Mayor-specific push) Same injection path but Mayor is exempt from session.idle exit, so it always works Mayor-only. Not available for polecats or refinery.
Container /agents/:id/message endpoint POST with {prompt}session.prompt() → injects follow-up turn No role check, works for any agent — but agent must be in running status.

Solution: Cloud-Native Nudge System

Core concept

Replace the "fire and hope" push model with a nudge queue that guarantees delivery. When a message needs to reach an agent, it's queued as a nudge. The container delivers it at the right moment — either immediately (if the agent is idle and listening) or on the next turn (if the agent is mid-task).

Key design change: don't exit on session.idle

The root problem is process-manager.ts:259:

const isTerminal = event.type === 'session.idle' && request.role !== 'mayor';

Non-mayor agents should NOT exit on first session.idle if they have pending nudges (or pending mail). Instead:

  1. On session.idle, check the nudge queue for this agent
  2. If nudges are pending → inject the next nudge as a follow-up prompt → agent processes it → stays alive
  3. If no nudges pending → check with TownDO for pending mail (quick HTTP call)
  4. If mail pending → inject mail → agent processes it → stays alive
  5. If nothing pending → THEN exit (current behavior)

This turns non-mayor agents into "idle-but-available" sessions that can receive nudges, matching how local Gastown's wait-idle delivery mode works.

Nudge queue

A per-agent queue stored either in the container's process manager (in-memory, ephemeral) or in the TownDO (durable). The TownDO is better since nudges should survive container restarts.

-- In TownDO SQLite, new table or new bead type
CREATE TABLE agent_nudges (
  nudge_id TEXT PRIMARY KEY,
  agent_bead_id TEXT NOT NULL REFERENCES agent_metadata(bead_id),
  message TEXT NOT NULL,
  priority TEXT NOT NULL DEFAULT 'normal' CHECK(priority IN ('normal', 'urgent')),
  source TEXT NOT NULL,  -- 'witness', 'refinery', 'user', 'system'
  created_at TEXT NOT NULL DEFAULT (datetime('now')),
  delivered_at TEXT,
  expired_at TEXT
);

Delivery modes

Matching local Gastown's three modes:

Mode Behavior
wait-idle (default) Queue the nudge. Deliver when the agent next reaches session.idle. If agent is already idle, deliver immediately.
immediate Inject directly via session.prompt() right now, even if the agent is mid-task. Used for urgent system messages (GUPP force-stop, merge failure).
queue Queue with TTL. If not delivered within TTL, expire and escalate.

gt_nudge tool for agents

Add a gt_nudge tool to the polecat/refinery/mayor tool set:

gt_nudge:
  target_agent_id: string (required)
  message: string (required)
  mode: "wait-idle" | "immediate" | "queue" (default "wait-idle")

This replaces gt_mail_send for real-time communication. Mail remains for formal, persistent messages (rework requests, status reports). Nudge is for "wake up and do this now."

Internal nudge API for the alarm loop

The alarm loop's patrol functions (witnessPatrol, deaconPatrol) should use nudge instead of mail for time-sensitive messages:

  • GUPP warnings → nudge (immediate) instead of mail
  • Merge-ready notifications → nudge (wait-idle) to refinery instead of mail
  • Rework requests → nudge (wait-idle) to polecat
  • Stale hook nudges → nudge (wait-idle)

Idle-but-available timeout

Agents in "idle-but-available" state (waiting for nudges after session.idle) should have a configurable timeout. If no nudge arrives within N minutes (e.g., 2 minutes), the agent exits. This prevents idle agents from consuming container resources indefinitely.

Container-side changes

  1. process-manager.ts: On session.idle for non-mayor agents, call a new checkPendingNudges(agentId) function instead of immediately exiting
  2. checkPendingNudges: Queries TownDO for pending nudges, injects them if found, otherwise starts the idle timeout
  3. New internal endpoint: GET /agents/:id/pending-nudges or the container proactively polls TownDO
  4. Nudge delivery via the existing session.prompt() path — no new injection mechanism needed

TownDO-side changes

  1. New queueNudge(agentId, message, mode, source) method
  2. New getPendingNudges(agentId) method
  3. deliverPendingMail() updated to use nudge queue for working agents instead of direct injection
  4. Patrol functions updated to use queueNudge instead of sendMail for time-sensitive messages

Acceptance Criteria

  • Non-mayor agents don't exit on first session.idle — they enter an "idle-but-available" state and check for pending nudges
  • Nudge queue in TownDO with per-agent messages, priorities, and delivery tracking
  • wait-idle delivery: nudges injected when agent next reaches idle state
  • immediate delivery: nudges injected mid-task via session.prompt()
  • gt_nudge tool available to all agent roles
  • Patrol functions (GUPP warnings, merge-ready, rework) use nudge instead of mail for time-sensitive messages
  • Idle-but-available timeout prevents zombie idle agents
  • Nudge delivery logged as bead events for auditability
  • Mail (gt_mail_send / gt_mail_check) retained for formal persistent messages

Notes

  • No data migration needed — cloud Gastown hasn't deployed to production
  • The existing /agents/:id/message endpoint and session.prompt() SDK call are the injection mechanism — no new container delivery plumbing needed, just queue management and lifecycle changes
  • The biggest risk is idle agents consuming container resources. The timeout is critical. Local Gastown doesn't have this problem because tmux sessions are lightweight; container SDK processes are heavier.
  • gt_mail_send and gt_mail_check should NOT be removed — mail is still appropriate for formal, persistent, non-urgent messages. Nudge is for real-time delivery.
  • Consider whether the Mayor should also use the nudge queue for receiving user messages (currently sendMayorMessage bypasses any queue). This would unify the delivery path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions