Parent
Part of #204 (Phase 4: Hardening)
Problem
Inter-agent communication in cloud Gastown is unreliable. Mail delivery (deliverPendingMail()) pushes messages to running agents via the SDK's session.prompt(), but non-mayor agents auto-exit on session.idle (process-manager.ts:259), so the injection window is narrow — mail must arrive while the agent is mid-task. In practice:
- If a polecat finishes its turn before the next 5-second alarm tick delivers mail, the mail is undeliverable (agent is
exited, sendMessage rejects with "Agent is not running")
- The mail sits as an open message bead indefinitely — the agent will never poll again
gt_mail_check (the pull mechanism) requires the agent to proactively call it, which agents rarely do unprompted during focused work
- Witness/deacon messages (GUPP warnings, rework requests, merge-ready notifications) often arrive too late
In local Gastown, gt nudge solves this by directly injecting keystrokes into the tmux pane — it can queue messages for delivery at the next idle window, retry with backoff, and serialize concurrent nudges. The cloud has no equivalent.
Current Infrastructure
The building blocks exist but don't compose into reliable delivery:
| Mechanism |
How it works |
Limitation |
deliverPendingMail() (alarm-driven push) |
Queries working agents with pending mail, calls sendMessageToAgent() → container /agents/:id/message → session.prompt() |
Only works while agent is running. Non-mayor agents exit on session.idle. Narrow window. |
gt_mail_check (agent-initiated pull) |
Agent calls tool → GET /agents/{id}/mail → returns unread mail |
Requires agent to proactively call. Agents rarely do this unprompted. |
sendMayorMessage() (Mayor-specific push) |
Same injection path but Mayor is exempt from session.idle exit, so it always works |
Mayor-only. Not available for polecats or refinery. |
Container /agents/:id/message endpoint |
POST with {prompt} → session.prompt() → injects follow-up turn |
No role check, works for any agent — but agent must be in running status. |
Solution: Cloud-Native Nudge System
Core concept
Replace the "fire and hope" push model with a nudge queue that guarantees delivery. When a message needs to reach an agent, it's queued as a nudge. The container delivers it at the right moment — either immediately (if the agent is idle and listening) or on the next turn (if the agent is mid-task).
Key design change: don't exit on session.idle
The root problem is process-manager.ts:259:
const isTerminal = event.type === 'session.idle' && request.role !== 'mayor';
Non-mayor agents should NOT exit on first session.idle if they have pending nudges (or pending mail). Instead:
- On
session.idle, check the nudge queue for this agent
- If nudges are pending → inject the next nudge as a follow-up prompt → agent processes it → stays alive
- If no nudges pending → check with TownDO for pending mail (quick HTTP call)
- If mail pending → inject mail → agent processes it → stays alive
- If nothing pending → THEN exit (current behavior)
This turns non-mayor agents into "idle-but-available" sessions that can receive nudges, matching how local Gastown's wait-idle delivery mode works.
Nudge queue
A per-agent queue stored either in the container's process manager (in-memory, ephemeral) or in the TownDO (durable). The TownDO is better since nudges should survive container restarts.
-- In TownDO SQLite, new table or new bead type
CREATE TABLE agent_nudges (
nudge_id TEXT PRIMARY KEY,
agent_bead_id TEXT NOT NULL REFERENCES agent_metadata(bead_id),
message TEXT NOT NULL,
priority TEXT NOT NULL DEFAULT 'normal' CHECK(priority IN ('normal', 'urgent')),
source TEXT NOT NULL, -- 'witness', 'refinery', 'user', 'system'
created_at TEXT NOT NULL DEFAULT (datetime('now')),
delivered_at TEXT,
expired_at TEXT
);
Delivery modes
Matching local Gastown's three modes:
| Mode |
Behavior |
wait-idle (default) |
Queue the nudge. Deliver when the agent next reaches session.idle. If agent is already idle, deliver immediately. |
immediate |
Inject directly via session.prompt() right now, even if the agent is mid-task. Used for urgent system messages (GUPP force-stop, merge failure). |
queue |
Queue with TTL. If not delivered within TTL, expire and escalate. |
gt_nudge tool for agents
Add a gt_nudge tool to the polecat/refinery/mayor tool set:
gt_nudge:
target_agent_id: string (required)
message: string (required)
mode: "wait-idle" | "immediate" | "queue" (default "wait-idle")
This replaces gt_mail_send for real-time communication. Mail remains for formal, persistent messages (rework requests, status reports). Nudge is for "wake up and do this now."
Internal nudge API for the alarm loop
The alarm loop's patrol functions (witnessPatrol, deaconPatrol) should use nudge instead of mail for time-sensitive messages:
- GUPP warnings → nudge (immediate) instead of mail
- Merge-ready notifications → nudge (wait-idle) to refinery instead of mail
- Rework requests → nudge (wait-idle) to polecat
- Stale hook nudges → nudge (wait-idle)
Idle-but-available timeout
Agents in "idle-but-available" state (waiting for nudges after session.idle) should have a configurable timeout. If no nudge arrives within N minutes (e.g., 2 minutes), the agent exits. This prevents idle agents from consuming container resources indefinitely.
Container-side changes
process-manager.ts: On session.idle for non-mayor agents, call a new checkPendingNudges(agentId) function instead of immediately exiting
checkPendingNudges: Queries TownDO for pending nudges, injects them if found, otherwise starts the idle timeout
- New internal endpoint:
GET /agents/:id/pending-nudges or the container proactively polls TownDO
- Nudge delivery via the existing
session.prompt() path — no new injection mechanism needed
TownDO-side changes
- New
queueNudge(agentId, message, mode, source) method
- New
getPendingNudges(agentId) method
deliverPendingMail() updated to use nudge queue for working agents instead of direct injection
- Patrol functions updated to use
queueNudge instead of sendMail for time-sensitive messages
Acceptance Criteria
Notes
- No data migration needed — cloud Gastown hasn't deployed to production
- The existing
/agents/:id/message endpoint and session.prompt() SDK call are the injection mechanism — no new container delivery plumbing needed, just queue management and lifecycle changes
- The biggest risk is idle agents consuming container resources. The timeout is critical. Local Gastown doesn't have this problem because tmux sessions are lightweight; container SDK processes are heavier.
gt_mail_send and gt_mail_check should NOT be removed — mail is still appropriate for formal, persistent, non-urgent messages. Nudge is for real-time delivery.
- Consider whether the Mayor should also use the nudge queue for receiving user messages (currently
sendMayorMessage bypasses any queue). This would unify the delivery path.
Parent
Part of #204 (Phase 4: Hardening)
Problem
Inter-agent communication in cloud Gastown is unreliable. Mail delivery (
deliverPendingMail()) pushes messages to running agents via the SDK'ssession.prompt(), but non-mayor agents auto-exit onsession.idle(process-manager.ts:259), so the injection window is narrow — mail must arrive while the agent is mid-task. In practice:exited,sendMessagerejects with "Agent is not running")gt_mail_check(the pull mechanism) requires the agent to proactively call it, which agents rarely do unprompted during focused workIn local Gastown,
gt nudgesolves this by directly injecting keystrokes into the tmux pane — it can queue messages for delivery at the next idle window, retry with backoff, and serialize concurrent nudges. The cloud has no equivalent.Current Infrastructure
The building blocks exist but don't compose into reliable delivery:
deliverPendingMail()(alarm-driven push)sendMessageToAgent()→ container/agents/:id/message→session.prompt()running. Non-mayor agents exit onsession.idle. Narrow window.gt_mail_check(agent-initiated pull)/agents/{id}/mail→ returns unread mailsendMayorMessage()(Mayor-specific push)session.idleexit, so it always works/agents/:id/messageendpointPOSTwith{prompt}→session.prompt()→ injects follow-up turnrunningstatus.Solution: Cloud-Native Nudge System
Core concept
Replace the "fire and hope" push model with a nudge queue that guarantees delivery. When a message needs to reach an agent, it's queued as a nudge. The container delivers it at the right moment — either immediately (if the agent is idle and listening) or on the next turn (if the agent is mid-task).
Key design change: don't exit on
session.idleThe root problem is
process-manager.ts:259:Non-mayor agents should NOT exit on first
session.idleif they have pending nudges (or pending mail). Instead:session.idle, check the nudge queue for this agentThis turns non-mayor agents into "idle-but-available" sessions that can receive nudges, matching how local Gastown's
wait-idledelivery mode works.Nudge queue
A per-agent queue stored either in the container's process manager (in-memory, ephemeral) or in the TownDO (durable). The TownDO is better since nudges should survive container restarts.
Delivery modes
Matching local Gastown's three modes:
wait-idle(default)session.idle. If agent is already idle, deliver immediately.immediatesession.prompt()right now, even if the agent is mid-task. Used for urgent system messages (GUPP force-stop, merge failure).queuegt_nudgetool for agentsAdd a
gt_nudgetool to the polecat/refinery/mayor tool set:This replaces
gt_mail_sendfor real-time communication. Mail remains for formal, persistent messages (rework requests, status reports). Nudge is for "wake up and do this now."Internal nudge API for the alarm loop
The alarm loop's patrol functions (
witnessPatrol,deaconPatrol) should use nudge instead of mail for time-sensitive messages:Idle-but-available timeout
Agents in "idle-but-available" state (waiting for nudges after
session.idle) should have a configurable timeout. If no nudge arrives within N minutes (e.g., 2 minutes), the agent exits. This prevents idle agents from consuming container resources indefinitely.Container-side changes
process-manager.ts: Onsession.idlefor non-mayor agents, call a newcheckPendingNudges(agentId)function instead of immediately exitingcheckPendingNudges: Queries TownDO for pending nudges, injects them if found, otherwise starts the idle timeoutGET /agents/:id/pending-nudgesor the container proactively polls TownDOsession.prompt()path — no new injection mechanism neededTownDO-side changes
queueNudge(agentId, message, mode, source)methodgetPendingNudges(agentId)methoddeliverPendingMail()updated to use nudge queue for working agents instead of direct injectionqueueNudgeinstead ofsendMailfor time-sensitive messagesAcceptance Criteria
session.idle— they enter an "idle-but-available" state and check for pending nudgeswait-idledelivery: nudges injected when agent next reaches idle stateimmediatedelivery: nudges injected mid-task viasession.prompt()gt_nudgetool available to all agent rolesgt_mail_send/gt_mail_check) retained for formal persistent messagesNotes
/agents/:id/messageendpoint andsession.prompt()SDK call are the injection mechanism — no new container delivery plumbing needed, just queue management and lifecycle changesgt_mail_sendandgt_mail_checkshould NOT be removed — mail is still appropriate for formal, persistent, non-urgent messages. Nudge is for real-time delivery.sendMayorMessagebypasses any queue). This would unify the delivery path.