Skip to content

[Gastown] PR 5.5: Container — Adopt kilo serve for Agent Management #305

@jrf0110

Description

@jrf0110

Overview

Replace the current stdin/stdout-based agent process management in the Town Container with Kilo's built-in HTTP server (kilo serve). The container currently spawns kilo code --non-interactive as fire-and-forget child processes and communicates via raw stdin pipes. This is fragile and provides no structured observability.

Decision: We are going forward with the kilo serve route. See analysis: docs/gt/opencode-server-analysis.md

Context

kilo serve starts a headless HTTP server (OpenAPI 3.1) with session management, structured message sending, SSE event streaming, abort/fork/revert, diff inspection, and more. The SDK (@kilocode/sdk/v2/server) provides createOpencodeServer() to manage server lifecycle.

Current flow:

Container Control Server (port 8080)
  └── Bun.spawn('kilo code --non-interactive') × N agents
      └── stdin/stdout pipes (fragile, unstructured)

Target flow:

Container Control Server (port 8080)
  └── kilo serve (port 4096+N) × M server instances (one per worktree)
      └── HTTP API: POST /session/:id/message, GET /event (SSE), etc.

Scope

1. Replace process-manager.ts internals

  • Instead of Bun.spawn(['kilo', 'code', '--non-interactive', ...]), use createOpencodeServer() from @kilocode/sdk/v2/server (or equivalent) to start kilo serve instances
  • One kilo serve instance per worktree/project directory (since a server is scoped to one project)
  • Manage port allocation for multiple server instances within the container
  • Track server instances and their sessions instead of raw child processes

2. Replace stdin-based messaging with HTTP API

  • sendMessage(agentId, prompt)POST /session/:id/message or POST /session/:id/prompt_async
  • getProcessStatus(agentId)GET /session/status (structured session-level status)
  • Agent abort → POST /session/:id/abort (clean abort instead of SIGTERM)

3. Replace agent-runner.ts startup flow

  • After git clone/worktree setup, start a kilo serve instance for the worktree (if not already running)
  • Create a new session on the server: POST /session
  • Send the initial prompt via POST /session/:id/message with model/agent/system-prompt configuration
  • Return session ID as the agent's handle (instead of process PID)

4. Wire up SSE event streaming

  • Subscribe to GET /event on each kilo serve instance
  • Forward relevant events (tool calls, completions, errors) to the heartbeat reporter
  • This replaces the raw stdout pipe reading with typed, structured events
  • Enables the future WebSocket streaming endpoint (/agents/:agentId/stream) referenced in the TODO

5. Update control server endpoints

Endpoint Current After
POST /agents/start Spawns kilo process Creates session on kilo server
POST /agents/:id/message Writes to stdin pipe POST /session/:id/message
GET /agents/:id/status Process lifecycle (pid, exit code) Session status (active tools, message count, etc.)
POST /agents/:id/stop SIGTERM/SIGKILL on process POST /session/:id/abort + optionally stop server if no more sessions
GET /health Process count Server instance count + session count

6. Update heartbeat reporter

  • Report session-level status instead of process-level status
  • Include active tool calls and last message info from SSE events

What stays the same

  • Git clone/worktree management (git-manager.ts) — unchanged
  • Container control server (port 8080) — same interface for TownContainer DO
  • Agent environment variable setup — still needed for gastown plugin config
  • Dockerfile — still needs kilo installed globally

Acceptance Criteria

  • Container starts kilo serve instances instead of kilo code --non-interactive processes
  • Agents are managed as sessions within kilo server instances
  • Follow-up messages use HTTP API instead of stdin pipes
  • Agent status reflects session-level detail (not just process alive/dead)
  • SSE event subscription is wired up for observability
  • Clean abort via server API works
  • Existing control server endpoints maintain the same external contract (no breaking changes for TownContainer DO)
  • All existing container tests pass (or are updated to reflect new internals)

Risks & Notes

  • Port management: Each kilo serve needs its own port. Need port allocation strategy (e.g., 4096 + incrementing counter)
  • One server per worktree: A kilo server is scoped to one project dir. Multiple agents sharing a worktree can share a server with separate sessions; agents in different worktrees need separate servers
  • Resource overhead: Marginal — kilo serve is a single Bun process either way, just with HTTP server overhead instead of raw stdin/stdout
  • Migration path: Can be done incrementally — start with HTTP messaging, then add SSE, then refine status reporting

Parent issue: #204

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions