Parent Issue
#204 (Gastown Cloud — Phase 2)
Bug Description
When a polecat agent fails to start in the container (e.g., git clone fails because the repo's default branch doesn't match), the polecat is left in a zombie hook state: status=idle but current_hook_bead_id is still set. Subsequent slingBead calls pick this agent (it looks idle) but fail with a hook conflict.
Root Cause Chain
-
Rig created with wrong default branch: User creates a rig with defaultBranch: 'main' but the repo actually uses master. This can happen because the CreateRigDialog defaults to 'main' and the user doesn't change it.
-
Clone failure in container: startAgentInContainer sends the start request to the container, which calls git clone --no-checkout --branch main <url> → fatal: Remote branch main not found in upstream origin.
-
Agent left in zombie state: The polecat was already created and hooked to the bead in slingBead() before the container start was attempted. When the container rejects the start, schedulePendingWork logs the failure and retries on the next alarm, but the agent remains hooked to the bead with status=idle.
-
Cascading failures: The next slingBead call finds this zombie polecat (idle), tries to hook a new bead, and gets Agent is already hooked to bead <old-bead-id>.
Logs
[Rig.do] startAgentInContainer: error response: {"error":"git clone --no-checkout --branch main https://github.com/jrf0110/8track.git ... failed: fatal: Remote branch main not found"}
[Rig.do] schedulePendingWork: FAILED to start agent in container (attempt 1/5)
...
[Rig.do] getOrCreateAgent: found existing agent id=... name=Maple role=polecat status=idle current_hook=d97de1ce-...
[Rig.do] getOrCreateAgent: returning existing agent (idle=true, singleton=false)
[Rig.do] hookBead: CONFLICT - agent ... already hooked to d97de1ce-...
Fix Required
Two issues need to be addressed:
1. getOrCreateAgent should skip agents with existing hooks when looking for idle polecats
The query currently sorts by status=idle first, but doesn't filter out agents that still have current_hook_bead_id set. An agent with status=idle AND a non-null hook is in an inconsistent state — it should either be cleaned up or skipped.
-- Current (broken)
WHERE role = ? ORDER BY CASE WHEN status = 'idle' THEN 0 ELSE 1 END
-- Fixed
WHERE role = ? AND (status != 'idle' OR current_hook_bead_id IS NULL)
ORDER BY CASE WHEN status = 'idle' THEN 0 ELSE 1 END
Or: getOrCreateAgent should unhook the zombie agent before returning it.
2. Failed container starts should unhook the agent
In schedulePendingWork, when startAgentInContainer fails, the error is logged but the agent is not unhooked. The bead stays in_progress with the agent still hooked, creating the zombie state. On failure, the agent should be unhooked and the bead reverted to open.
3. (Preventive) Auto-detect default branch
The CreateRigDialog defaults to main, but many repos use master. Consider using the GitHub/GitLab API to fetch the actual default branch when a repo is selected from integrations, or running git ls-remote --symref <url> HEAD during rig creation to detect it.
Parent Issue
#204 (Gastown Cloud — Phase 2)
Bug Description
When a polecat agent fails to start in the container (e.g.,
git clonefails because the repo's default branch doesn't match), the polecat is left in a zombie hook state:status=idlebutcurrent_hook_bead_idis still set. SubsequentslingBeadcalls pick this agent (it looks idle) but fail with a hook conflict.Root Cause Chain
Rig created with wrong default branch: User creates a rig with
defaultBranch: 'main'but the repo actually usesmaster. This can happen because theCreateRigDialogdefaults to'main'and the user doesn't change it.Clone failure in container:
startAgentInContainersends the start request to the container, which callsgit clone --no-checkout --branch main <url>→fatal: Remote branch main not found in upstream origin.Agent left in zombie state: The polecat was already created and hooked to the bead in
slingBead()before the container start was attempted. When the container rejects the start,schedulePendingWorklogs the failure and retries on the next alarm, but the agent remains hooked to the bead withstatus=idle.Cascading failures: The next
slingBeadcall finds this zombie polecat (idle), tries to hook a new bead, and getsAgent is already hooked to bead <old-bead-id>.Logs
Fix Required
Two issues need to be addressed:
1.
getOrCreateAgentshould skip agents with existing hooks when looking for idle polecatsThe query currently sorts by
status=idlefirst, but doesn't filter out agents that still havecurrent_hook_bead_idset. An agent withstatus=idleAND a non-null hook is in an inconsistent state — it should either be cleaned up or skipped.Or:
getOrCreateAgentshould unhook the zombie agent before returning it.2. Failed container starts should unhook the agent
In
schedulePendingWork, whenstartAgentInContainerfails, the error is logged but the agent is not unhooked. The bead staysin_progresswith the agent still hooked, creating the zombie state. On failure, the agent should be unhooked and the bead reverted toopen.3. (Preventive) Auto-detect default branch
The
CreateRigDialogdefaults tomain, but many repos usemaster. Consider using the GitHub/GitLab API to fetch the actual default branch when a repo is selected from integrations, or runninggit ls-remote --symref <url> HEADduring rig creation to detect it.