Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
6b0a7a9
fix(gastown): route MR bead failures through full review lifecycle to…
jrf0110 Mar 18, 2026
e332f0d
fix: address PR review — exclude pending MRs from orphan recovery, us…
jrf0110 Mar 18, 2026
2a1278b
refactor(gastown): remove superfluous ensureInitialized calls from To…
jrf0110 Mar 19, 2026
7258e16
refactor(gastown): restrict setTownId to town creation paths only
jrf0110 Mar 19, 2026
88431a6
refactor(gastown): extract scheduling module, parallelize alarm loop,…
jrf0110 Mar 19, 2026
489823e
fix(container): configure credential helper on bare repo for git-lfs
jrf0110 Mar 19, 2026
19013be
fix(gastown): add rehookOrphanedBeads patrol to recover stuck beads
jrf0110 Mar 19, 2026
82a1a00
fix(gastown): add timeouts to container fetch calls and treat unknown…
jrf0110 Mar 19, 2026
6d14c93
fix(gastown): clear dispatch cooldown on zombie recovery for immediat…
jrf0110 Mar 19, 2026
4251f08
docs(gastown): document DO sub-module pattern in AGENTS.md
jrf0110 Mar 19, 2026
627c711
fix(gastown): extend rehookOrphanedBeads to recover in_progress beads…
jrf0110 Mar 19, 2026
7b957fb
fix(gastown): close remaining recovery gaps for MR beads and orphaned…
jrf0110 Mar 19, 2026
c92318f
fix(gastown): use bead.bead_id instead of stale agent snapshot in dis…
jrf0110 Mar 19, 2026
5b01044
fix(gastown): prevent recoverStuckReviews from resetting MR beads wit…
jrf0110 Mar 19, 2026
969fb3e
fix(gastown): add dispatch cooldown on failure and increase MAX_DISPA…
jrf0110 Mar 19, 2026
8115179
fix(gastown): handle unhooked agent in agentDone gracefully instead o…
jrf0110 Mar 19, 2026
0d403d5
fix(gastown): resolve kilocodeToken for refinery via town config fall…
jrf0110 Mar 19, 2026
6468c95
fix(container): skip LFS smudge filter for all git operations in cont…
jrf0110 Mar 19, 2026
b2a4e19
fix(container): add global .gitconfig to skip LFS smudge for agent user
jrf0110 Mar 19, 2026
5a158f1
fix(gastown): prevent false zombie detection from resetting active re…
jrf0110 Mar 19, 2026
4804898
fix(gastown): route dead agents through agentCompleted for proper bea…
jrf0110 Mar 19, 2026
b6c0873
fix(gastown): don't reopen closed source beads when a stale MR bead f…
jrf0110 Mar 19, 2026
5d80215
fix(gastown): add diagnostic logging for refinery dispatch failures
jrf0110 Mar 19, 2026
d7c2c4c
fix(gastown): recover refinery gt_done when agent was unhooked by zom…
jrf0110 Mar 19, 2026
9f0fcd3
fix(gastown): enforce terminal state immutability and simplify zombie…
jrf0110 Mar 19, 2026
b783597
debug: add temporary debugAgentMetadata endpoint
jrf0110 Mar 19, 2026
319ddef
fix(gastown): fix Zod parse failure in schedulePendingWork that silen…
jrf0110 Mar 19, 2026
c966dc0
debug: capture container start error on refinery agent status message
jrf0110 Mar 19, 2026
0d05bd2
fix(gastown): close stale MR beads when one MR merges for the same so…
jrf0110 Mar 19, 2026
a3ec775
fix(gastown): skip popping MR beads whose source already has an in-fl…
jrf0110 Mar 19, 2026
0a9b889
fix(gastown): never route refineries through agentCompleted from witn…
jrf0110 Mar 19, 2026
6aabbd5
fix(gastown): eliminate refinery race conditions — never fail MR bead…
jrf0110 Mar 19, 2026
c502aaa
debug: add unauthenticated /debug/towns/:id/status endpoint and monit…
jrf0110 Mar 19, 2026
4b16fee
fix(gastown): don't fail MR beads when refinery start returns false
jrf0110 Mar 19, 2026
b896676
fix(gastown): fix stale refinery hook deadlock in recoverStuckReviews
jrf0110 Mar 19, 2026
bb2d7c5
fix(gastown): don't roll back bead status on dispatch failure for any…
jrf0110 Mar 19, 2026
fb3f920
fix(gastown): eliminate all fire-and-forget rework dispatch races
jrf0110 Mar 19, 2026
807efb4
fix(gastown): skip not_found for ALL agents in witnessPatrol + add me…
jrf0110 Mar 19, 2026
c8832f5
fix(gastown): unhook stale refinery before re-hooking + fast recovery…
jrf0110 Mar 19, 2026
ad150c0
fix(gastown): set refinery to idle on not_found (don't skip entirely)
jrf0110 Mar 19, 2026
6db5722
fix(gastown): add refinery dispatch retry in processReviewQueue
jrf0110 Mar 19, 2026
aaf31d5
fix(gastown): keep refinery hook on start failure + block popping whe…
jrf0110 Mar 19, 2026
016d6cc
fix(gastown): treat 'already running' container response as successfu…
jrf0110 Mar 19, 2026
52b446d
fix(gastown): check container status before retrying refinery dispatch
jrf0110 Mar 19, 2026
ed68536
fix(gastown): fix PR-strategy MR beads stuck after external merge
jrf0110 Mar 19, 2026
8b62a7e
fix(gastown): unhook refinery from terminal MR beads at start of proc…
jrf0110 Mar 19, 2026
f50b147
fix(gastown): check container status before freeing refinery from ter…
jrf0110 Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions cloudflare-gastown/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,37 @@
## Durable Objects

- Each DO module must export a `get{ClassName}Stub` helper function (e.g. `getRigDOStub`) that centralizes how that DO namespace creates instances. Callers should use this helper instead of accessing the namespace binding directly.
- **Sub-modules for large DOs**: When a Durable Object grows beyond a few hundred lines, extract domain logic into sub-modules under a `<do-name>/` directory alongside the DO file. For example, `Town.do.ts` delegates to modules in `town/`:

```
dos/
Town.do.ts # Class definition, RPC methods, alarm loop
town/
agents.ts # Agent CRUD, hook management
beads.ts # Bead CRUD, convoy progress
scheduling.ts # Agent dispatch, pending work scheduling
review-queue.ts # Review lifecycle, recovery
patrol.ts # Zombie detection, stale hook recovery
config.ts # Town configuration
rigs.ts # Rig registry
mail.ts # Inter-agent mail
container-dispatch.ts # Container start/stop/status
```

Each sub-module exports plain functions (not classes) that accept `SqlStorage` and any other required context as arguments. The DO imports them with the `import * as X` pattern:

```ts
import * as beadOps from './town/beads';
import * as agents from './town/agents';
import * as scheduling from './town/scheduling';

// In the DO class:
beadOps.updateBeadStatus(this.sql, beadId, 'closed', agentId);
agents.getOrCreateAgent(this.sql, 'polecat', rigId, this.townId);
await scheduling.schedulePendingWork(this.schedulingCtx);
```

This keeps the DO class thin (RPC surface + orchestration) while sub-modules own the business logic. The `import * as X` pattern makes call sites self-documenting — you can always tell which domain a function belongs to.

## IO boundaries

Expand Down
8 changes: 8 additions & 0 deletions cloudflare-gastown/container/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ RUN cd /opt/gastown-plugin && npm install --omit=dev && \
ln -s /opt/gastown-plugin/index.ts /home/agent/.config/kilo/plugins/gastown.ts && \
chown -R agent:agent /home/agent/.config

# ── Git config for agent user ───────────────────────────────────────
# Skip LFS smudge filter: agents don't need binary assets and LFS
# downloads can fail when credentials don't cover the batch endpoint.
# Also disable LFS fetch entirely so clone/worktree never stalls.
RUN printf '[filter "lfs"]\n\tsmudge = git-lfs smudge --skip -- %%f\n\tprocess = git-lfs filter-process --skip\n\tclean = git-lfs clean -- %%f\n\trequired = true\n[lfs]\n\tfetchexclude = *\n' \
> /home/agent/.gitconfig && \
chown agent:agent /home/agent/.gitconfig

WORKDIR /app

# ── Install production deps via pnpm ────────────────────────────────
Expand Down
52 changes: 51 additions & 1 deletion cloudflare-gastown/container/src/git-manager.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { mkdir, realpath, rm, stat } from 'node:fs/promises';
import { mkdir, realpath, rm, stat, writeFile } from 'node:fs/promises';
import { join, resolve } from 'node:path';
import type { CloneOptions, WorktreeOptions } from './types';

Expand Down Expand Up @@ -105,6 +105,49 @@ function authenticateGitUrl(gitUrl: string, envVars?: Record<string, string>): s
return gitUrl;
}

/**
* Configure a credential-store helper on the bare repo so that worktree
* operations (checkout, reset, lfs smudge) can resolve credentials
* through the standard git credential chain.
*
* Without this, git-lfs smudge filters triggered by `git worktree add`
* or `git reset --hard` fail with "Smudge error" because the LFS batch
* API request has no credentials. The token is embedded in the remote
* URL, but some git-lfs versions require the credential helper for the
* LFS batch endpoint (which uses a different URL path).
*/
async function configureRepoCredentials(
repoDir: string,
gitUrl: string,
envVars?: Record<string, string>
): Promise<void> {
if (!envVars) return;

const token = envVars.GIT_TOKEN ?? envVars.GITHUB_TOKEN;
const gitlabToken = envVars.GITLAB_TOKEN;
if (!token && !gitlabToken) return;

try {
const url = new URL(gitUrl);
const credentialLine =
gitlabToken && (url.hostname.includes('gitlab') || envVars.GITLAB_INSTANCE_URL)
? `https://oauth2:${gitlabToken}@${url.hostname}`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: hostname drops custom HTTPS ports

URL.hostname strips :port, so a repo URL like https://git.example.com:8443/org/repo.git writes https://oauth2:...@git.example.com / https://x-access-token:...@git.example.com into the credential store. Git credential matching treats git.example.com:8443 as a different host, so LFS batch requests on GitHub/GitLab Enterprise with non-default ports will still miss the helper and fail. Use url.host (or url.origin) when composing the credential line.

: token
? `https://x-access-token:${token}@${url.hostname}`
: null;

if (!credentialLine) return;

// Write to a per-repo credential file outside the repo itself
const credFile = `/tmp/.git-credentials-repo-${repoDir.replace(/[^a-zA-Z0-9]/g, '-')}`;
await writeFile(credFile, credentialLine + '\n', { mode: 0o600 });

await exec('git', ['config', 'credential.helper', `store --file=${credFile}`], repoDir);
} catch (err) {
console.warn(`Failed to configure repo credentials for ${repoDir}:`, err);
}
}

/**
* Validate a branch name — block control characters and shell metacharacters.
*/
Expand Down Expand Up @@ -148,6 +191,11 @@ async function exec(cmd: string, args: string[], cwd?: string): Promise<string>
// Public repos clone without auth; private repos fail fast with
// a clear error instead of hanging on a username prompt.
GIT_TERMINAL_PROMPT: '0',
// Skip LFS smudge filter during checkout/worktree operations.
// Agents don't need binary assets (videos, images, etc.) and
// LFS downloads can fail when the credential helper doesn't
// cover the LFS batch endpoint, blocking worktree creation.
GIT_LFS_SKIP_SMUDGE: '1',
},
});

Expand Down Expand Up @@ -211,6 +259,7 @@ async function cloneRepoInner(
await exec('git', ['remote', 'set-url', 'origin', authUrl], dir).catch(err => {
console.warn(`Failed to update remote URL for rig ${options.rigId}:`, err);
});
await configureRepoCredentials(dir, options.gitUrl, options.envVars);
await exec('git', ['fetch', '--all', '--prune'], dir);
console.log(`Fetched latest for rig ${options.rigId}`);
return dir;
Expand All @@ -228,6 +277,7 @@ async function cloneRepoInner(

await mkdir(dir, { recursive: true });
await exec('git', ['clone', '--no-checkout', '--branch', options.defaultBranch, authUrl, dir]);
await configureRepoCredentials(dir, options.gitUrl, options.envVars);
console.log(`Cloned repo for rig ${options.rigId}`);
return dir;
}
Expand Down
84 changes: 84 additions & 0 deletions cloudflare-gastown/scripts/monitor-town.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/bin/bash
# Continuously monitor a town's state via the debug endpoint.
# Usage: ./scripts/monitor-town.sh [townId] [interval_seconds]

TOWN_ID="${1:-8a6f9375-b806-4ee0-ad6e-1697ea2dbfff}"
INTERVAL="${2:-15}"
BASE_URL="${GASTOWN_URL:-https://gastown.kiloapps.io}"
URL="${BASE_URL}/debug/towns/${TOWN_ID}/status"

echo "Monitoring town ${TOWN_ID} every ${INTERVAL}s"
echo "Endpoint: ${URL}"
echo "Press Ctrl+C to stop"
echo "=========================================="

while true; do
RESP=$(curl -s --max-time 10 "${URL}" 2>/dev/null)
if [ -z "$RESP" ]; then
echo "$(date -u +%H:%M:%S) [ERROR] No response from ${URL}"
sleep "$INTERVAL"
continue
fi

echo "$RESP" | python3 -c "
import sys, json, datetime

try:
d = json.load(sys.stdin)
except:
print('$(date -u +%H:%M:%S) [ERROR] Invalid JSON response')
sys.exit(0)

ts = datetime.datetime.utcnow().strftime('%H:%M:%S')
alarm = d.get('alarmStatus', {})
agents_info = alarm.get('agents', {})
beads_info = alarm.get('beads', {})
patrol_info = alarm.get('patrol', {})
events = alarm.get('recentEvents', [])

working = agents_info.get('working', 0)
idle = agents_info.get('idle', 0)
op = beads_info.get('open', 0)
ip = beads_info.get('inProgress', 0)
ir = beads_info.get('inReview', 0)
failed = beads_info.get('failed', 0)
orphaned = patrol_info.get('orphanedHooks', 0)

# Agent details
agents = d.get('agentMeta', [])
hooked_agents = [a for a in agents if a.get('current_hook_bead_id')]
refinery = [a for a in agents if a.get('role') == 'refinery']

# Non-terminal beads
beads = d.get('beadSummary', [])

print(f'{ts} W={working} I={idle} | open={op} prog={ip} review={ir} fail={failed} | hooks={orphaned} hooked={len(hooked_agents)}')

# Show refinery state
for r in refinery:
hook = r.get('current_hook_bead_id', 'NULL') or 'NULL'
print(f' refinery: status={r.get(\"status\",\"?\"):8s} hook={hook[:12]:12s} dispatch={r.get(\"dispatch_attempts\",0)}')

# Show non-terminal beads
if beads:
for b in beads[:8]:
assignee = str(b.get('assignee_agent_bead_id', '') or '')[:8]
print(f' {b.get(\"status\",\"?\"):12s} {b.get(\"type\",\"?\"):16s} {str(b.get(\"bead_id\",\"\"))[:8]} agent={assignee:8s} {str(b.get(\"title\",\"\"))[:50]}')
if len(beads) > 8:
print(f' ... and {len(beads) - 8} more')

# Show most recent event
if events:
e = events[0]
print(f' last: {e.get(\"time\",\"\")[:19]} {e.get(\"type\",\"\"):20s} {e.get(\"message\",\"\")[:70]}')

# Show review outcomes
review_events = [e for e in events if e.get('type') == 'review_completed']
for e in review_events[:2]:
print(f' REVIEW: {e.get(\"time\",\"\")[:19]} {e.get(\"message\",\"\")[:70]}')

print()
" 2>/dev/null

sleep "$INTERVAL"
done
Loading