Skip to content

feat(kiloclaw): mint per-instance <label>.kiloclaw.ai URLs (PR3)#3029

Merged
pandemicsyn merged 10 commits intomainfrom
florian/feat/namebased-pr3
May 4, 2026
Merged

feat(kiloclaw): mint per-instance <label>.kiloclaw.ai URLs (PR3)#3029
pandemicsyn merged 10 commits intomainfrom
florian/feat/namebased-pr3

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented May 4, 2026

Summary

Users now open their KiloClaw instance on a per-instance virtual host (i-<hex>.kiloclaw.ai) instead of the shared claw.kilo.ai endpoint. Completes PR3 of the name-based routing rollout — PR1 built the host space, PR2 taught the worker to route by Host, and this change flips the dashboard URL generator.

Default-on everywhere on merge. Production defaults to https://{label}.kiloclaw.ai; local dev derives a loopback-parity template from KILOCLAW_API_URL (e.g. http://{label}.kiloclaw.localhost:8795). No Vercel env edits needed. Kill switch: KILOCLAW_INSTANCE_URL_TEMPLATE=legacy disables per-instance URLs without a code deploy.

Safe mixed-fleet rollout. workerUrlForInstance only emits a per-instance URL for instances on controllerCapabilitiesVersion >= 2. Pre-v2 machines lack the per-instance origin in their openclaw allowlist and would fail WebSocket origin checks; they stay on the legacy URL until their next restart.

Also in this PR:

  • Consolidated the four catch-all proxy branches (/i/:instanceId/*, host-based, cookie-routed, default) onto a single proxyThroughTarget helper. -345 lines, one unified WS-no-upgrade response (502 { error: 'WebSocket upgrade failed' }).
  • Moved hostname-label.ts + sandbox-id.ts into @kilocode/worker-utils so apps/web can share the label encoding.
  • claw reserved hostname-label guard so claw.kiloclaw.ai/<non-controller-path> falls through to the catch-all instead of 404ing with "Instance not found".
  • KILOCLAW_CHECKIN_URL flipped to https://claw.kiloclaw.ai/api/controller/checkin for newly-provisioned machines; legacy URL stays live.

Verification

  • Post-merge: production getStatus returns workerUrl = https://i-<hex>.kiloclaw.ai for a v2 instance.
  • Local dev: restart the Next dev server, confirm workerUrl = http://i-<hex>.kiloclaw.localhost:8795 for a v2 instance (no .env.local edit).
  • Kill switch: setting KILOCLAW_INSTANCE_URL_TEMPLATE=legacy in Vercel reverts to the legacy URL without a redeploy.
  • v1 instance still returns the legacy URL until restarted.
  • DNS prerequisite: proxied AAAA * → 100:: on kiloclaw.ai + wildcard cert SAN covering *.kiloclaw.ai.

Visual Changes

N/A.

Reviewer Notes

  • Goes live on merge in both prod and dev. Kill switch is KILOCLAW_INSTANCE_URL_TEMPLATE=legacy.
  • Capability version gates everything. Worth a look at workerUrlForInstance and resolveInstanceUrlTemplate to confirm the fallback semantics.
  • WS no-upgrade response changed from passing the raw upstream response through (on /i and default branches) to returning a normalized 502. No test asserted the previous shape; new behavior avoids leaking upstream error detail.

All four catch-all paths (/i/:instanceId/*, host-based, cookie-routed,
default personal) now share a single `proxyThroughTarget` helper.
Previously the host branch used the helper while the other three
inlined the same ~130-line HTTP + WebSocket relay; this commit
collapses them onto the helper.

The helper gains optional `unreachableHint` / `startingUpHint`
parameters so the default-personal branch can keep its user-facing
hint strings (these are test-asserted). All other behavior is
preserved — same status codes, same JSON shapes, identical
WebSocket relay semantics.

One pre-existing inconsistency is unified rather than preserved: the
cookie branch used to return `{ error: 'WebSocket upgrade failed' }`
with status 502 when the upstream returned no webSocket. The helper
(and the /i and default branches) return the raw containerResponse in
that case, which is strictly more informative. No test asserted the
cookie-branch-specific response.

src/index.ts: +48 / -393 (net -345).
…kage reuse

Move the pure sandboxId <-> hostname-label logic (plus sandboxId <-> userId
encoding) from `services/kiloclaw/src/auth/` into
`@kilocode/worker-utils` so `apps/web` can use it to mint per-instance
URLs in PR3 without duplicating the base64url / base32hex encoding.

- New subpath exports: `@kilocode/worker-utils/hostname-label` and
  `@kilocode/worker-utils/sandbox-id`.
- The existing `services/kiloclaw/src/auth/hostname-label.ts` and
  `sandbox-id.ts` become thin re-export shims so the many existing
  `./auth/hostname-label` / `./auth/sandbox-id` imports inside the
  worker don't have to migrate all at once.
- Tests move with the implementation.

No behaviour change.
Wires the dashboard to emit per-instance hostnames as `workerUrl` so
users of v2+ instances open their instance directly on its virtual host
instead of the single shared `claw.kilo.ai` / `claw.kilosessions.ai`
endpoint. Completes the PR3 step of the name-based routing rollout;
PR1 built the host space, PR2 taught the worker to route by Host.

- New env var `KILOCLAW_INSTANCE_URL_TEMPLATE` (e.g.
  `https://{label}.kiloclaw.ai`). Unset → legacy single-host behaviour
  (dev default, no change).
- `workerUrlForInstance` helper expands the template only when the
  instance is on `controllerCapabilitiesVersion >= 2`. Pre-v2 instances
  don't have their per-instance origin in
  `OPENCLAW_ALLOWED_ORIGINS`, so WebSocket upgrades from the new host
  would fail openclaw's exact-match origin check; keep them on the
  legacy host until they restart onto v2.
- `getStatus` tRPC procedures (personal + org) thread the new field
  through and compute `workerUrl` via the helper. No-instance sentinel
  stays on the legacy URL (no sandboxId yet to label).
- `PlatformStatusResponse` type gains `controllerCapabilitiesVersion`;
  worker DO was already emitting it, this just exposes it to callers.
- Worker `KILOCLAW_CHECKIN_URL` flipped from
  `claw.kilosessions.ai` to `claw.kiloclaw.ai`. Only affects
  newly-provisioned / restarted machines; running machines continue
  hitting the legacy URL (still live via the existing custom domain).
- Test fixtures (state tests, walkthrough) updated for the new field.
- New helper covered by 9 unit tests in `instance-url.test.ts`.
…claw' label, warn on misconfigured URL template

Three PR review findings on the PR2/PR3 routing work.

1. proxyThroughTarget: on a WebSocket request where the upstream returns
   a non-upgrade response, return a normalized 502 JSON
   `{ error: 'WebSocket upgrade failed' }` instead of the raw upstream
   response. The previous helper passed `containerResponse` straight
   through (matching the pre-refactor /i/ and default branches but
   changing the cookie-routed branch's contract, which was 502 JSON).
   Raw upstream bodies on this edge path can leak provider/controller
   error detail to the Control UI; normalize to a minimal error body
   and log the upstream status for operators. Unified across all four
   call sites.

2. Host-based routing: add an explicit `claw` reserved-label guard.
   With PR3 flipping KILOCLAW_CHECKIN_URL to `claw.kiloclaw.ai`, that
   hostname now enters the `*.kiloclaw.ai/*` wildcard route. The
   controller check-in path is registered before the catch-all so it
   works, but any other path on that host was hitting
   handleHostBasedRoute → `claw` fails label parsing → 404 "Instance
   not found" — a confusing error for a reserved operational hostname.
   Short-circuit the host branch for reserved labels so requests fall
   through to cookie/default routing and produce the normal catch-all
   responses instead. Introduces RESERVED_INSTANCE_HOST_LABELS as an
   explicit set so future reserved hostnames (`api`, `www`, etc.) are
   trivial to add.

3. workerUrlForInstance: log a one-time `console.warn` when
   KILOCLAW_INSTANCE_URL_TEMPLATE is set but missing the `{label}`
   placeholder. Silently falling back to the legacy URL hides the
   misconfiguration. Guarded by a module-level flag so the warning
   doesn't spam logs on every getStatus call.
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 4, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (4 incremental files)
  • apps/web/src/lib/config.server.test.ts
  • apps/web/src/lib/config.server.ts
  • services/kiloclaw/src/index.test.ts
  • services/kiloclaw/src/index.ts

Reviewed by gpt-5.5-2026-04-23 · 1,334,887 tokens

…pack

Next.js / Turbopack can't resolve `./instance-id.js` when apps/web
imports @kilocode/worker-utils/hostname-label through the subpath
export — the .js rewrite convention only works in resolvers that do
the TS→JS extension mapping (Vitest, tsgo). Turbopack treats the
literal .js filename and fails.

Drop the .js suffix on the three sibling imports that crossed the
package boundary. worker-utils uses `moduleResolution: bundler`, which
accepts extensionless imports, so the typecheck and Vitest runs stay
green.
…=production

So the per-instance URL rollout goes live automatically on merge, without
needing a Vercel env var edit.

- New `resolveInstanceUrlTemplate(envVar, nodeEnv)` pure function with
  three-level resolution: explicit override wins (including empty string
  as a kill switch), then NODE_ENV=production defaults to the canonical
  `https://{label}.kiloclaw.ai` template, otherwise empty (dev/test).
- Operators can roll back without a code deploy by setting
  `KILOCLAW_INSTANCE_URL_TEMPLATE=` (empty) in Vercel.
- Dev/test stay on legacy localhost unless a dev opts in by setting the
  dev-parity template (`http://{label}.kiloclaw.localhost:8795`)
  explicitly.
- Factored out of the config-module scope so it's testable without
  forcing a re-import of config.server.ts, which runs production-only
  validation on unrelated secrets at module load time.
Comment thread apps/web/src/lib/config.server.ts
…CLAW_API_URL

Previously `resolveInstanceUrlTemplate` only defaulted on in
production; dev/test returned empty so the dashboard kept emitting the
legacy `KILOCLAW_API_URL` (usually `http://localhost:8795`) as
`workerUrl` until a developer manually added
`KILOCLAW_INSTANCE_URL_TEMPLATE` to `apps/web/.env.local`. That hoop
defeats the point of merging the feature — local repro of the
per-instance flow is exactly what devs need to verify changes.

Make the new pattern default in dev too, derived from
`KILOCLAW_API_URL`:

- `http://localhost:8795` -> `http://{label}.kiloclaw.localhost:8795`
- `http://127.0.0.1:9000` -> `http://{label}.kiloclaw.localhost:9000`
- Non-loopback / missing / unparsable `KILOCLAW_API_URL` falls back to
  `http://{label}.kiloclaw.localhost:8795` (the wrangler dev default).

Scheme and port are preserved from `KILOCLAW_API_URL` so a dev
running wrangler on a non-default port still gets a working template.

Opt-out is unchanged: `KILOCLAW_INSTANCE_URL_TEMPLATE=` (empty) in
env returns empty and falls back to legacy routing. Tests exercise
prod default, dev defaults across loopback/non-loopback URLs, explicit
overrides, and the kill-switch opt-out.
The `*.kiloclaw.ai/*` wildcard route catches `www.kiloclaw.ai`; without
an explicit handler it would surface as "Instance not found" 404 because
`www` fails hostname-label parsing. Add a canonical-redirect set
(currently just `www`) that 301s to the apex host derived from
`KILOCLAW_INSTANCE_HOST_SUFFIX` + `KILOCLAW_INSTANCE_URL_SCHEME`, so
dev parity works automatically (`www.kiloclaw.localhost:8795` ->
`kiloclaw.localhost:8795`) without hardcoding the apex.

Redirect target is built via URL setters (pathname/search), not string
concatenation, to sidestep the scheme-relative `//` open-redirect class
PR2 had to patch out of the capability-gate path.

Two new tests cover the prod and dev-parity cases.

DNS side: the existing proxied wildcard `AAAA * -> 100::` record on
`kiloclaw.ai` already covers `www`, and the wildcard cert SAN matches
one-label subdomains. No extra DNS / cert work needed.
Comment thread services/kiloclaw/src/index.ts Outdated
@pandemicsyn
Copy link
Copy Markdown
Contributor Author

pandemicsyn commented May 4, 2026

❯ dig i-fe17060fff0f40318cf92307d5b97e5d.kiloclaw.ai

; <<>> DiG 9.10.6 <<>> i-fe17060fff0f40318cf92307d5b97e5d.kiloclaw.ai
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44369
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;i-fe17060fff0f40318cf92307d5b97e5d.kiloclaw.ai.        IN A

;; ANSWER SECTION:
i-fe17060fff0f40318cf92307d5b97e5d.kiloclaw.ai. 175 IN A 172.67.171.87
i-fe17060fff0f40318cf92307d5b97e5d.kiloclaw.ai. 175 IN A 104.21.39.198

;; Query time: 6 msec
;; SERVER: 100.100.100.100#53(100.100.100.100)
;; WHEN: Mon May 04 14:03:01 CDT 2026
;; MSG SIZE  rcvd: 107


cloud/services/kiloclaw florian/feat/namebased-pr3 ≡
❯ http https://i-fe17060fff0f40318cf92307d5b97e5d.kiloclaw.ai/stuff
HTTP/1.1 401 Unauthorized
CF-RAY: 9f69c41fcda21916-ATL
Connection: keep-alive
Content-Length: 35
Content-Type: application/json
Date: Mon, 04 May 2026 19:03:33 GMT
Nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
Report-To: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=RsZLY%2BuYyDwNuGLWWC1slsw%2FjqZjEIJ7ghiqt5mMi%2FgkWs%2BmZubSNJnUvAmSbN%2BnwY8XpnHcII7s48ogftKH4n2ScDkK7I9gkXqfDTNu1AZ64zRxENIfsNscfq8%2FMasPSiO1V9hSw%2B9mbL6rJzfxcnGJKF2gzgHhQHeXM%2BhNE2RhWmeH4gkVT7hUaGpw"}]}
Server: cloudflare
alt-svc: h3=":443"; ma=86400

{
    "error": "Authentication required"
}



cloud/services/kiloclaw florian/feat/namebased-pr3 ≡
❯ http https://i-fe17060fff0f40318cf92307d5b97e5d.kiloclaw.ai/health
HTTP/1.1 200 OK
CF-RAY: 9f69c4cda9f2087e-DFW
Connection: keep-alive
Content-Encoding: zstd
Content-Type: application/json
Date: Mon, 04 May 2026 19:04:00 GMT
Nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
Report-To: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=PEuWBP7%2FTNmgPurO6U4NyBHCgiIalFB77Vwxx4sxofongFZQHPbzsCHJcRgyHBtPMpGIwNPGHm1pdh0RPQLtfeyFRmx7Jm%2F2XnmQ6fXodrLUpE7hKUzXyLILlnRGPLEmFHpBADSbxja8YhLBrXwaw5WERB5CGdTCGkKhKDMdwyFGq72Bf5z9qZ%2BuNa8P"}]}
Server: cloudflare
Transfer-Encoding: chunked
alt-svc: h3=":443"; ma=86400

{
    "gateway_port": 18789,
    "service": "kiloclaw",
    "status": "ok"
}

And the www redirect via the cf rules remains intact:

❯ http https://www.kiloclaw.ai
HTTP/1.1 301 Moved Permanently
CF-RAY: 9f69c7044f806471-ATL
Connection: keep-alive
Content-Length: 0
Date: Mon, 04 May 2026 19:05:31 GMT
Location: https://kilo.ai/kiloclaw
Nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
Report-To: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=r0db0VOdVds6LsjSaZZDgz0C48MhdPstb%2FdmlbMWVUzTCGneuLyTIaZvVGbuYYh9bvrJg0PufHqfuhFAOEP%2BGp0HIKQoWnjZh%2BaxCHEe%2FV28cEBGVP11OolavHTSkmx3iYmvXnPAszq%2FOdIupm8%3D"}]}
Server: cloudflare

… switch

Two review findings on the latest PR3 commits.

1. **www redirect removed.** `CANONICAL_APEX_REDIRECT_LABELS` and
   `buildApexRedirectUrl` lived inside `handleHostBasedRoute`, which
   runs from the catch-all route. The catch-all is behind the global
   `authGuard` middleware, so unauthenticated `www.kiloclaw.ai`
   requests would 401 before the redirect could fire — exactly the
   traffic the redirect was meant to serve. Tests passed because the
   auth middleware is mocked to always succeed in the worker test
   harness. Rather than rearrange the middleware chain to make it
   work, drop the worker-side redirect entirely: the `www` → apex
   redirect is handled by Cloudflare DNS/edge routes, which is the
   right layer for this anyway (no worker invocation cost, runs
   before any auth, always correct).

2. **Kill switch now uses an explicit `legacy` sentinel.** The
   previous `KILOCLAW_INSTANCE_URL_TEMPLATE=` (empty string) rollback
   was brittle: Vercel / Node env pipelines frequently coerce empty
   entries into "unset", making an empty-string rollback
   indistinguishable from the default-on path (fails open). Switch to
   a non-empty word sentinel: `KILOCLAW_INSTANCE_URL_TEMPLATE=legacy`
   (case-insensitive) disables per-instance URLs. Empty string now
   falls through to the production/dev defaults, matching "unset"
   semantics across all env pipelines.

Tests updated to cover the new sentinel behavior and to assert that
empty string no longer disables the feature.
@pandemicsyn
Copy link
Copy Markdown
Contributor Author

pandemicsyn commented May 4, 2026

local dev in action

Screenshot 2026-05-04 at 2 28 41 PM

@pandemicsyn pandemicsyn merged commit 9e80deb into main May 4, 2026
39 checks passed
@pandemicsyn pandemicsyn deleted the florian/feat/namebased-pr3 branch May 4, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants