feat(kiloclaw): use fresh token when redeploying instance#1008
feat(kiloclaw): use fresh token when redeploying instance#1008pandemicsyn merged 5 commits intomainfrom
Conversation
| } | ||
| const nextAuthSecret = this.env.NEXTAUTH_SECRET; | ||
|
|
||
| let kilocodeApiKey = this.kilocodeApiKey ?? undefined; |
There was a problem hiding this comment.
WARNING: Fallback can redeploy with an already-expired API key
kilocodeApiKey defaults to the persisted value without checking this.kilocodeApiKeyExpiresAt. Once the 30-day token has aged out, any Hyperdrive timeout or lookup failure in this block will still inject the expired key into the next machine config, so the redeployed gateway comes up unable to authenticate. Please only reuse the stored key while it is still unexpired, otherwise fail the restart/provision instead of shipping a known-bad credential.
There was a problem hiding this comment.
yea, i think a bad cred here is better than fully preventing a deploy.
Code Review SummaryStatus: 1 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)N/A Files Reviewed (8 files)
Reviewed by gpt-5.4-20260305 · 1,421,058 tokens |
|
|
||
| export type KiloTokenPayload = z.infer<typeof kiloTokenPayload>; | ||
|
|
||
| const KILO_TOKEN_VERSION = 3; |
There was a problem hiding this comment.
Also defined in kiloclaw/src/config.ts
Can we do an export on this where it makes sense?
| if (params.env) { | ||
| payload.env = params.env; | ||
| } | ||
|
|
There was a problem hiding this comment.
Should we parse (payload) to just double check if someone adds a field to extra that doesn't exist in the Zod schema?
| if (!this.userId) { | ||
| throw new Error('Cannot build env vars: userId missing'); | ||
| } | ||
| if (!this.env.NEXTAUTH_SECRET) { |
There was a problem hiding this comment.
If minting fails (timeout, DB error, no Hyperdrive), kilocodeApiKey remains this.kilocodeApiKey which I think means the stale stored value remains?
Add check this.kilocodeApiKeyExpiresAt to see if that stored key is already expired before passing it to buildEnvVars . Otherwise, I think the machine gets deployed with expired creds.
There was a problem hiding this comment.
Yea, my theory was - id rather re-deploy you with a stale cred - than fail if hyper drive is timing out. But maybe, this should just kaboom. If hyperdrive is having issues - other stuff is probably breaking anyway (including token validation when you hit the claw control ui proxy.
Let me just change this to kaboom if hyper drive is offline and key expired.
| } | ||
|
|
||
| private hasExpiredStoredApiKey(): boolean { | ||
| if (!this.kilocodeApiKey || !this.kilocodeApiKeyExpiresAt) { |
There was a problem hiding this comment.
WARNING: Missing expiry still lets an unknown-age API key through
kilocodeApiKeyExpiresAt is optional in InstanceConfigSchema and defaults to null in persisted state, so legacy instances or any config patch that only writes kilocodeApiKey will hit this branch and be treated as reusable forever. If Hyperdrive is unavailable, the worker can still redeploy with a token whose age is unknown, which is the same failure mode this change is trying to prevent. Treat a missing or unparsable expiry as unusable so the fallback only reuses keys with a known future expiration.
Summary
KiloClaw instances receive a KiloCode API key (30-day HS256 JWT) at provision time from the Next.js backend. Previously,
restartGateway()andstart()re-deployed the same stale key stored in the DO — if the key was near expiry or the user's pepper was rotated, the machine would boot with a dead credential.This PR makes
buildUserEnvVars()mint a fresh API key on every redeploy/start by querying the user's currentapi_token_pepperfrom Postgres via Hyperdrive and signing a new 30-day JWT. The stored key serves as a graceful fallback if Hyperdrive is unavailable.Scope of this change:
signKiloToken()to@kilocode/worker-utilsalongside the existingverifyKiloToken— typedSignKiloTokenExtrais derived viaPick<KiloTokenPayload, ...>so sign/verify can't drift. Other workers (gastown, webhook-agent-ingest) are not changed; they can adopt the shared function in follow-up PRs.withTimeout) to prevent Hyperdrive issues from extending downtime while the machine is stopped.NEXTAUTH_SECRETto the platform routerequireEnvVarscheck — fails fast (503) instead of silently reusing a stale key.start()still returns early when the machine is already instartedstate — no minting in that path.Verification
pnpm typecheck— clean (worker-utils + kiloclaw)pnpm test— 30/30 worker-utils, 524/524 kiloclawpnpm lint— clean (worker-utils + kiloclaw)Visual Changes
N/A
Reviewer Notes
fly.updateMachine()succeeds. If the Fly op fails, DO state points at a key that never reached the machine. This matches howtrackedImageTagalready works — the persisted key is "latest minted" (fallback cache + expiry reporting viagetConfig()), not "confirmed on machine".getConfig()//api/kiloclaw/config, notgetStatus().