Skip to content

feat(kiloclaw): use fresh token when redeploying instance#1008

Merged
pandemicsyn merged 5 commits intomainfrom
florian/feat/refresh-token-on-redeploy
Mar 10, 2026
Merged

feat(kiloclaw): use fresh token when redeploying instance#1008
pandemicsyn merged 5 commits intomainfrom
florian/feat/refresh-token-on-redeploy

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Mar 10, 2026

Summary

KiloClaw instances receive a KiloCode API key (30-day HS256 JWT) at provision time from the Next.js backend. Previously, restartGateway() and start() re-deployed the same stale key stored in the DO — if the key was near expiry or the user's pepper was rotated, the machine would boot with a dead credential.

This PR makes buildUserEnvVars() mint a fresh API key on every redeploy/start by querying the user's current api_token_pepper from Postgres via Hyperdrive and signing a new 30-day JWT. The stored key serves as a graceful fallback if Hyperdrive is unavailable.

Scope of this change:

  • Adds signKiloToken() to @kilocode/worker-utils alongside the existing verifyKiloToken — typed SignKiloTokenExtra is derived via Pick<KiloTokenPayload, ...> so sign/verify can't drift. Other workers (gastown, webhook-agent-ingest) are not changed; they can adopt the shared function in follow-up PRs.
  • Wraps the mint in a 5-second timeout (withTimeout) to prevent Hyperdrive issues from extending downtime while the machine is stopped.
  • Adds NEXTAUTH_SECRET to the platform route requireEnvVars check — fails fast (503) instead of silently reusing a stale key.
  • Does not address proactive refresh of running instances (that'll be part 2). start() still returns early when the machine is already in started state — no minting in that path.

Verification

  • pnpm typecheck — clean (worker-utils + kiloclaw)
  • pnpm test — 30/30 worker-utils, 524/524 kiloclaw
  • pnpm lint — clean (worker-utils + kiloclaw)

Visual Changes

N/A

Reviewer Notes

  • The minted key is persisted to DO state before fly.updateMachine() succeeds. If the Fly op fails, DO state points at a key that never reached the machine. This matches how trackedImageTag already works — the persisted key is "latest minted" (fallback cache + expiry reporting via getConfig()), not "confirmed on machine".
  • Key expiry is surfaced via getConfig() / /api/kiloclaw/config, not getStatus().

@pandemicsyn pandemicsyn marked this pull request as ready for review March 10, 2026 21:57
}
const nextAuthSecret = this.env.NEXTAUTH_SECRET;

let kilocodeApiKey = this.kilocodeApiKey ?? undefined;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Fallback can redeploy with an already-expired API key

kilocodeApiKey defaults to the persisted value without checking this.kilocodeApiKeyExpiresAt. Once the 30-day token has aged out, any Hyperdrive timeout or lookup failure in this block will still inject the expired key into the next machine config, so the redeployed gateway comes up unable to authenticate. Please only reuse the stored key while it is still unexpired, otherwise fail the restart/provision instead of shipping a known-bad credential.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i think a bad cred here is better than fully preventing a deploy.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 10, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/durable-objects/kiloclaw-instance.ts 2815 Missing kilocodeApiKeyExpiresAt still allows fallback to an unknown-age stored API key

Fix these issues in Kilo Cloud

Other Observations (not in diff)

N/A

Files Reviewed (8 files)
  • kiloclaw/src/config.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance.ts - 1 issue
  • kiloclaw/src/index.test.ts - 0 issues
  • kiloclaw/src/index.ts - 0 issues
  • packages/worker-utils/src/index.ts - 0 issues
  • packages/worker-utils/src/kilo-token.test.ts - 0 issues
  • packages/worker-utils/src/kilo-token.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 1,421,058 tokens

Comment thread packages/worker-utils/src/kilo-token.ts Outdated

export type KiloTokenPayload = z.infer<typeof kiloTokenPayload>;

const KILO_TOKEN_VERSION = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also defined in kiloclaw/src/config.ts
Can we do an export on this where it makes sense?

if (params.env) {
payload.env = params.env;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we parse (payload) to just double check if someone adds a field to extra that doesn't exist in the Zod schema?

if (!this.userId) {
throw new Error('Cannot build env vars: userId missing');
}
if (!this.env.NEXTAUTH_SECRET) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If minting fails (timeout, DB error, no Hyperdrive), kilocodeApiKey remains this.kilocodeApiKey which I think means the stale stored value remains?

Add check this.kilocodeApiKeyExpiresAt to see if that stored key is already expired before passing it to buildEnvVars . Otherwise, I think the machine gets deployed with expired creds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, my theory was - id rather re-deploy you with a stale cred - than fail if hyper drive is timing out. But maybe, this should just kaboom. If hyperdrive is having issues - other stuff is probably breaking anyway (including token validation when you hit the claw control ui proxy.

Let me just change this to kaboom if hyper drive is offline and key expired.

}

private hasExpiredStoredApiKey(): boolean {
if (!this.kilocodeApiKey || !this.kilocodeApiKeyExpiresAt) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Missing expiry still lets an unknown-age API key through

kilocodeApiKeyExpiresAt is optional in InstanceConfigSchema and defaults to null in persisted state, so legacy instances or any config patch that only writes kilocodeApiKey will hit this branch and be treated as reusable forever. If Hyperdrive is unavailable, the worker can still redeploy with a token whose age is unknown, which is the same failure mode this change is trying to prevent. Treat a missing or unparsable expiry as unusable so the fallback only reuses keys with a known future expiration.

@pandemicsyn pandemicsyn merged commit 3a4be8c into main Mar 10, 2026
18 checks passed
@pandemicsyn pandemicsyn deleted the florian/feat/refresh-token-on-redeploy branch March 10, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants