feat(kiloclaw): named instance tiers + admin temporary size override#3051
feat(kiloclaw): named instance tiers + admin temporary size override#3051pandemicsyn wants to merge 20 commits intomainfrom
Conversation
Tier keys now consistently encode CPU count and RAM (perf-1-3, perf-4-8, perf-4-16) instead of mixing the disambiguating-suffix shape with a bare 'perf-1'. Legacy keys already followed this pattern (shared-2-3, shared-2-4); the offered tier list now matches. Also drops the unused displayPriceUsd field from the catalog and Zod schema. Pricing belongs in Postgres / Stripe price IDs once billing wires up tier-aware pricing — keeping a hardcoded USD number in TS implied more authority than it had. Migration 0110 regenerated with the new tier key in the CHECK constraint; the schema's CHECK helper switched to sql.raw to match the existing enumCheck convention, so the constraint inlines literal strings instead of being parameterized.
…-types Renumbers our migration from 0110 to 0112 to accommodate main's new 0110_nappy_newton_destine and 0111_shocking_selene migrations. Union- merges the new @kilocode/kilo-chat-hooks dependency in apps/web alongside our @kilocode/kiloclaw-instance-tiers addition, and merges the admin detail page imports block where main added scheduled-action notify fields and ours added the tier catalog helpers.
… copy Adds a read-side self-heal helper (resolveInstanceTypeFromState) that treats the incoherent persisted shape `instanceType='custom'` with `machineSize=null` as null instead of returning 'custom'. This stops the admin UI from labeling docker-local instances 'Custom' when there's no hardware evidence to back the label up. Used everywhere the DO resolves its current tier — getStatus, getDebugState, provision-path inference, and resize previousTier — so a stale label can't propagate or stick. Three new test cases cover (a) the broken shape resolves to null, (b) custom backed by a real non-catalog machineSize is preserved, and (c) re-provisioning an instance in the broken shape doesn't propagate 'custom'. Dialog copy is now provider-aware: - Action sentence mentions storage explicitly. - Drops the unconditional 'Fly volumes can grow but not shrink' line. - Orange Fly volume warning fires only on Fly when storage actually grows, with a more specific message. - Adds a muted info Alert on docker-local explaining that the host bind mount means storage stays the same regardless of tier — previously silent, which was misleading because the description claimed storage would change. Includes [instance-tier-debug] logs at every site that reads or writes instanceType, so we can keep tracing if the broken shape shows up on other instances. Logs are cheap and will be removed in a follow-up once we're confident the self-heal covers all real cases.
…tate Two real backfill gaps surfaced when manually testing legacy Fly instances: 1. The alarm-path `reconcileMachine` already calls `fly.getMachine` but never read `machine.config.guest`, so legacy instances (provisioned before this PR) sat at `Instance Tier: Unknown` indefinitely — no alarm tick would ever label them. The `syncStatusFromLiveCheck` path did backfill, but only `getStatus` triggers it. 2. The admin instance detail page calls `getDebugState`, which never dispatches a live check. So even an admin actively viewing a legacy instance wouldn't trigger a backfill — only the user-facing dashboard does. Extract the backfill block into a shared `backfillMachineSizeFromFlyConfig` helper in reconcile.ts and call it from both `reconcileMachine` (alarm) and `syncStatusFromLiveCheck` (live-check). Extract the live-check dispatch into a `maybeDispatchLiveCheck` private method on the DO and call it from both `getStatus` and `getDebugState` so admins get fresh hardware observation without waiting for the alarm. Tests cover (a) alarm-driven backfill on legacy state, (b) alarm no-op when machineSize is already set, (c) alarm writes 'custom' for non-catalog Fly guest, and (d) getDebugState dispatches the live-check. Required adding a `getAlarm` shim to the test fake storage that was missing — getDebugState already called it but no existing test exercised it. Also tightens the Fly volume warning copy to spell out the consequence: 'Fly volumes can grow but cannot be shrunk, so you will not be able to downgrade this instance.'
…nges The alarm tick was unconditionally calling syncInstanceTypeToPostgresHelper once per running instance per cadence (5 min for running, 30 min for idle), which meant a Hyperdrive round-trip per instance per tick for a SQL no-op once the column was populated. Same wasteful pattern as tracked_image_tag, but worth fixing here since it accumulates linearly with fleet size. Move the sync to the two sites that actually change instanceType: 1. backfillMachineSizeFromFlyConfig (called by alarm reconcileMachine and user-facing syncStatusFromLiveCheck): when a legacy instance gets its tier inferred from live Fly guest, sync immediately via waitUntil. 2. resizeMachine: already syncs explicitly when the admin picks a new tier — unchanged. Helper signature simplified — drops the (now redundant) inference branch and persist callback. Helper is now a thin 'if non-null, write to PG' which mirrors how it's used. Tradeoff: loses self-heal on transient Postgres outages at backfill time. If a backfill-time sync fails and DO state already has the tier, nothing retries until the next tier change. Acceptable because backfill itself is retriable on the next live-check or alarm tick (cures the DO state side), and resize remains an explicit retry vector for tier changes. Tests: - alarm backfill on legacy state asserts syncInstanceType IS called with the resolved tier - alarm with already-populated state asserts syncInstanceType is NOT called (the no-op case we're optimizing for) - alarm with custom-shape Fly guest asserts syncInstanceType called with 'custom' - getDebugState live-check assertions extended to cover both backfill (Postgres synced) and steady-state (no Postgres call) - updated three pre-existing alarm tests that were asserting waitUntil promise counts — counts decreased by 1 since the unconditional instance_type sync is gone. Same optimization could/should be applied to syncTrackedImageTag at some point — same pattern, same waste — but out of scope for this PR.
…chosen size The 'Bump Volume to 15 GB' button was a hardcoded customer-support workaround. Generalize it to a proper extend-volume admin tool that takes a target size and updates DO state alongside the Fly volume. Changes: - Rename BumpVolumeTo15GbDialog.tsx → ExtendVolumeDialog.tsx with a number input for target size. Defaults to current + 5 GB. Validates > current and ≤ 500 GB, requires confirmation checkbox. - Platform /api/platform/extend-volume route accepts targetSizeGb in the body (was hardcoded 15). After the Fly extend succeeds, calls a new DO method recordVolumeExtend() to update DO state. - New DO method KiloClawInstance.recordVolumeExtend(newSizeGb) writes volumeSizeGb and flips instanceType to 'custom' (an arbitrary extend is by definition off the catalog ladder), then fires the Postgres sync via waitUntil. Validates 1 ≤ size ≤ 500 GB. - Admin tRPC handler validates targetSizeGb > current via getDebugStatus before dispatching, so the round-trip to Fly is never wasted. - Audit log records previousSizeGb + sizeGb. - Internal client signature gains targetSizeGb param. This means a perf-1-3 instance whose volume gets extended to 15 GB correctly shows as Custom tier afterward, and a future tier resize via canUpgradeTo correctly compares against the actual 15 GB on disk (rejecting any tier with smaller volumeSizeGb). Tests: - recordVolumeExtend persists state, marks custom, syncs Postgres. - Rejects invalid sizes (0, 501, non-integer). - Rejects when not provisioned.
Adds an admin-only `adminMachineSizeOverride` on the Instance DO that wins over the tier-derived `machineSize` for runtime spec construction without changing `instanceType` or `volumeSizeGb`. Lets support temporarily bump a customer's hardware (e.g. OOM recovery) without disturbing billing or volume size, and revert later. Sticky until cleared explicitly or auto-cleared by a tier resize. Customer dashboard never sees the override; only admin getDebugState exposes it. Two preset hardware shapes (perf-4-8 / perf-4-16) at the admin tRPC + worker platform-route boundary; the DO is preset-agnostic and stores a literal MachineSize. New `kiloclaw_instances.admin_size_override jsonb` column with a partial index (non-null + non-destroyed) powers the admin "Has size override" filter chip and per-row `Override` badge on the instance list page. Sync to Postgres is conditional — fired only on explicit set/clear/auto-clear-on-resize, not from the alarm tick (no observation path). Squashed into the same 0112 migration as the existing `instance_type` column since neither has shipped yet. See ~/fd-plans/kiloclaw/admin-machine-size-override.md and admin-machine-size-override-deviations.md.
…-types # Conflicts: # packages/db/src/migrations/meta/0112_snapshot.json # packages/db/src/migrations/meta/_journal.json
When `clearAdminMachineSizeOverride` short-circuits because DO state is already null, also fire the best-effort Postgres sync. The `admin_size_override` column is the read cache for the admin "Has size override" filter and badge — if a prior best-effort sync failed (or DO was restored without the override while Postgres still held a stale payload), the list page would show a phantom override forever. An admin clicking "Clear Size Override" must repair the cache even when the DO is already null. Sync is idempotent via IS DISTINCT FROM, so this is a SQL no-op when Postgres already matches.
Five distinct fixes from PR review: 1. Ownership check on admin hardware mutations. resizeMachine, setAdminMachineSizeOverride, and clearAdminMachineSizeOverride now call assertInstanceBelongsToUser after resolveInstance so passing userId=A + instanceId=B (where B belongs to user C) no longer targets C's instance while the audit log records target=A. 2. /provision no longer defaults instanceType on re-provision. provision() is overloaded as the entrypoint for both fresh-create AND config-update flows on existing instances. Defaulting to DEFAULT_INSTANCE_TIER unconditionally silently overwrote custom (e.g. extend-volume) and legacy tier hardware on the next config change. Default is now gated on shouldInsertInstanceRecord; for re-provisions, instanceType passes through as undefined and the DO's inferredInstanceType path preserves existing state. Regression tests added in platform-provision-bootstrap.test.ts. 3. cpu_kind validation via MachineSizeSchema.safeParse. New parseMachineSizeFromFlyGuest helper in machine-config.ts replaces three sites that cast `cpu_kind as 'shared' | 'performance' | undefined` (reconcile alarm, startExistingMachine, restart-path backfill). Unknown vocabulary from Fly is now caught and logged instead of landing in DO state and silently flipping the instance to 'custom'. 4. parseAdminSizeOverride uses a shared Zod schema. New AdminSizeOverridePayloadSchema in admin-size-override.ts is the canonical shape for the kiloclaw_instances.admin_size_override JSONB column. AdminSizeOverrideRow is now an alias of AdminSizeOverridePayload — single source of truth, automatic field- drift detection, no hand-rolled type guards. 5. wrangler.jsonc no longer commits empty NF_DEPLOYMENT_PLAN_PERF_* keys. Empty string defaults relied on optionalEnv trim semantics; removing the keys means values come solely from environment overrides and getNorthflankConfig falls back to NF_DEPLOYMENT_PLAN.
Code Review SummaryStatus: 2 Issues Found | Recommendation: Address before merge Overview
No new issues were found in the incremental changes since Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments:
Fix these issues in Kilo Cloud Files Reviewed (1 incremental file)
Reviewed by gpt-5.5-20260423 · 778,703 tokens |
Set/clear of an admin size override no longer requires the instance to be stopped. The override is a pure DO state write; the Fly `updateMachine(guest=...)` call doesn't happen until the next stop/start cycle (in `startExistingMachine`), where Fly's own state machine enforces the constraint. Setting on a running machine is safe — the current container keeps running on tier hardware until the next restart, at which point the override applies. Tier resize keeps `requireStopped: true` because it extends the Fly volume in place, which Fly only allows on stopped machines. `assertAdminSizeChangeAllowed` gains a `requireStopped: boolean` flag; callers pass it explicitly so the constraint travels with the call site that needs it. `beforePhrase` is now optional (only used when `requireStopped: true`). UI: the Size Override button is no longer disabled when the machine is running. The dialog surfaces an inline note when the instance is running explaining that the change applies on the next stop/start cycle. Existing tests for "rejects when running" flipped to assert the new persists-on-running behavior. Northflank rejection unchanged — Northflank's async pod-rollout model isn't wired up for either path yet.
The override callout was rendered as a single grid cell wedged between Region and Volume Size, which broke the 3-column rhythm of every row below it and ended up visually disconnected from the Override badge it was reporting on. Moved to a full-width Alert banner above the Live Worker Status grid (matching the existing `workerStatusError` pattern in the same card). Tightened the copy to a single line for the hardware/actor/reason information plus a small caveat about billing.
Drop the redundant 'Billing stays on the customer's tier; this override is invisible to the customer dashboard.' line — the shield icon, amber styling, and 'Admin size override' header already telegraph the gist, and the line was wrapping awkwardly. Also shorten the metadata line to '\u00b7 actor, time \u2014 reason' instead of '\u00b7 set by actor time \u2014 reason'.
…e line shadcn's AlertDescription applies `grid justify-items-start gap-1`, which makes every direct child a separate grid row. The inline <span>/<strong>/<em>/text nodes were each landing on their own line. Wrapping the whole content in a single <p> collapses it to one grid row and lets the inline elements flow naturally.
…de scrub Three review-feedback fixes: 1. Postgres restore now derives machineSize and volumeSizeGb from the catalog when the persisted instanceType is a known offered/legacy tier. Previously the restore left hardware nulls beside a non-null tier label, so the next live-check would observe whatever Fly reported and could relabel the instance as custom if the running guest did not match the catalog yet. For custom or null tier, leaves nulls and lets the existing live-check backfill self-heal. 2. Comment on /extend-volume best-effort DO catchup was misleading — it claimed alarm/live-check would self-heal, but the alarm only reconciles machineSize, not volume size. Replaced with an honest description of the actual recovery path (admin re-runs /extend-volume; both Fly extend and DO recordVolumeExtend are idempotent on retry) and a follow-up note for proper alarm-driven volume reconciliation. 3. softDeleteUser now clears kiloclaw_instances.admin_size_override on (a) the deleted users retained destroyed instances, and (b) any instance where the deleted user was the admin actor. Both protect admin-PII (actorEmail) + free-form reason text from persisting in the denormalized read cache after soft-delete. New test in user.test.ts covers both paths plus a control row.
Originally added during the backfill investigation and kept for production observation. The self-heal and tier-backfill paths have stabilized, and the doWarn on schema-validation failure (parseMachineSizeFromFlyGuest returning null) is the only remaining diagnostic — appropriate for a real edge case worth alerting on. The verbose [instance-tier-debug] log added no actionable signal in steady state.
| `[admin-kiloclaw] extendVolume triggered by admin ${ctx.user.id} (${ctx.user.google_user_email}) app=${input.appName} volume=${input.volumeId} size=15GB` | ||
| `[admin-kiloclaw] extendVolume triggered by admin ${ctx.user.id} (${ctx.user.google_user_email}) app=${input.appName} volume=${input.volumeId} targetSizeGb=${input.targetSizeGb}` | ||
| ); | ||
| const instance = await resolveInstance(input.userId, input.instanceId); |
There was a problem hiding this comment.
The other three admin hardware mutation routes added in this PR (resizeMachine, setAdminMachineSizeOverride, clearAdminMachineSizeOverride) all call assertInstanceBelongsToUser(instance, input.userId) immediately after resolveInstance. extendVolume was modified here (the new targetSizeGb) but did not pick up the same guard.
resolveInstance(input.userId, input.instanceId) calls getInstanceById(instanceId), which queries only by id and does not filter by user_id. The platform route's resolveInstanceDoKey also ignores userId when instanceId is supplied. So an admin who passes userId=A together with instanceId=B where B belongs to user C ends up extending C's volume while the audit log row records target_user_id=A. The Fly app/volume consistency check on lines 2747 to 2752 passes as long as the supplied appName and volumeId match the resolved instance, which they do for C's instance. Fly volumes can grow but cannot be shrunk, so the storage change is permanent and attributed to the wrong user.
Suggestion: add assertInstanceBelongsToUser(instance, input.userId); immediately after the resolveInstance call. A regression test alongside the existing extendVolume tests would lock the behavior in so the guard does not silently regress in a future refactor.
| let providerState = getNorthflankProviderState(state); | ||
| const projectId = providerState.projectId; | ||
| const volumeName = providerState.volumeName ?? names.volumeName; | ||
| const deploymentPlan = resolveNorthflankDeploymentPlan( |
There was a problem hiding this comment.
Storage sizing on Northflank did not pick up the same tier-from-state shape used here for the deployment plan. Both Northflank provisioning paths still hardcode storageSizeMb: config.volumeSizeMb (the global NF_VOLUME_SIZE_MB): ensureProvisioningResources at line 519 and ensureStorage at line 562. Meanwhile the DO's provision() now derives state.volumeSizeGb from the tier catalog (20 GB for perf-4-8, 40 GB for perf-4-16), and getStatus() returns that value to the customer dashboard.
Net effect for a freshly provisioned Northflank instance at any tier other than perf-1-3:
- Postgres
instance_typeand DOstate.volumeSizeGbadvertise the catalog tier (20 or 40 GB). - Actual Northflank volume size is whatever
NF_VOLUME_SIZE_MBresolves to (default 10240, so 10 GB). - Customer fills past 10 GB and hits disk full errors with no signal anywhere that the persisted state diverged from reality.
The cleanest fix mirrors the deployment plan path: read state.volumeSizeGb (or look it up from the catalog when an instanceType is set) and pass that as storageSizeMb in both provisioning sites. If sizing per instance is not viable on Northflank because of storage class constraints, the alternative is to reject tiers other than the default on Northflank provisioning at the platform /provision route, matching how tier resize is already rejected on Northflank. Either way the persisted state should not promise more storage than the volume actually has.
A test in providers/northflank/index.test.ts parallel to the existing tier deployment plan coverage would seal the chosen behavior.
Summary
1. Named instance-tier catalog
Replaces the admin UI's free-form
{cpus, memory_mb, cpu_kind}resize with a named-tier catalog as the source of truth for provisioning and resizing. New shared package@kilocode/kiloclaw-instance-tiersconsumed by both worker and web.perf-1-3(default)perf-4-8perf-4-16shared-2-3shared-2-4resizeMachine(targetTierKey)is upgrade-only and stopped-machine-only; downgrades, sidegrades, and legacy-as-target are rejected server-side. Upgrade policy lives in the catalog (canUpgradeTo) so future providers reuse it.Per-provider:
machine.config.guest.Memory/NanoCpus. No automatic backfill for legacy docker-local instances (provider doesn't observe its own container; cure is re-provision or admin resize).NF_DEPLOYMENT_PLAN_PERF_*env vars (falling back toNF_DEPLOYMENT_PLAN). Resize rejected with "not yet supported."DO state gains
instanceType+volumeSizeGb; Postgres gainskiloclaw_instances.instance_type(nullable text, CHECK-constrained, partial index). Postgres syncs only when DO state actually changes — at provision insert, at resize, or at backfill that observes a previously-unknown shape. No alarm-tick churn.Also fixes two stale bullets in
services/kiloclaw/AGENTS.mdthat incorrectly claimed Next.js was the solekiloclaw_instanceswriter (contradicts.specs/kiloclaw-datamodel.mdrule 21 and the actual Worker insert path).2. Admin temporary CPU/RAM override
Support occasionally needs to temporarily upgrade a customer's hardware to recover them (OOM, runaway process) and revert later. Tier resize is the wrong tool — it's upgrade-only by design and tier-coupled (rewrites billing-relevant fields). Adds a parallel admin-only override that wins over the tier-derived
machineSizefor runtime spec construction without touchinginstanceTypeorvolumeSizeGb.Key design points:
adminMachineSizeOverride: MachineSize | null+ metadata. NeweffectiveMachineSize(state) = override ?? machineSizehelper applied at every Fly-guest construction site.state.machineSizestays the billable tier hardware everywhere else (tier comparisons, customer dashboard, backfill).'perf-4-8'/'perf-4-16'enforced at the tRPC + worker route boundary; the DO accepts anyMachineSize. Adding a preset or relaxing to free-form is a tRPC-layer change.clearedOverridein its response + audit log) so a tier change doesn't silently downgrade or free-upgrade the customer.getStatus()keeps returning the billable tier hardware. Override is admin-only by design — billing stays on the tier, support cost is the upgrade subsidy.backfillMachineSizeFromFlyConfigandstartExistingMachine's legacy branch both early-return when an override is active, so the live Fly guest (which reflects the override) doesn't get mistaken for tier hardware.startExistingMachine(wherefly.updateMachine(guest=...)enforces Fly's own stopped-state requirement). Northflank rejected for both.Out of scope
User-facing upgrade flow • per-tier billing/pricing • Northflank tier resize • volume downgrades • docker-local auto-backfill • override TTL/auto-expiry (additive) • outstanding-overrides admin dashboard view (the filter exists; a panel is a small follow-up) • alarm-driven volume-size reconciliation for
/extend-volume.Verification
Manually verified end-to-end on docker-local and Fly: provision → resize
perf-1-3 → perf-4-8→docker inspectconfirmsMemory: 8 GiB,NanoCpus: 4matching the catalog; provider-aware dialog Alerts render correctly. Set/clear override on a running Fly instance, restart, confirm Fly guest matches the override on the new machine.Visual Changes
Admin detail page: catalog-driven tier resize dialog, Instance Tier badge with optional
Overridechip, Volume Size row, new Size Override button + dialog (two radios + reason field, amber styling). When override is active: a banner above the Live Worker Status grid shows hardware + actor + reason.Admin list page: per-row
Overridebadge with tooltip;Has size overridefilter chip in the toolbar.No user-facing surface changes.
Reviewer Notes
0116_low_amazoness.sqladds both columns + indexes + CHECK; nullable, additive, no defaults.'custom'is terminal. Written by backfill when live Fly guest doesn't match any catalog entry; replaced by a real key on next resize.resolveInstanceTypeFromStateself-heals the incoherentcustom + null machineSizeshape (read-side, doesn't write back).