Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions api/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@
description: Local dev

tags:
- name: health
description: Health and leader-election probes
- name: tasks
description: Download task management
- name: subtasks
Expand Down Expand Up @@ -67,7 +69,7 @@
paths:
# ========== Tasks ==========
/tasks:
get:

Check warning on line 72 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: List tasks
operationId: listTasks
Expand All @@ -90,7 +92,7 @@
'401': {$ref: '#/components/responses/Unauthenticated'}
'429': {$ref: '#/components/responses/RateLimited'}

post:

Check warning on line 95 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Create download task
operationId: createTask
Expand Down Expand Up @@ -191,7 +193,7 @@
parameters:
- $ref: '#/components/parameters/TaskId'

get:

Check warning on line 196 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Get task by ID
operationId: getTask
Expand All @@ -202,7 +204,7 @@
application/json:
schema: {$ref: '#/components/schemas/DownloadTask'}

patch:

Check warning on line 207 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Update task (priority only)
operationId: updateTask
Expand All @@ -224,7 +226,7 @@
/tasks/{taskId}/cancel:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 229 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Cancel task (async)
operationId: cancelTask
Expand All @@ -250,7 +252,7 @@
/tasks/{taskId}/retry:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 255 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Retry failed subtasks
operationId: retrySubtasks
Expand Down Expand Up @@ -296,7 +298,7 @@
/tasks/{taskId}/upgrade:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 301 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Upgrade to new revision (incremental)
operationId: upgradeTask
Expand All @@ -320,7 +322,7 @@
/tasks/{taskId}/subtasks:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 325 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [subtasks]
summary: List subtasks of a task
operationId: listSubtasks
Expand All @@ -343,7 +345,7 @@
/tasks/{taskId}/source-allocation:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 348 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: View source allocation (multi-source visualization)
operationId: getSourceAllocation
Expand All @@ -357,7 +359,7 @@
/tasks/{taskId}/events:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 362 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Task event log
operationId: getTaskEvents
Expand Down Expand Up @@ -434,6 +436,35 @@
application/json:
schema: {$ref: '#/components/schemas/ModelInfo'}

# ========== Health ==========
/health/active:
get:
tags: [health]
summary: LB target — 200 iff this instance holds the leader advisory lock
operationId: healthActive
security: []
responses:
'200':
description: Active or recovering — LB should route to this instance.
content:
application/json:
schema:
type: object
properties:
status: { type: string, enum: [active] }
controller_state: { type: string, enum: [recovering, active] }
'503':
description: Standby — LB must NOT route to this instance.
content:
application/json:
schema:
type: object
properties:
detail:
type: object
properties:
controller_state: { type: string, enum: [standby] }

# ========== Executors ==========
/executors/register:
post:
Expand Down Expand Up @@ -644,6 +675,18 @@
description: Server requesting backoff
headers:
Retry-After: {schema: {type: integer}}
'503':
description: Controller is recovering after a failover (INVARIANT 33). Retry shortly.
content:
application/json:
schema:
type: object
properties:
detail:
type: object
properties:
code: { type: string, enum: [CONTROLLER_RECOVERING] }
message: { type: string }

/executors/{executorId}/renew:
parameters:
Expand Down Expand Up @@ -729,6 +772,18 @@
type: integer
got:
type: integer
'503':
description: Controller is recovering after a failover (INVARIANT 33). Retry shortly.
content:
application/json:
schema:
type: object
properties:
detail:
type: object
properties:
code: { type: string, enum: [CONTROLLER_RECOVERING] }
message: { type: string }

/hf-proxy/subtask/{subtaskId}:
get:
Expand Down Expand Up @@ -899,6 +954,18 @@
type: integer
'409':
description: STALE_ASSIGNMENT or epoch fence violation
'503':
description: Controller is recovering after a failover (INVARIANT 33). Retry shortly.
content:
application/json:
schema:
type: object
properties:
detail:
type: object
properties:
code: { type: string, enum: [CONTROLLER_RECOVERING] }
message: { type: string }

# ========== Tenants / Quota / Audit ==========
/tenants/{tenantId}:
Expand Down
39 changes: 39 additions & 0 deletions docs/operator/executor-runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,42 @@ and any deployment manifests — they are now ignored):
rather than executor→HF directly. For the internal beta this is acceptable;
global rate-limit coordination and an executor-local credential pool are
Phase 3 items.

## Controller leadership (W3c)

As of Phase 2 W3c, the controller supports active/standby deployments via an
app-level leader election. The instance holding a session-level
PostgreSQL advisory lock (`pg_try_advisory_lock(<DLW_ACTIVE_LOCK_ID>)`) is
the **active**; all others are **standby**.

**LB routing:** point the load balancer's health check at `GET /health/active`
(returns 200 only when this instance holds the lock). `/health/live` and
`/health/ready` remain the k8s liveness/readiness probes — unchanged.

**Failover behaviour:** when the active dies, PG auto-releases the advisory
lock the instant its holding session ends. A standby's leader-loop poll
(default 5 s, configurable via `DLW_LEADER_POLL_INTERVAL_SECONDS`) acquires
the freed lock and promotes through `standby → recovering → active`. During
the `recovering` phase the executor-loop endpoints (heartbeat, poll, report)
return **503 `CONTROLLER_RECOVERING`** — executors retry through their
existing tenacity backoff. Total RTO target: ≤ 10 min.

**Relationship to PG-level failover (`promote-standby.sh`):** the app-level
lock is orthogonal to PostgreSQL primary failover. The runbook
`deploy/runbooks/scripts/promote-standby.sh` promotes the PG primary itself
(CH-Q3); after that script runs, the controller pods reconnect and the
advisory lock is re-acquired automatically by whichever pod wins the race.

**Required environment variables:**

- `DLW_ACTIVE_LOCK_ID` — bigint advisory-lock key. Default
`0x444C5743414B5631`. **All controller instances MUST use the same value**;
a mismatch causes both to think they are active.
- `DLW_LEADER_POLL_INTERVAL_SECONDS` — standby poll interval (default 5.0,
range 0.5–60.0).

**Removed environment variables (W3c):**

- `DLW_STRICT_RECOVERY` — deleted. Recovery failures now keep the controller
in `recovering` and retry on the next leader-loop tick (heartbeats keep
503ing; alertable from log volume).
Loading
Loading