Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
23a85f1
Phase 3 SP3 design — incremental download + global dedup (refcount/GC)
l17728 May 19, 2026
ca2c9c8
Phase 3 SP3 implementation plan — incremental download + global dedup
l17728 May 19, 2026
d1cb171
Phase 3 SP3 plan/spec — apply 2-reviewer pre-execution fixes
l17728 May 19, 2026
e0607c9
feat(sp3): GC config + inherit status + upgrade_from_revision wire + …
l17728 May 19, 2026
ed373da
feat(sp3): StorageObject/SubtaskObjectRef models + migration
l17728 May 19, 2026
491afb9
feat(sp3): storage_objects service (upsert/ref/deref/gc + inherit ide…
l17728 May 19, 2026
7e6ac2b
test(sp3): deref decrement + gc zero-refcount/grace coverage
l17728 May 19, 2026
0c7ed6d
feat(sp3): diff_and_dedup (existing-object => inherit, unified dedup)
l17728 May 19, 2026
b92d341
feat(sp3): run diff_and_dedup before plan_task_sources; planner skips…
l17728 May 19, 2026
baf25de
feat(sp3): record_object on success (idempotent) + claim includes inh…
l17728 May 19, 2026
332ced0
feat(sp3): executor inherit materialization (S3 copy / hardlink) + ru…
l17728 May 19, 2026
9e91d4b
feat(sp3): DELETE /api/v1/tasks/{id} (terminal-only, tenant-scoped, d…
l17728 May 19, 2026
ab102af
feat(sp3): leader-gated GC loop (reclaims refcount=0 storage_objects)
l17728 May 19, 2026
846dd21
test(sp3): M4 milestone-gate fixes — alembic EXPECTED_TABLES + leak t…
l17728 May 19, 2026
a501f42
test(sp3): E2E-incremental — 1-file-changed upgrade inherits >=90%
l17728 May 19, 2026
94deb1e
docs(sp3): OpenAPI DELETE/inherit_from_key + operator incremental guide
l17728 May 19, 2026
6037d0f
fix(sp3): exempt inherit subtasks from the local-disk pre-flight
l17728 May 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions api/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@

# ========== Tasks ==========
/tasks:
get:

Check warning on line 169 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: List tasks
operationId: listTasks
Expand Down Expand Up @@ -194,7 +194,7 @@
schema: {$ref: '#/components/schemas/RbacDenied'}
'429': {$ref: '#/components/responses/RateLimited'}

post:

Check warning on line 197 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Create download task
operationId: createTask
Expand Down Expand Up @@ -296,7 +296,7 @@
parameters:
- $ref: '#/components/parameters/TaskId'

get:

Check warning on line 299 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Get task by ID
operationId: getTask
Expand All @@ -315,7 +315,7 @@
'404':
description: Task not found or cross-tenant ID (existence not leaked)

patch:

Check warning on line 318 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Update task (priority only)
operationId: updateTask
Expand All @@ -334,10 +334,41 @@
application/json:
schema: {$ref: '#/components/schemas/DownloadTask'}

delete:
tags: [tasks]
summary: Delete a terminal task (dereferences its storage objects)
description: >
Terminal-only (succeeded/failed/cancelled → else 409). Tenant-scoped
and RBAC-gated. Dereferences each subtask's storage_objects row
(refcount--); physical bytes are NOT deleted here — refcount=0 rows
are reclaimed by the leader-gated GC past the grace window.
operationId: deleteTask
responses:
'204':
description: Task deleted; referenced storage_objects dereferenced
'401': {$ref: '#/components/responses/Unauthenticated'}
'403':
description: RBAC denied
content:
application/json:
schema: {$ref: '#/components/schemas/RbacDenied'}
'404':
description: Task not found or cross-tenant ID (existence not leaked)
'409':
description: Task is not terminal
content:
application/json:
schema:
type: object
required: [code]
properties:
code: {type: string, enum: [TASK_NOT_TERMINAL]}
status: {type: string}

/tasks/{taskId}/cancel:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 371 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Cancel task (async)
operationId: cancelTask
Expand Down Expand Up @@ -371,7 +402,7 @@
/tasks/{taskId}/retry:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 405 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Retry failed subtasks
operationId: retrySubtasks
Expand Down Expand Up @@ -417,7 +448,7 @@
/tasks/{taskId}/upgrade:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 451 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Upgrade to new revision (incremental)
operationId: upgradeTask
Expand All @@ -441,7 +472,7 @@
/tasks/{taskId}/subtasks:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 475 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [subtasks]
summary: List subtasks of a task
operationId: listSubtasks
Expand All @@ -464,7 +495,7 @@
/tasks/{taskId}/source-allocation:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 498 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: View source allocation (multi-source visualization)
operationId: getSourceAllocation
Expand All @@ -478,7 +509,7 @@
/tasks/{taskId}/events:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 512 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Task event log
operationId: getTaskEvents
Expand Down Expand Up @@ -1651,6 +1682,7 @@
type: string
enum:
- pending
- inherit
- assigned
- downloading
- verifying_local
Expand Down Expand Up @@ -1724,6 +1756,7 @@
expected_sha256: {type: string, pattern: '^[0-9a-f]{64}$', nullable: true}
actual_sha256: {type: string, pattern: '^[0-9a-f]{64}$', nullable: true}
status: {$ref: '#/components/schemas/SubtaskStatus'}
inherit_from_key: {type: string, nullable: true, maxLength: 1024}
executor_id: {type: string, nullable: true}
executor_epoch: {type: integer, format: int64, nullable: true}
assignment_token: {type: string, format: uuid, nullable: true}
Expand Down
89 changes: 89 additions & 0 deletions docs/operator/incremental-download.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Incremental Download + Global Dedup — Operator Guide (SP3)

> **Cross-references**: `docs/v2.0/06-platform-and-ecosystem.md` §2 (incremental /
> diff design) and §3.1–§3.3 (global dedup, refcount, delete dependency);
> `docs/v2.0/INVARIANTS.md` 14 (one physical copy per tenant + backend + content).

---

## 1. `upgrade_from_revision` — sha-diff against what already exists

When a task is created with `upgrade_from_revision` set (the prior git sha),
the scheduling phase runs `diff_and_dedup` **before** source planning. For each
still-`pending` subtask whose HuggingFace `expected_sha256` already has a
`storage_objects` row for `(tenant_id, storage_id, sha256)`, the subtask is
flipped to status `inherit` and a `subtask_object_refs` row + a refcount
increment are recorded immediately. Only files whose content actually changed
(new sha) stay `pending` and flow through the normal SP2 multi-source planner.

This is **unified with cross-task dedup**: the lookup is purely by content
sha, so a file identical to one already stored by *any* prior task or revision
(not just the named `upgrade_from_revision`) is inherited too. There is no
separate "dedup mode" — one code path covers both.

## 2. `storage_objects` refcount model (INVARIANT 14)

`storage_objects` has `UNIQUE(tenant_id, storage_id, sha256)` — there is at
most **one** physical copy of a given content blob per tenant per storage
backend. Every subtask that resolves to that content holds a
`subtask_object_refs` row; `refcount` is the number of live references.

- `record_object` (on download success) upserts the row and adds a ref —
but it is a **no-op when a ref for that subtask already exists** (the
inherit path already added one in `diff_and_dedup`; this prevents a
double-count).
- `deref_subtask` removes a subtask's ref and decrements `refcount`.

## 3. Inherit materialization (no re-download)

An `inherit` subtask is claimed like any other, but the executor does **not**
fetch source bytes. Instead `materialize_inherit` performs:

- **S3 backend**: a server-side `copy_object` (in-region, ≈ free, no egress).
- **local backend**: `os.link` (hardlink), falling back to a copy on `EXDEV`.

It then reports success with the file's known sha — no HuggingFace or mirror
traffic is generated for inherited files.

## 4. `DELETE /api/v1/tasks/{id}`

- **Terminal-only**: the task must be `succeeded`, `failed`, or `cancelled`.
A non-terminal task returns **409** `{"code": "TASK_NOT_TERMINAL"}`.
- **Tenant-scoped + RBAC-gated**: a cross-tenant id returns 404 (existence is
not leaked); unauthenticated returns 401.
- On success (**204**) every subtask of the task is dereferenced
(`refcount--`) and the task row is deleted (FK cascade removes subtasks and
their object refs).
- It does **not** delete physical bytes — only DB references. Reclamation is
the GC's job (§5).

## 5. Leader-gated GC

The active controller runs a background GC loop (standby controllers do not):

- `DLW_GC_INTERVAL_SECONDS` (default 60) — how often a GC tick runs.
- `DLW_GC_GRACE_SECONDS` (default 3600) — a `storage_objects` row is only
reclaimed once it has been at `refcount = 0` for at least this long.

Each reclaiming tick emits an audited `storage.gc` event (system-scope:
`tenant_id=null`, `actor_user_id=null`) with `{"reclaimed": <n>}`.

### Inherit-copy-failure self-heal

If an inherit copy fails on the executor, `complete_subtask` undoes the
diff-time `refcount++` (via `deref_subtask`), clears `inherit_from_key`, and
re-queues the subtask as `pending` so it is downloaded normally on the next
scheduling pass. A failed inherit therefore never leaks refcount and never
strands a file.

## 6. Scope / deferred to Phase 4

SP3's GC only frees `refcount = 0` **database rows** past the grace window.

> An inherited file's `storage_objects` row tracks the *original* (source)
> key; the executor's server-side copy creates new-revision-key bytes that
> are NOT tracked by any `storage_objects` row — these are orphaned bytes
> reclaimed in Phase 4 (physical GC); SP3's GC only frees refcount=0 DB rows.

Also deferred to Phase 4: physical S3 / filesystem byte reclamation, and
quota- or LRU-driven eviction of cold content.
Loading
Loading