Skip to content

Phase 3 SP3 — Incremental download + global dedup (refcount/GC)#17

Merged
l17728 merged 17 commits into
mainfrom
feat/phase-3-sp3-incremental-download
May 19, 2026
Merged

Phase 3 SP3 — Incremental download + global dedup (refcount/GC)#17
l17728 merged 17 commits into
mainfrom
feat/phase-3-sp3-incremental-download

Conversation

@l17728
Copy link
Copy Markdown
Owner

@l17728 l17728 commented May 19, 2026

Summary

  • upgrade_from_revision + unified diff/dedup: a file whose HF sha already has a storage_objects row for (tenant,storage,sha) → inherit subtask (executor S3 copy_object / os.link, no download); else normal SP2 download. Unified with cross-task dedup (content-sha lookup, not just the named prior revision).
  • storage_objects refcount (INVARIANT 14 = UNIQUE(tenant,storage,sha)) + subtask_object_refs; record_object on success is inherit-idempotent (ref-exists ⇒ no double-count). Inherit-copy failure self-heals (deref + re-queue as pending, banner 7f).
  • DELETE /api/v1/tasks/{id} (terminal-only → 409, tenant-scoped, RBAC) derefs; leader-gated GC loop reclaims refcount=0 rows past grace (audited storage.gc). Phase 3 sub-project 3 of 4 (SP1 Phase 3 SP1 — Multi-tenancy (OIDC + RBAC + tenant scoping + quota) #15, SP2 Phase 3 SP2 — Multi-source (SourceDriver + NameResolver + LPT/chunk routing) #16 merged).
  • Final whole-impl review applied: fixed a HIGH where claim_one_subtask disk pre-flight would silently strand inherit subtasks (server-side copy needs zero local disk) — now bypassed for inherit, with a regression test.

Test plan

  • uv run pytest green: 381 passed incl. tests/e2e/test_incremental.py (≥90% inherit, roadmap §3.5)
  • invariant_lint (+ "inherit" in VALID_SUBTASK_STATUS) / openapi (spectral 0 errors + swagger-cli) / no-direct-status-write green
  • alembic up→down→up clean (down_revision bb1dd2c45a12 → 7636b35e4881)

Spec: docs/superpowers/specs/2026-05-19-phase-3-sp3-incremental-download-design.md
Plan: docs/superpowers/plans/2026-05-19-phase-3-sp3-incremental-download.md

🤖 Generated with Claude Code

l17728 and others added 17 commits May 19, 2026 08:53
Brainstormed spec for Phase 3 sub-project 3. Unified diff+dedup: an
existing storage_objects row for (tenant,storage,sha) => inherit via
executor S3 copy_object / os.link (no download). refcount on complete,
DELETE /tasks (terminal-only) deref, leader-gated GC of refcount=0 rows.
Stacks on SP2 (diff before plan_task_sources). Aggressive deferrals
(quota/LRU evict, physical byte GC, BLAKE3) keep it one spec->plan cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 bite-sized TDD tasks across 5 milestones (M1 schema/models/config/lint,
M2 storage_objects svc, M3 diff/dedup+wiring, M4 executor inherit+DELETE+GC,
M5 e2e+docs+PR). Complete code; self-reviewed. SP1/SP2 lessons baked in
(drop_all->create_all+teardown fixtures, models in __init__.py, inherit in
lint set first, real CI gate list, inherit double-count idempotency test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 opus reviewers: 1 BLOCKER + 5 IMPORTANT, all fixed before impl:
- BLOCKER: fully-inherited task stranded in paused_external -> T6 moves
  pending-subs early-return ABOVE plan_task_sources' pause gates
- StorageConfig had no backend_type (local hardlink dead) -> T1 adds it,
  T8 threads StorageBackend.backend_type via poll
- runner anchor wrong -> T8 uses _execute_subtask/self._s
- storage.gc audit dropped -> T10 _gc_loop write_audit
- create_task never threaded upgrade_from_revision -> T1 ctor wire+test
- inherit-then-fail refcount leak -> T7 deref + re-queue to pending
spec banner §7 records all 6 as authoritative rulings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test fixture seeds StorageBackend(id=1) — DownloadTask.storage_id FKs
storage_backends; the plan's fixture omitted it (same correction Task 3
applied to the storage_objects fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… inherit

plan_task_sources now loads pending-only subs at the top and returns
early when none remain, so a fully-inherited task never trips the
pinned/no-sha/no-speed pause gates (banner 7a). Dropped an unused
SourceManifest import the plan's test carried.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erit

complete_subtask records a storage_object on success (no-op when an
inherit ref already exists) and, on a failed inherit copy, derefs +
re-queues the file as pending (banner 7f). claim_one_subtask now also
claims `inherit` subs for the lightweight copy.

Deviation: test_inherit_complete_does_not_double_count seeds the sub as
`assigned` not `inherit` — claim_one_subtask transitions inherit->assigned
before the executor calls complete_subtask, and complete_subtask rejects
non-assigned rows (scheduler.py:123). The idempotency path (record_ref_only
already added the ref => record_object no-ops) is still fully exercised.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nner dispatch + poll backend_type

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eref)

Two plan-defect fixes required to make the route work:
- @router.delete needs response_model=None: the `-> None` return
  annotation makes FastAPI infer NoneType as a response model, tripping
  its "204 must not have a response body" assertion.
- DownloadTask.subtasks needed cascade "delete" + passive_deletes=True:
  the relationship previously had only "save-update, merge", so
  session.delete(task) with subtasks loaded UPDATE-d children task_id to
  NULL (NOT NULL violation) instead of deferring to the FK's existing
  ON DELETE CASCADE. Plain "delete" (not "delete-orphan") carries no
  orphan-on-lazy-load risk; matches the relationship's documented intent.

Test fixture also gets the now-standard intermediate await s.flush()
after Project/User/StorageBackend before DownloadTask (FK ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eardown

Two plan gaps surfaced only by the full-suite M4 milestone E2E:
- test_alembic EXPECTED_TABLES didn't include the SP3 tables the
  migration correctly creates (storage_objects, subtask_object_refs).
- test_delete_task's app_client fixture seeded fixed-PK rows with no
  teardown drop_all, leaking into test_executors/test_hf_proxy (which
  errored in the full suite, passed isolated — the recurring session-DB
  cross-module collision class).

Full suite now 379 passed, 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixture seeds StorageBackend(id=1) (DownloadTask.storage_id FK) — same
correction Tasks 5/9 applied; the plan's fixture omitted it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final-review HIGH: claim_one_subtask gated EVERY candidate (incl.
`inherit`) on size + 200MB <= free_bytes. An inherit subtask materializes
via a server-side S3 copy_object / local hardlink and writes zero
executor local bytes, so SP3's headline scenario — a 1-file change in a
huge repo producing N-1 inherit subtasks — would silently never be
claimed on any realistically-sized executor (task hangs in `downloading`,
no error). The 380-test suite missed it because fixtures seed ample disk
+ tiny files. Added a regression test proving an inherit sub bypasses the
gate while a same-size real download stays disk-gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@l17728 l17728 merged commit b3b0b09 into main May 19, 2026
12 checks passed
@l17728 l17728 deleted the feat/phase-3-sp3-incremental-download branch May 19, 2026 02:46
@l17728 l17728 mentioned this pull request May 19, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant