Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
ccdb9e8
Phase 3 SP2 design — multi-source (SourceDriver + NameResolver + spee…
l17728 May 18, 2026
22e9b32
Phase 3 SP2 implementation plan — multi-source
l17728 May 18, 2026
8d2dad6
Phase 3 SP2 plan/spec — apply 2-reviewer pre-execution fixes
l17728 May 18, 2026
45c3a54
feat(sp2): multi-source config + pyyaml dep
l17728 May 18, 2026
4ff732b
feat(sp2): SourceDriver Protocol + manifest/file/token dataclasses
l17728 May 18, 2026
1027ddd
feat(sp2): HuggingFace + hf_mirror SourceDrivers
l17728 May 18, 2026
282a5aa
feat(sp2): ModelScope SourceDriver
l17728 May 18, 2026
7efd7f1
feat(sp2): source registry + sources.yaml
l17728 May 18, 2026
45e8af0
feat(sp2): NameResolver 3-tier + resolver-rules.yaml
l17728 May 18, 2026
35b6339
feat(sp2): SubtaskChunk/SourceSpeedSample/SourceBlacklist models + mi…
l17728 May 18, 2026
82d19dc
test(sp2): add SP2's 3 tables to test_alembic EXPECTED_TABLES
l17728 May 18, 2026
64a978a
feat(sp2): source speed EWMA fusion + controller-side probe
l17728 May 18, 2026
af2b0c2
feat(sp2): LPT greedy assignment + optimal-combo selection
l17728 May 18, 2026
7187e26
feat(sp2): source blacklist service
l17728 May 18, 2026
c7a7bcd
feat(sp2): plan_task_sources (strategy filter + INV12 + chunk-split +…
l17728 May 18, 2026
eb0c0ab
feat(sp2): generalized /source-proxy with per-source cred (INV 2)
l17728 May 18, 2026
eb02bc3
feat(sp2): executor stream_source -> /source-proxy (all paths) + conf…
l17728 May 18, 2026
4a13f75
feat(sp2): lifespan registry/resolver bootstrap + scheduling/rebalanc…
l17728 May 18, 2026
c05b6a4
feat(sp2): HF sha256 authority — non-HF mismatch blacklists source 24h
l17728 May 18, 2026
d8f4568
fix(sp2): M4 integration fixes (source-proxy back-compat + test isola…
l17728 May 18, 2026
faf3004
test(sp2): E2E-002 auto_balance prefers fast source + HF-authority pause
l17728 May 18, 2026
8a977a0
docs(sp2): OpenAPI source-proxy op + operator multi-source guide
l17728 May 18, 2026
9b911c6
docs(sp2): record final-review chunk-alignment known limitation (bann…
l17728 May 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions api/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@

# ========== Tasks ==========
/tasks:
get:

Check warning on line 169 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: List tasks
operationId: listTasks
Expand Down Expand Up @@ -194,7 +194,7 @@
schema: {$ref: '#/components/schemas/RbacDenied'}
'429': {$ref: '#/components/responses/RateLimited'}

post:

Check warning on line 197 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Create download task
operationId: createTask
Expand Down Expand Up @@ -296,7 +296,7 @@
parameters:
- $ref: '#/components/parameters/TaskId'

get:

Check warning on line 299 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Get task by ID
operationId: getTask
Expand All @@ -315,7 +315,7 @@
'404':
description: Task not found or cross-tenant ID (existence not leaked)

patch:

Check warning on line 318 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Update task (priority only)
operationId: updateTask
Expand All @@ -337,7 +337,7 @@
/tasks/{taskId}/cancel:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 340 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Cancel task (async)
operationId: cancelTask
Expand Down Expand Up @@ -371,7 +371,7 @@
/tasks/{taskId}/retry:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 374 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Retry failed subtasks
operationId: retrySubtasks
Expand Down Expand Up @@ -417,7 +417,7 @@
/tasks/{taskId}/upgrade:
parameters:
- $ref: '#/components/parameters/TaskId'
post:

Check warning on line 420 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Upgrade to new revision (incremental)
operationId: upgradeTask
Expand All @@ -441,7 +441,7 @@
/tasks/{taskId}/subtasks:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 444 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [subtasks]
summary: List subtasks of a task
operationId: listSubtasks
Expand All @@ -464,7 +464,7 @@
/tasks/{taskId}/source-allocation:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 467 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: View source allocation (multi-source visualization)
operationId: getSourceAllocation
Expand All @@ -478,7 +478,7 @@
/tasks/{taskId}/events:
parameters:
- $ref: '#/components/parameters/TaskId'
get:

Check warning on line 481 in api/openapi.yaml

View workflow job for this annotation

GitHub Actions / OpenAPI lint

operation-description Operation "description" must be present and non-empty string.
tags: [tasks]
summary: Task event log
operationId: getTaskEvents
Expand Down Expand Up @@ -960,6 +960,63 @@
'503':
description: Forwarded from HF — upstream unavailable.

/source-proxy/subtask/{subtaskId}:
get:
tags: [executors]
summary: Multi-source reverse proxy — stream a subtask's file from its assigned source
description: >-
Controller-side reverse proxy. Authenticates the executor (mTLS + JWT),
verifies subtask ownership (assignment_token + epoch fence), resolves the
planner-assigned source for this subtask (defaults to huggingface when
unassigned), injects the appropriate credential, and streams the bytes
back. The executor never holds source credentials directly (INVARIANT 2).
operationId: sourceProxySubtask
parameters:
- name: subtaskId
in: path
required: true
schema:
type: string
format: uuid
- name: X-Assignment-Token
in: header
required: true
schema:
type: string
format: uuid
- name: Range
in: header
required: false
schema:
type: string
responses:
'200':
description: Full file stream.
content:
application/octet-stream:
schema:
type: string
format: binary
'206':
description: Partial content (Range request).
content:
application/octet-stream:
schema:
type: string
format: binary
'401':
description: Missing or invalid mTLS / executor JWT.
'403':
description: NOT_YOUR_SUBTASK — authenticated executor does not own this subtask.
'404':
description: Subtask not found.
'409':
description: STALE_ASSIGNMENT or EPOCH_MISMATCH (fence-token mismatch).
'502':
description: SOURCE_UNAVAILABLE — source returned an unrecoverable error.
'503':
description: Source unreachable or all sources exhausted.

/executors/{executorId}/subtasks/{subtaskId}/complete:
parameters:
- in: path
Expand Down
12 changes: 12 additions & 0 deletions config/resolver-rules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
identity_organizations:
- deepseek-ai
- Qwen
- 01-ai
- THUDM
- baichuan-inc
- mistralai
aliases:
- hf_org: meta-llama
modelscope_org: LLM-Research
transform: "Meta-{name}"
per_model_overrides: []
22 changes: 22 additions & 0 deletions config/sources.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
sources:
- id: huggingface
enabled: true
driver: huggingface
config: {base_url: "https://huggingface.co", timeout_seconds: 30}
cost_per_gb_egress: 0.09
- id: hf_mirror
enabled: true
driver: hf_mirror
config: {base_url: "https://hf-mirror.com", timeout_seconds: 30}
cost_per_gb_egress: 0.0
- id: modelscope
enabled: true
driver: modelscope
config: {base_url: "https://www.modelscope.cn", timeout_seconds: 30}
cost_per_gb_egress: 0.0
balancing:
speed_ewma_alpha: 0.3
chunk_level_min_file_mb: 100
regional_defaults:
cn-north: ["hf_mirror", "modelscope", "huggingface"]
us-east: ["huggingface"]
196 changes: 196 additions & 0 deletions docs/operator/multi-source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Multi-Source Download — Operator Guide (SP2)

> **Cross-references**: `docs/v2.0/06-platform-and-ecosystem.md` §1 (design rationale,
> scheduling algorithm, name-resolver detail); `docs/v2.0/INVARIANTS.md` 11/12/13
> (SHA256 authority, cross-source verification, HF-down policy).

---

## 1. `config/sources.yaml`

Controls which download sources the controller loads at startup.

```yaml
sources:
- id: huggingface # unique identifier used in source_blacklist, logs, etc.
enabled: true
driver: huggingface # must be a supported driver (see §1.1)
config:
base_url: "https://huggingface.co"
timeout_seconds: 30
cost_per_gb_egress: 0.09 # USD; used by cost-aware scheduling (future)

- id: hf_mirror
enabled: true
driver: hf_mirror
config:
base_url: "https://hf-mirror.com"
timeout_seconds: 30
cost_per_gb_egress: 0.0

- id: modelscope
enabled: true
driver: modelscope
config:
base_url: "https://www.modelscope.cn"
timeout_seconds: 30
cost_per_gb_egress: 0.0

balancing:
speed_ewma_alpha: 0.3 # EWMA smoothing factor for speed samples
chunk_level_min_file_mb: 100 # files smaller than this are not chunk-routed

regional_defaults:
cn-north: ["hf_mirror", "modelscope", "huggingface"]
us-east: ["huggingface"]
```

### 1.1 Supported drivers (v2.0)

| Driver ID | Endpoint | Notes |
|---------------|-----------------------|------------------------------|
| `huggingface` | huggingface.co | Authoritative SHA256 source |
| `hf_mirror` | hf-mirror.com | HF-compatible reverse proxy |
| `modelscope` | modelscope.cn | Requires name-resolver rules |

Entries with an unrecognised driver are logged as a warning and skipped at startup.

Drivers deferred to v2.1: `wisemodel`, `opencsg`, `s3_mirror`, `plugin`
(see §6 Deferred items).

---

## 2. `config/resolver-rules.yaml`

Maps HuggingFace repo IDs to their equivalents on other sources.

```yaml
identity_organizations:
# Organizations whose repo IDs are identical across all sources.
- deepseek-ai
- Qwen
- 01-ai
- THUDM
- baichuan-inc
- mistralai

aliases:
# Transform rules for organizations with different naming on other sources.
- hf_org: meta-llama
modelscope_org: LLM-Research
transform: "Meta-{name}" # e.g. Llama-3.1-8B → Meta-Llama-3.1-8B

per_model_overrides:
# Exact per-model overrides; takes precedence over aliases.
# - hf: "specific-org/specific-model"
# modelscope: "different-org/different-name"
```

The resolver applies three-tier lookup: (1) identity match, (2) alias rule,
(3) source search API fallback (cached 24 h). If no mapping is found the
source is skipped for that task.

---

## 3. SP2 Environment Settings (`DLW_*`)

All settings live under `SourceSettings` in `src/dlw/config.py` and are read
from environment variables with the `DLW_` prefix (or from the Helm configmap).

| Setting | Default | Description |
|-----------------------------------|---------|--------------------------------------------------------------------------|
| `DLW_PROBE_SIZE_MB` | 32 | Bytes downloaded per source during the scheduling-phase speed probe |
| `DLW_PROBE_TIMEOUT_S` | 8.0 | Soft deadline (seconds) for each probe; partial bytes still recorded |
| `DLW_PROBE_HISTORY_WEIGHT` | 0.3 | EWMA history weight; live probe weight = `1 - probe_history_weight` |
| `DLW_CHUNK_LEVEL_MIN_FILE_MB` | 100 | Files smaller than this are not split across sources |
| `DLW_SHA_MISMATCH_BLACKLIST_HOURS`| 24 | Duration to blacklist a `(source, repo, filename)` after SHA256 mismatch |
| `DLW_REBALANCE_INTERVAL_SECONDS` | 60.0 | How often the background rebalancer re-evaluates in-flight task routing |

Tuning guidance: increase `PROBE_SIZE_MB` to 64 for large-model repos where
speed variance is high; reduce `PROBE_TIMEOUT_S` below 8 only on low-latency
networks where probes consistently finish within 3-4 s.

---

## 4. `source_strategy` Task Field

Set on `POST /api/v1/tasks` in the `source_strategy` field.

| Value | Behaviour |
|------------------|--------------------------------------------------------------------------|
| `auto_balance` | Default. Probe all enabled sources, allocate files/chunks by speed. |
| `fastest_only` | Probe all sources, use only the single fastest. |
| `pin_huggingface`| Skip probe; download everything from HuggingFace only. |
| `pin_modelscope` | Skip probe; use ModelScope only (resolver rules applied automatically). |
| `list:a,b` | Use only the listed source IDs (comma-separated); probe between them. |

Sources listed in `source_blacklist` (array of source IDs) are always excluded
regardless of strategy.

---

## 5. SHA256 Authority Rules (INVARIANTS 11/12/13)

### INVARIANT 11 — HF is the authoritative SHA256 source

All files downloaded from any source must be verified against the SHA256 value
that HuggingFace provides in its LFS manifest. No other source's self-reported
SHA256 is accepted as truth.

### INVARIANT 12 — Cross-source verification is mandatory

After completing a download (single-source or chunk-level multi-source), the
controller compares the actual file SHA256 against the HF-supplied value.
A mismatch triggers a `(source_id, repo_id, filename)` blacklist for
`sha_mismatch_blacklist_hours` (default 24 h). Subsequent subtasks for that
combination fall back to HuggingFace.

### INVARIANT 13 — HF unavailable → task paused unless `trust_non_hf_sha256`

When HuggingFace is unreachable and the task was created with the default
`trust_non_hf_sha256: false`:

- The task transitions to `paused_external` with error code `no_sha256_authority`.
- No bytes are downloaded from alternative sources because integrity cannot be
guaranteed.

Set `trust_non_hf_sha256: true` on the task to opt out of this guarantee and
allow downloads to proceed using other sources' self-reported checksums.

**Special case**: if a file has no SHA256 pinned in HF's manifest at all (rare,
typically raw text files), it is always routed exclusively through HuggingFace
regardless of strategy.

---

## 6. Scheduling and Rebalancing — Leader-Gated

The scheduling phase (task `pending` → `scheduling` → `downloading`) and the
background rebalancer both run exclusively on the **active controller** (the
current Raft/leader-election winner). Standby controllers do not run probes or
mutate source assignments.

Task state `scheduling` is transient: it covers the probe window
(`probe_timeout_s`) plus assignment computation. If the controller loses
leadership during scheduling the task reverts to `pending` and is picked up by
the new leader.

The rebalancer (interval: `rebalance_interval_seconds`) re-probes sources for
tasks whose in-flight speed deviates significantly from the initial probe, and
may reassign future chunks. It does not interrupt chunks already in flight.

Note: per-executor probing (where each executor independently probes its local
sources) is deferred to v2.1.

---

## 7. Deferred to v2.1

The following capabilities are scoped out of v2.0 and will ship in v2.1:

- `wisemodel` and `opencsg` source drivers
- `s3_mirror` driver and per-task `s3_direct_source` (schema reserved)
- Plugin-based source driver API
- Per-executor probing (executors report their own speed matrix independently)
- Automatic 5xx / health-check triggered source blacklist transitions
- Source cost accounting UI and budget enforcement
Loading
Loading