fix: use non-blocking dispatch to prevent pipeline starvation#505
fix: use non-blocking dispatch to prevent pipeline starvation#505
Conversation
Replace blocking semaphore acquire in the dispatch loop with a non-blocking try_acquire that breaks out when the semaphore is full. This causes the outer loop to re-query the frontier, picking up newly-ready downstream tasks instead of draining a stale snapshot. Fixes #504 Made-with: Cursor
Greptile SummaryReplaces blocking
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Adds TrackingSemaphore.try_acquire() and replaces blocking acquire() calls in _main_dispatch_loop and _drain_frontier with non-blocking try_acquire() + break; the semaphore_full flag in the main loop correctly gates re-waiting on worker completion. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py | Adds unit test for TrackingSemaphore.try_acquire and integration test for downstream-interleaving; the new buffer manager is passed a list[str] instead of an ArtifactStorage instance, which is a type violation (though harmless here). |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[_main_dispatch_loop: outer while True] --> B[clear wake_event\nget_ready_tasks]
B --> C{for task in ready}
C --> D{try_acquire?}
D -- False --> E[semaphore_full = True\nbreak inner loop]
D -- True --> F[dispatch task\n_spawn_worker]
F --> C
E --> G[checkpoint completed RGs]
G --> H{all_done?}
H -- Yes --> I[exit loop]
H -- No --> J{not ready\nOR semaphore_full?}
J -- Yes --> K[await wake_event.wait]
K --> A
J -- No --> A
subgraph Worker task
W1[_execute_task_inner_impl] --> W2{LLM task?}
W2 -- Yes --> W3[acquire llm_wait_sem\nrelease submission_sem]
W2 -- No --> W4[run generator]
W3 --> W4
W4 --> W5[finally: in_flight.discard\nrelease submission_sem\nwake_event.set]
end
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py
Line: 1526
Comment:
**Wrong type passed to `RowGroupBufferManager`**
`graph.columns` is a `list[str]`, but `RowGroupBufferManager.__init__` expects an `ArtifactStorage`. This works at runtime here because `on_finalize_row_group` is `None` so `checkpoint_row_group` (the only code path that calls `self._artifact_storage`) is never reached — but it violates the type contract and would confuse a reader or a type-checker. A `MagicMock()` (consistent with the pattern used in other tests such as `test_scheduler_with_buffer_manager`) would be the correct substitute.
```suggestion
buffer_manager = RowGroupBufferManager(MagicMock())
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "Merge branch 'main' into nmulepati/fix/5..." | Re-trigger Greptile
|
Docs preview: https://96a60b73.dd-docs-preview.pages.dev
|
📋 Summary
The async scheduler's dispatch loop processes a stale snapshot of ready tasks, preventing pipeline parallelism in multi-model workloads. When the submission semaphore is full, the loop blocks on
acquire()while newly-ready downstream tasks accumulate in the frontier unseen. This PR replaces the blocking acquire with a non-blockingtry_acquirethat breaks out of the loop when the semaphore is contended, causing the outer loop to re-query the frontier and pick up downstream tasks.🔗 Related Issue
Fixes #504
🔄 Changes
TrackingSemaphore.try_acquire()— a non-blocking acquire that returnsFalsewhen no permits are available instead of blocking (async_scheduler.py#L58-L63)await self._submission_semaphore.acquire()withtry_acquire()+breakin both_main_dispatch_loopand_drain_frontier, so the dispatch loop breaks out when the semaphore is full and re-queries the frontier on the next iteration (async_scheduler.py#L306-L315,async_scheduler.py#L412-L419)_main_dispatch_loopfromif not readytoif not ready or semaphore_fullso the loop waits for a worker completion before re-querying when it broke out early (async_scheduler.py#L334)🧪 Testing
make testpasses (317/317 dataset builder tests)TrackingSemaphore.try_acquiretest_scheduler_downstream_interleaves_with_upstream) — builds the exact pipeline topology from the issue (topic → 3 gen cols → 3 judge cols), uses a small semaphore (4) with 10 records, and asserts that judge tasks begin dispatching before all gen tasks finish✅ Checklist
Made with Cursor