[Core][EC] Add async-load path for EC connector by furionw · Pull Request #2 · furionw/vllm

furionw · 2026-05-02T16:39:41Z

Why

Today's EC connector worker does a synchronous CPU→GPU H2D inline with the prefill step on the default stream — every cached image stalls its prefill behind a sequential copy. Mirroring how the KV connector parks remote-KV requests in WAITING_FOR_REMOTE_KVS, this lets the scheduler park the request, dispatch the H2D on a side stream, and re-admit only after the worker reports completion. H2D overlaps the previous step's compute instead of serializing on the default stream.

What Change

New RequestStatus.WAITING_FOR_EMBEDDINGS and a per-mm_hash LOADING/ABANDONED state machine in the scheduler.
has_cache_item widened to bool | (hit, load_async); new request_async_load, get_finished_loads(encoder_cache), evict_orphan connector hooks.
register_external_loaded(h, n, refs) promotes completed hashes with non-empty refs so LRU can't evict before waiters re-admit.
Atomic per-request prepass; chunked-prefill RUNNING-path park; ABANDONED resurrection + same-hash orphan rescue.

Test Plan

23 unit tests pass (tests/v1/ec_connector/unit/test_async_load.py) — state machine, back-compat shim, CUDA event cross-stream visibility invariant.
mypy-local skipped: pre-existing KVConnectorOutput.merge union-attr error in vllm/v1/outputs.py, verified on plain origin/main.
Full e2e Dynamo 2B sweep — pending dev-image rebuild against this branch.

Adds a scheduler-driven async-load fast path to the EC (encoder-cache) connector that mirrors the KV connector's WAITING_FOR_REMOTE_KVS pattern. When a connector returns (hit=True, load_async=True) from has_cache_item, the scheduler parks the request in a new RequestStatus.WAITING_FOR_EMBEDDINGS, dispatches the worker H2D on a side stream, and re-admits the request only after the worker reports the load done via ECConnectorOutput.finished_loading. This overlaps the H2D with the previous step's compute instead of stalling the consumer step's prefill behind an inline copy on the default stream. Design highlights: - has_cache_item return widened to bool | tuple[bool, bool]; legacy bool return continues to route through the existing sync external_load path. - New get_finished_loads(encoder_cache=...) worker hook reports completed H2Ds keyed by mm_hash. The connector moves each completed tensor into encoder_cache before reporting, so the next step's _gather_mm_embeddings sees it. Re-admission gated on event.query() == True; per CUDA semantics this guarantees cross-stream visibility, so no default_stream.wait_event is needed at dispatch time (would have killed the overlap). - Per-mm_hash _EmbeddingLoadState (LOADING / ABANDONED) decouples scheduler bookkeeping from encoder_cache_manager state. Hashes never appear in encoder_cache_manager.cached while still in flight, eliminating the admit-before-ready race. - Encoder-cache slots are reserved up-front via reserve_async_loaded_slots when the request is parked, NOT at register_external_loaded time. Without up-front reservation, concurrent encoder compute on other requests could eat the budget during the park window and trip the capacity assertion when the parked load completes. release_async_loaded_slots returns slots on the ABANDONED branch. - register_external_loaded promotes the entry into cached with non-empty refs so the LRU cannot evict it between completion and the waiter request's re-admission. Slots are NOT decremented again here (the reserve at park time already did that). - Atomic per-request prepass (_prepare_async_load_indices) detects async hits without mutating scheduler state mid-scan. - RUNNING-path park: chunked-prefill mid-flight that hits an async EC on a later mm_feature is queued for end-of-iteration removal from self.running, avoiding mid-iteration mutation. - ABANDONED resurrection + same-hash orphan rescue handled. - New request_async_load hook decouples async-path semantics from the legacy update_state_after_alloc. - New ECConnectorMetadata.evict_orphan slot for scheduler->worker tensor cleanup of abandoned loads. Tests cover: bool/tuple back-compat shim, RequestStatus enum ordering, state-machine transitions, EncoderCacheManager.check_only + reserve_async_loaded_slots / release_async_loaded_slots / register_external_loaded, ECConnectorOutput.finished_loading, ECConnectorBase get_finished_loads + request_async_load defaults, and a CUDA event lifecycle test that validates the cross-stream visibility invariant the design depends on (tests/v1/ec_connector/unit/test_async_load.py, 25 cases). Verified locally on Qwen3-VL-2B-Instruct: 5/5 sweep iters complete cleanly for both dynamo-fd and dynamo-fd-ec configs across 30 prompts each. nsys shows the dedicated side stream is created and its H2D copies overlap kernels on the default stream — the design's load-bearing perf invariant. Note: on this small-model + huge-encoder-cache workload the EC barely fires (GPU-resident encoder cache absorbs all reuse), so the perf delta is workload-dependent and not visible at this scale. Co-authored-by: Claude

PR vllm-project#40145 (deepstack buffer optimization) added a strict check that raises ValueError when `_get_deepstack_input_embeds(num_tokens)` is called with `num_tokens` greater than the buffer's stored span. This regresses the EC-connector async-load path: the scheduler parks requests in WAITING_FOR_EMBEDDINGS, so total_num_scheduled_tokens can land between cudagraph capture sizes (e.g. 34), and the model_runner pads inputs_embeds to the next bucket (40) before forward(). The deepstack buffer is sized to the actual mm-token span (34), so forward(inputs_embeds.size(0)=40) trips the check. Zero-pad the tail of the buffer (growing the underlying tensor when needed) so padded positions contribute no deepstack signal — same intent as the upstream follow-up in vllm-project#40932. Also relax _clear_deepstack_input_embeds to clamp to stored count instead of raising on padded clears. Repros on Qwen3-VL-2B with --enable-prefix-caching, the EC connector enabled, and a chunked-prefill schedule that parks any request mid- flight; observed in the fd-ec smoke for the 397B sweep. Co-authored-by: Claude Signed-off-by: furionw <qiwa@nvidia.com>

furionw force-pushed the qiwa/async-ec-load branch from 1ba781f to 945db52 Compare May 2, 2026 22:59

furionw force-pushed the qiwa/async-ec-load branch from 945db52 to daddf0d Compare May 2, 2026 23:19

furionw changed the base branch from main to releases/v0.20.1 May 2, 2026 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][EC] Add async-load path for EC connector#2

[Core][EC] Add async-load path for EC connector#2
furionw wants to merge 2 commits into
releases/v0.20.1from
qiwa/async-ec-load

furionw commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

furionw commented May 2, 2026

Why

What Change

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant