[Core][EC] Add async-load path for EC connector#2
Draft
furionw wants to merge 2 commits into
Draft
Conversation
1ba781f to
945db52
Compare
Adds a scheduler-driven async-load fast path to the EC (encoder-cache) connector that mirrors the KV connector's WAITING_FOR_REMOTE_KVS pattern. When a connector returns (hit=True, load_async=True) from has_cache_item, the scheduler parks the request in a new RequestStatus.WAITING_FOR_EMBEDDINGS, dispatches the worker H2D on a side stream, and re-admits the request only after the worker reports the load done via ECConnectorOutput.finished_loading. This overlaps the H2D with the previous step's compute instead of stalling the consumer step's prefill behind an inline copy on the default stream. Design highlights: - has_cache_item return widened to bool | tuple[bool, bool]; legacy bool return continues to route through the existing sync external_load path. - New get_finished_loads(encoder_cache=...) worker hook reports completed H2Ds keyed by mm_hash. The connector moves each completed tensor into encoder_cache before reporting, so the next step's _gather_mm_embeddings sees it. Re-admission gated on event.query() == True; per CUDA semantics this guarantees cross-stream visibility, so no default_stream.wait_event is needed at dispatch time (would have killed the overlap). - Per-mm_hash _EmbeddingLoadState (LOADING / ABANDONED) decouples scheduler bookkeeping from encoder_cache_manager state. Hashes never appear in encoder_cache_manager.cached while still in flight, eliminating the admit-before-ready race. - Encoder-cache slots are reserved up-front via reserve_async_loaded_slots when the request is parked, NOT at register_external_loaded time. Without up-front reservation, concurrent encoder compute on other requests could eat the budget during the park window and trip the capacity assertion when the parked load completes. release_async_loaded_slots returns slots on the ABANDONED branch. - register_external_loaded promotes the entry into cached with non-empty refs so the LRU cannot evict it between completion and the waiter request's re-admission. Slots are NOT decremented again here (the reserve at park time already did that). - Atomic per-request prepass (_prepare_async_load_indices) detects async hits without mutating scheduler state mid-scan. - RUNNING-path park: chunked-prefill mid-flight that hits an async EC on a later mm_feature is queued for end-of-iteration removal from self.running, avoiding mid-iteration mutation. - ABANDONED resurrection + same-hash orphan rescue handled. - New request_async_load hook decouples async-path semantics from the legacy update_state_after_alloc. - New ECConnectorMetadata.evict_orphan slot for scheduler->worker tensor cleanup of abandoned loads. Tests cover: bool/tuple back-compat shim, RequestStatus enum ordering, state-machine transitions, EncoderCacheManager.check_only + reserve_async_loaded_slots / release_async_loaded_slots / register_external_loaded, ECConnectorOutput.finished_loading, ECConnectorBase get_finished_loads + request_async_load defaults, and a CUDA event lifecycle test that validates the cross-stream visibility invariant the design depends on (tests/v1/ec_connector/unit/test_async_load.py, 25 cases). Verified locally on Qwen3-VL-2B-Instruct: 5/5 sweep iters complete cleanly for both dynamo-fd and dynamo-fd-ec configs across 30 prompts each. nsys shows the dedicated side stream is created and its H2D copies overlap kernels on the default stream — the design's load-bearing perf invariant. Note: on this small-model + huge-encoder-cache workload the EC barely fires (GPU-resident encoder cache absorbs all reuse), so the perf delta is workload-dependent and not visible at this scale. Co-authored-by: Claude
945db52 to
daddf0d
Compare
PR vllm-project#40145 (deepstack buffer optimization) added a strict check that raises ValueError when `_get_deepstack_input_embeds(num_tokens)` is called with `num_tokens` greater than the buffer's stored span. This regresses the EC-connector async-load path: the scheduler parks requests in WAITING_FOR_EMBEDDINGS, so total_num_scheduled_tokens can land between cudagraph capture sizes (e.g. 34), and the model_runner pads inputs_embeds to the next bucket (40) before forward(). The deepstack buffer is sized to the actual mm-token span (34), so forward(inputs_embeds.size(0)=40) trips the check. Zero-pad the tail of the buffer (growing the underlying tensor when needed) so padded positions contribute no deepstack signal — same intent as the upstream follow-up in vllm-project#40932. Also relax _clear_deepstack_input_embeds to clamp to stored count instead of raising on padded clears. Repros on Qwen3-VL-2B with --enable-prefix-caching, the EC connector enabled, and a chunked-prefill schedule that parks any request mid- flight; observed in the fd-ec smoke for the 397B sweep. Co-authored-by: Claude Signed-off-by: furionw <qiwa@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Today's EC connector worker does a synchronous CPU→GPU H2D inline with the prefill step on the default stream — every cached image stalls its prefill behind a sequential copy. Mirroring how the KV connector parks remote-KV requests in
WAITING_FOR_REMOTE_KVS, this lets the scheduler park the request, dispatch the H2D on a side stream, and re-admit only after the worker reports completion. H2D overlaps the previous step's compute instead of serializing on the default stream.What Change
RequestStatus.WAITING_FOR_EMBEDDINGSand a per-mm_hashLOADING/ABANDONEDstate machine in the scheduler.has_cache_itemwidened tobool | (hit, load_async); newrequest_async_load,get_finished_loads(encoder_cache),evict_orphanconnector hooks.register_external_loaded(h, n, refs)promotes completed hashes with non-empty refs so LRU can't evict before waiters re-admit.Test Plan
tests/v1/ec_connector/unit/test_async_load.py) — state machine, back-compat shim, CUDA event cross-stream visibility invariant.KVConnectorOutput.mergeunion-attr error invllm/v1/outputs.py, verified on plainorigin/main.