Skip to content

[Core][EC] Add async-load path for EC connector#2

Draft
furionw wants to merge 2 commits into
releases/v0.20.1from
qiwa/async-ec-load
Draft

[Core][EC] Add async-load path for EC connector#2
furionw wants to merge 2 commits into
releases/v0.20.1from
qiwa/async-ec-load

Conversation

@furionw
Copy link
Copy Markdown
Owner

@furionw furionw commented May 2, 2026

Why

Today's EC connector worker does a synchronous CPU→GPU H2D inline with the prefill step on the default stream — every cached image stalls its prefill behind a sequential copy. Mirroring how the KV connector parks remote-KV requests in WAITING_FOR_REMOTE_KVS, this lets the scheduler park the request, dispatch the H2D on a side stream, and re-admit only after the worker reports completion. H2D overlaps the previous step's compute instead of serializing on the default stream.

What Change

  • New RequestStatus.WAITING_FOR_EMBEDDINGS and a per-mm_hash LOADING/ABANDONED state machine in the scheduler.
  • has_cache_item widened to bool | (hit, load_async); new request_async_load, get_finished_loads(encoder_cache), evict_orphan connector hooks.
  • register_external_loaded(h, n, refs) promotes completed hashes with non-empty refs so LRU can't evict before waiters re-admit.
  • Atomic per-request prepass; chunked-prefill RUNNING-path park; ABANDONED resurrection + same-hash orphan rescue.

Test Plan

  • 23 unit tests pass (tests/v1/ec_connector/unit/test_async_load.py) — state machine, back-compat shim, CUDA event cross-stream visibility invariant.
  • mypy-local skipped: pre-existing KVConnectorOutput.merge union-attr error in vllm/v1/outputs.py, verified on plain origin/main.
  • Full e2e Dynamo 2B sweep — pending dev-image rebuild against this branch.

@furionw furionw force-pushed the qiwa/async-ec-load branch from 1ba781f to 945db52 Compare May 2, 2026 22:59
Adds a scheduler-driven async-load fast path to the EC (encoder-cache)
connector that mirrors the KV connector's WAITING_FOR_REMOTE_KVS pattern.
When a connector returns (hit=True, load_async=True) from has_cache_item,
the scheduler parks the request in a new RequestStatus.WAITING_FOR_EMBEDDINGS,
dispatches the worker H2D on a side stream, and re-admits the request only
after the worker reports the load done via ECConnectorOutput.finished_loading.
This overlaps the H2D with the previous step's compute instead of stalling
the consumer step's prefill behind an inline copy on the default stream.

Design highlights:
- has_cache_item return widened to bool | tuple[bool, bool]; legacy bool
  return continues to route through the existing sync external_load path.
- New get_finished_loads(encoder_cache=...) worker hook reports completed
  H2Ds keyed by mm_hash. The connector moves each completed tensor into
  encoder_cache before reporting, so the next step's _gather_mm_embeddings
  sees it. Re-admission gated on event.query() == True; per CUDA semantics
  this guarantees cross-stream visibility, so no default_stream.wait_event
  is needed at dispatch time (would have killed the overlap).
- Per-mm_hash _EmbeddingLoadState (LOADING / ABANDONED) decouples scheduler
  bookkeeping from encoder_cache_manager state. Hashes never appear in
  encoder_cache_manager.cached while still in flight, eliminating the
  admit-before-ready race.
- Encoder-cache slots are reserved up-front via reserve_async_loaded_slots
  when the request is parked, NOT at register_external_loaded time.
  Without up-front reservation, concurrent encoder compute on other
  requests could eat the budget during the park window and trip the
  capacity assertion when the parked load completes.
  release_async_loaded_slots returns slots on the ABANDONED branch.
- register_external_loaded promotes the entry into cached with non-empty
  refs so the LRU cannot evict it between completion and the waiter
  request's re-admission. Slots are NOT decremented again here (the
  reserve at park time already did that).
- Atomic per-request prepass (_prepare_async_load_indices) detects async
  hits without mutating scheduler state mid-scan.
- RUNNING-path park: chunked-prefill mid-flight that hits an async EC on
  a later mm_feature is queued for end-of-iteration removal from
  self.running, avoiding mid-iteration mutation.
- ABANDONED resurrection + same-hash orphan rescue handled.
- New request_async_load hook decouples async-path semantics from the
  legacy update_state_after_alloc.
- New ECConnectorMetadata.evict_orphan slot for scheduler->worker tensor
  cleanup of abandoned loads.

Tests cover: bool/tuple back-compat shim, RequestStatus enum ordering,
state-machine transitions, EncoderCacheManager.check_only +
reserve_async_loaded_slots / release_async_loaded_slots /
register_external_loaded, ECConnectorOutput.finished_loading,
ECConnectorBase get_finished_loads + request_async_load defaults, and
a CUDA event lifecycle test that validates the cross-stream visibility
invariant the design depends on (tests/v1/ec_connector/unit/test_async_load.py,
25 cases).

Verified locally on Qwen3-VL-2B-Instruct: 5/5 sweep iters complete cleanly
for both dynamo-fd and dynamo-fd-ec configs across 30 prompts each.
nsys shows the dedicated side stream is created and its H2D copies
overlap kernels on the default stream — the design's load-bearing
perf invariant. Note: on this small-model + huge-encoder-cache workload
the EC barely fires (GPU-resident encoder cache absorbs all reuse), so
the perf delta is workload-dependent and not visible at this scale.

Co-authored-by: Claude
@furionw furionw force-pushed the qiwa/async-ec-load branch from 945db52 to daddf0d Compare May 2, 2026 23:19
@furionw furionw changed the base branch from main to releases/v0.20.1 May 2, 2026 23:19
PR vllm-project#40145 (deepstack buffer optimization) added a strict check that
raises ValueError when `_get_deepstack_input_embeds(num_tokens)` is
called with `num_tokens` greater than the buffer's stored span. This
regresses the EC-connector async-load path: the scheduler parks
requests in WAITING_FOR_EMBEDDINGS, so total_num_scheduled_tokens can
land between cudagraph capture sizes (e.g. 34), and the model_runner
pads inputs_embeds to the next bucket (40) before forward(). The
deepstack buffer is sized to the actual mm-token span (34), so
forward(inputs_embeds.size(0)=40) trips the check.

Zero-pad the tail of the buffer (growing the underlying tensor when
needed) so padded positions contribute no deepstack signal — same
intent as the upstream follow-up in vllm-project#40932. Also relax
_clear_deepstack_input_embeds to clamp to stored count instead of
raising on padded clears.

Repros on Qwen3-VL-2B with --enable-prefix-caching, the EC connector
enabled, and a chunked-prefill schedule that parks any request mid-
flight; observed in the fd-ec smoke for the 397B sweep.

Co-authored-by: Claude
Signed-off-by: furionw <qiwa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant