-
Notifications
You must be signed in to change notification settings - Fork 262
Description
Issue reported via Foundry Discord https://aka.ms/foundry/discord
Foundry Local
Description
When using Foundry Local with the Microsoft Agent Framework, CPU models run successfully, but NPU‑backed models fail at the fan‑out stage when multiple parallel LLM calls are made.
This occurs even on systems with sufficient shared memory (32 GB). Single, serialized inference works on NPU; failures only appear when fan‑out / parallel orchestration is used.
This makes NPU models unusable today for multi‑agent or fan‑out workflows.
Repro steps
Install Foundry Local on a Copilot+ PC with NPU support (Windows 11 24H2)
Clone and run:
https://github.com/leestott/agentframework--foundrylocal
Configure the agent to use:
CPU model → ✅ succeeds
NPU model → ❌ fails during fan‑out
Observe failure when multiple concurrent LLM calls are triggered
Expected behavior
NPU models should either:
Support fan‑out / parallel inference, or
Fail gracefully with a clear error indicating that parallel execution is not supported
Actual behavior
CPU models complete successfully
NPU models fail at fan‑out / parallel execution stage
Failure appears related to NPU execution provider session / memory constraints rather than total system memory
Environment
Foundry Local: latest 0.8.119 (as of March 2026)
OS: Windows 11 24H2
Hardware: Copilot+ PC with NPU QNN ARM
Memory: 32 GB shared system memory
Agent Framework: used via agentframework--foundrylocal sample
Notes
NPU works correctly for single-agent, serialized inference
Issue only occurs with parallel / fan‑out agent execution
This appears to be a limitation of current NPU execution providers (session concurrency, buffer/KV cache duplication)
Clarifying whether this is a known limitation (and documenting it) would help developers avoid confusion when targeting NPUs with Agent Framework.
User asked to validate issue confirm using NPU model Qwen2.5-7b
Upload logs and Screenshot of demo and issue