Summary
The sandbox proxy's inference.local interception path fully buffers upstream streaming responses (SSE) before sending any bytes back to the client. This converts a streaming response (fast TTFB, incremental tokens) into a buffered response (slow TTFB, instant completion), causing downstream clients with TTFB timeouts to abort before the proxy finishes buffering.
Actual Behavior
- Client sends
POST /v1/chat/completions with "stream": true to inference.local
- Proxy forwards to upstream (e.g. NVIDIA NIM endpoint), which begins streaming SSE events (TTFB ~200ms)
response.bytes().await in proxy_to_backend() blocks for the full generation time (10-60s)
- Proxy sends nothing back to the client during this entire period
- Client's TTFB/first-event timeout fires and aborts the request
- Proxy eventually finishes buffering and tries to write to a closed connection
Expected Behavior
The proxy should forward response headers and SSE chunks to the client incrementally as they arrive from the upstream, preserving the streaming semantics and sub-second TTFB.
Root Cause
Three layers conspire to prevent streaming:
navigator-router/src/backend.rs:112-115 — response.bytes().await collects the entire upstream response body into memory before returning
ProxyResponse struct (backend.rs:9-13) — stores body as a single bytes::Bytes blob with no streaming abstraction
format_http_response() (l7/inference.rs:244-246) — replaces upstream Transfer-Encoding: chunked with Content-Length based on buffered body size, then writes everything in a single write_all call
Affected Path
Only the HTTPS/CONNECT inference interception path (inference.local) is affected. The plain HTTP forward proxy path uses copy_bidirectional and streams correctly.
Call chain: handle_inference_interception() → route_inference_request() → proxy_with_candidates() → proxy_to_backend() → response.bytes().await
Fix Direction
- Add a streaming variant of
proxy_to_backend() that returns headers + impl Stream<Item = Bytes> instead of a fully-buffered ProxyResponse
- Have
route_inference_request() write response headers immediately, then forward body chunks incrementally to the TLS client
- For streaming responses, emit
Transfer-Encoding: chunked instead of Content-Length
Summary
The sandbox proxy's
inference.localinterception path fully buffers upstream streaming responses (SSE) before sending any bytes back to the client. This converts a streaming response (fast TTFB, incremental tokens) into a buffered response (slow TTFB, instant completion), causing downstream clients with TTFB timeouts to abort before the proxy finishes buffering.Actual Behavior
POST /v1/chat/completionswith"stream": truetoinference.localresponse.bytes().awaitinproxy_to_backend()blocks for the full generation time (10-60s)Expected Behavior
The proxy should forward response headers and SSE chunks to the client incrementally as they arrive from the upstream, preserving the streaming semantics and sub-second TTFB.
Root Cause
Three layers conspire to prevent streaming:
navigator-router/src/backend.rs:112-115—response.bytes().awaitcollects the entire upstream response body into memory before returningProxyResponsestruct (backend.rs:9-13) — stores body as a singlebytes::Bytesblob with no streaming abstractionformat_http_response()(l7/inference.rs:244-246) — replaces upstreamTransfer-Encoding: chunkedwithContent-Lengthbased on buffered body size, then writes everything in a singlewrite_allcallAffected Path
Only the HTTPS/CONNECT inference interception path (
inference.local) is affected. The plain HTTP forward proxy path usescopy_bidirectionaland streams correctly.Call chain:
handle_inference_interception()→route_inference_request()→proxy_with_candidates()→proxy_to_backend()→response.bytes().awaitFix Direction
proxy_to_backend()that returns headers +impl Stream<Item = Bytes>instead of a fully-bufferedProxyResponseroute_inference_request()write response headers immediately, then forward body chunks incrementally to the TLS clientTransfer-Encoding: chunkedinstead ofContent-Length