Inference Interception Follow-ups
Status: Draft
Date: 2026-02-23
Context: Follow-ups from MR !38 (feat(inference): inference interception and routing).
Confirmed Constraints
- Inference routing is currently supported only in cluster mode.
- Local sandbox mode (
--policy-rules/--policy-data) is not expected to support inference routing at this time.
Gaps Observed
1) Missing cluster-only guardrail in local mode
Current behavior can accept CONNECT for inspect_for_inference and then fail later if inference runtime prerequisites are missing.
Desired behavior:
- If sandbox is not running with cluster prerequisites (
sandbox_id + gateway endpoint + TLS state), fail fast and clearly.
- Do not emit
200 Connection Established for requests that cannot be serviced.
2) Incomplete Anthropic authentication handling
Current routing path is OpenAI-Bearer-centric and does not provide full Anthropic-compatible API key behavior.
Desired behavior:
- Preserve credential isolation for Anthropic flows just like OpenAI flows.
- Ensure route credentials (not sandbox/client credentials) are used for Anthropic upstream calls.
- Support Anthropic header semantics end-to-end (
x-api-key, anthropic-version, and related required headers).
3) api_patterns policy field is not wired
InferencePolicy.api_patterns exists in proto but interception currently uses built-in defaults only.
Desired behavior:
- If policy defines
api_patterns, use them.
- If absent/empty, fall back to built-in defaults.
4) Missing GET /v1/models support
Current interception focuses on completion/messages-style write endpoints and does not provide a route-aware models listing behavior.
Desired behavior:
- Support
GET /v1/models for intercepted OpenAI-compatible flows.
- Return a deterministic response strategy:
- proxy to compatible backend route, or
- synthesize a route-aware response when backend behavior is unsuitable.
- Ensure policy and route compatibility filtering applies to model listing just like inference generation endpoints.
5) Missing streaming support
Current proxying model is request/response body buffering, which is insufficient for streaming APIs.
Desired behavior:
- Support streaming for OpenAI and Anthropic-compatible APIs.
- Preserve chunk boundaries and event framing (
text/event-stream / SSE semantics where applicable).
- Propagate cancellation/connection-close behavior correctly across sandbox proxy, gRPC transport, and backend.
Next Steps
P0: Enforce cluster-only inference explicitly
- Add startup validation in sandbox runtime:
- If inference routing is configured but cluster prerequisites are absent, fail sandbox startup with a clear error.
- Add proxy-side defensive handling:
- For
inspect_for_inference, verify prerequisites before returning 200 Connection Established.
- Return a deterministic error response when inference is unavailable.
- Add tests:
- Unit/integration test for local mode + inference policy => explicit failure.
- Regression test ensuring no optimistic
200 is sent before prerequisite validation.
P0: Full Anthropic API key support
- Define protocol-aware auth rewrite behavior in router backend:
openai_*: Authorization: Bearer <route.api_key>.
anthropic_messages: set x-api-key: <route.api_key> and preserve/validate anthropic-version behavior.
- Strip inbound client credentials for Anthropic at interception boundary:
- Remove
authorization and x-api-key from forwarded headers.
- Keep non-sensitive headers that are required for request compatibility.
- Add tests:
- Router integration test that verifies
x-api-key rewrite for Anthropic routes.
- Regression test proving client-supplied Anthropic credentials are not forwarded upstream.
P1: Add GET /v1/models support
- Extend inference API pattern matching to classify models-list requests.
- Implement route-aware handling in gateway/router for model-list requests.
- Add tests:
- e2e test for intercepted
GET /v1/models.
- compatibility test across multiple allowed routes.
P1: Add streaming support
- Define streaming transport contract across proxy <-> gateway (gRPC streaming or framed chunk transport).
- Implement protocol-aware streaming passthrough in router backend.
- Ensure cancellation propagation and timeout behavior are explicit.
- Add tests:
- e2e streaming chat completion test.
- disconnection/cancellation regression test.
P1: Wire policy-driven API pattern configuration
- Map
sandbox.policy.inference.api_patterns into sandbox inference interception context.
- Add validation for malformed patterns.
- Add tests for:
- custom pattern match,
- default fallback behavior,
- invalid pattern rejection.
Definition of Done
- Local mode with inference configuration fails fast with a clear, actionable error.
- Anthropic requests route successfully using route-managed credentials only.
GET /v1/models works through interception with policy/route-aware behavior.
- Streaming inference requests are supported end-to-end (including cancellation).
api_patterns works when configured and defaults remain backward compatible.
- Unit/integration coverage added for each gap above.
Originally by @pimlock on 2026-02-22T22:10:54.126-08:00
Inference Interception Follow-ups
Confirmed Constraints
--policy-rules/--policy-data) is not expected to support inference routing at this time.Gaps Observed
1) Missing cluster-only guardrail in local mode
Current behavior can accept
CONNECTforinspect_for_inferenceand then fail later if inference runtime prerequisites are missing.Desired behavior:
sandbox_id+ gateway endpoint + TLS state), fail fast and clearly.200 Connection Establishedfor requests that cannot be serviced.2) Incomplete Anthropic authentication handling
Current routing path is OpenAI-Bearer-centric and does not provide full Anthropic-compatible API key behavior.
Desired behavior:
x-api-key,anthropic-version, and related required headers).3)
api_patternspolicy field is not wiredInferencePolicy.api_patternsexists in proto but interception currently uses built-in defaults only.Desired behavior:
api_patterns, use them.4) Missing
GET /v1/modelssupportCurrent interception focuses on completion/messages-style write endpoints and does not provide a route-aware models listing behavior.
Desired behavior:
GET /v1/modelsfor intercepted OpenAI-compatible flows.5) Missing streaming support
Current proxying model is request/response body buffering, which is insufficient for streaming APIs.
Desired behavior:
text/event-stream/ SSE semantics where applicable).Next Steps
P0: Enforce cluster-only inference explicitly
inspect_for_inference, verify prerequisites before returning200 Connection Established.200is sent before prerequisite validation.P0: Full Anthropic API key support
openai_*:Authorization: Bearer <route.api_key>.anthropic_messages: setx-api-key: <route.api_key>and preserve/validateanthropic-versionbehavior.authorizationandx-api-keyfrom forwarded headers.x-api-keyrewrite for Anthropic routes.P1: Add
GET /v1/modelssupportGET /v1/models.P1: Add streaming support
P1: Wire policy-driven API pattern configuration
sandbox.policy.inference.api_patternsinto sandbox inference interception context.Definition of Done
GET /v1/modelsworks through interception with policy/route-aware behavior.api_patternsworks when configured and defaults remain backward compatible.Originally by @pimlock on 2026-02-22T22:10:54.126-08:00