Skip to content

Durable execution: minimal advanced-template + chatbot integration#207

Open
dhruv0811 wants to merge 4 commits intomainfrom
prose-min-templates
Open

Durable execution: minimal advanced-template + chatbot integration#207
dhruv0811 wants to merge 4 commits intomainfrom
prose-min-templates

Conversation

@dhruv0811
Copy link
Copy Markdown
Contributor

@dhruv0811 dhruv0811 commented May 1, 2026

Summary

Wires the advanced templates (agent-langgraph-advanced, agent-openai-advanced) and the e2e chatbot up to the bridge's LongRunningAgentServer so an in-flight agent run survives a pod restart without losing the user's turn. Companion to bridge PR databricks-ai-bridge#425 (merged).

This is a re-imagined, minimal version of #204 — addressing the feedback that the prior PR's 506 lines (and especially the 272-line Express proxy) were too clunky. The keystone simplification: the express proxy is gone. Vercel AI SDK's databricksFetch (in providers-server.ts) is already the single boundary every agent request flows through, so all the durable-execution glue lives there now.

What the chatbot does

All durable-execution work consolidated into e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts:

  • Background-mode injection. When API_PROXY points at a long-running server, set body.background = true on streaming requests so the server persists every SSE frame to its durable store and the retrieve endpoint can resume mid-stream.
  • Rotated-conversation alias capture. Sniff response.resumed SSE sentinels to capture the bridge's rotated conversation_id (e.g. chat-123::attempt-2), keyed by the original chat id in a module-scope Map. On the next user turn, databricksFetch substitutes the captured alias into the outgoing body.context.conversation_id so the request lands on the post-resume session, not the orphan-poisoned original.
  • Auto-resume on stream close. If the SSE stream closes without [DONE], transparently re-stream from GET /responses/{id}?stream=true&starting_after=<seq> (capped at 5 retries). Bytes pass through untouched; we only sniff data frames to track response_id / sequence_number / DONE state.

All three sit inside databricksFetch — no new server route, no separate proxy layer, no new module.

What the templates do

  • Cross-turn UI-echo dedup. Both advanced templates' chat handlers dedupe the chatbot's full-history echo against the SDK's own state. LangGraph asks the checkpointer (agent.aget_state(config)); OpenAI treats the session as authoritative whenever non-empty (session.get_items()). When the SDK already holds the prior turns, only the latest user message is forwarded — no synthetic tool events, no SDK-specific wrappers, no bridge involvement. Lives in agent_server/utils.py::deduplicate_input per template.
  • Bridge log surfacing. Uvicorn's default logging config drops INFO from non-uvicorn loggers, silently swallowing the durable-execution lifecycle breadcrumbs. Both advanced templates' start_server.py now attach a StreamHandler to the databricks_ai_bridge logger so [durable] resume…, [durable] stale-scan loop start…, and friends reach app stdout.

What is NOT in the templates

  • No express proxy. The previous iteration of this PR added a 272-line /invocations proxy in server/src/index.ts to do background-mode injection + alias capture + auto-resume. All three responsibilities collapsed into databricksFetch; the express layer is back to its main-branch shape.
  • No per-SDK adapter wrappers. No AsyncDatabricksSession.aget_tuple overrides, no DatabricksSaver.aget_tuple overrides. The bridge contract is "agent owns its session/checkpointer state"; templates abide by it.
  • No bridge dependency pin. LongRunningAgentServer already exists in databricks-ai-bridge 0.19.0 (older flavor), so without the new prose-recovery release the templates degrade gracefully — the chatbot's alias capture stays dormant because no response.resumed sentinels arrive. We'll bump the dependency floor when 0.20.0 (containing #425) ships.
  • No bundled databricks.yml customizations. Bundle names and app names stay at the template defaults; per-deploy customizations live outside the PR.

Files

Component File
Background injection + alias capture + auto-resume e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts
LangGraph UI-echo dedup helper agent-langgraph-advanced/agent_server/utils.py::deduplicate_input
LangGraph dedup wired into stream_handler agent-langgraph-advanced/agent_server/agent.py
OpenAI UI-echo dedup heuristic (session is authoritative) agent-openai-advanced/agent_server/utils.py::deduplicate_input
Bridge log surfacing agent-{langgraph,openai}-advanced/agent_server/start_server.py

Settings

Setting Default Rationale
MAX_RESUME_ATTEMPTS (chatbot) 5 Cap auto-resume attempts on SSE close-without-DONE so a permanently-down server doesn't loop forever.

Testing the durable path before 0.20.0 ships

Two pieces of pre-release setup are needed if you want to exercise the new prose-recovery flow end-to-end (after 0.20.0 they go away):

  1. Pin to bridge main. Add a temporary [tool.uv.sources] block to either advanced template's pyproject.toml:

    [tool.uv.sources]
    databricks-ai-bridge = { git = "https://github.com/databricks/databricks-ai-bridge.git", branch = "main" }

    Without this, databricks-ai-bridge resolves to the 0.19.0 PyPI release (which still has LongRunningAgentServer but lacks #425's prose-recovery, heartbeat, and attempt_number rotation).

  2. Enable the debug kill endpoint. Set LONG_RUNNING_ENABLE_DEBUG_KILL=1 on the deployed app (env block in databricks.yml). This gates POST /_debug/kill_task/{response_id}, the test-only endpoint that simulates a pod crash mid-stream. Leave it unset in production.

Test plan

  • agent-langgraph-advanced UI testing on dhruv-lg-claude-durable (Claude Sonnet 4.5) — multi-tool kill mid-deep_research, multi-turn ✅
  • agent-openai-advanced UI testing on dhruv-oai-gpt-durable (GPT-5) — same matrix ✅
  • agent-openai-advanced UI testing on dhruv-oai-claude-durable (Claude Sonnet 4.5) — same matrix ✅
  • Base-template regression: agent-openai-agents-sdk deployed as dhruv-oai-base-newui against the same chatbot branch — confirms the new providers-server.ts doesn't break templates that don't use durable execution ✅
  • e2e integration tests still green for all advanced templates

Companion PR

databricks-ai-bridge#425LongRunningAgentServer durable prose-recovery + always-rotate (merged).

Known follow-ups (non-blocking)

  • Move bridge log surfacing into LongRunningAgentServer.__init__ itself so individual templates don't each need to wire it up. Currently lives in each start_server.py.
  • Bump the databricks-ai-bridge dependency floor in both advanced templates once a release containing #425 ships on PyPI.
  • Lift the chatbot's auto-resume + alias capture into the upstream @databricks/ai-sdk-provider package once the durable contract stabilizes.
Text.Only.-.Prose.mov
Tool.Calling.Multiturn.-.Prose.mov

dhruv0811 added 3 commits May 1, 2026 21:25
Lets the advanced templates use `LongRunningAgentServer` end-to-end so an
in-flight agent run survives a pod restart without losing the user's turn.

Chatbot side (`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`):
the AI SDK's custom `databricksFetch` is the single boundary every agent
request flows through, so we put all the durable-execution glue here:
  - inject `background: true` on streaming requests when `API_PROXY` points
    at a long-running server,
  - capture the rotated `conversation_id` emitted in `response.resumed`
    sentinels and replay it on subsequent turns so the next request lands
    on the post-resume session,
  - on a stream that closes without `[DONE]`, transparently re-stream from
    `GET /responses/{id}?stream=true&starting_after=<seq>` (capped retries).

Template side:
  - factor the LangGraph dedup helper into `agent_server/utils.py` as
    `deduplicate_input` (matches the OpenAI template's existing helper),
    and call it from `stream_handler` so checkpointer-backed history
    isn't double-counted with the chatbot's UI-echoed history,
  - pin `databricks-ai-bridge` to the durable-execution branch via
    `[tool.uv.sources]` until that work ships in a stable release.

Plumbing:
  - `start_app.py` now honors `APP_TEMPLATES_BRANCH` (default `main`) when
    cloning the chatbot, so a non-main branch can be tested end-to-end.
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Uvicorn's default logging config drops INFO records from non-uvicorn
loggers, which silently swallows all the durable-execution lifecycle
breadcrumbs (task spawn, resume, prose-recovery build, terminal-status
CAS, stale-scan claims). When debugging a deployed app the only signal
left was raw uvicorn access logs — not enough to tell whether the
durable path was even firing.

Attach a stream handler to the `databricks_ai_bridge` logger explicitly
so its lifecycle logs reach app stdout. Long-term this belongs in the
bridge's `LongRunningAgentServer.__init__`, but doing it in the
templates means we don't have to wait for a bridge release.
The prior heuristic compared `session_items >= messages - 1` to decide
whether to forward only the latest user message. Under prose-recovery
+ always-rotate the rotated session has FEWER items than the chatbot's
accumulated UI echo (attempt N+1's session is fresh while the UI
accumulated events from both attempts), so the heuristic was returning
all messages — duplicates of attempt N+1's tool_calls plus the orphan
tool_use from attempt N.

The Runner then combined session+input, producing duplicate function_call
items that the OpenAI SDK grouped into a malformed assistant.tool_calls
block. Anthropic-backed models (databricks-claude-*) rejected the
request with 400 "tool_use ids were found without tool_result blocks
immediately after". gpt-* tolerated it; LangGraph templates were unaffected
because their dedup uses checkpointer state, not a count heuristic.

Fix: if the session has any items at all, treat it as authoritative for
cross-turn history and forward only the latest message. First-turn path
(empty session) still returns the full input so MLflow evaluation works.

This was originally fixed in the prior templates PR (commit 31d87d6,
"agent-openai-advanced: trust session as authoritative for cross-turn
dedup") and was inadvertently dropped when re-imagining the PR from main.
@dhruv0811 dhruv0811 force-pushed the prose-min-templates branch from cfba80e to 6e3b3be Compare May 1, 2026 21:59
@dhruv0811 dhruv0811 requested a review from bbqiu May 1, 2026 22:08
dhruv0811 added a commit that referenced this pull request May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin
both advanced templates to the bridge main branch (instead of the now-
deleted feature branch); remove this block entirely once a release ships.

Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was
a testing-only hack to let deployed templates clone the chatbot from a
non-main branch while #207 was open. Once #207 lands, mainline is the
right default.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811 added a commit that referenced this pull request May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin
both advanced templates to the bridge main branch (instead of the now-
deleted feature branch); remove this block entirely once a release ships.

Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was
a testing-only hack to let deployed templates clone the chatbot from a
non-main branch while #207 was open. Once #207 lands, mainline is the
right default.

Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
@dhruv0811 dhruv0811 force-pushed the prose-min-templates branch from 754242e to d13064e Compare May 1, 2026 22:12
#425 merged so the durable-execution bits are on bridge main now.
Templates only need PyPI databricks-ai-bridge — `LongRunningAgentServer`
already exists in 0.19.0 (older flavor), so without the new release the
templates degrade gracefully (chatbot's alias capture stays dormant
because no `response.resumed` sentinels arrive). To exercise the new
prose-recovery features end-to-end before 0.20.0 ships, add a temporary
`[tool.uv.sources]` block pinning the bridge to its main branch.

Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was
a testing-only hack to let deployed templates clone the chatbot from a
non-main branch while #207 was open. Once #207 lands, mainline is the
right default.
@dhruv0811 dhruv0811 force-pushed the prose-min-templates branch from d13064e to 432d654 Compare May 1, 2026 22:13

logger = logging.getLogger(__name__)

# Surface databricks_ai_bridge INFO logs (durable-execution lifecycle:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbqiu do we want to maybe gate this behind some debug env var? (my pr desc mentions the env var used to enable the debug kill endpoint, maybe we can bar the logging behind that too?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant