Durable execution: minimal advanced-template + chatbot integration by dhruv0811 · Pull Request #207 · databricks/app-templates

dhruv0811 · 2026-05-01T21:25:45Z

Summary

Wires the advanced templates (agent-langgraph-advanced, agent-openai-advanced) and the e2e chatbot up to the bridge's LongRunningAgentServer so an in-flight agent run survives a pod restart without losing the user's turn. Companion to bridge PR databricks-ai-bridge#425 (merged).

This is a re-imagined, minimal version of #204 — addressing the feedback that the prior PR's 506 lines (and especially the 272-line Express proxy) were too clunky. The keystone simplification: the express proxy is gone. Vercel AI SDK's databricksFetch (in providers-server.ts) is already the single boundary every agent request flows through, so all the durable-execution glue lives there now.

What the chatbot does

All durable-execution work consolidated into e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts:

Background-mode injection. When API_PROXY points at a long-running server, set body.background = true on streaming requests so the server persists every SSE frame to its durable store and the retrieve endpoint can resume mid-stream.
Rotated-conversation alias capture. Sniff response.resumed SSE sentinels to capture the bridge's rotated conversation_id (e.g. chat-123::attempt-2), keyed by the original chat id in a module-scope Map. On the next user turn, databricksFetch substitutes the captured alias into the outgoing body.context.conversation_id so the request lands on the post-resume session, not the orphan-poisoned original.
Auto-resume on stream close. If the SSE stream closes without [DONE], transparently re-stream from GET /responses/{id}?stream=true&starting_after=<seq> (capped at 5 retries). Bytes pass through untouched; we only sniff data frames to track response_id / sequence_number / DONE state.

All three sit inside databricksFetch — no new server route, no separate proxy layer, no new module.

What the templates do

Cross-turn UI-echo dedup. Both advanced templates' chat handlers dedupe the chatbot's full-history echo against the SDK's own state. LangGraph asks the checkpointer (agent.aget_state(config)); OpenAI treats the session as authoritative whenever non-empty (session.get_items()). When the SDK already holds the prior turns, only the latest user message is forwarded — no synthetic tool events, no SDK-specific wrappers, no bridge involvement. Lives in agent_server/utils.py::deduplicate_input per template.
Bridge log surfacing. Uvicorn's default logging config drops INFO from non-uvicorn loggers, silently swallowing the durable-execution lifecycle breadcrumbs. Both advanced templates' start_server.py now attach a StreamHandler to the databricks_ai_bridge logger so [durable] resume…, [durable] stale-scan loop start…, and friends reach app stdout.

What is NOT in the templates

No express proxy. The previous iteration of this PR added a 272-line /invocations proxy in server/src/index.ts to do background-mode injection + alias capture + auto-resume. All three responsibilities collapsed into databricksFetch; the express layer is back to its main-branch shape.
No per-SDK adapter wrappers. No AsyncDatabricksSession.aget_tuple overrides, no DatabricksSaver.aget_tuple overrides. The bridge contract is "agent owns its session/checkpointer state"; templates abide by it.
No bridge dependency pin. LongRunningAgentServer already exists in databricks-ai-bridge 0.19.0 (older flavor), so without the new prose-recovery release the templates degrade gracefully — the chatbot's alias capture stays dormant because no response.resumed sentinels arrive. We'll bump the dependency floor when 0.20.0 (containing #425) ships.
No bundled databricks.yml customizations. Bundle names and app names stay at the template defaults; per-deploy customizations live outside the PR.

Files

Component	File
Background injection + alias capture + auto-resume	`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`
LangGraph UI-echo dedup helper	`agent-langgraph-advanced/agent_server/utils.py::deduplicate_input`
LangGraph dedup wired into `stream_handler`	`agent-langgraph-advanced/agent_server/agent.py`
OpenAI UI-echo dedup heuristic (session is authoritative)	`agent-openai-advanced/agent_server/utils.py::deduplicate_input`
Bridge log surfacing	`agent-{langgraph,openai}-advanced/agent_server/start_server.py`

Settings

Setting	Default	Rationale
`MAX_RESUME_ATTEMPTS` (chatbot)	5	Cap auto-resume attempts on SSE close-without-DONE so a permanently-down server doesn't loop forever.

Testing the durable path before 0.20.0 ships

Two pieces of pre-release setup are needed if you want to exercise the new prose-recovery flow end-to-end (after 0.20.0 they go away):

Pin to bridge main. Add a temporary [tool.uv.sources] block to either advanced template's pyproject.toml:
```
[tool.uv.sources]
databricks-ai-bridge = { git = "https://github.com/databricks/databricks-ai-bridge.git", branch = "main" }
```
Without this, databricks-ai-bridge resolves to the 0.19.0 PyPI release (which still has LongRunningAgentServer but lacks #425's prose-recovery, heartbeat, and attempt_number rotation).
Enable the debug kill endpoint. Set LONG_RUNNING_ENABLE_DEBUG_KILL=1 on the deployed app (env block in databricks.yml). This gates POST /_debug/kill_task/{response_id}, the test-only endpoint that simulates a pod crash mid-stream. Leave it unset in production.

Test plan

agent-langgraph-advanced UI testing on dhruv-lg-claude-durable (Claude Sonnet 4.5) — multi-tool kill mid-deep_research, multi-turn ✅
agent-openai-advanced UI testing on dhruv-oai-gpt-durable (GPT-5) — same matrix ✅
agent-openai-advanced UI testing on dhruv-oai-claude-durable (Claude Sonnet 4.5) — same matrix ✅
Base-template regression: agent-openai-agents-sdk deployed as dhruv-oai-base-newui against the same chatbot branch — confirms the new providers-server.ts doesn't break templates that don't use durable execution ✅
e2e integration tests still green for all advanced templates

Companion PR

databricks-ai-bridge#425 — LongRunningAgentServer durable prose-recovery + always-rotate (merged).

Known follow-ups (non-blocking)

Move bridge log surfacing into LongRunningAgentServer.__init__ itself so individual templates don't each need to wire it up. Currently lives in each start_server.py.
Bump the databricks-ai-bridge dependency floor in both advanced templates once a release containing #425 ships on PyPI.
Lift the chatbot's auto-resume + alias capture into the upstream @databricks/ai-sdk-provider package once the durable contract stabilizes.

Text.Only.-.Prose.mov

Tool.Calling.Multiturn.-.Prose.mov

Lets the advanced templates use `LongRunningAgentServer` end-to-end so an in-flight agent run survives a pod restart without losing the user's turn. Chatbot side (`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`): the AI SDK's custom `databricksFetch` is the single boundary every agent request flows through, so we put all the durable-execution glue here: - inject `background: true` on streaming requests when `API_PROXY` points at a long-running server, - capture the rotated `conversation_id` emitted in `response.resumed` sentinels and replay it on subsequent turns so the next request lands on the post-resume session, - on a stream that closes without `[DONE]`, transparently re-stream from `GET /responses/{id}?stream=true&starting_after=<seq>` (capped retries). Template side: - factor the LangGraph dedup helper into `agent_server/utils.py` as `deduplicate_input` (matches the OpenAI template's existing helper), and call it from `stream_handler` so checkpointer-backed history isn't double-counted with the chatbot's UI-echoed history, - pin `databricks-ai-bridge` to the durable-execution branch via `[tool.uv.sources]` until that work ships in a stable release. Plumbing: - `start_app.py` now honors `APP_TEMPLATES_BRANCH` (default `main`) when cloning the chatbot, so a non-main branch can be tested end-to-end. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

Uvicorn's default logging config drops INFO records from non-uvicorn loggers, which silently swallows all the durable-execution lifecycle breadcrumbs (task spawn, resume, prose-recovery build, terminal-status CAS, stale-scan claims). When debugging a deployed app the only signal left was raw uvicorn access logs — not enough to tell whether the durable path was even firing. Attach a stream handler to the `databricks_ai_bridge` logger explicitly so its lifecycle logs reach app stdout. Long-term this belongs in the bridge's `LongRunningAgentServer.__init__`, but doing it in the templates means we don't have to wait for a bridge release.

The prior heuristic compared `session_items >= messages - 1` to decide whether to forward only the latest user message. Under prose-recovery + always-rotate the rotated session has FEWER items than the chatbot's accumulated UI echo (attempt N+1's session is fresh while the UI accumulated events from both attempts), so the heuristic was returning all messages — duplicates of attempt N+1's tool_calls plus the orphan tool_use from attempt N. The Runner then combined session+input, producing duplicate function_call items that the OpenAI SDK grouped into a malformed assistant.tool_calls block. Anthropic-backed models (databricks-claude-*) rejected the request with 400 "tool_use ids were found without tool_result blocks immediately after". gpt-* tolerated it; LangGraph templates were unaffected because their dedup uses checkpointer state, not a count heuristic. Fix: if the session has any items at all, treat it as authoritative for cross-turn history and forward only the latest message. First-turn path (empty session) still returns the full input so MLflow evaluation works. This was originally fixed in the prior templates PR (commit 31d87d6, "agent-openai-advanced: trust session as authoritative for cross-turn dedup") and was inadvertently dropped when re-imagining the PR from main.

#425 merged so the durable-execution bits are on bridge main now. Pin both advanced templates to the bridge main branch (instead of the now- deleted feature branch); remove this block entirely once a release ships. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>

#425 merged so the durable-execution bits are on bridge main now. Templates only need PyPI databricks-ai-bridge — `LongRunningAgentServer` already exists in 0.19.0 (older flavor), so without the new release the templates degrade gracefully (chatbot's alias capture stays dormant because no `response.resumed` sentinels arrive). To exercise the new prose-recovery features end-to-end before 0.20.0 ships, add a temporary `[tool.uv.sources]` block pinning the bridge to its main branch. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default.

dhruv0811 · 2026-05-01T22:15:09Z


 logger = logging.getLogger(__name__)

+# Surface databricks_ai_bridge INFO logs (durable-execution lifecycle:


@bbqiu do we want to maybe gate this behind some debug env var? (my pr desc mentions the env var used to enable the debug kill endpoint, maybe we can bar the logging behind that too?)

dhruv0811 added 3 commits May 1, 2026 21:25

dhruv0811 force-pushed the prose-min-templates branch from cfba80e to 6e3b3be Compare May 1, 2026 21:59

dhruv0811 requested a review from bbqiu May 1, 2026 22:08

dhruv0811 force-pushed the prose-min-templates branch from 754242e to d13064e Compare May 1, 2026 22:12

dhruv0811 force-pushed the prose-min-templates branch from d13064e to 432d654 Compare May 1, 2026 22:13

dhruv0811 commented May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable execution: minimal advanced-template + chatbot integration#207

Durable execution: minimal advanced-template + chatbot integration#207
dhruv0811 wants to merge 4 commits intomainfrom
prose-min-templates

dhruv0811 commented May 1, 2026 •

edited

Loading

Uh oh!

dhruv0811 May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		logger = logging.getLogger(__name__)

		# Surface databricks_ai_bridge INFO logs (durable-execution lifecycle:

Conversation

dhruv0811 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the chatbot does

What the templates do

What is NOT in the templates

Files

Settings

Testing the durable path before 0.20.0 ships

Test plan

Companion PR

Known follow-ups (non-blocking)

Uh oh!

dhruv0811 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhruv0811 commented May 1, 2026 •

edited

Loading