Durable execution: minimal advanced-template + chatbot integration#207
Open
Durable execution: minimal advanced-template + chatbot integration#207
Conversation
Lets the advanced templates use `LongRunningAgentServer` end-to-end so an
in-flight agent run survives a pod restart without losing the user's turn.
Chatbot side (`e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts`):
the AI SDK's custom `databricksFetch` is the single boundary every agent
request flows through, so we put all the durable-execution glue here:
- inject `background: true` on streaming requests when `API_PROXY` points
at a long-running server,
- capture the rotated `conversation_id` emitted in `response.resumed`
sentinels and replay it on subsequent turns so the next request lands
on the post-resume session,
- on a stream that closes without `[DONE]`, transparently re-stream from
`GET /responses/{id}?stream=true&starting_after=<seq>` (capped retries).
Template side:
- factor the LangGraph dedup helper into `agent_server/utils.py` as
`deduplicate_input` (matches the OpenAI template's existing helper),
and call it from `stream_handler` so checkpointer-backed history
isn't double-counted with the chatbot's UI-echoed history,
- pin `databricks-ai-bridge` to the durable-execution branch via
`[tool.uv.sources]` until that work ships in a stable release.
Plumbing:
- `start_app.py` now honors `APP_TEMPLATES_BRANCH` (default `main`) when
cloning the chatbot, so a non-main branch can be tested end-to-end.
Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
Uvicorn's default logging config drops INFO records from non-uvicorn loggers, which silently swallows all the durable-execution lifecycle breadcrumbs (task spawn, resume, prose-recovery build, terminal-status CAS, stale-scan claims). When debugging a deployed app the only signal left was raw uvicorn access logs — not enough to tell whether the durable path was even firing. Attach a stream handler to the `databricks_ai_bridge` logger explicitly so its lifecycle logs reach app stdout. Long-term this belongs in the bridge's `LongRunningAgentServer.__init__`, but doing it in the templates means we don't have to wait for a bridge release.
The prior heuristic compared `session_items >= messages - 1` to decide whether to forward only the latest user message. Under prose-recovery + always-rotate the rotated session has FEWER items than the chatbot's accumulated UI echo (attempt N+1's session is fresh while the UI accumulated events from both attempts), so the heuristic was returning all messages — duplicates of attempt N+1's tool_calls plus the orphan tool_use from attempt N. The Runner then combined session+input, producing duplicate function_call items that the OpenAI SDK grouped into a malformed assistant.tool_calls block. Anthropic-backed models (databricks-claude-*) rejected the request with 400 "tool_use ids were found without tool_result blocks immediately after". gpt-* tolerated it; LangGraph templates were unaffected because their dedup uses checkpointer state, not a count heuristic. Fix: if the session has any items at all, treat it as authoritative for cross-turn history and forward only the latest message. First-turn path (empty session) still returns the full input so MLflow evaluation works. This was originally fixed in the prior templates PR (commit 31d87d6, "agent-openai-advanced: trust session as authoritative for cross-turn dedup") and was inadvertently dropped when re-imagining the PR from main.
cfba80e to
6e3b3be
Compare
dhruv0811
added a commit
that referenced
this pull request
May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin both advanced templates to the bridge main branch (instead of the now- deleted feature branch); remove this block entirely once a release ships. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
dhruv0811
added a commit
that referenced
this pull request
May 1, 2026
#425 merged so the durable-execution bits are on bridge main now. Pin both advanced templates to the bridge main branch (instead of the now- deleted feature branch); remove this block entirely once a release ships. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default. Signed-off-by: Dhruv Gupta <dhruv.gupta@databricks.com>
754242e to
d13064e
Compare
#425 merged so the durable-execution bits are on bridge main now. Templates only need PyPI databricks-ai-bridge — `LongRunningAgentServer` already exists in 0.19.0 (older flavor), so without the new release the templates degrade gracefully (chatbot's alias capture stays dormant because no `response.resumed` sentinels arrive). To exercise the new prose-recovery features end-to-end before 0.20.0 ships, add a temporary `[tool.uv.sources]` block pinning the bridge to its main branch. Also revert the APP_TEMPLATES_BRANCH support in start_app.py — that was a testing-only hack to let deployed templates clone the chatbot from a non-main branch while #207 was open. Once #207 lands, mainline is the right default.
d13064e to
432d654
Compare
dhruv0811
commented
May 1, 2026
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Surface databricks_ai_bridge INFO logs (durable-execution lifecycle: |
Contributor
Author
There was a problem hiding this comment.
@bbqiu do we want to maybe gate this behind some debug env var? (my pr desc mentions the env var used to enable the debug kill endpoint, maybe we can bar the logging behind that too?)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the advanced templates (
agent-langgraph-advanced,agent-openai-advanced) and the e2e chatbot up to the bridge'sLongRunningAgentServerso an in-flight agent run survives a pod restart without losing the user's turn. Companion to bridge PR databricks-ai-bridge#425 (merged).This is a re-imagined, minimal version of #204 — addressing the feedback that the prior PR's 506 lines (and especially the 272-line Express proxy) were too clunky. The keystone simplification: the express proxy is gone. Vercel AI SDK's
databricksFetch(inproviders-server.ts) is already the single boundary every agent request flows through, so all the durable-execution glue lives there now.What the chatbot does
All durable-execution work consolidated into
e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.ts:API_PROXYpoints at a long-running server, setbody.background = trueon streaming requests so the server persists every SSE frame to its durable store and the retrieve endpoint can resume mid-stream.response.resumedSSE sentinels to capture the bridge's rotatedconversation_id(e.g.chat-123::attempt-2), keyed by the original chat id in a module-scopeMap. On the next user turn,databricksFetchsubstitutes the captured alias into the outgoingbody.context.conversation_idso the request lands on the post-resume session, not the orphan-poisoned original.[DONE], transparently re-stream fromGET /responses/{id}?stream=true&starting_after=<seq>(capped at 5 retries). Bytes pass through untouched; we only sniff data frames to trackresponse_id/sequence_number/ DONE state.All three sit inside
databricksFetch— no new server route, no separate proxy layer, no new module.What the templates do
agent.aget_state(config)); OpenAI treats the session as authoritative whenever non-empty (session.get_items()). When the SDK already holds the prior turns, only the latest user message is forwarded — no synthetic tool events, no SDK-specific wrappers, no bridge involvement. Lives inagent_server/utils.py::deduplicate_inputper template.start_server.pynow attach aStreamHandlerto thedatabricks_ai_bridgelogger so[durable] resume…,[durable] stale-scan loop start…, and friends reach app stdout.What is NOT in the templates
/invocationsproxy inserver/src/index.tsto do background-mode injection + alias capture + auto-resume. All three responsibilities collapsed intodatabricksFetch; the express layer is back to itsmain-branch shape.AsyncDatabricksSession.aget_tupleoverrides, noDatabricksSaver.aget_tupleoverrides. The bridge contract is "agent owns its session/checkpointer state"; templates abide by it.LongRunningAgentServeralready exists indatabricks-ai-bridge0.19.0 (older flavor), so without the new prose-recovery release the templates degrade gracefully — the chatbot's alias capture stays dormant because noresponse.resumedsentinels arrive. We'll bump the dependency floor when 0.20.0 (containing #425) ships.databricks.ymlcustomizations. Bundle names and app names stay at the template defaults; per-deploy customizations live outside the PR.Files
e2e-chatbot-app-next/packages/ai-sdk-providers/src/providers-server.tsagent-langgraph-advanced/agent_server/utils.py::deduplicate_inputstream_handleragent-langgraph-advanced/agent_server/agent.pyagent-openai-advanced/agent_server/utils.py::deduplicate_inputagent-{langgraph,openai}-advanced/agent_server/start_server.pySettings
MAX_RESUME_ATTEMPTS(chatbot)Testing the durable path before 0.20.0 ships
Two pieces of pre-release setup are needed if you want to exercise the new prose-recovery flow end-to-end (after 0.20.0 they go away):
Pin to bridge main. Add a temporary
[tool.uv.sources]block to either advanced template'spyproject.toml:Without this,
databricks-ai-bridgeresolves to the 0.19.0 PyPI release (which still hasLongRunningAgentServerbut lacks #425's prose-recovery, heartbeat, andattempt_numberrotation).Enable the debug kill endpoint. Set
LONG_RUNNING_ENABLE_DEBUG_KILL=1on the deployed app (env block indatabricks.yml). This gatesPOST /_debug/kill_task/{response_id}, the test-only endpoint that simulates a pod crash mid-stream. Leave it unset in production.Test plan
agent-langgraph-advancedUI testing ondhruv-lg-claude-durable(Claude Sonnet 4.5) — multi-tool kill mid-deep_research, multi-turn ✅agent-openai-advancedUI testing ondhruv-oai-gpt-durable(GPT-5) — same matrix ✅agent-openai-advancedUI testing ondhruv-oai-claude-durable(Claude Sonnet 4.5) — same matrix ✅agent-openai-agents-sdkdeployed asdhruv-oai-base-newuiagainst the same chatbot branch — confirms the newproviders-server.tsdoesn't break templates that don't use durable execution ✅Companion PR
databricks-ai-bridge#425 —
LongRunningAgentServerdurable prose-recovery + always-rotate (merged).Known follow-ups (non-blocking)
LongRunningAgentServer.__init__itself so individual templates don't each need to wire it up. Currently lives in eachstart_server.py.databricks-ai-bridgedependency floor in both advanced templates once a release containing #425 ships on PyPI.@databricks/ai-sdk-providerpackage once the durable contract stabilizes.Text.Only.-.Prose.mov
Tool.Calling.Multiturn.-.Prose.mov