Add structured-output migration repro test by fergusfinn · Pull Request #1 · doublewordai/dynamo

fergusfinn · 2026-03-25T15:21:37Z

Summary

add a focused fault-tolerance repro for structured-output request migration with real local dynamo.frontend and real dynamo.vllm workers
extend the migration test helper to accept a payload override so the repro can inject an OpenAI response_format JSON schema request
kill the serving worker mid-stream and assert the resumed response still parses as JSON

What this reproduces

On the current v1.0.1-based environment, request migration is happening, but the resumed structured-output stream becomes invalid JSON.

Observed in the targeted test run on gotenks:

frontend logs Stream disconnected... recreating stream...
migration metrics report ongoing_request: 1, new_request: 0
the resumed response contains nested restarted JSON content, for example:

{
  "animals": [
    {
      "name": "Lion",
      "habitat": "{
        "animals": [
          {

json.loads(...) then fails with:

json.decoder.JSONDecodeError: Invalid control character at: line 7 column 20 (char 66)

Repro command

source /tmp/dynamo-install-mv3o75/.venv/bin/activate
cd /tmp/dynamo-install-mv3o75
python -m pytest -q tests/fault_tolerance/migration/test_vllm_structured.py::test_request_migration_vllm_aggregated_structured_output -s

Latest result

FAILED tests/fault_tolerance/migration/test_vllm_structured.py::test_request_migration_vllm_aggregated_structured_output[tcp]
1 failed in 49.31s

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

…amo#6655) Signed-off-by: PeaBrane <yanrpei@gmail.com>

Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…otes (ai-dynamo#6648) Signed-off-by: Dan Gil <dagil@nvidia.com>

Signed-off-by: Dan Gil <dagil@nvidia.com>

…i-dynamo#6662) Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

…namo#6663) Signed-off-by: PeaBrane <yanrpei@gmail.com>

Signed-off-by: ashnamehrotra <ashnamehrotra@gmail.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

…i-dynamo#6679) Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

…es (ai-dynamo#6682) Signed-off-by: Anant Sharma <anants@nvidia.com>

ai-dynamo#6708)

…metadata to backends (ai-dynamo#6692) (ai-dynamo#6718) Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

) Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

…6650) (ai-dynamo#6706) Signed-off-by: Anant Sharma <anants@nvidia.com>

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: PeaBrane <yanrpei@gmail.com>

Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

…amo#6753) Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>

…age type is pvc (ai-dynamo#6752) (ai-dynamo#6755) Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

…iation (ai-dynamo#6651) (ai-dynamo#6776) Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>

Signed-off-by: Qi Wang <qiwa@nvidia.com>

…rker (ai-dynamo#6765) Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

…s (http://nvbugs/5936491/1) (ai-dynamo#6772) Signed-off-by: Matej Kosec <mkosec@nvidia.com>

…amo#7283 (ai-dynamo#7284) Signed-off-by: Dan Gil <dagil@nvidia.com>

…mo#7306) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Signed-off-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Anant Sharma <anants@nvidia.com>

Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>

Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com> Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com>

…7312) (ai-dynamo#7332) Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

…ynamo#7336, ai-dynamo#7350, ai-dynamo#7352) (ai-dynamo#7354) Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>

…namo#7372) (ai-dynamo#7393)

ai-dynamo#7404) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com>

…7410) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: Anish <80174047+athreesh@users.noreply.github.com> Signed-off-by: akshatha-k <akshutk@gmail.com> Signed-off-by: Nikhar Maheshwari <nikharm@nvidia.com> Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com> Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Anish <80174047+athreesh@users.noreply.github.com> Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com> Co-authored-by: akshatha-k <akshutk@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: nikharm <nikharm@nvidia.com> Co-authored-by: Keiven C <213854356+keivenchang@users.noreply.github.com> Co-authored-by: Keiven Chang <keivenchang@users.noreply.github.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

…to be use Kimi's tokenizer and fix tiktoken multi-byte handling (ai-dynamo#7424)

…mo#7433) Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Ben Hamm <ben.hamm@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

…o#7412) (ai-dynamo#7429) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com>

Signed-off-by: Dan Gil <dagil@nvidia.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a10f9700e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-25T15:27:11Z

+@pytest.mark.vllm
+@pytest.mark.gpu_1
+@pytest.mark.e2e
+@pytest.mark.post_merge


Mark the repro test non-blocking until migration fix lands

This new test is tagged post_merge, and the post-merge workflow selects all vllm and gpu_1 tests with (pre_merge or post_merge) markers, so it will run in nightly CI; because the test asserts json.loads(response_text) succeeds after forced migration (the known repro path), it will keep the post-merge pipeline red on environments where the migration bug is still present. Please gate it with xfail/skip (or remove the post_merge marker) until the product fix is merged.

Useful? React with 👍 / 👎.

fergusfinn · 2026-03-25T15:32:44Z

Reran the repro on the rebased main branch in a fresh Python 3.11 environment (/tmp/dynamo-install-mv3o75/.venv-main) with current vllm deps and rebuilt local bindings.

Command:

source /tmp/dynamo-install-mv3o75/.venv-main/bin/activate
cd /tmp/dynamo-install-mv3o75
python -m pytest -q tests/fault_tolerance/migration/test_vllm_structured.py::test_request_migration_vllm_aggregated_structured_output -s

Latest result on main:

FAILED tests/fault_tolerance/migration/test_vllm_structured.py::test_request_migration_vllm_aggregated_structured_output[tcp]
1 failed in 60.17s

Observed behavior differs from the earlier v1.0.1 snapshot run:

worker 2 is killed as intended
frontend logs Stream disconnected... recreating stream...
worker 1 then hits a vLLM engine crash / segfault during the resumed request
frontend logs Cannot recreate stream: no instances found for endpoint dynamo/backend/generate
the test fails in validate_response(...) because the client sees an event: error SSE and a 12.293s delay instead of a completed migrated response

So on rebased main, this repro is still failing, but the failure mode is now migration collapse / backend crash rather than malformed resumed JSON.

github-actions · 2026-05-10T10:45:32Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2026-05-15T11:33:53Z

This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information.

jasonqinzhou and others added 30 commits February 26, 2026 18:13

chore: update aiconfigurator commit (ai-dynamo#6634)

a04a2c3

fix: fix gpu discovery preflight job (ai-dynamo#6628) (ai-dynamo#6640)

b697689

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

feat: add profiler job overrides (ai-dynamo#6607) (ai-dynamo#6641)

6406d97

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

docs(cherry-pick): fix typos and wrong imports in router docs (ai-dyn…

ec7a9e2

…amo#6655) Signed-off-by: PeaBrane <yanrpei@gmail.com>

chore(deps): bump vLLM to 0.16.0 (release/1.0.0) (ai-dynamo#6652)

ce8e6f3

Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

docs: update release artifacts to v0.9.0 and add v0.9.0.post1 patch n…

207e4c9

…otes (ai-dynamo#6648) Signed-off-by: Dan Gil <dagil@nvidia.com>

docs: fix rendered.Dockerfile path in container README (ai-dynamo#6646)

52a2e28

Signed-off-by: Dan Gil <dagil@nvidia.com>

docs(cherry-pick): add Ryan Olson to flash indexer author list (ai-dy…

b48bd74

…namo#6663) Signed-off-by: PeaBrane <yanrpei@gmail.com>

docs(cherry pick): v1beta1 dgdr docs (ai-dynamo#6647) (ai-dynamo#6713)

db7579d

Signed-off-by: ashnamehrotra <ashnamehrotra@gmail.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

fix: proper DGD prefix for naive fallback in DGDR (ai-dynamo#6667) (a…

eef02ee

…i-dynamo#6679) Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

ci: trigger container validation dynamo on planner and frontend chang…

dc2fc96

…es (ai-dynamo#6682) Signed-off-by: Anant Sharma <anants@nvidia.com>

fix: Fix chat processor for vllm video/audio examples (ai-dynamo#6689) (

adbef62

ai-dynamo#6708)

fix: propagate vllm-distributed-executor-backend annotation from DGD …

cbe9ea2

…metadata to backends (ai-dynamo#6692) (ai-dynamo#6718) Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

fix: AutoApply bool --> AutoApply *bool (ai-dynamo#6683) (ai-dynamo#6712

c8b8896

) Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

fix: store nodesWithGPUs (ai-dynamo#6690) (ai-dynamo#6714)

abb6961

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

fix: shell-quote Ray leader args (ai-dynamo#6693) (ai-dynamo#6711)

a0bd5d2

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

chore: upgrade nixl to 0.10.0 (ai-dynamo#6685) (ai-dynamo#6701)

90bdacf

ci: enable EFA builds in post-merge and release pipelines (ai-dynamo#…

83b6222

…6650) (ai-dynamo#6706) Signed-off-by: Anant Sharma <anants@nvidia.com>

fix: broken link in profiler guide (ai-dynamo#6709) (ai-dynamo#6710)

4f6c053

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

docs: add Dynamo Docs Guide and Claude Code skills (ai-dynamo#6715)

acdf3fd

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

feat: default router_event_threads to 4 (cherry-pick) (ai-dynamo#6724)

9ce6506

Signed-off-by: PeaBrane <yanrpei@gmail.com>

chore: update my-registry and my-tag references (ai-dynamo#6729)

3c3d00c

Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

chore: update tools to latest prerelease for dynamo (ai-dynamo#6728)

c3e64ae

Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

fix: Phase out llava and make EPD single GPU (ai-dynamo#6674) (ai-dyn…

7ad6d6e

…amo#6753) Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>

fix: always emit pvc block in operator configmap when checkpoint stor…

980f2e8

…age type is pvc (ai-dynamo#6752) (ai-dynamo#6755) Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

fix(perf): add embedding transfer implementation with NIXL WRITE init…

32abd33

…iation (ai-dynamo#6651) (ai-dynamo#6776) Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>

fix: remove costly logs in EPD (ai-dynamo#6742)

2162d1b

Signed-off-by: Qi Wang <qiwa@nvidia.com>

fix: handle missing out_hidden_size for LLaVA models in EPD encode wo…

4aa6b8d

…rker (ai-dynamo#6765) Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

fix: [cherry-pick] fix TRT-LLM worker SSH crash in non-root container…

38db289

…s (http://nvbugs/5936491/1) (ai-dynamo#6772) Signed-off-by: Matej Kosec <mkosec@nvidia.com>

dagil-nvidia and others added 18 commits March 12, 2026 17:53

docs: add FastVideo example and guide with light sidebar reorg ai-dyn…

2e3605e

…amo#7283 (ai-dynamo#7284) Signed-off-by: Dan Gil <dagil@nvidia.com>

ci: add a preliminary compliance scan to ci (ai-dynamo#7289) (ai-dyna…

a3ca551

…mo#7306) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Signed-off-by: Anant Sharma <anants@nvidia.com> Co-authored-by: Anant Sharma <anants@nvidia.com>

docs: add docs for DGDR usage -- golden path (ai-dynamo#7304)

fad6f6c

Signed-off-by: Hannah Zhang <hannahz@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>

docs: add introduction page to Getting Started (ai-dynamo#7314)

a8eb56d

Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com> Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com>

chore: update attributions for 1.0.0 (ai-dynamo#7331)

c886c58

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

fix: remove cutlass-dsl stub that crashes TRT-LLM on CUDA 13.1 (ai-dy…

c3989ac

…namo#7372) (ai-dynamo#7393)

fix: populate logprobs bytes and token fields in OpenAI-compatible re… (

4535afd

ai-dynamo#7404) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com>

chore: update the version numbers of artifacts for v1.0.1 (ai-dynamo#…

fbbcb6d

…7410) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>

fix: manual change reflecting the desire on pr7411 (ai-dynamo#7413)

1aad319

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

fix: (ai-dynamo#6653, ai-dynamo#6996) allow deepseek v3 architecture …

16ce48a

…to be use Kimi's tokenizer and fix tiktoken multi-byte handling (ai-dynamo#7424)

docs: cherry-pick README revision and banner assets to 1.0.1 (ai-dyna…

c45e9b5

…mo#7433) Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Ben Hamm <ben.hamm@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: update attributions with epp go project included (ai-dynamo#7438)

b4b18d5

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

docs: add experimental recipes details for Kimi-K2.5 recipe (ai-dynam…

8816784

…o#7412) (ai-dynamo#7429) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com>

docs: update architecture overview diagram (ai-dynamo#7442)

5534a9d

Signed-off-by: Dan Gil <dagil@nvidia.com>

Add structured-output migration repro test

a10f970

fergusfinn force-pushed the repro/structured-output-migration branch from a10f970 to 43f38e4 Compare March 25, 2026 15:25

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

fergusfinn force-pushed the main branch from 7edb07b to 5534a9d Compare March 25, 2026 15:39

fergusfinn force-pushed the repro/structured-output-migration branch from 43f38e4 to a10f970 Compare March 25, 2026 15:39

fergusfinn mentioned this pull request Mar 25, 2026

[BUG]: Request migration corrupts structured-output responses after worker crash ai-dynamo/dynamo#7634

Closed

fergusfinn force-pushed the main branch from 5534a9d to d232b45 Compare April 9, 2026 14:45

github-actions Bot added the Stale label May 10, 2026

github-actions Bot closed this May 15, 2026

github-actions Bot deleted the repro/structured-output-migration branch May 15, 2026 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add structured-output migration repro test#1

Add structured-output migration repro test#1
fergusfinn wants to merge 171 commits into
mainfrom
repro/structured-output-migration

fergusfinn commented Mar 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 25, 2026

Uh oh!

fergusfinn commented Mar 25, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

fergusfinn commented Mar 25, 2026

Summary

What this reproduces

Repro command

Latest result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

fergusfinn commented Mar 25, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants