[serve] cb error by SunMarc · Pull Request #45691 · huggingface/transformers

SunMarc · 2026-04-28T17:05:52Z

What does this PR do?

This PR adds better support for CB when the CB worker thread dies due to unexpected errors. We display clearly to the user that they need to restart the server.

cc @qgallouedec

SunMarc · 2026-04-28T17:06:04Z

@bot /style

qgallouedec

thanks. For context the motivation of this PR is that I occasionally get this kind of error, and this should help to debug.

[...]
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
2026-04-28 02:46:28,290 - ContinuousBatchingLogger - ERROR - Error in generation loop: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 1101, in _run_generation_loop
    self._inner_generation_loop(batch_processor)
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 1124, in _inner_generation_loop
    self._generation_step()
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 1040, in _generation_step
    self.batch_processor._generation_step(self.model)
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 538, in _generation_step
    self.capture_graph(forward_fn, compute_stream, *args)
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 546, in capture_graph
    forward_fn(*args)
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 569, in _forward_process_and_sample
    logits = self._model_forward(model, batch_data).float()  # convert to fp32 to match generate
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 575, in _model_forward
    return model(**batch_data).logits
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 887, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 462, in forward
    outputs: BaseModelOutputWithPast = self.model(
                                       ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 963, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/utils/output_capturing.py", line 248, in wrapper
    outputs = func(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 399, in forward
    hidden_states = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/modeling_layers.py", line 93, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 293, in forward
    hidden_states, _ = self.self_attn(
                       ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 222, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/kernels/layer/func.py", line 297, in forward
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 145, in apply_rotary_pos_emb
    k_embed = (k * cos) + (rotate_half(k) * sin)
              ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 415, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 56, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/fastapi/applications.py", line 1159, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/applications.py", line 90, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 191, in __call__
    with recv_stream, send_stream, collapse_excgroups():
  File "/opt/conda/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/conda/lib/python3.11/site-packages/starlette/_utils.py", line 87, in collapse_excgroups
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 193, in __call__
    response = await self.dispatch_func(request, call_next)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/server.py", line 85, in request_id_middleware
    response = await call_next(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 168, in call_next
    raise app_exc from app_exc.__cause__ or app_exc.__context__
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 144, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/server.py", line 93, in chat_completions
    return await chat_handler.handle_request(body, request.state.request_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/chat_completion.py", line 158, in handle_request
    return await self._non_streaming(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/chat_completion.py", line 299, in _non_streaming
    parsed = parse_tool_calls(processor, generated_ids, tool_config["schema"])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/utils.py", line 150, in parse_tool_calls
    parsed = processor.parse_response(generated_ids, schema)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3328, in parse_response
    (isinstance(response, list) and not isinstance(response[0], int))
                                                   ~~~~~~~~^^^
IndexError: list index out of range
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True

HuggingFaceDocBuilderDev · 2026-04-29T13:49:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc · 2026-04-29T13:52:57Z

@bot /style

github-actions · 2026-04-29T13:54:01Z

Style fix fix runs successfully without any file modified.

cb error

c9cc099

qgallouedec approved these changes Apr 28, 2026

View reviewed changes

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

simpler tests

486b9e8

fix

6147b04

SunMarc enabled auto-merge April 29, 2026 14:07

SunMarc added this pull request to the merge queue Apr 29, 2026

Merged via the queue into main with commit c641d13 Apr 29, 2026
18 checks passed

SunMarc deleted the handle-cb-error branch April 29, 2026 14:17

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] cb error #45691

[serve] cb error #45691
SunMarc merged 3 commits intomainfrom
handle-cb-error

SunMarc commented Apr 28, 2026 •

edited

Loading

Uh oh!

SunMarc commented Apr 28, 2026

Uh oh!

qgallouedec left a comment •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2026

Uh oh!

SunMarc commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SunMarc commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

SunMarc commented Apr 28, 2026

Uh oh!

qgallouedec left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2026

Uh oh!

SunMarc commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SunMarc commented Apr 28, 2026 •

edited

Loading

qgallouedec left a comment •

edited

Loading

github-actions Bot commented Apr 29, 2026 •

edited

Loading