Skip to content

[serve] cb error #45691

Merged
SunMarc merged 3 commits intomainfrom
handle-cb-error
Apr 29, 2026
Merged

[serve] cb error #45691
SunMarc merged 3 commits intomainfrom
handle-cb-error

Conversation

@SunMarc
Copy link
Copy Markdown
Member

@SunMarc SunMarc commented Apr 28, 2026

What does this PR do?

This PR adds better support for CB when the CB worker thread dies due to unexpected errors. We display clearly to the user that they need to restart the server.

cc @qgallouedec

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Apr 28, 2026

@bot /style

Copy link
Copy Markdown
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. For context the motivation of this PR is that I occasionally get this kind of error, and this should help to debug.

[...]
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
2026-04-28 02:46:28,290 - ContinuousBatchingLogger - ERROR - Error in generation loop: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 1101, in _run_generation_loop
    self._inner_generation_loop(batch_processor)
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 1124, in _inner_generation_loop
    self._generation_step()
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 1040, in _generation_step
    self.batch_processor._generation_step(self.model)
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 538, in _generation_step
    self.capture_graph(forward_fn, compute_stream, *args)
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 546, in capture_graph
    forward_fn(*args)
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 569, in _forward_process_and_sample
    logits = self._model_forward(model, batch_data).float()  # convert to fp32 to match generate
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/continuous_batching/continuous_api.py", line 575, in _model_forward
    return model(**batch_data).logits
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 887, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 462, in forward
    outputs: BaseModelOutputWithPast = self.model(
                                       ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/utils/generic.py", line 963, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/utils/output_capturing.py", line 248, in wrapper
    outputs = func(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 399, in forward
    hidden_states = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/modeling_layers.py", line 93, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 293, in forward
    hidden_states, _ = self.self_attn(
                       ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 222, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/kernels/layer/func.py", line 297, in forward
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 145, in apply_rotary_pos_emb
    k_embed = (k * cos) + (rotate_half(k) * sin)
              ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

INFO:     ::1:52036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     ::1:52056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     ::1:52052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 415, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 56, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/fastapi/applications.py", line 1159, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/applications.py", line 90, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 191, in __call__
    with recv_stream, send_stream, collapse_excgroups():
  File "/opt/conda/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/conda/lib/python3.11/site-packages/starlette/_utils.py", line 87, in collapse_excgroups
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 193, in __call__
    response = await self.dispatch_func(request, call_next)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/server.py", line 85, in request_id_middleware
    response = await call_next(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 168, in call_next
    raise app_exc from app_exc.__cause__ or app_exc.__context__
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/base.py", line 144, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/opt/conda/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/server.py", line 93, in chat_completions
    return await chat_handler.handle_request(body, request.state.request_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/chat_completion.py", line 158, in handle_request
    return await self._non_streaming(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/chat_completion.py", line 299, in _non_streaming
    parsed = parse_tool_calls(processor, generated_ids, tool_config["schema"])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/cli/serving/utils.py", line 150, in parse_tool_calls
    parsed = processor.parse_response(generated_ids, schema)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3328, in parse_response
    (isinstance(response, list) and not isinstance(response[0], int))
                                                   ~~~~~~~~^^^
IndexError: list index out of range
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True
[transformers] [Request received] Model: Qwen/Qwen2.5-7B-Instruct@main, CB: True

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Apr 29, 2026

@bot /style

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

Style fix fix runs successfully without any file modified.

@SunMarc SunMarc enabled auto-merge April 29, 2026 14:07
@SunMarc SunMarc added this pull request to the merge queue Apr 29, 2026
Merged via the queue into main with commit c641d13 Apr 29, 2026
18 checks passed
@SunMarc SunMarc deleted the handle-cb-error branch April 29, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants