[AMD][ROCM] Fix benchmark_serving Rust Tokenizer Crash via Direct transformers AutoTokenizer by ChangLiu0709 · Pull Request #1253 · SemiAnalysisAI/InferenceX

ChangLiu0709 · 2026-05-01T10:54:19Z

Summary

Replaces the vllm get_tokenizer import in benchmark_serving.py with a self-contained get_tokenizer() backed by transformers.AutoTokenizer.from_pretrained().
Root cause: vllm's get_cached_tokenizer() does a bare tokenizer.all_special_tokens_extended access. This attribute does not exist on Rust-backed TokenizersBackend (HuggingFace tokenizers library). Models with no Python slow tokenizer fallback (e.g. amd/GLM-5-MXFP4) crash with AttributeError: TokenizersBackend has no attribute all_special_tokens_extended. --tokenizer-mode slow is also insufficient — use_fast=False still returns TokenizersBackend for these models.
Crash site: the benchmark client (benchmark_serving.py), not the SGLang server.
Fix: AutoTokenizer.from_pretrained() always returns a proper Python-wrapped tokenizer, bypassing the vllm code path entirely. Fully API-compatible with all three call sites in benchmark_serving.py (_init_tokenizer_worker, main, sample_random_requests).
Removes the vllm dependency from the benchmark client — get_tokenizer is no longer imported from vllm even when vllm is installed.

Test plan

Fix verified on lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260422 (vllm 0.9.2rc2) with amd/GLM-5-MXFP4, TP=8, MI355X — benchmark completes successfully (exit code 0).
Fix verified on lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi35x-20260428 with same model/config — benchmark completes successfully (exit code 0).
All three get_tokenizer call sites in benchmark_serving.py exercised without error.
CI passes on MI355X.

🤖 Generated with Claude Code

…rs AutoTokenizer ## Problem benchmark_serving.py imported get_tokenizer from vllm: try: from vllm.transformers_utils.tokenizer import get_tokenizer except ImportError: from backend_request_func import get_tokenizer vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended, which does not exist on Rust-backed TokenizersBackend (e.g. GLM-5 / ZhipuAI models). There is no Python slow tokenizer fallback for these models — use_fast=False still returns TokenizersBackend — so --tokenizer-mode slow is also insufficient. AttributeError: TokenizersBackend has no attribute all_special_tokens_extended The crash occurs in the benchmark client, not in the SGLang server. ## Fix Replace the vllm import with a self-contained get_tokenizer() backed by transformers.AutoTokenizer.from_pretrained(). This avoids vllm's get_cached_tokenizer() entirely while maintaining full API compatibility with all three call sites in benchmark_serving.py. ## Verification Confirmed fix resolves the crash on: - lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260422 (vllm 0.9.2rc2) - lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi35x-20260428 with model amd/GLM-5-MXFP4 (TP=8, MI355X, EAGLE MTP speculative decoding). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-05-01T10:59:52Z

+def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
+    """Load tokenizer directly via transformers, bypassing vllm's get_cached_tokenizer.
+
+    vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended which
+    does not exist on Rust-backed TokenizersBackend (e.g. GLM-5). Using transformers
+    AutoTokenizer directly avoids that code path entirely.
+    """
+    from transformers import AutoTokenizer
+    use_fast = tokenizer_mode != "slow"
+    return AutoTokenizer.from_pretrained(
+        tokenizer_id,
+        use_fast=use_fast,
+        trust_remote_code=trust_remote_code,
+    )


🔴 The new get_tokenizer() only branches use_fast on tokenizer_mode == 'slow' versus everything else, but argparse still advertises choices=['auto', 'slow', 'mistral', 'custom'] for --tokenizer-mode. Passing --tokenizer-mode mistral (or custom) now silently falls through to a regular fast AutoTokenizer with no warning — for Mistral models that means wrong tokenization and bogus prompt_len/throughput numbers. Fix by either restricting the argparse choices to ['auto', 'slow'] or raising NotImplementedError for the unsupported modes (the previous fallback in backend_request_func.py:527-535 had a dedicated mistral branch that returned MistralTokenizer.from_pretrained(...)). Separately and more minor: the signature accepts **kwargs but never forwards them to AutoTokenizer.from_pretrained — no current caller passes extras so this is latent today, but either drop **kwargs or pass them through to match the prior fallback signature.

Extended reasoning...

What's wrong

The replacement get_tokenizer() (utils/bench_serving/benchmark_serving.py:48-61) advertises an API matching the prior fallback but silently drops two declared inputs:

def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs): from transformers import AutoTokenizer use_fast = tokenizer_mode != "slow" return AutoTokenizer.from_pretrained( tokenizer_id, use_fast=use_fast, trust_remote_code=trust_remote_code, )

1. tokenizer_mode values mistral and custom are silently downgraded to auto. The argparse declaration further down in the file states:

parser.add_argument( '--tokenizer-mode', type=str, default="auto", choices=['auto', 'slow', 'mistral', 'custom'], help='The tokenizer mode.\n\n* "auto" will use the ' 'fast tokenizer if available.\n* "slow" will ' 'always use the slow tokenizer. \n* ' '"mistral" will always use the `mistral_common` tokenizer. \n*' '"custom" will use --tokenizer to select the preregistered tokenizer.')

The previous fallback in backend_request_func.py:527-535 had a dedicated mistral branch that returned MistralTokenizer.from_pretrained(...), and the original vllm.transformers_utils.tokenizer.get_tokenizer (now removed via this PR) handled mistral and custom natively. The new helper only checks tokenizer_mode != 'slow', so any non-slow value collapses to use_fast=True AutoTokenizer.

2. **kwargs is accepted but never forwarded. The previous fallback at backend_request_func.py:537-541 forwarded **kwargs to AutoTokenizer.from_pretrained. The new body never references kwargs at all, so any caller passing e.g. revision=... or cache_dir=... would have those kwargs silently swallowed.

Step-by-step proof for the mistral case

User runs python benchmark_serving.py --model mistralai/Mistral-7B-Instruct-v0.3 --tokenizer-mode mistral --dataset-name random ....

argparse accepts mistral because it's in choices. args.tokenizer_mode == 'mistral'.

main() calls get_tokenizer(tokenizer_id, tokenizer_mode='mistral', trust_remote_code=...).

Inside get_tokenizer, use_fast = 'mistral' != 'slow' evaluates to True.

AutoTokenizer.from_pretrained(tokenizer_id, use_fast=True, trust_remote_code=...) returns a regular HuggingFace fast tokenizer — not MistralTokenizer from mistral_common.

The benchmark proceeds with a tokenizer that does not match Mistral's official tokenization (e.g. tekken-based models). prompt_len, actual_output_lens, and the derived throughput numbers are all computed against the wrong tokenizer.

No warning is emitted — the user thinks they're benchmarking Mistral mode but is actually benchmarking the HF fast tokenizer.

Why current code doesn't prevent this

Argparse's choices validation only confirms the string is one of the four allowed values; it doesn't ensure the receiving function actually distinguishes them. There is no assertion, log line, or branch in the new get_tokenizer that observes tokenizer_mode outside of the != 'slow' comparison.

Impact

Mistral mode: Real, user-visible silent miscalibration when a user explicitly opts into Mistral tokenization. Niche for this benchmark client, but the fact that argparse advertises it as a valid choice is the user contract being broken.

Custom mode: Same silent fallthrough; affects any custom preregistered tokenizer the user expected.

**kwargs drop: Latent — no current caller (_init_tokenizer_worker, main, sample_random_requests via the worker init) passes extras, so there's no observable bug today. This is the kind of API-contract mismatch a refuter rightly flagged as hypothetical; it's flagged here only because it appears alongside the substantive mistral/custom regression and is trivially fixable in the same edit.

How to fix

Pick one of:

# Option A — narrow what we advertise: parser.add_argument('--tokenizer-mode', choices=['auto', 'slow'], ...)

# Option B — fail loudly on the unsupported modes: def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs): if tokenizer_mode in ("mistral", "custom"): raise NotImplementedError( f"tokenizer_mode={tokenizer_mode!r} is not supported by this benchmark client" ) from transformers import AutoTokenizer use_fast = tokenizer_mode != "slow" return AutoTokenizer.from_pretrained( tokenizer_id, use_fast=use_fast, trust_remote_code=trust_remote_code, **kwargs )

Option B also forwards **kwargs, addressing the secondary signature-shim issue at the same time. Option A is simpler if Mistral/custom are not on the roadmap for this script.

claude · 2026-05-01T10:59:52Z

+def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
+    """Load tokenizer directly via transformers, bypassing vllm's get_cached_tokenizer.
+
+    vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended which
+    does not exist on Rust-backed TokenizersBackend (e.g. GLM-5). Using transformers
+    AutoTokenizer directly avoids that code path entirely.
+    """
+    from transformers import AutoTokenizer
+    use_fast = tokenizer_mode != "slow"
+    return AutoTokenizer.from_pretrained(
+        tokenizer_id,
+        use_fast=use_fast,
+        trust_remote_code=trust_remote_code,
+    )


🔴 The new self-contained get_tokenizer calls AutoTokenizer.from_pretrained directly but never applies _fix_tokenizer_for_sglang, which the prior backend_request_func.get_tokenizer fallback applied at backend_request_func.py:542. On transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer/decoder from scratch and breaks DeepSeek-R1-class tokenizers — without the fix, a 7000-token prompt is encoded as ~35000 tokens client-side, silently inflating reported TTFT by ~5x. Fix by importing _fix_tokenizer_for_sglang from backend_request_func and calling it on the loaded tokenizer before returning.

Extended reasoning...

What the bug is

The PR replaces the prior tokenizer import:

try: from vllm.transformers_utils.tokenizer import get_tokenizer except ImportError: from backend_request_func import get_tokenizer

with a self-contained helper at benchmark_serving.py:48-61 that calls AutoTokenizer.from_pretrained raw. This correctly fixes the GLM-5 TokenizersBackend crash. However, the previous backend_request_func.get_tokenizer fallback path (backend_request_func.py:537-542) did not just call AutoTokenizer.from_pretrained — it then ran the result through _fix_tokenizer_for_sglang(tokenizer, pretrained_model_name_or_path) before returning. The new helper drops that post-processing step, and since the helper is now the only path, no users get the fix.

Why _fix_tokenizer_for_sglang matters

The fix function (backend_request_func.py:442-509) is documented in detail. In transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer and decoder from class-specific components rather than honoring the originals from tokenizer.json. For models like DeepSeek-R1 that declare LlamaTokenizerFast but actually need the original Sequence/ByteLevel components, v5 silently replaces the Sequence pre_tokenizer with Metaspace and the ByteLevel decoder with a different Sequence. The fix re-reads tokenizer.json directly and restores the originals; it also re-applies add_bos_token/add_eos_token from tokenizer_config.json when the class is in a known list (Llama/Gemma/Cohere variants).

The sglang server applies an equivalent fix in hf_transformers_utils.py. So if the benchmark client doesn't apply it, the client and server tokenize identical text differently — per the docstring, a 7000-token prompt becomes ~35000 tokens server-side, inflating TTFT ~5x.

Step-by-step proof

User runs benchmark_serving.py --model deepseek-ai/DeepSeek-R1 ... in an environment with transformers>=5.0 (and either no vllm, or any current setup since the vllm import has been removed).

main calls the new get_tokenizer (benchmark_serving.py:48-61), which calls AutoTokenizer.from_pretrained and returns the result directly.

Because DeepSeek-R1's tokenizer_config.json lists LlamaTokenizerFast as the class, transformers v5's LlamaTokenizerFast.__init__ overwrites the Sequence pre_tokenizer with Metaspace and the ByteLevel decoder with Sequence. Returned tokenizer is corrupted relative to the JSON.

Pre-PR, the fallback path at backend_request_func.py:542 would have run _fix_tokenizer_for_sglang, which re-reads tokenizer.json and replaces backend.pre_tokenizer/backend.decoder with the originals. Post-PR, that step never runs.

sample_random_requests uses this tokenizer to encode prompts client-side; the server's sglang process applies its own equivalent fix and tokenizes the same prompt very differently. Per the fix's docstring, a 7000-token client prompt expands to ~35000 tokens on the server.

Server-side TTFT (which scales with actual prompt length) is ~5x larger than the prompt length the client thinks it sent. The benchmark prints the inflated TTFT as a real measurement.

Why existing code doesn't prevent this

The new helper in benchmark_serving.py has no awareness of _fix_tokenizer_for_sglang — it just wraps AutoTokenizer.from_pretrained. The fix function still exists in the sibling module backend_request_func.py (which is already imported at the top of benchmark_serving.py for ASYNC_REQUEST_FUNCS) but is never called from the new path. There is no automatic detection or fallback.

Impact

Silent, ~5x TTFT inflation on a primary benchmark target (DeepSeek-R1 and other Llama/Gemma/Cohere-class models on transformers v5). The benchmark continues to run, prints plausible-looking numbers, and reports a false performance regression. Users tracking SGLang vs. baseline performance on these models would see misleading results without any error message.

Note: pre-PR users who had vllm installed went through vllm's own get_tokenizer (which also lacked this SGLang-specific fix), so they didn't have the fix either. The clear regression is for the no-vllm path that previously got the fix via the backend_request_func fallback. With the new code that path is gone, so no one gets it.

How to fix

Import _fix_tokenizer_for_sglang from backend_request_func (already imported in this file) and apply it before returning:

def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs): from transformers import AutoTokenizer from backend_request_func import _fix_tokenizer_for_sglang use_fast = tokenizer_mode != "slow" tokenizer = AutoTokenizer.from_pretrained( tokenizer_id, use_fast=use_fast, trust_remote_code=trust_remote_code, ) return _fix_tokenizer_for_sglang(tokenizer, tokenizer_id)

This preserves the GLM-5 fix (still bypasses vllm's get_cached_tokenizer) while restoring the v5 transformers compatibility fix that the prior fallback provided.

ChangLiu0709 requested a review from a team May 1, 2026 10:54

github-project-automation Bot added this to InferenceMAX Board May 1, 2026

claude Bot reviewed May 1, 2026

View reviewed changes

ChangLiu0709 changed the title ~~fix(bench_serving): replace vllm get_tokenizer with direct transformers AutoTokenizer~~ [AMD][ROCM] Fix benchmark_serving Rust Tokenizer Crash via Direct transformers AutoTokenizer May 1, 2026

ChangLiu0709 mentioned this pull request May 1, 2026

[AMD][ROCM] Add MI355X Config: glm5-fp4-mi355x-sglang-mtp #1254

Draft

5 tasks

ChangLiu0709 marked this pull request as draft May 1, 2026 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][ROCM] Fix benchmark_serving Rust Tokenizer Crash via Direct transformers AutoTokenizer#1253

[AMD][ROCM] Fix benchmark_serving Rust Tokenizer Crash via Direct transformers AutoTokenizer#1253
ChangLiu0709 wants to merge 1 commit intomainfrom
chang/fix-bench-serving-rust-tokenizer-crash

ChangLiu0709 commented May 1, 2026 •

edited

Loading

Uh oh!

claude Bot May 1, 2026

Uh oh!

claude Bot May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChangLiu0709 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

What's wrong

Step-by-step proof for the mistral case

Why current code doesn't prevent this

Impact

How to fix

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

What the bug is

Why _fix_tokenizer_for_sglang matters

Step-by-step proof

Why existing code doesn't prevent this

Impact

How to fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChangLiu0709 commented May 1, 2026 •

edited

Loading

Why `_fix_tokenizer_for_sglang` matters