[AMD][ROCM] Fix benchmark_serving Rust Tokenizer Crash via Direct transformers AutoTokenizer#1253
[AMD][ROCM] Fix benchmark_serving Rust Tokenizer Crash via Direct transformers AutoTokenizer#1253ChangLiu0709 wants to merge 1 commit intomainfrom
Conversation
…rs AutoTokenizer
## Problem
benchmark_serving.py imported get_tokenizer from vllm:
try:
from vllm.transformers_utils.tokenizer import get_tokenizer
except ImportError:
from backend_request_func import get_tokenizer
vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended,
which does not exist on Rust-backed TokenizersBackend (e.g. GLM-5 / ZhipuAI
models). There is no Python slow tokenizer fallback for these models —
use_fast=False still returns TokenizersBackend — so --tokenizer-mode slow is
also insufficient.
AttributeError: TokenizersBackend has no attribute all_special_tokens_extended
The crash occurs in the benchmark client, not in the SGLang server.
## Fix
Replace the vllm import with a self-contained get_tokenizer() backed by
transformers.AutoTokenizer.from_pretrained(). This avoids vllm's
get_cached_tokenizer() entirely while maintaining full API compatibility with
all three call sites in benchmark_serving.py.
## Verification
Confirmed fix resolves the crash on:
- lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260422 (vllm 0.9.2rc2)
- lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi35x-20260428
with model amd/GLM-5-MXFP4 (TP=8, MI355X, EAGLE MTP speculative decoding).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs): | ||
| """Load tokenizer directly via transformers, bypassing vllm's get_cached_tokenizer. | ||
|
|
||
| vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended which | ||
| does not exist on Rust-backed TokenizersBackend (e.g. GLM-5). Using transformers | ||
| AutoTokenizer directly avoids that code path entirely. | ||
| """ | ||
| from transformers import AutoTokenizer | ||
| use_fast = tokenizer_mode != "slow" | ||
| return AutoTokenizer.from_pretrained( | ||
| tokenizer_id, | ||
| use_fast=use_fast, | ||
| trust_remote_code=trust_remote_code, | ||
| ) |
There was a problem hiding this comment.
🔴 The new get_tokenizer() only branches use_fast on tokenizer_mode == 'slow' versus everything else, but argparse still advertises choices=['auto', 'slow', 'mistral', 'custom'] for --tokenizer-mode. Passing --tokenizer-mode mistral (or custom) now silently falls through to a regular fast AutoTokenizer with no warning — for Mistral models that means wrong tokenization and bogus prompt_len/throughput numbers. Fix by either restricting the argparse choices to ['auto', 'slow'] or raising NotImplementedError for the unsupported modes (the previous fallback in backend_request_func.py:527-535 had a dedicated mistral branch that returned MistralTokenizer.from_pretrained(...)). Separately and more minor: the signature accepts **kwargs but never forwards them to AutoTokenizer.from_pretrained — no current caller passes extras so this is latent today, but either drop **kwargs or pass them through to match the prior fallback signature.
Extended reasoning...
What's wrong
The replacement get_tokenizer() (utils/bench_serving/benchmark_serving.py:48-61) advertises an API matching the prior fallback but silently drops two declared inputs:
def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
from transformers import AutoTokenizer
use_fast = tokenizer_mode != "slow"
return AutoTokenizer.from_pretrained(
tokenizer_id,
use_fast=use_fast,
trust_remote_code=trust_remote_code,
)1. tokenizer_mode values mistral and custom are silently downgraded to auto. The argparse declaration further down in the file states:
parser.add_argument(
'--tokenizer-mode',
type=str,
default="auto",
choices=['auto', 'slow', 'mistral', 'custom'],
help='The tokenizer mode.\n\n* "auto" will use the '
'fast tokenizer if available.\n* "slow" will '
'always use the slow tokenizer. \n* '
'"mistral" will always use the `mistral_common` tokenizer. \n*'
'"custom" will use --tokenizer to select the preregistered tokenizer.')The previous fallback in backend_request_func.py:527-535 had a dedicated mistral branch that returned MistralTokenizer.from_pretrained(...), and the original vllm.transformers_utils.tokenizer.get_tokenizer (now removed via this PR) handled mistral and custom natively. The new helper only checks tokenizer_mode != 'slow', so any non-slow value collapses to use_fast=True AutoTokenizer.
2. **kwargs is accepted but never forwarded. The previous fallback at backend_request_func.py:537-541 forwarded **kwargs to AutoTokenizer.from_pretrained. The new body never references kwargs at all, so any caller passing e.g. revision=... or cache_dir=... would have those kwargs silently swallowed.
Step-by-step proof for the mistral case
- User runs
python benchmark_serving.py --model mistralai/Mistral-7B-Instruct-v0.3 --tokenizer-mode mistral --dataset-name random .... - argparse accepts
mistralbecause it's inchoices.args.tokenizer_mode == 'mistral'. main()callsget_tokenizer(tokenizer_id, tokenizer_mode='mistral', trust_remote_code=...).- Inside
get_tokenizer,use_fast = 'mistral' != 'slow'evaluates toTrue. AutoTokenizer.from_pretrained(tokenizer_id, use_fast=True, trust_remote_code=...)returns a regular HuggingFace fast tokenizer — notMistralTokenizerfrommistral_common.- The benchmark proceeds with a tokenizer that does not match Mistral's official tokenization (e.g. tekken-based models).
prompt_len,actual_output_lens, and the derived throughput numbers are all computed against the wrong tokenizer. - No warning is emitted — the user thinks they're benchmarking Mistral mode but is actually benchmarking the HF fast tokenizer.
Why current code doesn't prevent this
Argparse's choices validation only confirms the string is one of the four allowed values; it doesn't ensure the receiving function actually distinguishes them. There is no assertion, log line, or branch in the new get_tokenizer that observes tokenizer_mode outside of the != 'slow' comparison.
Impact
- Mistral mode: Real, user-visible silent miscalibration when a user explicitly opts into Mistral tokenization. Niche for this benchmark client, but the fact that argparse advertises it as a valid choice is the user contract being broken.
- Custom mode: Same silent fallthrough; affects any custom preregistered tokenizer the user expected.
**kwargsdrop: Latent — no current caller (_init_tokenizer_worker,main,sample_random_requestsvia the worker init) passes extras, so there's no observable bug today. This is the kind of API-contract mismatch a refuter rightly flagged as hypothetical; it's flagged here only because it appears alongside the substantive mistral/custom regression and is trivially fixable in the same edit.
How to fix
Pick one of:
# Option A — narrow what we advertise:
parser.add_argument('--tokenizer-mode', choices=['auto', 'slow'], ...)# Option B — fail loudly on the unsupported modes:
def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
if tokenizer_mode in ("mistral", "custom"):
raise NotImplementedError(
f"tokenizer_mode={tokenizer_mode!r} is not supported by this benchmark client"
)
from transformers import AutoTokenizer
use_fast = tokenizer_mode != "slow"
return AutoTokenizer.from_pretrained(
tokenizer_id, use_fast=use_fast, trust_remote_code=trust_remote_code, **kwargs
)Option B also forwards **kwargs, addressing the secondary signature-shim issue at the same time. Option A is simpler if Mistral/custom are not on the roadmap for this script.
| def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs): | ||
| """Load tokenizer directly via transformers, bypassing vllm's get_cached_tokenizer. | ||
|
|
||
| vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended which | ||
| does not exist on Rust-backed TokenizersBackend (e.g. GLM-5). Using transformers | ||
| AutoTokenizer directly avoids that code path entirely. | ||
| """ | ||
| from transformers import AutoTokenizer | ||
| use_fast = tokenizer_mode != "slow" | ||
| return AutoTokenizer.from_pretrained( | ||
| tokenizer_id, | ||
| use_fast=use_fast, | ||
| trust_remote_code=trust_remote_code, | ||
| ) |
There was a problem hiding this comment.
🔴 The new self-contained get_tokenizer calls AutoTokenizer.from_pretrained directly but never applies _fix_tokenizer_for_sglang, which the prior backend_request_func.get_tokenizer fallback applied at backend_request_func.py:542. On transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer/decoder from scratch and breaks DeepSeek-R1-class tokenizers — without the fix, a 7000-token prompt is encoded as ~35000 tokens client-side, silently inflating reported TTFT by ~5x. Fix by importing _fix_tokenizer_for_sglang from backend_request_func and calling it on the loaded tokenizer before returning.
Extended reasoning...
What the bug is
The PR replaces the prior tokenizer import:
try:
from vllm.transformers_utils.tokenizer import get_tokenizer
except ImportError:
from backend_request_func import get_tokenizerwith a self-contained helper at benchmark_serving.py:48-61 that calls AutoTokenizer.from_pretrained raw. This correctly fixes the GLM-5 TokenizersBackend crash. However, the previous backend_request_func.get_tokenizer fallback path (backend_request_func.py:537-542) did not just call AutoTokenizer.from_pretrained — it then ran the result through _fix_tokenizer_for_sglang(tokenizer, pretrained_model_name_or_path) before returning. The new helper drops that post-processing step, and since the helper is now the only path, no users get the fix.
Why _fix_tokenizer_for_sglang matters
The fix function (backend_request_func.py:442-509) is documented in detail. In transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer and decoder from class-specific components rather than honoring the originals from tokenizer.json. For models like DeepSeek-R1 that declare LlamaTokenizerFast but actually need the original Sequence/ByteLevel components, v5 silently replaces the Sequence pre_tokenizer with Metaspace and the ByteLevel decoder with a different Sequence. The fix re-reads tokenizer.json directly and restores the originals; it also re-applies add_bos_token/add_eos_token from tokenizer_config.json when the class is in a known list (Llama/Gemma/Cohere variants).
The sglang server applies an equivalent fix in hf_transformers_utils.py. So if the benchmark client doesn't apply it, the client and server tokenize identical text differently — per the docstring, a 7000-token prompt becomes ~35000 tokens server-side, inflating TTFT ~5x.
Step-by-step proof
- User runs
benchmark_serving.py --model deepseek-ai/DeepSeek-R1 ...in an environment withtransformers>=5.0(and either no vllm, or any current setup since the vllm import has been removed). maincalls the newget_tokenizer(benchmark_serving.py:48-61), which callsAutoTokenizer.from_pretrainedand returns the result directly.- Because DeepSeek-R1's
tokenizer_config.jsonlistsLlamaTokenizerFastas the class, transformers v5'sLlamaTokenizerFast.__init__overwrites theSequencepre_tokenizer withMetaspaceand theByteLeveldecoder withSequence. Returned tokenizer is corrupted relative to the JSON. - Pre-PR, the fallback path at
backend_request_func.py:542would have run_fix_tokenizer_for_sglang, which re-readstokenizer.jsonand replacesbackend.pre_tokenizer/backend.decoderwith the originals. Post-PR, that step never runs. sample_random_requestsuses this tokenizer to encode prompts client-side; the server's sglang process applies its own equivalent fix and tokenizes the same prompt very differently. Per the fix's docstring, a 7000-token client prompt expands to ~35000 tokens on the server.- Server-side TTFT (which scales with actual prompt length) is ~5x larger than the prompt length the client thinks it sent. The benchmark prints the inflated TTFT as a real measurement.
Why existing code doesn't prevent this
The new helper in benchmark_serving.py has no awareness of _fix_tokenizer_for_sglang — it just wraps AutoTokenizer.from_pretrained. The fix function still exists in the sibling module backend_request_func.py (which is already imported at the top of benchmark_serving.py for ASYNC_REQUEST_FUNCS) but is never called from the new path. There is no automatic detection or fallback.
Impact
Silent, ~5x TTFT inflation on a primary benchmark target (DeepSeek-R1 and other Llama/Gemma/Cohere-class models on transformers v5). The benchmark continues to run, prints plausible-looking numbers, and reports a false performance regression. Users tracking SGLang vs. baseline performance on these models would see misleading results without any error message.
Note: pre-PR users who had vllm installed went through vllm's own get_tokenizer (which also lacked this SGLang-specific fix), so they didn't have the fix either. The clear regression is for the no-vllm path that previously got the fix via the backend_request_func fallback. With the new code that path is gone, so no one gets it.
How to fix
Import _fix_tokenizer_for_sglang from backend_request_func (already imported in this file) and apply it before returning:
def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
from transformers import AutoTokenizer
from backend_request_func import _fix_tokenizer_for_sglang
use_fast = tokenizer_mode != "slow"
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_id,
use_fast=use_fast,
trust_remote_code=trust_remote_code,
)
return _fix_tokenizer_for_sglang(tokenizer, tokenizer_id)This preserves the GLM-5 fix (still bypasses vllm's get_cached_tokenizer) while restoring the v5 transformers compatibility fix that the prior fallback provided.
Summary
get_tokenizerimport inbenchmark_serving.pywith a self-containedget_tokenizer()backed bytransformers.AutoTokenizer.from_pretrained().get_cached_tokenizer()does a baretokenizer.all_special_tokens_extendedaccess. This attribute does not exist on Rust-backedTokenizersBackend(HuggingFacetokenizerslibrary). Models with no Python slow tokenizer fallback (e.g.amd/GLM-5-MXFP4) crash withAttributeError: TokenizersBackend has no attribute all_special_tokens_extended.--tokenizer-mode slowis also insufficient —use_fast=Falsestill returnsTokenizersBackendfor these models.benchmark_serving.py), not the SGLang server.AutoTokenizer.from_pretrained()always returns a proper Python-wrapped tokenizer, bypassing the vllm code path entirely. Fully API-compatible with all three call sites inbenchmark_serving.py(_init_tokenizer_worker,main,sample_random_requests).get_tokenizeris no longer imported from vllm even when vllm is installed.Test plan
lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260422(vllm 0.9.2rc2) withamd/GLM-5-MXFP4, TP=8, MI355X — benchmark completes successfully (exit code 0).lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi35x-20260428with same model/config — benchmark completes successfully (exit code 0).get_tokenizercall sites inbenchmark_serving.pyexercised without error.🤖 Generated with Claude Code