Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions utils/bench_serving/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,20 @@
from tqdm.asyncio import tqdm
from transformers import PreTrainedTokenizerBase

try:
from vllm.transformers_utils.tokenizer import get_tokenizer
except ImportError:
from backend_request_func import get_tokenizer
def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
"""Load tokenizer directly via transformers, bypassing vllm's get_cached_tokenizer.

vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended which
does not exist on Rust-backed TokenizersBackend (e.g. GLM-5). Using transformers
AutoTokenizer directly avoids that code path entirely.
"""
from transformers import AutoTokenizer
use_fast = tokenizer_mode != "slow"
return AutoTokenizer.from_pretrained(
tokenizer_id,
use_fast=use_fast,
trust_remote_code=trust_remote_code,
)

Check failure on line 61 in utils/bench_serving/benchmark_serving.py

View check run for this annotation

Claude / Claude Code Review

New get_tokenizer silently drops declared inputs (tokenizer_mode mistral/custom, **kwargs)

The new `get_tokenizer()` only branches `use_fast` on `tokenizer_mode == 'slow'` versus everything else, but argparse still advertises `choices=['auto', 'slow', 'mistral', 'custom']` for `--tokenizer-mode`. Passing `--tokenizer-mode mistral` (or `custom`) now silently falls through to a regular fast `AutoTokenizer` with no warning — for Mistral models that means wrong tokenization and bogus `prompt_len`/throughput numbers. Fix by either restricting the argparse choices to `['auto', 'slow']` or r

Check failure on line 61 in utils/bench_serving/benchmark_serving.py

View check run for this annotation

Claude / Claude Code Review

New get_tokenizer drops _fix_tokenizer_for_sglang, causing silent ~5x TTFT inflation on transformers v5 models

The new self-contained `get_tokenizer` calls `AutoTokenizer.from_pretrained` directly but never applies `_fix_tokenizer_for_sglang`, which the prior `backend_request_func.get_tokenizer` fallback applied at backend_request_func.py:542. On transformers v5, `LlamaTokenizerFast.__init__` rebuilds the pre_tokenizer/decoder from scratch and breaks DeepSeek-R1-class tokenizers — without the fix, a 7000-token prompt is encoded as ~35000 tokens client-side, silently inflating reported TTFT by ~5x. Fix by
Comment on lines +48 to +61
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new get_tokenizer() only branches use_fast on tokenizer_mode == 'slow' versus everything else, but argparse still advertises choices=['auto', 'slow', 'mistral', 'custom'] for --tokenizer-mode. Passing --tokenizer-mode mistral (or custom) now silently falls through to a regular fast AutoTokenizer with no warning — for Mistral models that means wrong tokenization and bogus prompt_len/throughput numbers. Fix by either restricting the argparse choices to ['auto', 'slow'] or raising NotImplementedError for the unsupported modes (the previous fallback in backend_request_func.py:527-535 had a dedicated mistral branch that returned MistralTokenizer.from_pretrained(...)). Separately and more minor: the signature accepts **kwargs but never forwards them to AutoTokenizer.from_pretrained — no current caller passes extras so this is latent today, but either drop **kwargs or pass them through to match the prior fallback signature.

Extended reasoning...

What's wrong

The replacement get_tokenizer() (utils/bench_serving/benchmark_serving.py:48-61) advertises an API matching the prior fallback but silently drops two declared inputs:

def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
    from transformers import AutoTokenizer
    use_fast = tokenizer_mode != "slow"
    return AutoTokenizer.from_pretrained(
        tokenizer_id,
        use_fast=use_fast,
        trust_remote_code=trust_remote_code,
    )

1. tokenizer_mode values mistral and custom are silently downgraded to auto. The argparse declaration further down in the file states:

parser.add_argument(
    '--tokenizer-mode',
    type=str,
    default="auto",
    choices=['auto', 'slow', 'mistral', 'custom'],
    help='The tokenizer mode.\n\n* "auto" will use the '
    'fast tokenizer if available.\n* "slow" will '
    'always use the slow tokenizer. \n* '
    '"mistral" will always use the `mistral_common` tokenizer. \n*'
    '"custom" will use --tokenizer to select the preregistered tokenizer.')

The previous fallback in backend_request_func.py:527-535 had a dedicated mistral branch that returned MistralTokenizer.from_pretrained(...), and the original vllm.transformers_utils.tokenizer.get_tokenizer (now removed via this PR) handled mistral and custom natively. The new helper only checks tokenizer_mode != 'slow', so any non-slow value collapses to use_fast=True AutoTokenizer.

2. **kwargs is accepted but never forwarded. The previous fallback at backend_request_func.py:537-541 forwarded **kwargs to AutoTokenizer.from_pretrained. The new body never references kwargs at all, so any caller passing e.g. revision=... or cache_dir=... would have those kwargs silently swallowed.

Step-by-step proof for the mistral case

  1. User runs python benchmark_serving.py --model mistralai/Mistral-7B-Instruct-v0.3 --tokenizer-mode mistral --dataset-name random ....
  2. argparse accepts mistral because it's in choices. args.tokenizer_mode == 'mistral'.
  3. main() calls get_tokenizer(tokenizer_id, tokenizer_mode='mistral', trust_remote_code=...).
  4. Inside get_tokenizer, use_fast = 'mistral' != 'slow' evaluates to True.
  5. AutoTokenizer.from_pretrained(tokenizer_id, use_fast=True, trust_remote_code=...) returns a regular HuggingFace fast tokenizer — not MistralTokenizer from mistral_common.
  6. The benchmark proceeds with a tokenizer that does not match Mistral's official tokenization (e.g. tekken-based models). prompt_len, actual_output_lens, and the derived throughput numbers are all computed against the wrong tokenizer.
  7. No warning is emitted — the user thinks they're benchmarking Mistral mode but is actually benchmarking the HF fast tokenizer.

Why current code doesn't prevent this

Argparse's choices validation only confirms the string is one of the four allowed values; it doesn't ensure the receiving function actually distinguishes them. There is no assertion, log line, or branch in the new get_tokenizer that observes tokenizer_mode outside of the != 'slow' comparison.

Impact

  • Mistral mode: Real, user-visible silent miscalibration when a user explicitly opts into Mistral tokenization. Niche for this benchmark client, but the fact that argparse advertises it as a valid choice is the user contract being broken.
  • Custom mode: Same silent fallthrough; affects any custom preregistered tokenizer the user expected.
  • **kwargs drop: Latent — no current caller (_init_tokenizer_worker, main, sample_random_requests via the worker init) passes extras, so there's no observable bug today. This is the kind of API-contract mismatch a refuter rightly flagged as hypothetical; it's flagged here only because it appears alongside the substantive mistral/custom regression and is trivially fixable in the same edit.

How to fix

Pick one of:

# Option A — narrow what we advertise:
parser.add_argument('--tokenizer-mode', choices=['auto', 'slow'], ...)
# Option B — fail loudly on the unsupported modes:
def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
    if tokenizer_mode in ("mistral", "custom"):
        raise NotImplementedError(
            f"tokenizer_mode={tokenizer_mode!r} is not supported by this benchmark client"
        )
    from transformers import AutoTokenizer
    use_fast = tokenizer_mode != "slow"
    return AutoTokenizer.from_pretrained(
        tokenizer_id, use_fast=use_fast, trust_remote_code=trust_remote_code, **kwargs
    )

Option B also forwards **kwargs, addressing the secondary signature-shim issue at the same time. Option A is simpler if Mistral/custom are not on the roadmap for this script.

Comment on lines +48 to +61
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new self-contained get_tokenizer calls AutoTokenizer.from_pretrained directly but never applies _fix_tokenizer_for_sglang, which the prior backend_request_func.get_tokenizer fallback applied at backend_request_func.py:542. On transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer/decoder from scratch and breaks DeepSeek-R1-class tokenizers — without the fix, a 7000-token prompt is encoded as ~35000 tokens client-side, silently inflating reported TTFT by ~5x. Fix by importing _fix_tokenizer_for_sglang from backend_request_func and calling it on the loaded tokenizer before returning.

Extended reasoning...

What the bug is

The PR replaces the prior tokenizer import:

try:
    from vllm.transformers_utils.tokenizer import get_tokenizer
except ImportError:
    from backend_request_func import get_tokenizer

with a self-contained helper at benchmark_serving.py:48-61 that calls AutoTokenizer.from_pretrained raw. This correctly fixes the GLM-5 TokenizersBackend crash. However, the previous backend_request_func.get_tokenizer fallback path (backend_request_func.py:537-542) did not just call AutoTokenizer.from_pretrained — it then ran the result through _fix_tokenizer_for_sglang(tokenizer, pretrained_model_name_or_path) before returning. The new helper drops that post-processing step, and since the helper is now the only path, no users get the fix.

Why _fix_tokenizer_for_sglang matters

The fix function (backend_request_func.py:442-509) is documented in detail. In transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer and decoder from class-specific components rather than honoring the originals from tokenizer.json. For models like DeepSeek-R1 that declare LlamaTokenizerFast but actually need the original Sequence/ByteLevel components, v5 silently replaces the Sequence pre_tokenizer with Metaspace and the ByteLevel decoder with a different Sequence. The fix re-reads tokenizer.json directly and restores the originals; it also re-applies add_bos_token/add_eos_token from tokenizer_config.json when the class is in a known list (Llama/Gemma/Cohere variants).

The sglang server applies an equivalent fix in hf_transformers_utils.py. So if the benchmark client doesn't apply it, the client and server tokenize identical text differently — per the docstring, a 7000-token prompt becomes ~35000 tokens server-side, inflating TTFT ~5x.

Step-by-step proof

  1. User runs benchmark_serving.py --model deepseek-ai/DeepSeek-R1 ... in an environment with transformers>=5.0 (and either no vllm, or any current setup since the vllm import has been removed).
  2. main calls the new get_tokenizer (benchmark_serving.py:48-61), which calls AutoTokenizer.from_pretrained and returns the result directly.
  3. Because DeepSeek-R1's tokenizer_config.json lists LlamaTokenizerFast as the class, transformers v5's LlamaTokenizerFast.__init__ overwrites the Sequence pre_tokenizer with Metaspace and the ByteLevel decoder with Sequence. Returned tokenizer is corrupted relative to the JSON.
  4. Pre-PR, the fallback path at backend_request_func.py:542 would have run _fix_tokenizer_for_sglang, which re-reads tokenizer.json and replaces backend.pre_tokenizer/backend.decoder with the originals. Post-PR, that step never runs.
  5. sample_random_requests uses this tokenizer to encode prompts client-side; the server's sglang process applies its own equivalent fix and tokenizes the same prompt very differently. Per the fix's docstring, a 7000-token client prompt expands to ~35000 tokens on the server.
  6. Server-side TTFT (which scales with actual prompt length) is ~5x larger than the prompt length the client thinks it sent. The benchmark prints the inflated TTFT as a real measurement.

Why existing code doesn't prevent this

The new helper in benchmark_serving.py has no awareness of _fix_tokenizer_for_sglang — it just wraps AutoTokenizer.from_pretrained. The fix function still exists in the sibling module backend_request_func.py (which is already imported at the top of benchmark_serving.py for ASYNC_REQUEST_FUNCS) but is never called from the new path. There is no automatic detection or fallback.

Impact

Silent, ~5x TTFT inflation on a primary benchmark target (DeepSeek-R1 and other Llama/Gemma/Cohere-class models on transformers v5). The benchmark continues to run, prints plausible-looking numbers, and reports a false performance regression. Users tracking SGLang vs. baseline performance on these models would see misleading results without any error message.

Note: pre-PR users who had vllm installed went through vllm's own get_tokenizer (which also lacked this SGLang-specific fix), so they didn't have the fix either. The clear regression is for the no-vllm path that previously got the fix via the backend_request_func fallback. With the new code that path is gone, so no one gets it.

How to fix

Import _fix_tokenizer_for_sglang from backend_request_func (already imported in this file) and apply it before returning:

def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs):
    from transformers import AutoTokenizer
    from backend_request_func import _fix_tokenizer_for_sglang
    use_fast = tokenizer_mode != "slow"
    tokenizer = AutoTokenizer.from_pretrained(
        tokenizer_id,
        use_fast=use_fast,
        trust_remote_code=trust_remote_code,
    )
    return _fix_tokenizer_for_sglang(tokenizer, tokenizer_id)

This preserves the GLM-5 fix (still bypasses vllm's get_cached_tokenizer) while restoring the v5 transformers compatibility fix that the prior fallback provided.


try:
from vllm.utils import FlexibleArgumentParser
Expand Down
Loading