-
Notifications
You must be signed in to change notification settings - Fork 156
[AMD][ROCM] Fix benchmark_serving Rust Tokenizer Crash via Direct transformers AutoTokenizer #1253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -45,10 +45,20 @@ | |
| from tqdm.asyncio import tqdm | ||
| from transformers import PreTrainedTokenizerBase | ||
|
|
||
| try: | ||
| from vllm.transformers_utils.tokenizer import get_tokenizer | ||
| except ImportError: | ||
| from backend_request_func import get_tokenizer | ||
| def get_tokenizer(tokenizer_id, tokenizer_mode="auto", trust_remote_code=False, **kwargs): | ||
| """Load tokenizer directly via transformers, bypassing vllm's get_cached_tokenizer. | ||
|
|
||
| vllm's get_cached_tokenizer() accesses tokenizer.all_special_tokens_extended which | ||
| does not exist on Rust-backed TokenizersBackend (e.g. GLM-5). Using transformers | ||
| AutoTokenizer directly avoids that code path entirely. | ||
| """ | ||
| from transformers import AutoTokenizer | ||
| use_fast = tokenizer_mode != "slow" | ||
| return AutoTokenizer.from_pretrained( | ||
| tokenizer_id, | ||
| use_fast=use_fast, | ||
| trust_remote_code=trust_remote_code, | ||
| ) | ||
|
Check failure on line 61 in utils/bench_serving/benchmark_serving.py
|
||
|
Comment on lines
+48
to
+61
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new self-contained Extended reasoning...What the bug isThe PR replaces the prior tokenizer import: try:
from vllm.transformers_utils.tokenizer import get_tokenizer
except ImportError:
from backend_request_func import get_tokenizerwith a self-contained helper at Why
|
||
|
|
||
| try: | ||
| from vllm.utils import FlexibleArgumentParser | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
get_tokenizer()only branchesuse_fastontokenizer_mode == 'slow'versus everything else, but argparse still advertiseschoices=['auto', 'slow', 'mistral', 'custom']for--tokenizer-mode. Passing--tokenizer-mode mistral(orcustom) now silently falls through to a regular fastAutoTokenizerwith no warning — for Mistral models that means wrong tokenization and bogusprompt_len/throughput numbers. Fix by either restricting the argparse choices to['auto', 'slow']or raisingNotImplementedErrorfor the unsupported modes (the previous fallback inbackend_request_func.py:527-535had a dedicatedmistralbranch that returnedMistralTokenizer.from_pretrained(...)). Separately and more minor: the signature accepts**kwargsbut never forwards them toAutoTokenizer.from_pretrained— no current caller passes extras so this is latent today, but either drop**kwargsor pass them through to match the prior fallback signature.Extended reasoning...
What's wrong
The replacement
get_tokenizer()(utils/bench_serving/benchmark_serving.py:48-61) advertises an API matching the prior fallback but silently drops two declared inputs:1.
tokenizer_modevaluesmistralandcustomare silently downgraded toauto. The argparse declaration further down in the file states:The previous fallback in
backend_request_func.py:527-535had a dedicatedmistralbranch that returnedMistralTokenizer.from_pretrained(...), and the originalvllm.transformers_utils.tokenizer.get_tokenizer(now removed via this PR) handledmistralandcustomnatively. The new helper only checkstokenizer_mode != 'slow', so any non-slowvalue collapses touse_fast=TrueAutoTokenizer.2.
**kwargsis accepted but never forwarded. The previous fallback atbackend_request_func.py:537-541forwarded**kwargstoAutoTokenizer.from_pretrained. The new body never referenceskwargsat all, so any caller passing e.g.revision=...orcache_dir=...would have those kwargs silently swallowed.Step-by-step proof for the mistral case
python benchmark_serving.py --model mistralai/Mistral-7B-Instruct-v0.3 --tokenizer-mode mistral --dataset-name random ....mistralbecause it's inchoices.args.tokenizer_mode == 'mistral'.main()callsget_tokenizer(tokenizer_id, tokenizer_mode='mistral', trust_remote_code=...).get_tokenizer,use_fast = 'mistral' != 'slow'evaluates toTrue.AutoTokenizer.from_pretrained(tokenizer_id, use_fast=True, trust_remote_code=...)returns a regular HuggingFace fast tokenizer — notMistralTokenizerfrommistral_common.prompt_len,actual_output_lens, and the derived throughput numbers are all computed against the wrong tokenizer.Why current code doesn't prevent this
Argparse's
choicesvalidation only confirms the string is one of the four allowed values; it doesn't ensure the receiving function actually distinguishes them. There is no assertion, log line, or branch in the newget_tokenizerthat observestokenizer_modeoutside of the!= 'slow'comparison.Impact
**kwargsdrop: Latent — no current caller (_init_tokenizer_worker,main,sample_random_requestsvia the worker init) passes extras, so there's no observable bug today. This is the kind of API-contract mismatch a refuter rightly flagged as hypothetical; it's flagged here only because it appears alongside the substantive mistral/custom regression and is trivially fixable in the same edit.How to fix
Pick one of:
Option B also forwards
**kwargs, addressing the secondary signature-shim issue at the same time. Option A is simpler if Mistral/custom are not on the roadmap for this script.