-
Notifications
You must be signed in to change notification settings - Fork 111
Fix tokenizer mismatch between benchmark client and sglang server #937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -439,6 +439,76 @@ def get_model(pretrained_model_name_or_path: str) -> str: | |
| return pretrained_model_name_or_path | ||
|
|
||
|
|
||
| def _fix_tokenizer_for_sglang(tokenizer, model_path): | ||
| """Fix transformers v5 tokenizer to match sglang server-side behavior. | ||
|
|
||
| Root cause: transformers v5 (>= 5.0) changed how tokenizers are loaded. | ||
| Specifically, LlamaTokenizerFast.__init__ in v5 rebuilds the pre_tokenizer | ||
| and decoder from scratch using class-specific components, discarding the | ||
| originals from tokenizer.json. For models like DeepSeek-R1 that declare | ||
| LlamaTokenizerFast but actually use a ByteLevel/Sequence tokenizer | ||
| architecture, v5 incorrectly replaces the original Sequence pre_tokenizer | ||
| with Metaspace, and the original ByteLevel decoder with Sequence. | ||
| See: https://github.com/sgl-project/sglang/blob/9238bd08a2895fa3b7ec79ea567e5c27ac951343/python/sglang/srt/utils/hf_transformers_utils.py#L836 | ||
|
|
||
| The sglang server applies fixes for this in hf_transformers_utils.py | ||
| (_fix_v5_tokenizer_components and _fix_v5_add_bos_eos_token), but the | ||
| benchmark client loads the tokenizer directly via AutoTokenizer without | ||
| these fixes. This mismatch causes the client to encode text differently | ||
| from the server -- e.g. a 7000-token prompt on the client becomes ~35000 | ||
| tokens on the server, leading to ~5x TTFT inflation and false performance | ||
| regressions in benchmarks. | ||
|
|
||
| This function replicates the same fixes so the benchmark client tokenizes | ||
| identically to the sglang server. It is a no-op on transformers v4. | ||
| """ | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| backend = getattr(tokenizer, "_tokenizer", None) | ||
| if backend is not None: | ||
| try: | ||
| from tokenizers import Tokenizer as RawTokenizer | ||
| tok_file = Path(model_path) / "tokenizer.json" | ||
| if tok_file.is_file(): | ||
| raw = RawTokenizer.from_file(str(tok_file)) | ||
| raw_pre = type(raw.pre_tokenizer).__name__ if raw.pre_tokenizer else None | ||
| loaded_pre = type(backend.pre_tokenizer).__name__ if backend.pre_tokenizer else None | ||
| if raw_pre and loaded_pre and raw_pre != loaded_pre: | ||
| backend.pre_tokenizer = raw.pre_tokenizer | ||
|
Comment on lines
+476
to
+478
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 In Extended reasoning...What the bug is: At lines 477-478, when a pre_tokenizer type mismatch is detected, the code unconditionally assigns: backend.pre_tokenizer = raw.pre_tokenizer
backend.decoder = raw.decoderThe entry condition ( The specific code path: If Why existing guards don't prevent it: The condition guards Impact: Any tokenizer.json that specifies a pre_tokenizer without a decoder would silently break all token-to-text decoding after this function runs. Since the failure happens after the function returns, it would manifest as garbage output or crashes far from the assignment site, making the root cause very difficult to diagnose. Step-by-step proof:
Addressing the refutation: The refutation correctly notes that real-world tokenizer.json files with a pre_tokenizer almost universally also define a decoder (DeepSeek-R1 does have a ByteLevel decoder). In practice this scenario is unlikely to trigger for the intended targets. However, the defensive None-check is a one-line fix with zero downside, and code that silently corrupts state depending on an implicit invariant of external files is fragile. The missing check is a real defect, even if it is low severity. Fix: if raw_pre and loaded_pre and raw_pre \!= loaded_pre:
backend.pre_tokenizer = raw.pre_tokenizer
if raw.decoder is not None:
backend.decoder = raw.decoder |
||
| backend.decoder = raw.decoder | ||
| except Exception: | ||
| pass | ||
|
|
||
| try: | ||
| config_file = Path(model_path) / "tokenizer_config.json" | ||
| if config_file.is_file(): | ||
| with open(config_file) as f: | ||
| config = json.load(f) | ||
| tok_class = config.get("tokenizer_class", "") | ||
| bos_eos_classes = { | ||
| "LlamaTokenizer", "LlamaTokenizerFast", | ||
| "CodeLlamaTokenizer", "CodeLlamaTokenizerFast", | ||
| "GemmaTokenizer", "GemmaTokenizerFast", "CohereTokenizerFast", | ||
| } | ||
| if tok_class in bos_eos_classes: | ||
| defaults = {"add_bos_token": True, "add_eos_token": False} | ||
| changed = False | ||
| for attr in ("add_bos_token", "add_eos_token"): | ||
| val = config.get(attr) | ||
| if val is None: | ||
| val = defaults.get(attr, False) | ||
| if getattr(tokenizer, attr, None) != val: | ||
| setattr(tokenizer, f"_{attr}", val) | ||
| changed = True | ||
| if changed and hasattr(tokenizer, "update_post_processor"): | ||
| tokenizer.update_post_processor() | ||
| except Exception: | ||
| pass | ||
|
|
||
| return tokenizer | ||
|
|
||
|
|
||
| def get_tokenizer( | ||
| pretrained_model_name_or_path: str, | ||
| tokenizer_mode: str = "auto", | ||
|
|
@@ -464,11 +534,12 @@ def get_tokenizer( | |
| return MistralTokenizer.from_pretrained( | ||
| str(pretrained_model_name_or_path)) | ||
| else: | ||
| return AutoTokenizer.from_pretrained( | ||
| tokenizer = AutoTokenizer.from_pretrained( | ||
| pretrained_model_name_or_path, | ||
| trust_remote_code=trust_remote_code, | ||
| **kwargs, | ||
| ) | ||
| return _fix_tokenizer_for_sglang(tokenizer, pretrained_model_name_or_path) | ||
|
|
||
|
|
||
| ASYNC_REQUEST_FUNCS = { | ||
|
|
@@ -481,4 +552,4 @@ def get_tokenizer( | |
| "tensorrt-llm": async_request_trt_llm, | ||
| "scalellm": async_request_openai_completions, | ||
| "sglang": async_request_openai_completions, | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The tokenizer fix is silently skipped when a HuggingFace Hub model ID (e.g. 'deepseek-ai/DeepSeek-R1') is passed instead of a local directory path. Path('deepseek-ai/DeepSeek-R1') / 'tokenizer.json' constructs a relative path that doesn't exist on disk, so both tok_file.is_file() and config_file.is_file() return False and the entire fix is bypassed. Since get_model() returns the model ID unchanged when ModelScope is not in use, this covers the common case for most benchmark users — the tokenizer mismatch this PR is intended to fix will not be corrected for them.
Extended reasoning...
What the bug is and how it manifests
The
_fix_tokenizer_for_sglangfunction reads tokenizer configuration files from the local filesystem usingPath(model_path) / "tokenizer.json"andPath(model_path) / "tokenizer_config.json". Both reads are guarded by.is_file()checks. Ifmodel_pathis a HuggingFace Hub model ID like'deepseek-ai/DeepSeek-R1', neither path exists on disk, so both checks returnFalseand both fixes — the pre_tokenizer/decoder restoration and the add_bos/add_eos recovery — are silently skipped. The function returns the tokenizer unchanged.The specific code path that triggers it
In
get_tokenizer()(line ~517), whenpretrained_model_name_or_pathis not a local directory,get_model()is called. WithoutVLLM_USE_MODELSCOPE=true,get_model()simply returns the input string unchanged (lines 439-441).AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1')then succeeds by downloading to the HuggingFace cache. Finally,_fix_tokenizer_for_sglang(tokenizer, 'deepseek-ai/DeepSeek-R1')is called with the Hub ID asmodel_path. Inside,Path('deepseek-ai/DeepSeek-R1') / 'tokenizer.json'resolves to the relative pathdeepseek-ai/DeepSeek-R1/tokenizer.jsonin the current working directory — which does not exist.Why existing code doesn't prevent it
The code has no fallback to resolve the HF cache location. The
is_file()guards are appropriate for local paths but become a silent no-op for Hub IDs. There is no warning, log message, or exception — the function just returns the unmodified tokenizer. BecauseAutoTokenizer.from_pretrainedsucceeds silently using the HF cache, callers have no indication the fix was skipped.What the impact would be
Any user who passes a HuggingFace Hub model ID to the benchmark script (the default and most common usage pattern) will get an unfixed tokenizer. The benchmark client will still tokenize text differently from the sglang server — the exact mismatch this PR intends to resolve. For DeepSeek-R1, this means the client sees ~7,000 tokens but the server sees ~35,000 tokens, causing ~5x TTFT inflation and ~24% throughput regression as described in the PR description. The fix only works if users explicitly pre-download models to a local path.
How to fix it
After loading the tokenizer, resolve the local cache path before calling
_fix_tokenizer_for_sglang. The simplest approach is to usetokenizer.name_or_path(whichAutoTokenizerpopulates with the resolved cache directory) or callhuggingface_hub.snapshot_download(model_id, local_files_only=True)to get the cache path without triggering a re-download. For example:_fix_tokenizer_for_sglang(tokenizer, tokenizer.name_or_path).Step-by-step proof
--model deepseek-ai/DeepSeek-R1(no local path).get_tokenizer('deepseek-ai/DeepSeek-R1')is called.os.path.exists('deepseek-ai/DeepSeek-R1')→False.get_model('deepseek-ai/DeepSeek-R1')→ returns'deepseek-ai/DeepSeek-R1'(ModelScope not set).AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1')→ succeeds, tokenizer loaded from~/.cache/huggingface/hub/...._fix_tokenizer_for_sglang(tokenizer, 'deepseek-ai/DeepSeek-R1')called.Path('deepseek-ai/DeepSeek-R1') / 'tokenizer.json'→PosixPath('deepseek-ai/DeepSeek-R1/tokenizer.json').tok_file.is_file()→False→ pre_tokenizer/decoder fix skipped.config_file.is_file()→False→ add_bos/add_eos fix skipped.Metaspaceinstead ofSequence) and decoder (Sequenceinstead ofByteLevel).