Conversation
|
After doing more review of @BorisTheBrave's comments on #1133, and reviewing the historical design choices of This is optimal for fast tokenizers and slightly slower (but still correct) for slow ones. If a slow-tokenizer user reports a real perf regression, we'll revisit. When This change also has a small effect on SentencePiece tokenizers (T5, Mistral). Previously, a BOS token was emitted at every chunk boundary for LLaMA models, and an extra EOS token was emitted at every chunk boundary for T5 models. Any variance perceived by users of those models is due to the new per-doc approach eliminating those unnecessary tokens. In benchmark testing, this actually resulted in a 20-30% performance improvement for SentencePiece tokenizers, due to the removal of the spurious tokens. |
Description
PR #1201 updated our chunking system to be whitespace bound, which narrowed the set of cases that #1133 could appear in, but did not completely eliminate the issue. This additional change should further close this gap in coverage.
Fixes #1133
Type of change
Checklist: