-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Create SentencePieceTokenizer from options object #7403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create SentencePieceTokenizer from options object #7403
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces the ability to create a SentencePieceTokenizer from an options object, enabling support for models defined by JSON or similar non-protobuf formats. Key changes include adding a new SentencePieceOptions class with configurable properties, updating existing model implementations and tests to support options-based creation, and adding new constructors and methods to handle JSON-based initialization.
Reviewed Changes
| File | Description |
|---|---|
| src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs | Adds a new options class to configure SentencePiece model properties. |
| test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs | Introduces tests ensuring the created tokenizer from JSON behaves as expected and validates shifted token IDs. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs | Updates constructor and internal methods to support options-based initialization and renames internal counter for normalized text. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs | Adds a new constructor that validates unsupported normalization data and initializes the vocabulary based on options. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs | Introduces a new constructor from options and changes token ID properties from read-only to mutable. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs | Adds an overload of the Create method that instantiates a tokenizer using the options object. |
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7403 +/- ##
==========================================
+ Coverage 68.93% 68.97% +0.03%
==========================================
Files 1480 1481 +1
Lines 273273 273666 +393
Branches 28234 28287 +53
==========================================
+ Hits 188393 188751 +358
- Misses 77510 77521 +11
- Partials 7370 7394 +24
Flags with carried forward coverage won't be shown. Click here to find out more.
|
We have extended support for the SentencePiece tokenizer, which traditionally uses a protobuf data file (
tokenizer.model). This update introduces the ability to create a tokenizer object by passing an options object containing tokenizer data such as vocabulary and normalization. With this change, users can initialize a tokenizer without loading a protobuf file, enabling them to source tokenizer data from formats like JSON files and other alternatives.