Skip to content

Conversation

@tarekgh
Copy link
Member

@tarekgh tarekgh commented Feb 27, 2025

We have extended support for the SentencePiece tokenizer, which traditionally uses a protobuf data file (tokenizer.model). This update introduces the ability to create a tokenizer object by passing an options object containing tokenizer data such as vocabulary and normalization. With this change, users can initialize a tokenizer without loading a protobuf file, enabling them to source tokenizer data from formats like JSON files and other alternatives.

Copilot AI review requested due to automatic review settings February 27, 2025 02:50
@tarekgh
Copy link
Member Author

tarekgh commented Feb 27, 2025

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR introduces the ability to create a SentencePieceTokenizer from an options object, enabling support for models defined by JSON or similar non-protobuf formats. Key changes include adding a new SentencePieceOptions class with configurable properties, updating existing model implementations and tests to support options-based creation, and adding new constructors and methods to handle JSON-based initialization.

Reviewed Changes

File Description
src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs Adds a new options class to configure SentencePiece model properties.
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs Introduces tests ensuring the created tokenizer from JSON behaves as expected and validates shifted token IDs.
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs Updates constructor and internal methods to support options-based initialization and renames internal counter for normalized text.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs Adds a new constructor that validates unsupported normalization data and initializes the vocabulary based on options.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs Introduces a new constructor from options and changes token ID properties from read-only to mutable.
src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Adds an overload of the Create method that instantiates a tokenizer using the options object.

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

@codecov
Copy link

codecov bot commented Feb 27, 2025

Codecov Report

Attention: Patch coverage is 73.39667% with 112 lines in your changes missing coverage. Please review.

Project coverage is 68.97%. Comparing base (2bd88b9) to head (f97b783).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...osoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs 0.00% 33 Missing ⚠️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 52.45% 23 Missing and 6 partials ⚠️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs 88.80% 16 Missing and 12 partials ⚠️
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs 66.66% 10 Missing and 6 partials ⚠️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 60.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7403      +/-   ##
==========================================
+ Coverage   68.93%   68.97%   +0.03%     
==========================================
  Files        1480     1481       +1     
  Lines      273273   273666     +393     
  Branches    28234    28287      +53     
==========================================
+ Hits       188393   188751     +358     
- Misses      77510    77521      +11     
- Partials     7370     7394      +24     
Flag Coverage Δ
Debug 68.97% <73.39%> (+0.03%) ⬆️
production 63.27% <50.87%> (+0.02%) ⬆️
test 89.46% <88.80%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rosoft.ML.Tokenizers/Model/SentencePieceOptions.cs 100.00% <100.00%> (ø)
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 85.26% <60.00%> (-4.74%) ⬇️
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs 77.83% <66.66%> (-1.21%) ⬇️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs 93.93% <88.80%> (-3.17%) ⬇️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 65.99% <52.45%> (+5.80%) ⬆️
...osoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs 74.51% <0.00%> (-2.93%) ⬇️

... and 4 files with indirect coverage changes

@tarekgh tarekgh merged commit 0807bd8 into dotnet:main Feb 27, 2025
25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 30, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants