Skip to content

Add Byte Pair Encoding (BPE) class for subword tokenization#3056

Merged
davisking merged 7 commits intodavisking:masterfrom
Cydral:master
Mar 23, 2025
Merged

Add Byte Pair Encoding (BPE) class for subword tokenization#3056
davisking merged 7 commits intodavisking:masterfrom
Cydral:master

Conversation

@Cydral
Copy link
Contributor

@Cydral Cydral commented Feb 15, 2025

Description:

This PR introduces a new bpe_tokenizer class to Dlib, implementing the Byte Pair Encoding (BPE) algorithm for subword tokenization. The BPE tokenizer is a widely used technique in natural language processing (NLP) for handling out-of-vocabulary words and reducing vocabulary size while maintaining text representation capabilities.

Key Features:

  • BPE Algorithm: Implements the BPE algorithm as described in Sennrich et al., 2016.
  • Special Tokens: Supports predefined special tokens (e.g., <text>, <url>, <image>) for marking specific elements in the text.
  • Training and Encoding: Provides methods for training the tokenizer on a text corpus and encoding/decoding text into subword tokens.
  • Serialization: Supports saving and loading the tokenizer model and vocabulary for reuse.
  • Thread-Safe: Utilizes multi-threading for efficient frequency statistics computation during training.

Usage:

dlib::bpe_tokenizer tokenizer;
tokenizer.train(corpus_text, target_vocab_size, true); // Train on a text corpus
std::vector<int> tokens = tokenizer.encode("Sample text to tokenize."); // Encode text
std::string decoded_text = tokenizer.decode(tokens); // Decode tokens back to text

- Implement BPE (Byte Pair Encoding) tokenization
- Add training and encoding methods
- Include unit tests
@Cydral Cydral changed the title Add Byte Pair Encoding Class for Subword Tokenization Add Byte Pair Encoding (BPE) class for subword tokenization Feb 15, 2025
@davisking
Copy link
Owner

Nice, this is great. Sorry it took so long for me to get back to this.

@davisking davisking merged commit 1cd0634 into davisking:master Mar 23, 2025
10 checks passed
Repository owner deleted a comment from dlib-issue-bot Mar 24, 2025
@Cydral
Copy link
Contributor Author

Cydral commented Mar 24, 2025

No problem. I think this is another great new feature for our library.

@davisking
Copy link
Owner

No problem. I think this is another great new feature for our library.

Indeed 😁

davisking pushed a commit to kSkip/dlib that referenced this pull request Apr 19, 2025
…g#3056)

* Add new BPE_Tokenizer class to Dlib

- Implement BPE (Byte Pair Encoding) tokenization
- Add training and encoding methods
- Include unit tests

* Update

* Update

* Last update: optimize BPE tokenizer encoding with parallel paragraph processing

* Use of “in-memory” files to avoid leaving any traces on disk during the test.

* Add one DLIB_TEST_MSG() test per encoded/decoded string

* Add bpe_tokenizer_abstract.h for documentation and integration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants