Skip to content

dotvignesh/light-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

light-tokenizer

A parallelized BPE tokenizer built from scratch as part of Stanford's CS336 assignment.

No HuggingFace. No SentencePiece. Just raw Python and a lot of profiling.

What's here

  • train.py - BPE training with multiprocessing for pre-tokenization
  • tokenizer.py - CLI for BPE encoding and decoding
  • trained-tokenizers/ - Trained vocabulary and merge files for TinyStories (10K) and OpenWebText (32K)

Quick start

# Train a tokenizer
python train.py --input sample-data/TinyStoriesV2-GPT4-valid.txt --vocab-size 10000

# Encode text
python tokenizer.py --encode "Hello world" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txt

# Decode tokens
python tokenizer.py --decode "15496 995" --vocab trained-tokenizers/TinyStories/vocab.json --merges trained-tokenizers/TinyStories/merges.txt

Performance

Profiled with Scalene.

Compression Ratios

Evaluated on validation sets:

  • OpenWebText (32K vocab): 4.37
  • TinyStories (10K vocab): 4.12

Blog post

Wrote about the whole process here: Building a BPE Tokenizer from Scratch

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages