Add local json hf tokenization demos#798
Conversation
Walks through obtaining dialogsum, materializing a local HuggingFace tokenizer directory (tokenizer.json + tokenizer_config.json) from a flat JSON list of custom tokens with byte_fallback=true, tokenizing via `prepare.py --method huggingface`, and kicking off a default training run on the dialogsum dataset. https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS
prepare.py stamps --hf_tokenizer_name verbatim into meta.pkl. A relative path like './hf_local_tok' works at prepare time (cwd = data/dialogsum) but at train time (cwd = repo root) AutoTokenizer treats it as a Hub repo id and trips HF's repo-id regex validator. Pass the realpath instead, and re-tokenize when meta.pkl was written with a different path. https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS
Adds data/template/premade_vocab_sets/letters_punctuation.json with ASCII letters, digits, whitespace (including "\n", "\t", "\r"), and common punctuation, formatted as a flat JSON array of strings (the shape consumed by JsonByteTokenizerWithByteFallback and the local HF tokenizer builder). Adds demos/hf_letters_punct_dialogsum_demo.sh, which snapshots the premade vocab next to the dataset, builds a local HF tokenizer directory with byte_fallback=true, tokenizes dialogsum via --method huggingface, and starts a default training run. https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS
There was a problem hiding this comment.
Pull request overview
Adds two end-to-end demos showing how to build and use local HuggingFace fast tokenizers from JSON vocabularies (including byte fallback) and run the existing prepare.py + train.py pipeline on dialogsum.
Changes:
- Add a demo that writes an inline JSON vocab, builds a local HF tokenizer with
byte_fallback=true, tokenizesdialogsum, and trains. - Add a companion demo that uses a premade “letters + punctuation” JSON vocab to build a similar local HF tokenizer and run the same pipeline.
- Add the premade JSON vocab set (
letters_punctuation.json) under the template vocab sets.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| demos/hf_local_json_byte_fallback_demo.sh | New demo script building a local HF tokenizer from an inline JSON vocab + byte fallback, then running prepare/train. |
| demos/hf_letters_punct_dialogsum_demo.sh | New demo script building a local HF tokenizer from a premade JSON vocab + byte fallback, then running prepare/train. |
| data/template/premade_vocab_sets/letters_punctuation.json | Adds a premade minimal character-level vocab (whitespace + letters/digits/punct). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tok = Tokenizer(models.BPE( | ||
| vocab=vocab, | ||
| merges=[], | ||
| unk_token="<unk>", | ||
| fuse_unk=True, | ||
| byte_fallback=True, | ||
| )) |
There was a problem hiding this comment.
The BPE model is created with merges=[], so it can only emit tokens that exist at the initial symbol level (typically single characters/bytes). Multi-character entries in custom_tokens like "▁the" or "Please summarize the following:" will never be produced by the model, which undermines the stated goal of "priming the custom-piece region".
Consider adding custom_tokens as added tokens (e.g., via the tokenizer's added-token mechanism) or generating the necessary merge rules (and intermediate symbols) so these multi-character pieces can actually be emitted.
| probe = "Hello 👋! #U: Please summarize the following:\n안녕하세요 — Привет." | ||
| enc = tok.encode(probe) | ||
| dec = tok.decode(enc.ids) | ||
| print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}") |
There was a problem hiding this comment.
The "smoke-test" only prints the decoded text; it won’t fail the script if the round-trip is incorrect. To make failures surface immediately (as the comment suggests), add an explicit check (e.g., compare dec vs probe and exit non-zero on mismatch).
| print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}") | |
| print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}") | |
| if dec != probe: | |
| raise SystemExit( | |
| "[hf-local-json] ERROR: tokenizer round-trip mismatch: " | |
| f"expected={probe!r} got={dec!r}" | |
| ) |
| probe = "Hi Bob!\nHow was your weekend? 👋 (Let's talk — 안녕.)" | ||
| enc = tok.encode(probe) | ||
| dec = tok.decode(enc.ids) | ||
| print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}") |
There was a problem hiding this comment.
The "smoke-test" prints the decoded output but doesn’t assert that the tokenizer round-trips correctly. If the decoder/model setup is wrong, this demo will continue and may fail later in less obvious ways; add an explicit dec == probe check (exit non-zero on mismatch) so errors are caught immediately.
| print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}") | |
| print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}") | |
| if dec != probe: | |
| import sys | |
| print("[hf-letters-punct] ERROR: tokenizer round-trip mismatch", file=sys.stderr) | |
| print(f"[hf-letters-punct] expected={probe!r}", file=sys.stderr) | |
| print(f"[hf-letters-punct] actual ={dec!r}", file=sys.stderr) | |
| sys.exit(1) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings" | ||
|
|
There was a problem hiding this comment.
Avoid using assert for validating the JSON vocab shape in a user-facing/demo script. assert can be disabled with Python optimizations (-O) and yields a generic AssertionError. Replace this with an explicit runtime check that raises a ValueError/TypeError with a clear message (and also validate that all entries are strings).
| assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings" | |
| if not isinstance(custom_tokens, list): | |
| raise ValueError("JSON vocab must be a flat list of strings") | |
| for i, tok in enumerate(custom_tokens): | |
| if not isinstance(tok, str): | |
| raise TypeError( | |
| f"JSON vocab entries must all be strings; found {type(tok).__name__} at index {i}" | |
| ) |
| # 5) Train with defaults on dialogsum. The HF tokenizer snapshot written by | ||
| # prepare.py at data/dialogsum/hf_tokenizer/ makes sample.py reloadable | ||
| # without network access (sample.py:1191). |
There was a problem hiding this comment.
The comment here suggests prepare.py’s saved snapshot at data/dialogsum/hf_tokenizer makes sample/train reloadable offline. However meta.pkl currently stores hf_tokenizer_path as a relative path (e.g., "hf_tokenizer"), and train.py/sample.py are typically run from the repo root, so that relative path won’t resolve. Either adjust the comment to reflect that reload will come from hf_tokenizer_name (your absolute TOK_DIR_ABS), or update the flow to ensure meta.pkl contains a repo-root-relative/absolute hf_tokenizer_path that can be resolved at runtime.
| # 5) Train with defaults on dialogsum. The HF tokenizer snapshot written by | |
| # prepare.py at data/dialogsum/hf_tokenizer/ makes sample.py reloadable | |
| # without network access (sample.py:1191). | |
| # 5) Train with defaults on dialogsum. This demo passes an absolute local | |
| # tokenizer directory via --hf_tokenizer_name, and prepare.py persists that | |
| # value in meta.pkl; that is the path train.py/sample.py can later reuse | |
| # offline. A local snapshot is also written under data/dialogsum/hf_tokenizer/. |
| need_tokenize=0 | ||
| if [ ! -f "train.bin" ] || [ ! -f "val.bin" ] || [ ! -f "meta.pkl" ]; then | ||
| need_tokenize=1 | ||
| else | ||
| stored_name="$(python3 -c "import pickle,sys; m=pickle.load(open('meta.pkl','rb')); print(m.get('hf_tokenizer_name',''))" 2>/dev/null || true)" | ||
| if [ "${stored_name}" != "${TOK_DIR_ABS}" ]; then | ||
| echo -e "${YELLOW}[STALE]${RESET} meta.pkl stored hf_tokenizer_name='${stored_name}', expected '${TOK_DIR_ABS}'. Re-tokenizing." | ||
| rm -f train.bin val.bin meta.pkl | ||
| rm -rf hf_tokenizer | ||
| need_tokenize=1 | ||
| fi | ||
| fi | ||
|
|
||
| if [ "${need_tokenize}" -eq 1 ]; then | ||
| echo -e "${GREEN}[TOKENIZE]${RESET} prepare.py --method huggingface --hf_tokenizer_name ${TOK_DIR_ABS}" | ||
| python3 prepare.py \ | ||
| -t input.txt \ | ||
| --method huggingface \ | ||
| --hf_tokenizer_name "${TOK_DIR_ABS}" \ | ||
| --hf_use_fast \ | ||
| -T |
There was a problem hiding this comment.
This demo also tokenizes directly into data/dialogsum/{train.bin,val.bin,meta.pkl} and conditionally deletes them if the stored hf_tokenizer_name differs. That’s risky because it can overwrite or delete an existing dialogsum preprocessing run. Prefer writing demo outputs into a dedicated subdirectory (or separate dataset folder) by passing explicit output paths / using prepare.py’s subdir flags, and avoid removing the canonical artifacts.
|
|
||
| with open(json_vocab_path, "r", encoding="utf-8") as f: | ||
| custom_tokens = json.load(f) | ||
| assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings" |
There was a problem hiding this comment.
Same as the other demo: avoid assert for validating the JSON vocab contents. Use an explicit check that raises a clear exception, and verify each element is a string to prevent later failures inside tokenizers/transformers.
| assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings" | |
| if not isinstance(custom_tokens, list): | |
| raise ValueError("JSON vocab must be a flat list of strings") | |
| for i, tok in enumerate(custom_tokens): | |
| if not isinstance(tok, str): | |
| raise ValueError( | |
| f"JSON vocab element at index {i} must be a string, got {type(tok).__name__}" | |
| ) |
| if [ "${stored_name}" != "${TOK_DIR_ABS}" ]; then | ||
| echo -e "${YELLOW}[STALE]${RESET} meta.pkl stored hf_tokenizer_name='${stored_name}', expected '${TOK_DIR_ABS}'. Re-tokenizing." | ||
| rm -f train.bin val.bin meta.pkl | ||
| rm -rf hf_tokenizer | ||
| need_tokenize=1 | ||
| fi | ||
| fi | ||
|
|
||
| if [ "${need_tokenize}" -eq 1 ]; then | ||
| echo -e "${GREEN}[TOKENIZE]${RESET} prepare.py --method huggingface --hf_tokenizer_name ${TOK_DIR_ABS}" | ||
| python3 prepare.py \ | ||
| -t input.txt \ | ||
| --method huggingface \ | ||
| --hf_tokenizer_name "${TOK_DIR_ABS}" \ | ||
| --hf_use_fast \ | ||
| -T |
There was a problem hiding this comment.
This demo deletes/recreates data/dialogsum/{train.bin,val.bin,meta.pkl} in-place (and even removes the hf_tokenizer snapshot dir). That can clobber an existing tokenization for dialogsum that users may rely on. Prefer writing all artifacts into a dedicated demo subdirectory (like other demos do) by passing explicit --train_output/--val_output/--meta_output_path or using prepare.py’s --output_tokenization_subdir/--output_subdir_suffix, and avoid rm -f of the canonical files.
No description provided.