Skip to content

Add local json hf tokenization demos#798

Open
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-local-json-hf-tokenization-demos
Open

Add local json hf tokenization demos#798
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-local-json-hf-tokenization-demos

Conversation

@klei22
Copy link
Copy Markdown
Collaborator

@klei22 klei22 commented Apr 15, 2026

No description provided.

claude added 3 commits April 15, 2026 00:06
Walks through obtaining dialogsum, materializing a local HuggingFace
tokenizer directory (tokenizer.json + tokenizer_config.json) from a
flat JSON list of custom tokens with byte_fallback=true, tokenizing
via `prepare.py --method huggingface`, and kicking off a default
training run on the dialogsum dataset.

https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS
prepare.py stamps --hf_tokenizer_name verbatim into meta.pkl. A
relative path like './hf_local_tok' works at prepare time (cwd =
data/dialogsum) but at train time (cwd = repo root) AutoTokenizer
treats it as a Hub repo id and trips HF's repo-id regex validator.
Pass the realpath instead, and re-tokenize when meta.pkl was written
with a different path.

https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS
Adds data/template/premade_vocab_sets/letters_punctuation.json with
ASCII letters, digits, whitespace (including "\n", "\t", "\r"), and
common punctuation, formatted as a flat JSON array of strings (the
shape consumed by JsonByteTokenizerWithByteFallback and the local HF
tokenizer builder). Adds demos/hf_letters_punct_dialogsum_demo.sh,
which snapshots the premade vocab next to the dataset, builds a local
HF tokenizer directory with byte_fallback=true, tokenizes dialogsum
via --method huggingface, and starts a default training run.

https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS
Copilot AI review requested due to automatic review settings April 15, 2026 05:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two end-to-end demos showing how to build and use local HuggingFace fast tokenizers from JSON vocabularies (including byte fallback) and run the existing prepare.py + train.py pipeline on dialogsum.

Changes:

  • Add a demo that writes an inline JSON vocab, builds a local HF tokenizer with byte_fallback=true, tokenizes dialogsum, and trains.
  • Add a companion demo that uses a premade “letters + punctuation” JSON vocab to build a similar local HF tokenizer and run the same pipeline.
  • Add the premade JSON vocab set (letters_punctuation.json) under the template vocab sets.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
demos/hf_local_json_byte_fallback_demo.sh New demo script building a local HF tokenizer from an inline JSON vocab + byte fallback, then running prepare/train.
demos/hf_letters_punct_dialogsum_demo.sh New demo script building a local HF tokenizer from a premade JSON vocab + byte fallback, then running prepare/train.
data/template/premade_vocab_sets/letters_punctuation.json Adds a premade minimal character-level vocab (whitespace + letters/digits/punct).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +162 to +168
tok = Tokenizer(models.BPE(
vocab=vocab,
merges=[],
unk_token="<unk>",
fuse_unk=True,
byte_fallback=True,
))
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BPE model is created with merges=[], so it can only emit tokens that exist at the initial symbol level (typically single characters/bytes). Multi-character entries in custom_tokens like "▁the" or "Please summarize the following:" will never be produced by the model, which undermines the stated goal of "priming the custom-piece region".

Consider adding custom_tokens as added tokens (e.g., via the tokenizer's added-token mechanism) or generating the necessary merge rules (and intermediate symbols) so these multi-character pieces can actually be emitted.

Copilot uses AI. Check for mistakes.
probe = "Hello 👋! #U: Please summarize the following:\n안녕하세요 — Привет."
enc = tok.encode(probe)
dec = tok.decode(enc.ids)
print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "smoke-test" only prints the decoded text; it won’t fail the script if the round-trip is incorrect. To make failures surface immediately (as the comment suggests), add an explicit check (e.g., compare dec vs probe and exit non-zero on mismatch).

Suggested change
print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
if dec != probe:
raise SystemExit(
"[hf-local-json] ERROR: tokenizer round-trip mismatch: "
f"expected={probe!r} got={dec!r}"
)

Copilot uses AI. Check for mistakes.
probe = "Hi Bob!\nHow was your weekend? 👋 (Let's talk — 안녕.)"
enc = tok.encode(probe)
dec = tok.decode(enc.ids)
print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "smoke-test" prints the decoded output but doesn’t assert that the tokenizer round-trips correctly. If the decoder/model setup is wrong, this demo will continue and may fail later in less obvious ways; add an explicit dec == probe check (exit non-zero on mismatch) so errors are caught immediately.

Suggested change
print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
if dec != probe:
import sys
print("[hf-letters-punct] ERROR: tokenizer round-trip mismatch", file=sys.stderr)
print(f"[hf-letters-punct] expected={probe!r}", file=sys.stderr)
print(f"[hf-letters-punct] actual ={dec!r}", file=sys.stderr)
sys.exit(1)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +142 to +143
assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using assert for validating the JSON vocab shape in a user-facing/demo script. assert can be disabled with Python optimizations (-O) and yields a generic AssertionError. Replace this with an explicit runtime check that raises a ValueError/TypeError with a clear message (and also validate that all entries are strings).

Suggested change
assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"
if not isinstance(custom_tokens, list):
raise ValueError("JSON vocab must be a flat list of strings")
for i, tok in enumerate(custom_tokens):
if not isinstance(tok, str):
raise TypeError(
f"JSON vocab entries must all be strings; found {type(tok).__name__} at index {i}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +245 to +247
# 5) Train with defaults on dialogsum. The HF tokenizer snapshot written by
# prepare.py at data/dialogsum/hf_tokenizer/ makes sample.py reloadable
# without network access (sample.py:1191).
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here suggests prepare.py’s saved snapshot at data/dialogsum/hf_tokenizer makes sample/train reloadable offline. However meta.pkl currently stores hf_tokenizer_path as a relative path (e.g., "hf_tokenizer"), and train.py/sample.py are typically run from the repo root, so that relative path won’t resolve. Either adjust the comment to reflect that reload will come from hf_tokenizer_name (your absolute TOK_DIR_ABS), or update the flow to ensure meta.pkl contains a repo-root-relative/absolute hf_tokenizer_path that can be resolved at runtime.

Suggested change
# 5) Train with defaults on dialogsum. The HF tokenizer snapshot written by
# prepare.py at data/dialogsum/hf_tokenizer/ makes sample.py reloadable
# without network access (sample.py:1191).
# 5) Train with defaults on dialogsum. This demo passes an absolute local
# tokenizer directory via --hf_tokenizer_name, and prepare.py persists that
# value in meta.pkl; that is the path train.py/sample.py can later reuse
# offline. A local snapshot is also written under data/dialogsum/hf_tokenizer/.

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +171
need_tokenize=0
if [ ! -f "train.bin" ] || [ ! -f "val.bin" ] || [ ! -f "meta.pkl" ]; then
need_tokenize=1
else
stored_name="$(python3 -c "import pickle,sys; m=pickle.load(open('meta.pkl','rb')); print(m.get('hf_tokenizer_name',''))" 2>/dev/null || true)"
if [ "${stored_name}" != "${TOK_DIR_ABS}" ]; then
echo -e "${YELLOW}[STALE]${RESET} meta.pkl stored hf_tokenizer_name='${stored_name}', expected '${TOK_DIR_ABS}'. Re-tokenizing."
rm -f train.bin val.bin meta.pkl
rm -rf hf_tokenizer
need_tokenize=1
fi
fi

if [ "${need_tokenize}" -eq 1 ]; then
echo -e "${GREEN}[TOKENIZE]${RESET} prepare.py --method huggingface --hf_tokenizer_name ${TOK_DIR_ABS}"
python3 prepare.py \
-t input.txt \
--method huggingface \
--hf_tokenizer_name "${TOK_DIR_ABS}" \
--hf_use_fast \
-T
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demo also tokenizes directly into data/dialogsum/{train.bin,val.bin,meta.pkl} and conditionally deletes them if the stored hf_tokenizer_name differs. That’s risky because it can overwrite or delete an existing dialogsum preprocessing run. Prefer writing demo outputs into a dedicated subdirectory (or separate dataset folder) by passing explicit output paths / using prepare.py’s subdir flags, and avoid removing the canonical artifacts.

Copilot uses AI. Check for mistakes.

with open(json_vocab_path, "r", encoding="utf-8") as f:
custom_tokens = json.load(f)
assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other demo: avoid assert for validating the JSON vocab contents. Use an explicit check that raises a clear exception, and verify each element is a string to prevent later failures inside tokenizers/transformers.

Suggested change
assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"
if not isinstance(custom_tokens, list):
raise ValueError("JSON vocab must be a flat list of strings")
for i, tok in enumerate(custom_tokens):
if not isinstance(tok, str):
raise ValueError(
f"JSON vocab element at index {i} must be a string, got {type(tok).__name__}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +238
if [ "${stored_name}" != "${TOK_DIR_ABS}" ]; then
echo -e "${YELLOW}[STALE]${RESET} meta.pkl stored hf_tokenizer_name='${stored_name}', expected '${TOK_DIR_ABS}'. Re-tokenizing."
rm -f train.bin val.bin meta.pkl
rm -rf hf_tokenizer
need_tokenize=1
fi
fi

if [ "${need_tokenize}" -eq 1 ]; then
echo -e "${GREEN}[TOKENIZE]${RESET} prepare.py --method huggingface --hf_tokenizer_name ${TOK_DIR_ABS}"
python3 prepare.py \
-t input.txt \
--method huggingface \
--hf_tokenizer_name "${TOK_DIR_ABS}" \
--hf_use_fast \
-T
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demo deletes/recreates data/dialogsum/{train.bin,val.bin,meta.pkl} in-place (and even removes the hf_tokenizer snapshot dir). That can clobber an existing tokenization for dialogsum that users may rely on. Prefer writing all artifacts into a dedicated demo subdirectory (like other demos do) by passing explicit --train_output/--val_output/--meta_output_path or using prepare.py’s --output_tokenization_subdir/--output_subdir_suffix, and avoid rm -f of the canonical files.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants