Add local json hf tokenization demos by klei22 · Pull Request #798 · ReaLLMASIC/ReaLLM-Forge

klei22 · 2026-04-15T05:17:01Z

No description provided.

Walks through obtaining dialogsum, materializing a local HuggingFace tokenizer directory (tokenizer.json + tokenizer_config.json) from a flat JSON list of custom tokens with byte_fallback=true, tokenizing via `prepare.py --method huggingface`, and kicking off a default training run on the dialogsum dataset. https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS

prepare.py stamps --hf_tokenizer_name verbatim into meta.pkl. A relative path like './hf_local_tok' works at prepare time (cwd = data/dialogsum) but at train time (cwd = repo root) AutoTokenizer treats it as a Hub repo id and trips HF's repo-id regex validator. Pass the realpath instead, and re-tokenize when meta.pkl was written with a different path. https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS

Adds data/template/premade_vocab_sets/letters_punctuation.json with ASCII letters, digits, whitespace (including "\n", "\t", "\r"), and common punctuation, formatted as a flat JSON array of strings (the shape consumed by JsonByteTokenizerWithByteFallback and the local HF tokenizer builder). Adds demos/hf_letters_punct_dialogsum_demo.sh, which snapshots the premade vocab next to the dataset, builds a local HF tokenizer directory with byte_fallback=true, tokenizes dialogsum via --method huggingface, and starts a default training run. https://claude.ai/code/session_01KE24Xpn7PhCAjens27u4YS

Copilot

Pull request overview

Adds two end-to-end demos showing how to build and use local HuggingFace fast tokenizers from JSON vocabularies (including byte fallback) and run the existing prepare.py + train.py pipeline on dialogsum.

Changes:

Add a demo that writes an inline JSON vocab, builds a local HF tokenizer with byte_fallback=true, tokenizes dialogsum, and trains.
Add a companion demo that uses a premade “letters + punctuation” JSON vocab to build a similar local HF tokenizer and run the same pipeline.
Add the premade JSON vocab set (letters_punctuation.json) under the template vocab sets.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
demos/hf_local_json_byte_fallback_demo.sh	New demo script building a local HF tokenizer from an inline JSON vocab + byte fallback, then running prepare/train.
demos/hf_letters_punct_dialogsum_demo.sh	New demo script building a local HF tokenizer from a premade JSON vocab + byte fallback, then running prepare/train.
data/template/premade_vocab_sets/letters_punctuation.json	Adds a premade minimal character-level vocab (whitespace + letters/digits/punct).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T05:20:48Z

+tok = Tokenizer(models.BPE(
+    vocab=vocab,
+    merges=[],
+    unk_token="<unk>",
+    fuse_unk=True,
+    byte_fallback=True,
+))


The BPE model is created with merges=[], so it can only emit tokens that exist at the initial symbol level (typically single characters/bytes). Multi-character entries in custom_tokens like "▁the" or "Please summarize the following:" will never be produced by the model, which undermines the stated goal of "priming the custom-piece region".

Consider adding custom_tokens as added tokens (e.g., via the tokenizer's added-token mechanism) or generating the necessary merge rules (and intermediate symbols) so these multi-character pieces can actually be emitted.

Copilot · 2026-04-15T05:20:48Z

+probe = "Hello 👋! #U: Please summarize the following:\n안녕하세요 — Привет."
+enc = tok.encode(probe)
+dec = tok.decode(enc.ids)
+print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")


The "smoke-test" only prints the decoded text; it won’t fail the script if the round-trip is incorrect. To make failures surface immediately (as the comment suggests), add an explicit check (e.g., compare dec vs probe and exit non-zero on mismatch).

Suggested change

print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")

print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")

if dec != probe:

raise SystemExit(

"[hf-local-json] ERROR: tokenizer round-trip mismatch: "

f"expected={probe!r} got={dec!r}"

)

Copilot · 2026-04-15T05:20:49Z

+probe = "Hi Bob!\nHow was your weekend? 👋 (Let's talk — 안녕.)"
+enc = tok.encode(probe)
+dec = tok.decode(enc.ids)
+print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")


The "smoke-test" prints the decoded output but doesn’t assert that the tokenizer round-trips correctly. If the decoder/model setup is wrong, this demo will continue and may fail later in less obvious ways; add an explicit dec == probe check (exit non-zero on mismatch) so errors are caught immediately.

Suggested change

print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")

print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")

if dec != probe:

import sys

print("[hf-letters-punct] ERROR: tokenizer round-trip mismatch", file=sys.stderr)

print(f"[hf-letters-punct] expected={probe!r}", file=sys.stderr)

print(f"[hf-letters-punct] actual ={dec!r}", file=sys.stderr)

sys.exit(1)

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T05:29:13Z

+assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"
+


Avoid using assert for validating the JSON vocab shape in a user-facing/demo script. assert can be disabled with Python optimizations (-O) and yields a generic AssertionError. Replace this with an explicit runtime check that raises a ValueError/TypeError with a clear message (and also validate that all entries are strings).

Suggested change

assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"

if not isinstance(custom_tokens, list):

raise ValueError("JSON vocab must be a flat list of strings")

for i, tok in enumerate(custom_tokens):

if not isinstance(tok, str):

raise TypeError(

f"JSON vocab entries must all be strings; found {type(tok).__name__} at index {i}"

)

Copilot · 2026-04-15T05:29:14Z

+# 5) Train with defaults on dialogsum. The HF tokenizer snapshot written by
+#    prepare.py at data/dialogsum/hf_tokenizer/ makes sample.py reloadable
+#    without network access (sample.py:1191).


The comment here suggests prepare.py’s saved snapshot at data/dialogsum/hf_tokenizer makes sample/train reloadable offline. However meta.pkl currently stores hf_tokenizer_path as a relative path (e.g., "hf_tokenizer"), and train.py/sample.py are typically run from the repo root, so that relative path won’t resolve. Either adjust the comment to reflect that reload will come from hf_tokenizer_name (your absolute TOK_DIR_ABS), or update the flow to ensure meta.pkl contains a repo-root-relative/absolute hf_tokenizer_path that can be resolved at runtime.

Suggested change

# 5) Train with defaults on dialogsum. The HF tokenizer snapshot written by

# prepare.py at data/dialogsum/hf_tokenizer/ makes sample.py reloadable

# without network access (sample.py:1191).

# 5) Train with defaults on dialogsum. This demo passes an absolute local

# tokenizer directory via --hf_tokenizer_name, and prepare.py persists that

# value in meta.pkl; that is the path train.py/sample.py can later reuse

# offline. A local snapshot is also written under data/dialogsum/hf_tokenizer/.

Copilot · 2026-04-15T05:29:14Z

+need_tokenize=0
+if [ ! -f "train.bin" ] || [ ! -f "val.bin" ] || [ ! -f "meta.pkl" ]; then
+  need_tokenize=1
+else
+  stored_name="$(python3 -c "import pickle,sys; m=pickle.load(open('meta.pkl','rb')); print(m.get('hf_tokenizer_name',''))" 2>/dev/null || true)"
+  if [ "${stored_name}" != "${TOK_DIR_ABS}" ]; then
+    echo -e "${YELLOW}[STALE]${RESET} meta.pkl stored hf_tokenizer_name='${stored_name}', expected '${TOK_DIR_ABS}'. Re-tokenizing."
+    rm -f train.bin val.bin meta.pkl
+    rm -rf hf_tokenizer
+    need_tokenize=1
+  fi
+fi
+
+if [ "${need_tokenize}" -eq 1 ]; then
+  echo -e "${GREEN}[TOKENIZE]${RESET} prepare.py --method huggingface --hf_tokenizer_name ${TOK_DIR_ABS}"
+  python3 prepare.py \
+    -t input.txt \
+    --method huggingface \
+    --hf_tokenizer_name "${TOK_DIR_ABS}" \
+    --hf_use_fast \
+    -T


This demo also tokenizes directly into data/dialogsum/{train.bin,val.bin,meta.pkl} and conditionally deletes them if the stored hf_tokenizer_name differs. That’s risky because it can overwrite or delete an existing dialogsum preprocessing run. Prefer writing demo outputs into a dedicated subdirectory (or separate dataset folder) by passing explicit output paths / using prepare.py’s subdir flags, and avoid removing the canonical artifacts.

Copilot · 2026-04-15T05:29:14Z

+
+with open(json_vocab_path, "r", encoding="utf-8") as f:
+    custom_tokens = json.load(f)
+assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"


Same as the other demo: avoid assert for validating the JSON vocab contents. Use an explicit check that raises a clear exception, and verify each element is a string to prevent later failures inside tokenizers/transformers.

Suggested change

assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"

if not isinstance(custom_tokens, list):

raise ValueError("JSON vocab must be a flat list of strings")

for i, tok in enumerate(custom_tokens):

if not isinstance(tok, str):

raise ValueError(

f"JSON vocab element at index {i} must be a string, got {type(tok).__name__}"

)

Copilot · 2026-04-15T05:29:14Z

+  if [ "${stored_name}" != "${TOK_DIR_ABS}" ]; then
+    echo -e "${YELLOW}[STALE]${RESET} meta.pkl stored hf_tokenizer_name='${stored_name}', expected '${TOK_DIR_ABS}'. Re-tokenizing."
+    rm -f train.bin val.bin meta.pkl
+    rm -rf hf_tokenizer
+    need_tokenize=1
+  fi
+fi
+
+if [ "${need_tokenize}" -eq 1 ]; then
+  echo -e "${GREEN}[TOKENIZE]${RESET} prepare.py --method huggingface --hf_tokenizer_name ${TOK_DIR_ABS}"
+  python3 prepare.py \
+    -t input.txt \
+    --method huggingface \
+    --hf_tokenizer_name "${TOK_DIR_ABS}" \
+    --hf_use_fast \
+    -T


This demo deletes/recreates data/dialogsum/{train.bin,val.bin,meta.pkl} in-place (and even removes the hf_tokenizer snapshot dir). That can clobber an existing tokenization for dialogsum that users may rely on. Prefer writing all artifacts into a dedicated demo subdirectory (like other demos do) by passing explicit --train_output/--val_output/--meta_output_path or using prepare.py’s --output_tokenization_subdir/--output_subdir_suffix, and avoid rm -f of the canonical files.

claude added 3 commits April 15, 2026 00:06

Copilot AI review requested due to automatic review settings April 15, 2026 05:17

Copilot started reviewing on behalf of klei22 April 15, 2026 05:17 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

klei22 requested a review from Copilot April 15, 2026 05:23

Copilot started reviewing on behalf of klei22 April 15, 2026 05:24 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local json hf tokenization demos#798

Add local json hf tokenization demos#798
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-local-json-hf-tokenization-demos

klei22 commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
+print(f"[hf-local-json] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
+if dec != probe:
+    raise SystemExit(
+        "[hf-local-json] ERROR: tokenizer round-trip mismatch: "
+        f"expected={probe!r} got={dec!r}"
+    )

-print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
+print(f"[hf-letters-punct] probe len(ids)={len(enc.ids)} round-trip={dec!r}")
+if dec != probe:
+    import sys
+    print("[hf-letters-punct] ERROR: tokenizer round-trip mismatch", file=sys.stderr)
+    print(f"[hf-letters-punct] expected={probe!r}", file=sys.stderr)
+    print(f"[hf-letters-punct] actual  ={dec!r}", file=sys.stderr)
+    sys.exit(1)

		assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"

-assert isinstance(custom_tokens, list), "JSON vocab must be a flat list of strings"
+if not isinstance(custom_tokens, list):
+    raise ValueError("JSON vocab must be a flat list of strings")
+for i, tok in enumerate(custom_tokens):
+    if not isinstance(tok, str):
+        raise TypeError(
+            f"JSON vocab entries must all be strings; found {type(tok).__name__} at index {i}"
+        )

-# 5) Train with defaults on dialogsum. The HF tokenizer snapshot written by
-#    prepare.py at data/dialogsum/hf_tokenizer/ makes sample.py reloadable
-#    without network access (sample.py:1191).
+# 5) Train with defaults on dialogsum. This demo passes an absolute local
+#    tokenizer directory via --hf_tokenizer_name, and prepare.py persists that
+#    value in meta.pkl; that is the path train.py/sample.py can later reuse
+#    offline. A local snapshot is also written under data/dialogsum/hf_tokenizer/.

Conversation

klei22 commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants