Open
Conversation
- Rename colorize_dataset.py -> token_prediction_colorizer.py for clarity - Update references in demos/dataset_colorization.sh and docstring - Add gemma_tokenization_highlighter.py: terminal + interactive HTML visualization of Gemma 270M tokenizer segmentation with hover tooltips and switchable colour modes (token_id, char_length, byte_length) - Add test_gemma_en_ko_heatmap.py: fetches EN-KO pairs from Helsinki-NLP/opus-100, produces side-by-side tokenization heatmaps with summary statistics (token counts, bytes/token ratios) https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV
Both gemma_tokenization_highlighter.py and test_gemma_en_ko_heatmap.py now support --inference to load the full Gemma 270M model and colour tokens by next-token prediction quality: - rank: green = rank 1 (top prediction), red = rank >= --rank_red - probability: green = 1.0 softmax confidence, red = 0.0 (absolute) - minmax: red = lowest prob in sequence, green = highest (relative) HTML tooltips show rank and probability on hover. Summary tables include avg rank and avg probability columns when inference is enabled. https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV
- gemma_tokenization_highlighter_demo.sh: runs all 6 colour modes (token_id, char_length, byte_length, rank, probability, minmax) on mixed EN/KO/accented text with both terminal and HTML output - gemma_en_ko_heatmap_demo.sh: runs all 6 colour modes on 5 EN-KO translation pairs from Helsinki-NLP/opus-100 https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV
There was a problem hiding this comment.
Pull request overview
Adds Gemma-focused tokenization visualization utilities (terminal + interactive HTML heatmaps) and updates demos/docs to use the current dataset colorization script name.
Changes:
- Add a Gemma tokenizer highlighter for arbitrary text with optional next-token inference colouring.
- Add an EN↔KO translation-pair heatmap generator (tokenization + optional inference overlays) with HTML output.
- Add demo scripts and update existing demo/docs references to
token_prediction_colorizer.py.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
token_prediction_colorizer.py |
Updates top-of-file usage example to reference the correct script name. |
huggingface_model/gemma/270M/test_gemma_en_ko_heatmap.py |
New EN↔KO tokenization + inference heatmap generator (terminal + HTML). |
huggingface_model/gemma/270M/gemma_tokenization_highlighter.py |
New single-text tokenization highlighter with optional inference-based colouring. |
demos/gemma_tokenization_highlighter_demo.sh |
Demo runner for all colour modes of the new highlighter. |
demos/gemma_en_ko_heatmap_demo.sh |
Demo runner for all colour modes of the new EN↔KO heatmap tool. |
demos/dataset_colorization.sh |
Updates demo to call token_prediction_colorizer.py instead of the old script name. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+607
to
+610
| html += `<span class="tok" style="background:${{bg}};color:${{fg}}" ` + | ||
| `data-tid="${{t.id}}" data-tok="${{escHtml(t.tok)}}" data-orig="${{escHtml(t.orig)}}" ` + | ||
| `data-chars="${{t.chars}}" data-bytes="${{t.bytes}}" ` + | ||
| `data-rank="${{t.rank}}" data-prob="${{t.prob}}">${{escHtml(display)}}</span>`; |
Comment on lines
+190
to
+196
| if mode == "byte_length": | ||
| bls = [len(o.encode("utf-8")) for _, _, o in spans] | ||
| lo, hi = (min(bls), max(bls)) if bls else (0, 1) | ||
| return [_rg_gradient(1 - (bl - lo) / (hi - lo + 1e-9)) for bl in bls] | ||
|
|
||
| if inference_results is None: | ||
| # fallback to byte_length |
Comment on lines
+134
to
+141
| for i in range(1, len(token_ids)): | ||
| step_logits = logits[i - 1] | ||
| probs = F.softmax(step_logits, dim=-1) | ||
| target_id = token_ids[i] | ||
| prob = probs[target_id].item() | ||
| rank = int((step_logits > step_logits[target_id]).sum().item()) + 1 | ||
| results.append({"rank": rank, "probability": prob}) | ||
|
|
| import colorsys | ||
| import html as html_lib | ||
| import io | ||
| import math |
Comment on lines
+52
to
+56
| def _hue_hex(hue: float, s: float = 0.7, l: float = 0.5) -> str: | ||
| r, g, b = colorsys.hls_to_rgb(hue, l, s) | ||
| return f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}" | ||
|
|
||
|
|
| # --------------------------------------------------------------------------- | ||
|
|
||
| def tokenize_text(text: str, tokenizer) -> List[Tuple[int, str, str]]: | ||
| """Return [(token_id, token_string, original_segment), ...].""" |
Comment on lines
+137
to
+145
| for i in range(1, len(token_ids)): | ||
| # logits[i-1] predicts token at position i | ||
| step_logits = logits[i - 1] | ||
| probs = F.softmax(step_logits, dim=-1) | ||
| target_id = token_ids[i] | ||
| prob = probs[target_id].item() | ||
| rank = int((step_logits > step_logits[target_id]).sum().item()) + 1 | ||
| results.append({"rank": rank, "probability": prob}) | ||
|
|
Comment on lines
+110
to
+114
| def run_inference( | ||
| token_ids: List[int], | ||
| model, | ||
| tokenizer, | ||
| device: str = "cpu", |
| def _token_id_colour(token_id: int, vocab_size: int) -> str: | ||
| """Deterministic colour based on token ID (golden-ratio hashing).""" | ||
| golden = 0.618033988749895 | ||
| hue = (token_id * golden) % 1.0 |
| text: str, | ||
| tokenizer, | ||
| ) -> List[Tuple[int, str, str]]: | ||
| """Return list of (token_id, token_string, original_text_segment).""" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.