Add gemma heatmap by klei22 · Pull Request #774 · ReaLLMASIC/ReaLLM-Forge

klei22 · 2026-03-19T17:51:53Z

No description provided.

- Rename colorize_dataset.py -> token_prediction_colorizer.py for clarity - Update references in demos/dataset_colorization.sh and docstring - Add gemma_tokenization_highlighter.py: terminal + interactive HTML visualization of Gemma 270M tokenizer segmentation with hover tooltips and switchable colour modes (token_id, char_length, byte_length) - Add test_gemma_en_ko_heatmap.py: fetches EN-KO pairs from Helsinki-NLP/opus-100, produces side-by-side tokenization heatmaps with summary statistics (token counts, bytes/token ratios) https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV

Both gemma_tokenization_highlighter.py and test_gemma_en_ko_heatmap.py now support --inference to load the full Gemma 270M model and colour tokens by next-token prediction quality: - rank: green = rank 1 (top prediction), red = rank >= --rank_red - probability: green = 1.0 softmax confidence, red = 0.0 (absolute) - minmax: red = lowest prob in sequence, green = highest (relative) HTML tooltips show rank and probability on hover. Summary tables include avg rank and avg probability columns when inference is enabled. https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV

- gemma_tokenization_highlighter_demo.sh: runs all 6 colour modes (token_id, char_length, byte_length, rank, probability, minmax) on mixed EN/KO/accented text with both terminal and HTML output - gemma_en_ko_heatmap_demo.sh: runs all 6 colour modes on 5 EN-KO translation pairs from Helsinki-NLP/opus-100 https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV

Copilot

Pull request overview

Adds Gemma-focused tokenization visualization utilities (terminal + interactive HTML heatmaps) and updates demos/docs to use the current dataset colorization script name.

Changes:

Add a Gemma tokenizer highlighter for arbitrary text with optional next-token inference colouring.
Add an EN↔KO translation-pair heatmap generator (tokenization + optional inference overlays) with HTML output.
Add demo scripts and update existing demo/docs references to token_prediction_colorizer.py.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`token_prediction_colorizer.py`	Updates top-of-file usage example to reference the correct script name.
`huggingface_model/gemma/270M/test_gemma_en_ko_heatmap.py`	New EN↔KO tokenization + inference heatmap generator (terminal + HTML).
`huggingface_model/gemma/270M/gemma_tokenization_highlighter.py`	New single-text tokenization highlighter with optional inference-based colouring.
`demos/gemma_tokenization_highlighter_demo.sh`	Demo runner for all colour modes of the new highlighter.
`demos/gemma_en_ko_heatmap_demo.sh`	Demo runner for all colour modes of the new EN↔KO heatmap tool.
`demos/dataset_colorization.sh`	Updates demo to call `token_prediction_colorizer.py` instead of the old script name.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    html += `<span class="tok" style="background:${{bg}};color:${{fg}}" ` +
+      `data-tid="${{t.id}}" data-tok="${{escHtml(t.tok)}}" data-orig="${{escHtml(t.orig)}}" ` +
+      `data-chars="${{t.chars}}" data-bytes="${{t.bytes}}" ` +
+      `data-rank="${{t.rank}}" data-prob="${{t.prob}}">${{escHtml(display)}}</span>`;


+    if mode == "byte_length":
+        bls = [len(o.encode("utf-8")) for _, _, o in spans]
+        lo, hi = (min(bls), max(bls)) if bls else (0, 1)
+        return [_rg_gradient(1 - (bl - lo) / (hi - lo + 1e-9)) for bl in bls]
+
+    if inference_results is None:
+        # fallback to byte_length


+    for i in range(1, len(token_ids)):
+        step_logits = logits[i - 1]
+        probs = F.softmax(step_logits, dim=-1)
+        target_id = token_ids[i]
+        prob = probs[target_id].item()
+        rank = int((step_logits > step_logits[target_id]).sum().item()) + 1
+        results.append({"rank": rank, "probability": prob})
+


+import colorsys
+import html as html_lib
+import io
+import math


+def _hue_hex(hue: float, s: float = 0.7, l: float = 0.5) -> str:
+    r, g, b = colorsys.hls_to_rgb(hue, l, s)
+    return f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}"
+
+


+# ---------------------------------------------------------------------------
+
+def tokenize_text(text: str, tokenizer) -> List[Tuple[int, str, str]]:
+    """Return [(token_id, token_string, original_segment), ...]."""


+    for i in range(1, len(token_ids)):
+        # logits[i-1] predicts token at position i
+        step_logits = logits[i - 1]
+        probs = F.softmax(step_logits, dim=-1)
+        target_id = token_ids[i]
+        prob = probs[target_id].item()
+        rank = int((step_logits > step_logits[target_id]).sum().item()) + 1
+        results.append({"rank": rank, "probability": prob})
+


+def run_inference(
+    token_ids: List[int],
+    model,
+    tokenizer,
+    device: str = "cpu",


+def _token_id_colour(token_id: int, vocab_size: int) -> str:
+    """Deterministic colour based on token ID (golden-ratio hashing)."""
+    golden = 0.618033988749895
+    hue = (token_id * golden) % 1.0


+    text: str,
+    tokenizer,
+) -> List[Tuple[int, str, str]]:
+    """Return list of (token_id, token_string, original_text_segment)."""


claude added 3 commits March 17, 2026 02:18

klei22 requested review from Copilot and gkielian March 19, 2026 17:51

Copilot started reviewing on behalf of klei22 March 19, 2026 17:52 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gemma heatmap#774

Add gemma heatmap#774
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-gemma-heatmap

klei22 commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

klei22 commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants