Skip to content

Add gemma heatmap#774

Open
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-gemma-heatmap
Open

Add gemma heatmap#774
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-gemma-heatmap

Conversation

@klei22
Copy link
Copy Markdown
Collaborator

@klei22 klei22 commented Mar 19, 2026

No description provided.

claude added 3 commits March 17, 2026 02:18
- Rename colorize_dataset.py -> token_prediction_colorizer.py for clarity
- Update references in demos/dataset_colorization.sh and docstring
- Add gemma_tokenization_highlighter.py: terminal + interactive HTML
  visualization of Gemma 270M tokenizer segmentation with hover tooltips
  and switchable colour modes (token_id, char_length, byte_length)
- Add test_gemma_en_ko_heatmap.py: fetches EN-KO pairs from
  Helsinki-NLP/opus-100, produces side-by-side tokenization heatmaps
  with summary statistics (token counts, bytes/token ratios)

https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV
Both gemma_tokenization_highlighter.py and test_gemma_en_ko_heatmap.py
now support --inference to load the full Gemma 270M model and colour
tokens by next-token prediction quality:

- rank: green = rank 1 (top prediction), red = rank >= --rank_red
- probability: green = 1.0 softmax confidence, red = 0.0 (absolute)
- minmax: red = lowest prob in sequence, green = highest (relative)

HTML tooltips show rank and probability on hover. Summary tables include
avg rank and avg probability columns when inference is enabled.

https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV
- gemma_tokenization_highlighter_demo.sh: runs all 6 colour modes
  (token_id, char_length, byte_length, rank, probability, minmax)
  on mixed EN/KO/accented text with both terminal and HTML output
- gemma_en_ko_heatmap_demo.sh: runs all 6 colour modes on 5
  EN-KO translation pairs from Helsinki-NLP/opus-100

https://claude.ai/code/session_01GfYcQ7Rp4bhDwcwoBYowRV
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Gemma-focused tokenization visualization utilities (terminal + interactive HTML heatmaps) and updates demos/docs to use the current dataset colorization script name.

Changes:

  • Add a Gemma tokenizer highlighter for arbitrary text with optional next-token inference colouring.
  • Add an EN↔KO translation-pair heatmap generator (tokenization + optional inference overlays) with HTML output.
  • Add demo scripts and update existing demo/docs references to token_prediction_colorizer.py.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
token_prediction_colorizer.py Updates top-of-file usage example to reference the correct script name.
huggingface_model/gemma/270M/test_gemma_en_ko_heatmap.py New EN↔KO tokenization + inference heatmap generator (terminal + HTML).
huggingface_model/gemma/270M/gemma_tokenization_highlighter.py New single-text tokenization highlighter with optional inference-based colouring.
demos/gemma_tokenization_highlighter_demo.sh Demo runner for all colour modes of the new highlighter.
demos/gemma_en_ko_heatmap_demo.sh Demo runner for all colour modes of the new EN↔KO heatmap tool.
demos/dataset_colorization.sh Updates demo to call token_prediction_colorizer.py instead of the old script name.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +607 to +610
html += `<span class="tok" style="background:${{bg}};color:${{fg}}" ` +
`data-tid="${{t.id}}" data-tok="${{escHtml(t.tok)}}" data-orig="${{escHtml(t.orig)}}" ` +
`data-chars="${{t.chars}}" data-bytes="${{t.bytes}}" ` +
`data-rank="${{t.rank}}" data-prob="${{t.prob}}">${{escHtml(display)}}</span>`;
Comment on lines +190 to +196
if mode == "byte_length":
bls = [len(o.encode("utf-8")) for _, _, o in spans]
lo, hi = (min(bls), max(bls)) if bls else (0, 1)
return [_rg_gradient(1 - (bl - lo) / (hi - lo + 1e-9)) for bl in bls]

if inference_results is None:
# fallback to byte_length
Comment on lines +134 to +141
for i in range(1, len(token_ids)):
step_logits = logits[i - 1]
probs = F.softmax(step_logits, dim=-1)
target_id = token_ids[i]
prob = probs[target_id].item()
rank = int((step_logits > step_logits[target_id]).sum().item()) + 1
results.append({"rank": rank, "probability": prob})

import colorsys
import html as html_lib
import io
import math
Comment on lines +52 to +56
def _hue_hex(hue: float, s: float = 0.7, l: float = 0.5) -> str:
r, g, b = colorsys.hls_to_rgb(hue, l, s)
return f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}"


# ---------------------------------------------------------------------------

def tokenize_text(text: str, tokenizer) -> List[Tuple[int, str, str]]:
"""Return [(token_id, token_string, original_segment), ...]."""
Comment on lines +137 to +145
for i in range(1, len(token_ids)):
# logits[i-1] predicts token at position i
step_logits = logits[i - 1]
probs = F.softmax(step_logits, dim=-1)
target_id = token_ids[i]
prob = probs[target_id].item()
rank = int((step_logits > step_logits[target_id]).sum().item()) + 1
results.append({"rank": rank, "probability": prob})

Comment on lines +110 to +114
def run_inference(
token_ids: List[int],
model,
tokenizer,
device: str = "cpu",
def _token_id_colour(token_id: int, vocab_size: int) -> str:
"""Deterministic colour based on token ID (golden-ratio hashing)."""
golden = 0.618033988749895
hue = (token_id * golden) % 1.0
text: str,
tokenizer,
) -> List[Tuple[int, str, str]]:
"""Return list of (token_id, token_string, original_text_segment)."""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants