Gemma270m flare by klei22 · Pull Request #799 · ReaLLMASIC/ReaLLM-Forge

klei22 · 2026-04-15T05:40:38Z

This pull request introduces a comprehensive workflow for experimenting with gradual attention mechanism blending (Softmax → ReLU variants) in Gemma 270M for English→Spanish translation. It adds a runnable demo script, benchmarking utilities, and major enhancements to the fine-tuning script, enabling flexible experimentation with attention activations, output norm blending, and improved reproducibility. The documentation is updated to provide clear walkthroughs for all supported experiment paths.

Key changes include:

Experimentation Workflow & Utilities

Added demo_gradual_blend_en_es.sh, a ready-to-run shell script that orchestrates staged fine-tuning (Softmax warmup, gradual blend, and baselines) and provides commented commands for plotting and benchmarking.
Introduced benchmark_en_es_translation.py for evaluating translation checkpoints with exact-match, BLEU, and chrF metrics, supporting quick quality assessment of experimental runs.

Fine-tuning Enhancements

Refactored finetune.py to support configurable attention mechanisms (softmax, sum, gradual_blend), including alpha scheduling, output norm blending, and activation variants (relumax, relu2max). This includes new classes (BlendController, AlphaScheduleCallback), monkey-patching logic for attention, and output norm blending. [1] [2] [3]
Made the fine-tuning process fully configurable via command-line arguments (dataset, language names, output dirs, etc.), and improved prompt formatting for both training and evaluation. [1] [2] [3]

Documentation

Significantly expanded README.md with detailed instructions for running gradual blend, baselines, plotting validation loss, and benchmarking translation quality, making the experimentation process transparent and reproducible.

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9]

Copilot

Pull request overview

Adds an end-to-end experimentation workflow for Gemma 270M EN→ES translation to compare attention activation variants (Softmax vs ReLU variants) including gradual blending, plus utilities for plotting and benchmarking.

Changes:

Extended finetune.py with configurable dataset/prompt params and experimental attention modes (softmax, sum, gradual_blend) including alpha scheduling and optional output-norm blending.
Added runnable demo + utilities: a staged fine-tuning shell script, a validation-loss plotting script, and a translation benchmarking script.
Expanded README.md with step-by-step instructions for the new experiment paths and utilities.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
huggingface_model/gemma/270M/finetune.py	Adds attention-mode experimentation (monkey-patched softmax), alpha scheduling callback, configurable prompts/dataset args, and optional norm blending.
huggingface_model/gemma/270M/demo_gradual_blend_en_es.sh	Provides a staged warmup→gradual-blend runnable workflow with baseline/plot/benchmark command templates.
huggingface_model/gemma/270M/benchmark_en_es_translation.py	Adds a lightweight checkpoint evaluation script (exact match + optional BLEU/chrF).
huggingface_model/gemma/270M/plot_validation_loss.py	Adds a utility to plot eval loss curves from Trainer state logs across runs.
huggingface_model/gemma/270M/README.md	Documents the gradual blend recipe, baselines, plotting, and benchmarking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T05:44:04Z

+    dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split)
    train_test_split = dataset.train_test_split(test_size=0.1)


train_test_split(test_size=0.1) is currently unseeded, so train/eval splits (and therefore reported eval_loss/benchmarks) will vary run-to-run. Consider adding a --seed CLI arg and passing it to both dataset.train_test_split(..., seed=...) and TrainingArguments(seed=..., data_seed=...) to make experiments reproducible/comparable.

Suggested change

dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split)

train_test_split = dataset.train_test_split(test_size=0.1)

seed = getattr(args, "seed", 42)

dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split)

train_test_split = dataset.train_test_split(test_size=0.1, seed=seed)

Copilot · 2026-04-15T05:44:04Z

+    def update(self, step: int):
+        progress = min(step, self.anneal_steps) / float(self.anneal_steps)
+        self.alpha = self.alpha_start + (self.alpha_end - self.alpha_start) * progress
+        self.alpha = max(self.alpha, 0.0)  # clamp; cannot go below zero


BlendController.update() clamps alpha only on the lower bound. If a user passes --alpha_start/--alpha_end outside [0, 1], the blend weights can become negative or >1 (e.g., 1 - alpha), which is likely unintended for a convex interpolation. Consider validating inputs (raise) or clamping alpha to [0, 1] when attention_mode=gradual_blend.

Suggested change

self.alpha = max(self.alpha, 0.0) # clamp; cannot go below zero

self.alpha = min(max(self.alpha, 0.0), 1.0) # clamp to valid blend range

Copilot · 2026-04-15T05:44:04Z

+def _patch_output_norms(model, blend_controller: BlendController, enable_norm_blend: bool):
+    if not enable_norm_blend:
+        return []
+
+    patched = []
+    target_fragments = ("post_attention_layernorm", "post_feedforward_layernorm")
+    for name, module in model.named_modules():
+        if not any(fragment in name for fragment in target_fragments):
+            continue


When --blend_output_norm is enabled, _patch_output_norms() silently does nothing if no module names match the configured fragments. This can make runs look like they're using norm blending when they aren't. Consider emitting a warning or raising when enable_norm_blend is true but patched remains empty (and/or logging how many modules were patched).

klei22 added 2 commits April 14, 2026 13:19

Document and implement requested Gemma gradual blend experiment sequence

8c39700

Add demo script for Gemma EN-ES gradual blend workflow

69361fc

klei22 requested review from Copilot and gkielian April 15, 2026 05:40

Copilot started reviewing on behalf of klei22 April 15, 2026 05:41 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma270m flare#799

Gemma270m flare#799
klei22 wants to merge 2 commits intoReaLLMASIC:masterfrom
klei22:gemma270m-flare

klei22 commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split)
		train_test_split = dataset.train_test_split(test_size=0.1)

	self.alpha = max(self.alpha, 0.0) # clamp; cannot go below zero
	self.alpha = min(max(self.alpha, 0.0), 1.0) # clamp to valid blend range

Conversation

klei22 commented Apr 15, 2026

Experimentation Workflow & Utilities

Fine-tuning Enhancements

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants