Skip to content

Gemma270m flare#799

Open
klei22 wants to merge 2 commits intoReaLLMASIC:masterfrom
klei22:gemma270m-flare
Open

Gemma270m flare#799
klei22 wants to merge 2 commits intoReaLLMASIC:masterfrom
klei22:gemma270m-flare

Conversation

@klei22
Copy link
Copy Markdown
Collaborator

@klei22 klei22 commented Apr 15, 2026

This pull request introduces a comprehensive workflow for experimenting with gradual attention mechanism blending (Softmax → ReLU variants) in Gemma 270M for English→Spanish translation. It adds a runnable demo script, benchmarking utilities, and major enhancements to the fine-tuning script, enabling flexible experimentation with attention activations, output norm blending, and improved reproducibility. The documentation is updated to provide clear walkthroughs for all supported experiment paths.

Key changes include:

Experimentation Workflow & Utilities

  • Added demo_gradual_blend_en_es.sh, a ready-to-run shell script that orchestrates staged fine-tuning (Softmax warmup, gradual blend, and baselines) and provides commented commands for plotting and benchmarking.
  • Introduced benchmark_en_es_translation.py for evaluating translation checkpoints with exact-match, BLEU, and chrF metrics, supporting quick quality assessment of experimental runs.

Fine-tuning Enhancements

  • Refactored finetune.py to support configurable attention mechanisms (softmax, sum, gradual_blend), including alpha scheduling, output norm blending, and activation variants (relumax, relu2max). This includes new classes (BlendController, AlphaScheduleCallback), monkey-patching logic for attention, and output norm blending. [1] [2] [3]
  • Made the fine-tuning process fully configurable via command-line arguments (dataset, language names, output dirs, etc.), and improved prompt formatting for both training and evaluation. [1] [2] [3]

Documentation

  • Significantly expanded README.md with detailed instructions for running gradual blend, baselines, plotting validation loss, and benchmarking translation quality, making the experimentation process transparent and reproducible.

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9]

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an end-to-end experimentation workflow for Gemma 270M EN→ES translation to compare attention activation variants (Softmax vs ReLU variants) including gradual blending, plus utilities for plotting and benchmarking.

Changes:

  • Extended finetune.py with configurable dataset/prompt params and experimental attention modes (softmax, sum, gradual_blend) including alpha scheduling and optional output-norm blending.
  • Added runnable demo + utilities: a staged fine-tuning shell script, a validation-loss plotting script, and a translation benchmarking script.
  • Expanded README.md with step-by-step instructions for the new experiment paths and utilities.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
huggingface_model/gemma/270M/finetune.py Adds attention-mode experimentation (monkey-patched softmax), alpha scheduling callback, configurable prompts/dataset args, and optional norm blending.
huggingface_model/gemma/270M/demo_gradual_blend_en_es.sh Provides a staged warmup→gradual-blend runnable workflow with baseline/plot/benchmark command templates.
huggingface_model/gemma/270M/benchmark_en_es_translation.py Adds a lightweight checkpoint evaluation script (exact match + optional BLEU/chrF).
huggingface_model/gemma/270M/plot_validation_loss.py Adds a utility to plot eval loss curves from Trainer state logs across runs.
huggingface_model/gemma/270M/README.md Documents the gradual blend recipe, baselines, plotting, and benchmarking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +165 to 166
dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split)
train_test_split = dataset.train_test_split(test_size=0.1)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_test_split(test_size=0.1) is currently unseeded, so train/eval splits (and therefore reported eval_loss/benchmarks) will vary run-to-run. Consider adding a --seed CLI arg and passing it to both dataset.train_test_split(..., seed=...) and TrainingArguments(seed=..., data_seed=...) to make experiments reproducible/comparable.

Suggested change
dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split)
train_test_split = dataset.train_test_split(test_size=0.1)
seed = getattr(args, "seed", 42)
dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split)
train_test_split = dataset.train_test_split(test_size=0.1, seed=seed)

Copilot uses AI. Check for mistakes.
def update(self, step: int):
progress = min(step, self.anneal_steps) / float(self.anneal_steps)
self.alpha = self.alpha_start + (self.alpha_end - self.alpha_start) * progress
self.alpha = max(self.alpha, 0.0) # clamp; cannot go below zero
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlendController.update() clamps alpha only on the lower bound. If a user passes --alpha_start/--alpha_end outside [0, 1], the blend weights can become negative or >1 (e.g., 1 - alpha), which is likely unintended for a convex interpolation. Consider validating inputs (raise) or clamping alpha to [0, 1] when attention_mode=gradual_blend.

Suggested change
self.alpha = max(self.alpha, 0.0) # clamp; cannot go below zero
self.alpha = min(max(self.alpha, 0.0), 1.0) # clamp to valid blend range

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +122
def _patch_output_norms(model, blend_controller: BlendController, enable_norm_blend: bool):
if not enable_norm_blend:
return []

patched = []
target_fragments = ("post_attention_layernorm", "post_feedforward_layernorm")
for name, module in model.named_modules():
if not any(fragment in name for fragment in target_fragments):
continue
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --blend_output_norm is enabled, _patch_output_norms() silently does nothing if no module names match the configured fragments. This can make runs look like they're using norm blending when they aren't. Consider emitting a warning or raising when enable_norm_blend is true but patched remains empty (and/or logging how many modules were patched).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants