Conversation
There was a problem hiding this comment.
Pull request overview
Adds an end-to-end experimentation workflow for Gemma 270M EN→ES translation to compare attention activation variants (Softmax vs ReLU variants) including gradual blending, plus utilities for plotting and benchmarking.
Changes:
- Extended
finetune.pywith configurable dataset/prompt params and experimental attention modes (softmax,sum,gradual_blend) including alpha scheduling and optional output-norm blending. - Added runnable demo + utilities: a staged fine-tuning shell script, a validation-loss plotting script, and a translation benchmarking script.
- Expanded
README.mdwith step-by-step instructions for the new experiment paths and utilities.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| huggingface_model/gemma/270M/finetune.py | Adds attention-mode experimentation (monkey-patched softmax), alpha scheduling callback, configurable prompts/dataset args, and optional norm blending. |
| huggingface_model/gemma/270M/demo_gradual_blend_en_es.sh | Provides a staged warmup→gradual-blend runnable workflow with baseline/plot/benchmark command templates. |
| huggingface_model/gemma/270M/benchmark_en_es_translation.py | Adds a lightweight checkpoint evaluation script (exact match + optional BLEU/chrF). |
| huggingface_model/gemma/270M/plot_validation_loss.py | Adds a utility to plot eval loss curves from Trainer state logs across runs. |
| huggingface_model/gemma/270M/README.md | Documents the gradual blend recipe, baselines, plotting, and benchmarking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split) | ||
| train_test_split = dataset.train_test_split(test_size=0.1) |
There was a problem hiding this comment.
train_test_split(test_size=0.1) is currently unseeded, so train/eval splits (and therefore reported eval_loss/benchmarks) will vary run-to-run. Consider adding a --seed CLI arg and passing it to both dataset.train_test_split(..., seed=...) and TrainingArguments(seed=..., data_seed=...) to make experiments reproducible/comparable.
| dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split) | |
| train_test_split = dataset.train_test_split(test_size=0.1) | |
| seed = getattr(args, "seed", 42) | |
| dataset = load_dataset(args.dataset_name, args.dataset_config, split=args.dataset_split) | |
| train_test_split = dataset.train_test_split(test_size=0.1, seed=seed) |
| def update(self, step: int): | ||
| progress = min(step, self.anneal_steps) / float(self.anneal_steps) | ||
| self.alpha = self.alpha_start + (self.alpha_end - self.alpha_start) * progress | ||
| self.alpha = max(self.alpha, 0.0) # clamp; cannot go below zero |
There was a problem hiding this comment.
BlendController.update() clamps alpha only on the lower bound. If a user passes --alpha_start/--alpha_end outside [0, 1], the blend weights can become negative or >1 (e.g., 1 - alpha), which is likely unintended for a convex interpolation. Consider validating inputs (raise) or clamping alpha to [0, 1] when attention_mode=gradual_blend.
| self.alpha = max(self.alpha, 0.0) # clamp; cannot go below zero | |
| self.alpha = min(max(self.alpha, 0.0), 1.0) # clamp to valid blend range |
| def _patch_output_norms(model, blend_controller: BlendController, enable_norm_blend: bool): | ||
| if not enable_norm_blend: | ||
| return [] | ||
|
|
||
| patched = [] | ||
| target_fragments = ("post_attention_layernorm", "post_feedforward_layernorm") | ||
| for name, module in model.named_modules(): | ||
| if not any(fragment in name for fragment in target_fragments): | ||
| continue |
There was a problem hiding this comment.
When --blend_output_norm is enabled, _patch_output_norms() silently does nothing if no module names match the configured fragments. This can make runs look like they're using norm blending when they aren't. Consider emitting a warning or raising when enable_norm_blend is true but patched remains empty (and/or logging how many modules were patched).
This pull request introduces a comprehensive workflow for experimenting with gradual attention mechanism blending (Softmax → ReLU variants) in Gemma 270M for English→Spanish translation. It adds a runnable demo script, benchmarking utilities, and major enhancements to the fine-tuning script, enabling flexible experimentation with attention activations, output norm blending, and improved reproducibility. The documentation is updated to provide clear walkthroughs for all supported experiment paths.
Key changes include:
Experimentation Workflow & Utilities
demo_gradual_blend_en_es.sh, a ready-to-run shell script that orchestrates staged fine-tuning (Softmax warmup, gradual blend, and baselines) and provides commented commands for plotting and benchmarking.benchmark_en_es_translation.pyfor evaluating translation checkpoints with exact-match, BLEU, and chrF metrics, supporting quick quality assessment of experimental runs.Fine-tuning Enhancements
finetune.pyto support configurable attention mechanisms (softmax,sum,gradual_blend), including alpha scheduling, output norm blending, and activation variants (relumax,relu2max). This includes new classes (BlendController,AlphaScheduleCallback), monkey-patching logic for attention, and output norm blending. [1] [2] [3]Documentation
README.mdwith detailed instructions for running gradual blend, baselines, plotting validation loss, and benchmarking translation quality, making the experimentation process transparent and reproducible.References:
[1] [2] [3] [4] [5] [6] [7] [8] [9]