Skip to content

Add create new mlp variation with two gates#795

Open
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-create-new-mlp-variation-with-two-gates
Open

Add create new mlp variation with two gates#795
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22:add-create-new-mlp-variation-with-two-gates

Conversation

@klei22
Copy link
Copy Markdown
Collaborator

@klei22 klei22 commented Apr 11, 2026

This pull request introduces a new MLP variant called swiglu_2gate_pre_act, expands the configuration and experiment setup to compare this and other MLP activation variants under parameter-matched conditions, and makes minor improvements to argument parsing and configuration handling. The main focus is on enabling and evaluating the new two-gate SwiGLU pre-activation architecture alongside other variants.

Key changes:

New MLP variant and integration

  • Implemented the SwiGLUTwoGatesPreAct class in mlp_variations.py, which introduces a SwiGLU variant with two gates applied before the non-linearity, including all relevant quantization, normalization, and offset logic. This is now available as swiglu_2gate_pre_act in the activation dictionary and MLP instantiation logic. [1] [2] [3]
  • Added "swiglu_2gate_pre_act" to the list of supported MLP variants in the argument parser in train_args.py, so it can be selected via CLI/config.

Experimental configuration and comparison

  • Added a new experiment YAML file mlp_equal_params_vs_swiglu_minipile.yaml that sets up a comprehensive comparison of regular SwiGLU, dual-path, and parameter-matched plain MLP variants (with various activations) on the minipile dataset. This includes rationale for parameter matching, and defines multiple named groups for systematic exploration.

Configuration and usability improvements

  • Changed the default device argument in train_args.py from 'cuda' to 'cuda:0' for more explicit device selection.
  • Added support for l2_norm_print_dims in the MLP config initialization for potential debugging or logging.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new MLP variant (swiglu_2gate_pre_act) to the model-variation system and wires it into CLI/config + exploration tooling to enable parameter-matched comparisons against existing MLP/SwiGLU variants.

Changes:

  • Introduces SwiGLUTwoGatesPreAct and registers it as swiglu_2gate_pre_act in the MLP factory.
  • Extends CLI argument choices to allow selecting the new MLP variant and makes the default --device more explicit (cuda:0).
  • Adds a new exploration YAML to run parameter-matched sweeps comparing SwiGLU, the new 2-gate variant, and plain MLP activations on minipile.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
variations/mlp_variations.py Implements and registers the new 2-gate pre-activation SwiGLU MLP module.
train_args.py Exposes the new MLP variant in CLI choices and updates default device string.
explorations/mlp_equal_params_vs_swiglu_minipile.yaml Adds an experiment grid to compare MLP activation variants under (approx.) parameter-matched settings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +767 to +775
if self.quantization_mlp_dict["quantize_mlp_act_activation_input"]:
num_bits = self.quantization_mlp_dict["quantize_mlp_act_activation_input_bits"]
quant_method = self.quantization_mlp_dict["activations_quant_method"]
x_main = fake_quantize_act(self, "mlp_act_activation_input", x_main, num_bits, quant_method, iter_num)

gate1 = self._up_project(x, self.c_fc_gate1)
gate2 = self._up_project(x, self.c_fc_gate2)

x_out = (x_main * gate1) * gate2
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SwiGLUTwoGatesPreAct.forward, quantize_mlp_act_activation_input is applied to x_main, but the activation function is actually applied to x_out (after multiplying by both gates). This means the “activation input” quantization is quantizing the wrong tensor for this variant and will make quantization runs inconsistent with other MLPs (e.g., DualPathSwiglu quantizes the true pre-activation tensor). Apply this quantization step to x_out right before calling self.activation_variant(...) (and consider whether the gates themselves should be quantized under the same flag).

Suggested change
if self.quantization_mlp_dict["quantize_mlp_act_activation_input"]:
num_bits = self.quantization_mlp_dict["quantize_mlp_act_activation_input_bits"]
quant_method = self.quantization_mlp_dict["activations_quant_method"]
x_main = fake_quantize_act(self, "mlp_act_activation_input", x_main, num_bits, quant_method, iter_num)
gate1 = self._up_project(x, self.c_fc_gate1)
gate2 = self._up_project(x, self.c_fc_gate2)
x_out = (x_main * gate1) * gate2
gate1 = self._up_project(x, self.c_fc_gate1)
gate2 = self._up_project(x, self.c_fc_gate2)
x_out = (x_main * gate1) * gate2
if self.quantization_mlp_dict["quantize_mlp_act_activation_input"]:
num_bits = self.quantization_mlp_dict["quantize_mlp_act_activation_input_bits"]
quant_method = self.quantization_mlp_dict["activations_quant_method"]
x_out = fake_quantize_act(self, "mlp_act_activation_input", x_out, num_bits, quant_method, iter_num)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants