This document explains the new hierarchical architecture introduced in StackWise, which provides a clearer and more intuitive naming system for transformer components.
The new architecture follows a physical analogy that makes the structure more intuitive:
Rack (Final Model)
├── Stack 1 (Collection of Blocks)
│ ├── Block 1 (Standard Transformer Block)
│ ├── Block 2 (Standard Transformer Block)
│ └── Block 3 (Standard Transformer Block)
├── Stack 2 (Collection of Blocks)
│ ├── Block 4 (Standard Transformer Block)
│ ├── Block 5 (Standard Transformer Block)
│ └── Block 6 (Standard Transformer Block)
└── ... (More Stacks)
A Block is the standard transformer block containing:
- Self-attention mechanism (MHA, GQA, MLA, or kernel-based)
- Feed-forward network (SwiGLU with optional frozen up-projections)
- Layer normalization (pre-norm style)
- Residual connections (around both attention and FFN)
from model.architecture import Block
# Create a single block
block = Block(
d_model=512,
d_ff=2048,
n_heads=8,
n_kv_heads=2,
attention_type="gqa",
attention_mode="bidirectional"
)A Stack is a collection of multiple Blocks, useful for:
- Block-wise training: Train groups of blocks together
- Memory management: Organize blocks into logical groups
- Fusion training: Train multiple blocks with frozen/trainable options
from model.architecture import Stack
# Create a stack with multiple blocks
blocks = [Block(...) for _ in range(4)]
stack = Stack(blocks, stack_id=0)A Rack is the final model containing:
- Input embeddings
- Multiple Stacks of Blocks
- Output layer (language model head)
- Positional encoding (RoPE if enabled)
from model.architecture import Rack
# Create the complete model
stacks = [Stack(...), Stack(...)]
rack = Rack(
stacks=stacks,
vocab_size=50000,
d_model=512,
tie_embeddings=True
)The new architecture supports three training modes:
Train each block independently:
from training.architecture_trainer import ArchitectureTrainer
config.training.training_architecture = "blockwise"
trainer = ArchitectureTrainer(config)
results = trainer.train_architecture(rack, dataloader)Train each stack independently:
config.training.training_architecture = "stackwise"
trainer = ArchitectureTrainer(config)
results = trainer.train_architecture(rack, dataloader)Train the entire model together:
config.training.training_architecture = "rackwise"
trainer = ArchitectureTrainer(config)
results = trainer.train_architecture(rack, dataloader)The configuration system has been updated to support the new architecture:
model:
# Architecture configuration
architecture:
n_stacks: 2 # Number of stacks
blocks_per_stack: 4 # Number of blocks per stack
training:
# Training architecture modes
training_architecture: "blockwise" # blockwise | stackwise | rackwisefrom model.architecture import create_rack_from_config
from config.base import StackWiseConfig
# Load configuration
config = StackWiseConfig.from_yaml("config.yaml")
# Create rack
rack = create_rack_from_config(config.to_dict())from training.architecture_trainer import ArchitectureTrainer
# Create trainer
trainer = ArchitectureTrainer(config)
# Train the architecture
results = trainer.train_architecture(rack, dataloader)# Forward pass
input_ids = torch.randint(0, 50000, (batch_size, seq_len))
logits = rack(input_ids)
# Get model information
print(f"Parameters: {rack.get_parameter_count():,}")
print(f"Stacks: {len(rack.stacks)}")
print(f"Total blocks: {sum(len(stack.blocks) for stack in rack.stacks)}")- Block: Standard transformer block (what was previously called "layer")
- Stack: Collection of blocks (logical grouping)
- Rack: Complete model (final assembly)
- Hierarchical structure makes the model easier to understand
- Clear separation between individual components and complete model
- Intuitive naming that matches physical hardware analogy
- Block-wise: Train each block independently (layer-wise training)
- Stack-wise: Train groups of blocks together (block-wise training)
- Rack-wise: Train the entire model together (end-to-end training)
- Stacks can be frozen/unfrozen independently
- Blocks can be trained individually to reduce memory usage
- Support for different training strategies per stack
The new architecture is backward compatible:
MLGKALayer→Block(with attention + FFN + layer norm + residual)model_layers→rack.stacks[].blockslayerwise_training→mode(with values "layerwise", "blockwise", "fused")blockwise_training→stackwise_training
# Old configuration
training:
mode: "layerwise"
block_size: 4
# New configuration
training:
training_architecture: "blockwise" # or "stackwise" or "rackwise"
architecture:
# Use architecture.n_stacks and architecture.blocks_per_stack instead
n_stacks: 2
blocks_per_stack: 4See the following examples for detailed usage:
examples/architecture_example.py- Basic architecture usageexamples/new_architecture_training.py- Training examplesexamples/gpt2_fusion/- GPT-2 fusion training with new architecture
The new architecture enables several future enhancements:
- Stack-specific Training: Different training strategies per stack
- Dynamic Stacking: Add/remove stacks during training
- Stack Fusion: Merge multiple stacks into one
- Stack Quantization: Different quantization per stack
- Stack Caching: Efficient caching per stack
The new Block/Stack/Rack architecture provides:
- ✅ Clearer naming that matches physical hardware analogy
- ✅ Better organization with hierarchical structure
- ✅ Flexible training modes for different scenarios
- ✅ Backward compatibility with existing code
- ✅ Future extensibility for advanced features
This architecture makes StackWise more intuitive to use while maintaining all the powerful features of the original system.