This document provides a comprehensive guide to the StackWise architecture system, covering the Block-Stack-Rack paradigm, configuration management, and helper functions.
StackWise uses a hierarchical architecture that provides unprecedented training flexibility:
Rack (Complete Model)
βββ Stack 1 (Collection of Blocks)
β βββ Block 1 (Standard Transformer Block)
β βββ Block 2 (Standard Transformer Block)
β βββ Block 3 (Standard Transformer Block)
βββ Stack 2 (Collection of Blocks)
β βββ Block 4 (Standard Transformer Block)
β βββ Block 5 (Standard Transformer Block)
β βββ Block 6 (Standard Transformer Block)
βββ ... (More Stacks)
A Block is the standard transformer block containing:
- Self-attention mechanism (MHA, GQA, MLA, or kernel-based)
- Feed-forward network (SwiGLU with optional frozen up-projections)
- Layer normalization (pre-norm style)
- Residual connections (around both attention and FFN)
from model.architecture import Block
# Create a single block
block = Block(
d_model=512,
d_ff=2048,
n_heads=8,
n_kv_heads=2,
attention_type="gqa",
attention_mode="bidirectional"
)A Stack is a collection of multiple Blocks, useful for:
- Block-wise training: Train groups of blocks together
- Memory management: Organize blocks into logical groups
- Fusion training: Train multiple blocks with frozen/trainable options
from model.architecture import Stack
# Create a stack with multiple blocks
blocks = [Block(...) for _ in range(4)]
stack = Stack(blocks, stack_id=0)A Rack is the final model containing:
- Input embeddings
- Multiple Stacks of Blocks
- Output layer (language model head)
- Positional encoding (RoPE if enabled)
from model.architecture import Rack
# Create the complete model
stacks = [Stack(...), Stack(...)]
rack = Rack(
stacks=stacks,
vocab_size=50000,
d_model=512,
tie_embeddings=True
)model:
# Model dimensions
d_model: 4096
n_heads: 32
n_kv_heads: 8
d_ff: 14336
# Architecture configuration
architecture:
n_stacks: 2 # Number of stacks
blocks_per_stack: 4 # Number of blocks per stack@dataclass
class ArchitectureConfig(BaseConfig):
"""Architecture configuration for stacks and blocks."""
n_stacks: int = 2
blocks_per_stack: int = 4
@dataclass
class ModelConfig(BaseConfig):
"""Model architecture configuration."""
# ... other fields ...
architecture: ArchitectureConfig = field(default_factory=ArchitectureConfig)from model.architecture import create_block_spec
# Create a block specification
block_spec = create_block_spec(
d_model=512,
d_ff=2048,
n_heads=8,
n_kv_heads=2,
attention_type="gqa",
attention_mode="bidirectional"
)from model.architecture import create_stack_from_spec
# Create a stack with identical blocks
stack = create_stack_from_spec(
stack_id=0,
n_blocks=4,
block_spec=block_spec,
freeze_blocks=False
)from model.architecture import create_simple_rack
# Create a simple rack: 2 stacks, 4 blocks per stack
rack = create_simple_rack(
n_stacks=2,
blocks_per_stack=4,
d_model=512,
d_ff=2048,
n_heads=8,
vocab_size=10000
)from model.architecture import create_rack_from_specs
# Create stack specifications
stack_specs = [
{
"stack_id": 0,
"n_blocks": 4,
"block_spec": block_spec,
"freeze_blocks": False
},
{
"stack_id": 1,
"n_blocks": 4,
"block_spec": block_spec,
"freeze_blocks": True # Frozen stack
}
]
# Create rack
rack = create_rack_from_specs(
vocab_size=10000,
d_model=512,
stack_specs=stack_specs,
tie_embeddings=True
)from model.architecture import create_rack_from_config
from config.base import StackWiseConfig
# Load configuration and create rack
config = StackWiseConfig.from_yaml("config.yaml")
rack = create_rack_from_config(config.to_dict())The architecture supports three training modes:
Train each block independently:
from training.architecture_trainer import ArchitectureTrainer
config.training.training_architecture = "blockwise"
trainer = ArchitectureTrainer(config)
results = trainer.train_architecture(rack, dataloader)Train each stack independently:
config.training.training_architecture = "stackwise"
trainer = ArchitectureTrainer(config)
results = trainer.train_architecture(rack, dataloader)Train the entire model together:
config.training.training_architecture = "rackwise"
trainer = ArchitectureTrainer(config)
results = trainer.train_architecture(rack, dataloader)# 2 stacks, 4 blocks per stack = 8 total blocks
rack = create_simple_rack(
n_stacks=2,
blocks_per_stack=4,
d_model=512,
d_ff=2048,
n_heads=8,
vocab_size=10000
)# Different block types for different stacks
encoder_spec = create_block_spec(
d_model=256, attention_mode="bidirectional"
)
decoder_spec = create_block_spec(
d_model=256, attention_mode="causal"
)
# Create stack specifications
stack_specs = [
{"stack_id": 0, "n_blocks": 3, "block_spec": encoder_spec},
{"stack_id": 1, "n_blocks": 3, "block_spec": decoder_spec}
]
# Create rack
rack = create_rack_from_specs(5000, 256, stack_specs)# Small model
small = create_simple_rack(n_stacks=1, blocks_per_stack=2, ...)
# Medium model
medium = create_simple_rack(n_stacks=2, blocks_per_stack=3, ...)
# Large model
large = create_simple_rack(n_stacks=3, blocks_per_stack=4, ...)# Layer-wise: 8 stacks, 1 block each
layerwise = create_simple_rack(n_stacks=8, blocks_per_stack=1, ...)
# Block-wise: 2 stacks, 4 blocks each
blockwise = create_simple_rack(n_stacks=2, blocks_per_stack=4, ...)
# Stack-wise: 1 stack, 8 blocks
stackwise = create_simple_rack(n_stacks=1, blocks_per_stack=8, ...)# Architecture optimized for training
block_spec = create_block_spec(
d_model=512, d_ff=2048, n_heads=8,
attention_type="gqa", attention_mode="bidirectional",
dropout=0.1, freeze_up_proj=True
)
stack_specs = [
{"stack_id": 0, "n_blocks": 4, "block_spec": block_spec, "freeze_blocks": False},
{"stack_id": 1, "n_blocks": 4, "block_spec": block_spec, "freeze_blocks": False}
]
rack = create_rack_from_specs(10000, 512, stack_specs)from model.architecture import create_rack_from_config
from config.base import StackWiseConfig
# Load configuration
config = StackWiseConfig.from_yaml("config.yaml")
# Create rack
rack = create_rack_from_config(config.to_dict())from training.architecture_trainer import ArchitectureTrainer
# Create trainer
trainer = ArchitectureTrainer(config)
# Train the architecture
results = trainer.train_architecture(rack, dataloader)# Forward pass
input_ids = torch.randint(0, 50000, (batch_size, seq_len))
logits = rack(input_ids)
# Get model information
print(f"Parameters: {rack.get_parameter_count():,}")
print(f"Stacks: {len(rack.stacks)}")
print(f"Total blocks: {sum(len(stack.blocks) for stack in rack.stacks)}")- Block: Standard transformer block
- Stack: Collection of blocks (logical grouping)
- Rack: Complete model (final assembly)
- Hierarchical structure makes the model easier to understand
- Clear separation between individual components and complete model
- Intuitive naming that matches physical hardware analogy
- Block-wise: Train each block independently (layer-wise training)
- Stack-wise: Train groups of blocks together (block-wise training)
- Rack-wise: Train the entire model together (end-to-end training)
- Stacks can be frozen/unfrozen independently
- Blocks can be trained individually to reduce memory usage
- Support for different training strategies per stack
- No redundancy: Only specify what you need
- Clear intent:
n_stacksandblocks_per_stackare self-explanatory - Less confusion: No need to calculate total blocks
# Simple cases
rack = create_simple_rack(n_stacks=2, blocks_per_stack=4, ...)
# Complex cases
rack = create_rack_from_specs(vocab_size, d_model, stack_specs)The architecture enables several future enhancements:
- Stack-specific Training: Different training strategies per stack
- Dynamic Stacking: Add/remove stacks during training
- Stack Fusion: Merge multiple stacks into one
- Stack Quantization: Different quantization per stack
- Stack Caching: Efficient caching per stack
The StackWise architecture provides:
- β Clearer naming that matches physical hardware analogy
- β Better organization with hierarchical structure
- β Flexible training modes for different scenarios
- β Intuitive configuration with simplified parameters
- β Helper functions that reduce boilerplate code
- β Memory efficiency for large model training
- β Future extensibility for advanced features
This architecture makes StackWise more intuitive to use while maintaining all the powerful features needed for revolutionary layer-wise transformer training.