SoftGrad

A lightweight, educational deep learning framework built on MLX for Apple Silicon. SoftGrad provides a clean, intuitive API for building and training neural networks while maintaining full transparency into the forward and backward pass computations.

Philosophy

SoftGrad is designed to help you understand deep learning by implementing it from scratch:

Explicit gradients: See exactly how backpropagation flows through each layer
Clean abstractions: Simple, readable code that mirrors mathematical definitions
Native MLX: Leverages Apple Silicon's Neural Engine for performance
Educational focus: Learn by building real models that actually work

Features

Core Layers: Linear, Conv2d, MaxPool2d, Embedding, CausalSelfAttention
Structural Layers: Sequential, Parallel, Residual, ProjectionResidual
Normalization Layers: LayerNorm, BatchNorm
Activations: ReLU, LeakyReLU, Softmax, and custom function support
Loss Functions: Cross Entropy, Binary Cross Entropy, MSELoss
Optimizers: SGD, AdamW, Lion
Checkpointing: Save and load model weights
MLX Interop: Use MLX models directly or load PyTorch weights

Quick Start

Installation

# Clone the repository
git clone https://github.com/AlexPetrusca/softgrad.git
cd softgrad

# Install dependencies
pip install -r requirements.txt

Hello World: Training a Simple Network

import mlx.core as mx
from softgrad import Network
from softgrad.layer.core import Linear, Activation
from softgrad.function.activation import relu
from softgrad.optim import SGD
from softgrad.function.loss import cross_entropy_loss

# Build network
network = Network(input_shape=784)
network.add_layer(Linear(256))
network.add_layer(Activation(relu))
network.add_layer(Linear(128))
network.add_layer(Activation(relu))
network.add_layer(Linear(10))

# Setup optimizer
optimizer = SGD(eta=0.01, momentum=0.9)
optimizer.bind_loss_fn(cross_entropy_loss)
optimizer.bind_network(network)

# Training loop
for epoch in range(10):
    for x_batch, y_batch in dataloader:
        optimizer.step(x_batch, y_batch)

Examples

Some examples of what Softgrad is capable of.

1. Image Classification with CNN

from softgrad import Network
from softgrad.layer.conv import Conv2d, MaxPool2d
from softgrad.layer.core import Linear, Activation
from softgrad.layer.transform import Flatten
from softgrad.function.activation import relu

# Build a simple CNN
network = Network(input_shape=(32, 32, 3))

# Convolutional layers
network.add_layer(Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1))
network.add_layer(Activation(relu))
network.add_layer(MaxPool2d(kernel_size=2, stride=2))

network.add_layer(Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1))
network.add_layer(Activation(relu))
network.add_layer(MaxPool2d(kernel_size=2, stride=2))

# Classification head
network.add_layer(Flatten())
network.add_layer(Linear(256))
network.add_layer(Activation(relu))
network.add_layer(Linear(10))

2. Transformer for Language Modeling (GPT)

from softgrad import Network
from softgrad.function.activation import Relu
from softgrad.function.core import Concatenate, Add
from softgrad.layer.attn import CausalSelfAttention
from softgrad.layer.core import Linear, Activation, Embedding
from softgrad.layer.core import Sequential, Parallel, Residual
from softgrad.layer.norm import LayerNorm
from softgrad.layer.transform.PositionIndices import PositionIndices

class FeedForward(Sequential):
    """Position-wise MLP with expansion and non-linearity"""

    def __init__(self, n_embd):
        super().__init__([
            Linear(4 * n_embd),
            Activation(Relu()),
            Linear(n_embd)
        ])


class MultiHeadAttention(Sequential):
    """Multiple heads of causal self-attention in parallel"""

    def __init__(self, num_heads, head_size, block_size):
        super().__init__([
            Parallel([
                CausalSelfAttention(n_embd, head_size, block_size) # heads
                for _ in range(num_heads)
            ], Concatenate()),
            Linear(n_embd)  # projection
        ])

class TransformerBlock(Sequential):
    """Transformer block: communication followed by computation"""

    def __init__(self, n_embd, n_head):
        super().__init__([
            Residual(Sequential([
                LayerNorm(),
                MultiHeadAttention(n_head, n_embd // n_head, block_size)
            ])),
            Residual(Sequential([
                LayerNorm(),
                FeedForward(n_embd)
            ]))
        ])

network = Network(input_shape=(block_size,))

# Token and positional embeddings
network.add_layer(Parallel([
    Embedding(vocab_size, n_embd),  
    Sequential([
        PositionIndices(),
        Embedding(block_size, n_embd)
    ])
], Add()))

# Transformer blocks
network.add_layer(Sequential([
    TransformerBlock(n_embd, n_head)
    for _ in range(n_layer)
]))

# LLM head
network.add_layer(LayerNorm())
network.add_layer(Linear(vocab_size))

See examples/transformer/minimal_transformer.py for a complete GPT-style transformer trained on Shakespeare.

3. DeepDream with VGG16

from examples.deepdream import deep_dream_octaves
from load_vgg16 import load_vgg16_pretrained

# Load pretrained VGG16
vgg16 = load_vgg16_pretrained()

# Generate DeepDream
deep_dream_octaves(
    img_path="input.png",
    output_path="output.png",
    layer_names=['conv4_3', 'conv5_2'],
    octaves=4,
    n_iterations=10
)

See examples/deepdream/ for complete DeepDream implementation.

Architecture

Forward and Backward Flow

Every layer implements three core methods:

class Layer:
    def _link(self):
        """Initialize parameters based on input shape"""
        
    def _forward(self, x_in: mx.array) -> mx.array:
        """Compute forward pass"""
        
    def _backward(self, dx_out: mx.array) -> mx.array:
        """Compute backward pass (gradient w.r.t. input)"""

Parameter Management

Parameters are stored with explicit gradient tracking:

# Setting parameters
layer.params["W"] = mx.array(weights)
layer.params["b"] = mx.array(bias)

# Accessing gradients (automatic "d" prefix)
weight_grad = layer.params["dW"]
bias_grad = layer.params["db"]

Context Saving

Layers automatically save forward pass values for backward computation:

def _forward(self, x_in: mx.array) -> mx.array:
    x_out = x_in @ self.params["W"] + self.params["b"]
    # Context automatically stores x_in and x_out
    return x_out

def _backward(self, dx_out: mx.array) -> mx.array:
    # Access saved values
    x_in = self.ctx.x_in
    self.params["dW"] += x_in.T @ dx_out
    return dx_out @ self.params["W"].T

Gradient Accumulation

Gradients accumulate across mini-batches:

# In backward pass
self.params["dW"] += gradient  # Accumulate

# After optimizer step
layer.params.zero_grad()  # Reset for next batch

Advanced Features

Loading PyTorch Weights

from softgrad.util.pytorch_loader import load_pytorch_weights_into_network

# Automatic layer mapping
network.load_from_pytorch(pytorch_model.features)

Using MLX Models Directly (MLX Iterop)

from mlx import nn
from softgrad.layer.shim import MLX

# Wrap any MLX model
mlx_model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

network = Network(input_shape=784)
network.add_layer(MLX(mlx_model))

Contributing

Contributions welcome! Areas of interest:

Acknowledgments

Built on MLX by Apple
DeepDream implementation based on Google's original work
GPT implementations based on Andrej Karpathy's minGPT and nanoGPT

⭐ If you find this project helpful, please consider starring it!

Why SoftGrad?

Because understanding comes from building. This framework is intentionally simple, readable, and educational. Every abstraction serves a pedagogical purpose. If you want to truly understand how neural networks work under the hood, build them yourself with SoftGrad.

Happy learning! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
examples		examples
src/softgrad		src/softgrad
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SoftGrad

Philosophy

Features

Quick Start

Installation

Hello World: Training a Simple Network

Examples

1. Image Classification with CNN

2. Transformer for Language Modeling (GPT)

3. DeepDream with VGG16

Architecture

Forward and Backward Flow

Parameter Management

Context Saving

Gradient Accumulation

Advanced Features

Loading PyTorch Weights

Using MLX Models Directly (MLX Iterop)

Contributing

Acknowledgments

Why SoftGrad?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SoftGrad

Philosophy

Features

Quick Start

Installation

Hello World: Training a Simple Network

Examples

1. Image Classification with CNN

2. Transformer for Language Modeling (GPT)

3. DeepDream with VGG16

Architecture

Forward and Backward Flow

Parameter Management

Context Saving

Gradient Accumulation

Advanced Features

Loading PyTorch Weights

Using MLX Models Directly (MLX Iterop)

Contributing

Acknowledgments

Why SoftGrad?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages