This is an educational implementation of GPT from scratch using numpy.
- An educational implementation of GPT from scratch using numpy, heavily inspired by deep-learning-from-scratch-3, nanoGPT, and pytorch.
- Auto Differentiation is a core algorithm for neural networks to compute backpropagation by following the chain rule.
- To implement the chain rule, we need to satisfy the following requirements:
- During forward propagation,
- A function gets input tensors and returns output tensors.
- The function should refer to the input tensors and output tensors.
- The output tensor should also refer to the function.
- During backward propagation,
- The output tensor brings the function that generated the tensor.
- The function computes the gradient of the output tensor with respect to the input tensor.
- If the function has multiple input tensors, the function should accumulate the gradients of the input tensors.
- To meet the requirements, we define the following concepts:
-
Creator- A function that generates a tensor.
- When backpropagating, each output tensor refers to its
Creatorso that the creator can compute the gradient of the output tensor.
Loadinggraph LR Tensor1[Tensor] --> Function[Function] Function --> Tensor2[Tensor] Function -. "inputs" .-> Tensor1 Function -. "outputs" .-> Tensor2 Tensor2 -. "creator" .-> Function
-
Generation- Each node(tensor or function) has its generation that means the order of the node being created.
- Being started from 0, the generation number is incremented when the node is created.
- When backpropagating, the backward computations are performed in the reverse order of the generation number.
- Why is it needed?
- Gradients of each generation should be accumulated before being passed back to the function at younger generation.
- E.g., in the following diagram, the
dy/dais the sum ofdy/db * db/daanddy/dc * dc/da. - But there's a risk that the node
apasses them one by one, which makes backpropagation being performed twice. - To avoid this, we need to compute gradients of tensors at older generation first and accumulate the gradients of the tensors that were inputted to multiple functions.
Loadinggraph LR x((x @ 0)) --> Func1[Func1 @ 0] Func1 --> a((a @ 1)) a --> Func2[Func2 @ 1] a --> Func3[Func3 @ 1] Func2 --> b((b @ 2)) Func3 --> c((c @ 2)) b --> Func4[Func4 @ 2] c --> Func4[Func4 @ 2] Func4 --> y((y @ 3))
- Memory Optimization
- As you can see the diagram at
Creator, the function and the output refer to each other.(circular reference) - To avoid this, we use
weakref.reffor the outputs of the function. - Using weakref.ref for output tensors eliminates circular references, reducing memory usage by 72% (from 136.4 MiB to 38.2 MiB) in our test(measure_memory.py).
There are two fundamental paradigms for implementing automatic differentiation and building neural networks:
In the Define-and-Run paradigm, the entire computational graph is constructed before execution.
- The computational graph is fully defined, analyzed, and compiled before any data flows through it
- Once defined, the same graph processes all input data without changing its structure
- Examples: TensorFlow 1.x, Theano, Caffe
Key characteristics:
- Clear separation between graph definition and execution phases
- Graph optimization happens during compilation, improving performance
- Easier to deploy to production environments and non-Python platforms
- Well-suited for distributed computing across multiple devices
- More difficult to debug as errors often occur in the compiled graph
- Less flexible for dynamic computational patterns
In the Define-by-Run paradigm, the computational graph is constructed dynamically during execution.
- The computational graph is created on-the-fly as operations are performed
- Each forward pass can potentially create a different graph structure
- Examples: PyTorch, TensorFlow Eager mode, Chainer
Key characteristics:
- No separate compilation step; operations execute immediately
- More Pythonic approach, using native control flow statements
- Easier to debug as errors occur at the point of operation
- More intuitive development experience with standard Python tools
- Supports dynamic models where graph structure depends on input data
- May sacrifice some optimization opportunities available in static graphs
| Feature | Define-and-Run | Define-by-Run |
|---|---|---|
| Graph Construction | Pre-compiled static graph | Dynamic graph built during execution |
| Programming Style | Domain-specific language | Native Python code |
| Debugging | Harder (debugging compiled graph) | Easier (standard Python debugging) |
| Performance | Potentially better (global optimization) | Potentially worse (less optimization) |
| Flexibility | Less flexible for dynamic patterns | More flexible for dynamic patterns |
| Deployment | Easier to deploy to production/devices | More dependent on Python runtime |
| Learning Curve | Steeper | More intuitive for Python developers |
| Use Cases | Production deployment, mobile/edge devices | Research, rapid prototyping |
| Examples | TensorFlow 1.x, Theano | PyTorch, TensorFlow Eager |
This implementation follows the Define-by-Run paradigm, similar to PyTorch.
This library implements a subset of PyTorch's tensor operations from scratch using NumPy as the underlying computation engine. All operations support automatic differentiation.
- Creation:
torch.tensor()- Create tensors from Python scalars, lists, or NumPy arrays - Arithmetic: Addition, Subtraction, Multiplication, Division, Negation, Power
- Mathematical:
square(),exp(),cos(),sin(),tanh() - In-place Operations: TBD
- Autograd is a core algorithm for neural networks to compute backpropagation by following the chain rule.
# Example of autograd
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 1 # Forward pass
y.backward() # Backward pass
print(x.grad) # Access gradients-
Using that, we can implement gradient descent.
-
Code
# Source: https://github.com/oreilly-japan/deep-learning-from-scratch-3/blob/master/steps/step28.py import torch from torch import Tensor def f(x: torch.Tensor) -> torch.Tensor: """f(x) = x^4 - 2x^2""" y = x**4 - 2 * x**2 return y x = torch.tensor(2.0, requires_grad=True) lr = 0.01 iters = 1000 for i in range(iters): print(i, x) y = f(x) x.grad = None y.backward() with torch.no_grad(): x -= lr * x.grad
-
Output
0 tensor(2.0, dtype=torch.float64, requires_grad=True) 1 tensor(1.76, dtype=torch.float64, requires_grad=True) 2 tensor(1.61232896, dtype=torch.float64, requires_grad=True) 3 tensor(1.5091654023014192, dtype=torch.float64, requires_grad=True) 4 tensor(1.4320422081467723, dtype=torch.float64, requires_grad=True) 5 tensor(1.3718537670818505, dtype=torch.float64, requires_grad=True) ... 386 tensor(1.0000000000000036, dtype=torch.float64, requires_grad=True) 387 tensor(1.0000000000000033, dtype=torch.float64, requires_grad=True) 388 tensor(1.000000000000003, dtype=torch.float64, requires_grad=True) 389 tensor(1.0000000000000029, dtype=torch.float64, requires_grad=True) 390 tensor(1.0000000000000027, dtype=torch.float64, requires_grad=True) ...
-
The library provides robust tools for visualizing computational graphs, which helps understand the flow of tensors through operations:
- Generate DOT representations of computational graphs
- Visualize graphs in multiple formats (PNG, SVG, PDF, etc.)
- Display tensor shapes, names, and data types
- Follow the computation chain backward from outputs to inputs
- Graphviz must be installed on your system for generating image files
- Installation instructions
- macOS:
brew install graphviz - Ubuntu/Debian:
apt-get install graphviz - Windows: Download installer from the Graphviz website
-
Code
# Source: https://github.com/oreilly-japan/deep-learning-from-scratch-3/blob/master/steps/step26.py import torch from torch.utils import plot_dot_graph from tests.torch.complex_funcs import goldstein x = torch.tensor(1.0, requires_grad=True, name="x") y = torch.tensor(1.0, requires_grad=True, name="y") z: torch.Tensor = goldstein(x, y) z.backward() x.name = "x" y.name = "y" z.name = "z" plot_dot_graph(z, "goldstein.png", verbose=True)
-
Output
-
The visualization can be exported in any format supported by Graphviz:
# DOT file (raw graph definition) plot_dot_graph(z, "graph.dot", verbose=True) # Image formats plot_dot_graph(z, "graph.png", verbose=True) # PNG plot_dot_graph(z, "graph.svg", verbose=True) # SVG plot_dot_graph(z, "graph.pdf", verbose=True) # PDF
-
The
verboseparameter controls the amount of information displayed:# Minimal visualization (just the graph structure) plot_dot_graph(z, "minimal_graph.png", verbose=False) # Detailed visualization (with tensor names, shapes, and dtypes) plot_dot_graph(z, "detailed_graph.png", verbose=True)
-
Understanding the Graph
- Orange nodes: Tensor objects (inputs and intermediate values)
- Green nodes: Named tensor objects (when verbose=True and tensor has a name)
- Blue boxes: Operations (functions that transform tensors)
- Arrows: Data flow direction between operations and tensors
Higher-order derivatives involve taking the derivative of a function multiple times. For instance, the second derivative is the derivative of the first derivative, the third derivative is the derivative of the second derivative, and so on. These are crucial in various mathematical and scientific applications, including optimization algorithms and analyzing function curvature.
To support higher-order differentiation, backpropagations should build a computational graph as well as forward propagations.
- When we do
Tensor.backward(create_graph=True), the operations performed during gradient computations extends the computational graph. For that, we toggle theenable_backproptocreate_graphby using theusing_configcontext manager. - Gradients, nodes in the graph, should also be tensors, which allows us to call
backward()again on the resulting gradient tensor.
Here are computational graphs illustrating the forward and backward propagations of sin(x).:
- When
create_graphis set toFalse(default),
graph LR
subgraph x
x_data((data))
x_grad((grad))
end
subgraph y
y_data((data))
y_grad((grad))
end
subgraph gy
gy_data((data))
gy_grad((grad))
end
subgraph gx
gx_data((data))
gx_grad((grad))
end
x --> sin[sin / forward] --> y
x_grad -.-> gx
y_grad -.-> gy
- When
create_graphis set toTrue,
graph LR
subgraph x
x_data((data))
x_grad((grad))
end
subgraph y
y_data((data))
y_grad((grad))
end
subgraph cosine_of_x
cos_data((data))
cos_grad((grad))
end
subgraph "gx"
gx_data((data))
gx_grad((grad))
end
subgraph "gy"
gy_data((data))
gy_grad((grad))
end
x --> sin[sin / forward] --> y
x --> cos[cos / backward] --> cosine_of_x
cosine_of_x --> mul[mul / backward]
gy --> mul
mul --> gx
x_grad -.-> gx
y_grad -.-> gy
In the graphs:
- The forward pass computes
y = sin(x). - Both graphs computes the derivativve
gx($ f'(x) = \cos(x) \cdot \frac{dy}{dy} $). - The only difference is that the
create_graph=Truegraph also constructs the backward computational graph (thecosandmuloperations shown) that represents this derivative calculation.
Base on that, we can implement higher-order derivatives of sin function.
-
Code
# Source: https://github.com/oreilly-japan/deep-learning-from-scratch-3/blob/master/steps/step34.py import matplotlib.pyplot as plt import numpy as np import torch x = torch.tensor(np.linspace(-7, 7, 200), requires_grad=True) y = torch.sin(x) y.backward(create_graph=True) logs = [y._data] for i in range(3): logs.append(x.grad._data) gx = x.grad x.grad = None gx.backward(create_graph=True) labels = ["y=sin(x)", "y'", "y''", "y'''"] for i, v in enumerate(logs): plt.plot(x._data, logs[i], label=labels[i]) plt.legend(loc="lower right") plt.savefig("sin_derivatives.png")
-
Output
Newton's method is an iterative optimization algorithm that uses second-order derivative information to find the minimum of a function. The update rule is:
This requires computing both the first derivative $ f'(x) $ and the second derivative $ f''(x) $. Our implementation of higher-order derivatives allows this directly:
-
Code
# Source: https://github.com/oreilly-japan/deep-learning-from-scratch-3/blob/master/steps/step33.py import torch def f(x: torch.Tensor) -> torch.Tensor: """f(x) = x^4 - 2x^2""" y = x**4 - 2 * x**2 return y x = torch.tensor(2.0, requires_grad=True) iters = 10 for i in range(iters): print(i, x) # Compute first derivative: f'(x) y = f(x) x.grad = None y.backward(create_graph=True) # Keep graph for second derivative # Compute second derivative: f''(x) gx = x.grad # gx = f'(x) x.grad = None # Clear previous grad gx.backward() # Compute gradient of gx w.r.t x gx2 = x.grad # gx2 = f''(x) # Update x using Newton's method with torch.no_grad(): x -= gx / gx2
-
Output
0 tensor(2.0, dtype=torch.float64, requires_grad=True) 1 tensor(1.4545454545454546, dtype=torch.float64, requires_grad=True) 2 tensor(1.1510467893775467, dtype=torch.float64, requires_grad=True) 3 tensor(1.0253259289766978, dtype=torch.float64, requires_grad=True) 4 tensor(1.0009084519430513, dtype=torch.float64, requires_grad=True) 5 tensor(1.0000012353089454, dtype=torch.float64, requires_grad=True) 6 tensor(1.000000000002289, dtype=torch.float64, requires_grad=True) 7 tensor(1.0, dtype=torch.float64, requires_grad=True) 8 tensor(1.0, dtype=torch.float64, requires_grad=True) 9 tensor(1.0, dtype=torch.float64, requires_grad=True) -
Pro: As you can see above, Newton's method can converge faster than gradient descent.
-
Cons: Computing the full Hessian matrix(for multivariate functions) can be computationally expensive. For an input vector $ \textbf{x} \in \mathbb{R}^n $, the Hessian matrix takes $ \mathcal{O}(n^2) $ space for storing all elements, and has the $ \mathcal{O}(n^3) $ time complexity for computing the inverse of the matrix.
-
Alternative: To mitigate the computational cost, Several Quasi-Newton methods have been introduced, using approximations of the derivatives of the functions in place of exact derivatives. L-BFGS, one of the Quasi-Newton methods, approximates the Hessian or its inverse using only first-order gradient information, offering a balance between the rapid convergence of Newton's method and the lower computational cost of first-order methods. PyTorch provides the implementation in torch.optim.LBFGS.


