Plast: A High-Performance Deep Learning Engine from Scratch

Plast is a professional-grade, zero-dependency deep learning framework built entirely from the ground up in C and CUDA. It bridges the gap between high-level AI research and low-level systems engineering, delivering a high-performance training engine that rivals the core implementations of industry standards.

Engineering Excellence & Technical Depth

1. Production-Ready Autograd Engine

Plast implements a sophisticated Automatic Differentiation engine based on Dynamic Computational Graphs.

Topological Optimization: Intelligently schedules operations via DAG-based topological sorting to minimize memory footprint.
Robust Gradient Propagation: Advanced accumulation logic that correctly handles complex edge cases like node re-use and non-contiguous gradient flow.
Extensible Operator API: A clean, decoupled architecture allows for seamless integration of new mathematical operations without graph-level changes.

2. High-Performance numerical Kernels

Engineered for raw throughput across heterogeneous hardware:

Massively Parallel CUDA Kernels: Tiled matrix operations with shared memory optimizations, designed to exploit maximum warp occupancy and memory coalescing on NVIDIA GPUs.
SIMD Optimized CPU Backends: Leverages AVX/NEON intrinsics and OpenMP multi-threading to achieve near-theoretical peak performance on modern CPUs.
Stride-Aware Logic: Efficiently handles arbitrary tensor layouts (slices, transposes, views) through zero-copy "virtual" tensors and localized packing utilities.

3. Industrial-Grade Memory Management

Plast avoids the overhead of standard garbage collection through a custom Arena Allocation system:

Zero-Latency Training: Pre-allocates memory for the entire computation graph, eliminating the need for malloc/free during the critical training path.
Deterministic Memory Profile: Perfect for resource-constrained environments where predictable memory behavior is non-negotiable.
Cross-Device Unified Interface: A consistent memory management layer across CPU and GPU memory spaces.

System Architecture

graph TD
    A[User API] --> B[Dynamic Computation Graph]
    B --> C[Optimized Task Scheduler]
    C --> D[Hardware-Specific Backends]
    D --> E[AVX/SIMD CPU Kernels]
    D --> F[NVIDIA CUDA Kernels]
    G[Arena Memory Manager] --> B
    G --> D

Roadmap & Future Vision

Python Ecosystem Integration: CFFI/Pybind11 bindings for seamless interoperability with the broader AI ecosystem.
Mixed Precision Training: Support for FP16 and BF16 to accelerate training on modern Tensor Cores.
Distributed High-Performance Computing: Cluster-scale training support via MPI and NCCL.
Layer-Level Abstractions: High-level modules for Conv2D, Transformers, and BatchNorm.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
include		include
plast		plast
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
main.c		main.c
pyproject.toml		pyproject.toml
setup.py		setup.py
test_xor.py		test_xor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plast: A High-Performance Deep Learning Engine from Scratch

Engineering Excellence & Technical Depth

1. Production-Ready Autograd Engine

2. High-Performance numerical Kernels

3. Industrial-Grade Memory Management

System Architecture

Roadmap & Future Vision

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Plast: A High-Performance Deep Learning Engine from Scratch

Engineering Excellence & Technical Depth

1. Production-Ready Autograd Engine

2. High-Performance numerical Kernels

3. Industrial-Grade Memory Management

System Architecture

Roadmap & Future Vision

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages