A small research-oriented inference engine for GPT-style models.
This repo builds a Python-callable shared library (x module) that loads safetensors weights and runs a simple transformer stack.
- Build the C++ inference engine as a shared library (
x.so) via CMake. - Load a model from
weights/<model_name>(e.g.,weights/gpt2). - Expose a minimal Python API via
pybind11.
From the repo root:
python test.pyThis will:
- build the shared module (if it hasn’t been built yet)
- load
weights/gpt2 - run a short generation and print the returned tokens.
- CMake (>= 3.14)
- A C++17 compiler (
g++,clang++, etc.) - Python 3 with development headers (e.g.,
python3-dev/python3-devel) pybind11(can be installed viapip)
Install pybind11:
pip install pybind11rm -rf build
mkdir -p build
cd build
cmake ../model_app
cmake --build . --config ReleaseOptional: enable AMX + OpenMP
If you have an AMX-capable CPU and want to exercise the AMX-accelerated tiled GEMM path, pass compiler flags to enable AMX and OpenMP. For example:
cmake -DCMAKE_CXX_FLAGS="-fopenmp -mavx512f -mavx512bw -mavx512vbmi -mavx512vnni -mamx" ../model_app cmake --build . --config Release
This produces the shared module x.so in build/ (and the repo root in this repo layout).
Note:
CMakeLists.txtsets the target name toxand forcesPREFIX ""so the output isx.so(notlibx.so).
If you modify any C++ source (e.g., model_app/src/** or headers in model_app/include/**):
cd build
cmake --build . --config ReleaseIf you add new source files or change CMake configuration, rerun CMake:
cd build
cmake ../model_app
cmake --build . --config ReleaseExample (similar to test.py):
from x import engine
# Create the engine
eng = engine()
# Initialize by pointing to a model directory under `weights/`
eng.initialize("gpt2", "./weights/gpt2")
# Run generation
tokens = eng.generate("Hello", max_tokens=16)
print(tokens)This repo includes a weights/gpt2/ directory with a sample GPT-2 quantized model. To use a different model, replace or add a folder under weights/ with the same file structure.
- The current loader supports
safetensorsfiles and relies on a lightweight header parser inmodel_app/src/loader/loader.cpp. - The runtime is intentionally minimal and primarily for experimentation.