YOLOv8n inference engine on CPU. 159 KB library, 1.50× faster than ONNX Runtime, hand-written AVX-512 GEMM kernels. No Python, no CUDA.
NexusInfer is a hand-written YOLOv8n inference runtime in pure C, with custom AVX-512 GEMM kernels. It does not depend on ONNX Runtime, OpenVINO, NCNN or TFLite. The goal: a surveillance CPU without a GPU should comfortably run object detection on dozens of cameras.
Part of the NexusEye stack.
Each step is a real commit in the repo. Final inference latency for YOLOv8n@320×320 is 8.1 ms on a single i5-11500 core, which translates to 117 FPS single-thread.
ONNX Runtime is the strongest mainstream CPU baseline. NexusInfer beats it by hand-tuning specifically for the YOLOv8n graph (specialized inference graph, packed weights in column-panel format, fused bias+SiLU+residual).
| Component | Implementation |
|---|---|
| GEMM (FP32) | 6×32 microkernel, AVX-512, KC blocking, packed weights |
| GEMM (INT8) | AVX-512 VNNI, vpdpbusd u8×s8→s32, 4×8c8 tiling |
| im2col | implicit (fused into GEMM) or parallel explicit |
| Activation | fused bias + SiLU into GEMM, no intermediate buffer |
| Pooling / upsample | AVX-512 maxpool, NEON-ready (port on roadmap Q3'26) |
| Residual | fused into conv (no separate add) |
| Weights format | pre-packed column-panel binary, ~5 MB for YOLOv8n@320 |
| Threading | pthread pool, slice-parallel + intra-op parallelism |
| Runtime SIMD dispatch | CPUID at init: AVX-512+VNNI / AVX2 / SSE2 fallback |
nn2_ctx* ctx = nn2_create("yolov8n_320.bin", 0);
motion_gate_t gate = {.kf_skip_threshold = 0.05f, .post_motion_frames = 30};
kalman_tracker_t* tracker = kalman_tracker_create();
for (frame = 0; frame < total_frames; frame++) {
if (motion_gate_skip(&gate, frame_metadata)) continue; // skip static periods
Detection dets[100];
int n = nn2_detect(ctx, rgb_data, w, h, dets, 100);
kalman_update(tracker, dets, n); // smooth bboxes
}In this combined pipeline the amortized cost is 0.56 ms/frame on a typical NVR stream = ~70 cameras @ 25 fps on a single GPU-less machine. Vs naive pure inference: 14.8× faster.
wget https://github.com/facex-engine/nexusinfer/releases/latest/download/nexusinfer-linux-x64.tar.gz
tar xzf nexusinfer-linux-x64.tar.gz
cd nexusinfer
./nexusinfer-img weights/yolov8n_320.bin sample_frame.jpg 2
# output:
# Image: sample_frame.jpg (1101x945)
# Detections: person (xx.x%) @ (...)
# Benchmark: 2.6 ms (388 FPS) over 50 runsgit clone https://github.com/facex-engine/nexusinfer
cd nexusinfer
make CC=gcc # → libnexusinfer.a (159 KB) + nexusinfer-img CLI
make export-320 # export YOLOv8n@320 weights (requires Python + ultralytics)pip install onnxruntime numpy
python tools/bench_compare.py
# Runs ORT and NexusInfer on the same inputs,
# prints latencies + delta detections.#include "nn2.h"
nn2_ctx* ctx = nn2_create("yolov8n_320.bin", /* threads */ 2);
if (!ctx) { fprintf(stderr, "load failed\n"); return 1; }
uint8_t* rgb = load_image_rgb(path, &w, &h); // your loader
Detection dets[100];
int n = nn2_detect(ctx, rgb, w, h, dets, 100);
for (int i = 0; i < n; i++) {
printf("%s (%.1f%%) @ (%d,%d)-(%d,%d)\n",
coco_classes[dets[i].class_id], dets[i].confidence * 100,
dets[i].x1, dets[i].y1, dets[i].x2, dets[i].y2);
}
nn2_free(ctx);Linking: gcc your_app.c libnexusinfer.a -lm -lpthread.
| Model | Status | Bench latency (i5-11500) |
|---|---|---|
| YOLOv8n @ 320×320 | ✓ live | 8.5 ms (117 FPS) |
| YOLOv8n @ 256×256 | ✓ live | 6.5 ms (154 FPS) |
| YOLOv8n @ 640×640 | ✓ live | ~28 ms |
| NanoDet | ✓ live | ~7 ms (320×320) |
| YOLOv8s/m/l/x | planned | — |
| FaceX (separate) | separate | see facex |
MIT — see LICENSE.
Baurzhan Atynov (@bauratynov)
| Component | What it does | Size |
|---|---|---|
| NexusDecode | H.264 decode without FFmpeg | 497 KB |
| NexusInfer (you are here) | YOLO inference on CPU, ONNX Runtime replacement | 159 KB |
| NexusSense | Compressed-domain analytics | 56 KB |
| FaceX | Face embedding INT8 on CPU | 180 KB |