NexusInfer

YOLOv8n inference engine on CPU. 159 KB library, 1.50× faster than ONNX Runtime, hand-written AVX-512 GEMM kernels. No Python, no CUDA.

NexusInfer is a hand-written YOLOv8n inference runtime in pure C, with custom AVX-512 GEMM kernels. It does not depend on ONNX Runtime, OpenVINO, NCNN or TFLite. The goal: a surveillance CPU without a GPU should comfortably run object detection on dozens of cameras.

Part of the NexusEye stack.

Optimization journey: 113× speedup over naive C

Each step is a real commit in the repo. Final inference latency for YOLOv8n@320×320 is 8.1 ms on a single i5-11500 core, which translates to 117 FPS single-thread.

Latency vs alternatives

ONNX Runtime is the strongest mainstream CPU baseline. NexusInfer beats it by hand-tuning specifically for the YOLOv8n graph (specialized inference graph, packed weights in column-panel format, fused bias+SiLU+residual).

What's inside

Component	Implementation
GEMM (FP32)	6×32 microkernel, AVX-512, KC blocking, packed weights
GEMM (INT8)	AVX-512 VNNI, `vpdpbusd` u8×s8→s32, 4×8c8 tiling
im2col	implicit (fused into GEMM) or parallel explicit
Activation	fused bias + SiLU into GEMM, no intermediate buffer
Pooling / upsample	AVX-512 maxpool, NEON-ready (port on roadmap Q3'26)
Residual	fused into conv (no separate add)
Weights format	pre-packed column-panel binary, ~5 MB for YOLOv8n@320
Threading	pthread pool, slice-parallel + intra-op parallelism
Runtime SIMD dispatch	CPUID at init: AVX-512+VNNI / AVX2 / SSE2 fallback

Smart NVR pipeline

nn2_ctx* ctx = nn2_create("yolov8n_320.bin", 0);
motion_gate_t gate = {.kf_skip_threshold = 0.05f, .post_motion_frames = 30};
kalman_tracker_t* tracker = kalman_tracker_create();

for (frame = 0; frame < total_frames; frame++) {
    if (motion_gate_skip(&gate, frame_metadata)) continue;   // skip static periods
    Detection dets[100];
    int n = nn2_detect(ctx, rgb_data, w, h, dets, 100);
    kalman_update(tracker, dets, n);                          // smooth bboxes
}

In this combined pipeline the amortized cost is 0.56 ms/frame on a typical NVR stream = ~70 cameras @ 25 fps on a single GPU-less machine. Vs naive pure inference: 14.8× faster.

Quick start

Linux x86-64 (prebuilt binary)

wget https://github.com/facex-engine/nexusinfer/releases/latest/download/nexusinfer-linux-x64.tar.gz
tar xzf nexusinfer-linux-x64.tar.gz
cd nexusinfer
./nexusinfer-img weights/yolov8n_320.bin sample_frame.jpg 2

# output:
# Image: sample_frame.jpg (1101x945)
# Detections: person (xx.x%) @ (...)
# Benchmark: 2.6 ms (388 FPS) over 50 runs

Build from source

git clone https://github.com/facex-engine/nexusinfer
cd nexusinfer
make CC=gcc       # → libnexusinfer.a (159 KB) + nexusinfer-img CLI
make export-320   # export YOLOv8n@320 weights (requires Python + ultralytics)

Compare with ONNX Runtime yourself

pip install onnxruntime numpy
python tools/bench_compare.py
# Runs ORT and NexusInfer on the same inputs,
# prints latencies + delta detections.

Use as a library

#include "nn2.h"

nn2_ctx* ctx = nn2_create("yolov8n_320.bin", /* threads */ 2);
if (!ctx) { fprintf(stderr, "load failed\n"); return 1; }

uint8_t* rgb = load_image_rgb(path, &w, &h);     // your loader
Detection dets[100];
int n = nn2_detect(ctx, rgb, w, h, dets, 100);

for (int i = 0; i < n; i++) {
    printf("%s (%.1f%%) @ (%d,%d)-(%d,%d)\n",
           coco_classes[dets[i].class_id], dets[i].confidence * 100,
           dets[i].x1, dets[i].y1, dets[i].x2, dets[i].y2);
}

nn2_free(ctx);

Linking: gcc your_app.c libnexusinfer.a -lm -lpthread.

Supported models

Model	Status	Bench latency (i5-11500)
YOLOv8n @ 320×320	✓ live	8.5 ms (117 FPS)
YOLOv8n @ 256×256	✓ live	6.5 ms (154 FPS)
YOLOv8n @ 640×640	✓ live	~28 ms
NanoDet	✓ live	~7 ms (320×320)
YOLOv8s/m/l/x	planned	—
FaceX (separate)	separate	see facex

License

MIT — see LICENSE.

Author

Baurzhan Atynov (@bauratynov)

Part of the NexusEye stack

Component	What it does	Size
NexusDecode	H.264 decode without FFmpeg	497 KB
NexusInfer (you are here)	YOLO inference on CPU, ONNX Runtime replacement	159 KB
NexusSense	Compressed-domain analytics	56 KB
FaceX	Face embedding INT8 on CPU	180 KB

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NexusInfer

Optimization journey: 113× speedup over naive C

Latency vs alternatives

What's inside

Smart NVR pipeline

Quick start

Linux x86-64 (prebuilt binary)

Build from source

Compare with ONNX Runtime yourself

Use as a library

Supported models

License

Author

Part of the NexusEye stack

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NexusInfer

Optimization journey: 113× speedup over naive C

Latency vs alternatives

What's inside

Smart NVR pipeline

Quick start

Linux x86-64 (prebuilt binary)

Build from source

Compare with ONNX Runtime yourself

Use as a library

Supported models

License

Author

Part of the NexusEye stack

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Packages