Skip to content

facex-engine/nexusinfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

NexusInfer

YOLOv8n inference engine on CPU. 159 KB library, 1.50× faster than ONNX Runtime, hand-written AVX-512 GEMM kernels. No Python, no CUDA.

status size speedup lang simd

NexusInfer is a hand-written YOLOv8n inference runtime in pure C, with custom AVX-512 GEMM kernels. It does not depend on ONNX Runtime, OpenVINO, NCNN or TFLite. The goal: a surveillance CPU without a GPU should comfortably run object detection on dozens of cameras.

Part of the NexusEye stack.


Optimization journey: 113× speedup over naive C

optimization journey

Each step is a real commit in the repo. Final inference latency for YOLOv8n@320×320 is 8.1 ms on a single i5-11500 core, which translates to 117 FPS single-thread.

Latency vs alternatives

latency vs onnx

ONNX Runtime is the strongest mainstream CPU baseline. NexusInfer beats it by hand-tuning specifically for the YOLOv8n graph (specialized inference graph, packed weights in column-panel format, fused bias+SiLU+residual).


What's inside

Component Implementation
GEMM (FP32) 6×32 microkernel, AVX-512, KC blocking, packed weights
GEMM (INT8) AVX-512 VNNI, vpdpbusd u8×s8→s32, 4×8c8 tiling
im2col implicit (fused into GEMM) or parallel explicit
Activation fused bias + SiLU into GEMM, no intermediate buffer
Pooling / upsample AVX-512 maxpool, NEON-ready (port on roadmap Q3'26)
Residual fused into conv (no separate add)
Weights format pre-packed column-panel binary, ~5 MB for YOLOv8n@320
Threading pthread pool, slice-parallel + intra-op parallelism
Runtime SIMD dispatch CPUID at init: AVX-512+VNNI / AVX2 / SSE2 fallback

Smart NVR pipeline

nn2_ctx* ctx = nn2_create("yolov8n_320.bin", 0);
motion_gate_t gate = {.kf_skip_threshold = 0.05f, .post_motion_frames = 30};
kalman_tracker_t* tracker = kalman_tracker_create();

for (frame = 0; frame < total_frames; frame++) {
    if (motion_gate_skip(&gate, frame_metadata)) continue;   // skip static periods
    Detection dets[100];
    int n = nn2_detect(ctx, rgb_data, w, h, dets, 100);
    kalman_update(tracker, dets, n);                          // smooth bboxes
}

In this combined pipeline the amortized cost is 0.56 ms/frame on a typical NVR stream = ~70 cameras @ 25 fps on a single GPU-less machine. Vs naive pure inference: 14.8× faster.


Quick start

Linux x86-64 (prebuilt binary)

wget https://github.com/facex-engine/nexusinfer/releases/latest/download/nexusinfer-linux-x64.tar.gz
tar xzf nexusinfer-linux-x64.tar.gz
cd nexusinfer
./nexusinfer-img weights/yolov8n_320.bin sample_frame.jpg 2

# output:
# Image: sample_frame.jpg (1101x945)
# Detections: person (xx.x%) @ (...)
# Benchmark: 2.6 ms (388 FPS) over 50 runs

Build from source

git clone https://github.com/facex-engine/nexusinfer
cd nexusinfer
make CC=gcc       # → libnexusinfer.a (159 KB) + nexusinfer-img CLI
make export-320   # export YOLOv8n@320 weights (requires Python + ultralytics)

Compare with ONNX Runtime yourself

pip install onnxruntime numpy
python tools/bench_compare.py
# Runs ORT and NexusInfer on the same inputs,
# prints latencies + delta detections.

Use as a library

#include "nn2.h"

nn2_ctx* ctx = nn2_create("yolov8n_320.bin", /* threads */ 2);
if (!ctx) { fprintf(stderr, "load failed\n"); return 1; }

uint8_t* rgb = load_image_rgb(path, &w, &h);     // your loader
Detection dets[100];
int n = nn2_detect(ctx, rgb, w, h, dets, 100);

for (int i = 0; i < n; i++) {
    printf("%s (%.1f%%) @ (%d,%d)-(%d,%d)\n",
           coco_classes[dets[i].class_id], dets[i].confidence * 100,
           dets[i].x1, dets[i].y1, dets[i].x2, dets[i].y2);
}

nn2_free(ctx);

Linking: gcc your_app.c libnexusinfer.a -lm -lpthread.


Supported models

Model Status Bench latency (i5-11500)
YOLOv8n @ 320×320 ✓ live 8.5 ms (117 FPS)
YOLOv8n @ 256×256 ✓ live 6.5 ms (154 FPS)
YOLOv8n @ 640×640 ✓ live ~28 ms
NanoDet ✓ live ~7 ms (320×320)
YOLOv8s/m/l/x planned
FaceX (separate) separate see facex

License

MIT — see LICENSE.

Author

Baurzhan Atynov (@bauratynov)

Part of the NexusEye stack

Component What it does Size
NexusDecode H.264 decode without FFmpeg 497 KB
NexusInfer (you are here) YOLO inference on CPU, ONNX Runtime replacement 159 KB
NexusSense Compressed-domain analytics 56 KB
FaceX Face embedding INT8 on CPU 180 KB

About

YOLOv8n inference engine на CPU. 159 KB, 1.50× быстрее ONNX Runtime, hand-written AVX-512 GEMM. Без Python и CUDA.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors