Skip to content

CrispStrobe/CrispEmbed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

123 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrispEmbed

Build

Lightweight text embedding inference via ggml. No Python runtime, no ONNX. 10 architectures: BERT, XLM-R, MPNet, NomicBERT, ModernBERT, GTE v1.5, DeBERTa-v2, Qwen3, Gemma3, SPLADE. GPU acceleration (CUDA/Vulkan/Metal), BLAS (OpenBLAS/MKL).

Multi-vector retrieval: dense, sparse (SPLADE/BGE-M3), ColBERT multi-vector, cross-encoder rerankers, bi-encoder reranking — all in one binary, all GPU-accelerated.

9.5x faster than FastEmbed (ONNX) on MiniLM-L6. Python/Rust/Dart APIs. iOS (Metal) + Android (Vulkan) builds. 29 HuggingFace model repos.

Part of the Crisp ecosystem

Project Role
CrispEmbed This repo — text embedding engine (ggml), dense + sparse + ColBERT + reranking
CrispASR Speech recognition engine (ggml) — 11 ASR backends, same philosophy for audio
CrisperWeaver Flutter transcription app powered by CrispASR — desktop + mobile, fully offline
Susurrus Python ASR GUI with 9 backends (faster-whisper, mlx-whisper, voxtral, ...)

Status

23+ models verified (cos>=0.96), 43+ in registry:

Model Type Dim F32 CosSim Q8_0 Q4_K
all-MiniLM-L6-v2 BERT 384 0.999999 0.9995 0.97
gte-small BERT 384 1.000000 0.9998 0.99
arctic-embed-xs BERT 384 1.000000 0.9999 0.99
multilingual-e5-small XLM-R 384 1.000000 0.9999 0.99
PIXIE-Rune-v1.0 XLM-R 1024 0.999993 0.9991 0.95
arctic-embed-l-v2 XLM-R 1024 0.999993 0.9989 0.95
Octen-Embedding-0.6B Qwen3 1024 0.999891 0.9995 0.97
F2LLM-v2-0.6B Qwen3 1024 0.999420 0.9952 --
Jina v5 Nano Qwen3 768 0.999020 0.9983 --
Jina v5 Small Qwen3 1024 0.999941 0.9997 0.97
Harrier-OSS-v1-0.6B Qwen3 1024 0.999959 0.9999 0.99
Qwen3-Embedding-0.6B Qwen3 1024 0.999895 0.9996 0.97
Harrier-OSS-v1-270M Gemma3 640 0.999948 0.9998 0.99
all-mpnet-base-v2 MPNet 768 0.999997 0.9998 0.99
nomic-embed-text-v1.5 NomicBERT 768 0.999441 0.9994 --
multilingual-e5-base XLM-R 768 0.999995 0.9999 0.99
multilingual-e5-large XLM-R 1024 0.999997 0.9999 0.99
granite-embedding-278m XLM-R 768 0.999984 0.9999 0.99
granite-embedding-107m XLM-R 384 0.999986 0.9999 0.99
bge-small-en-v1.5 BERT 384 0.999999 0.9999 0.99
bge-base-en-v1.5 BERT 768 0.999994 0.9999 0.99
bge-large-en-v1.5 BERT 1024 0.999992 0.9999 0.99
mxbai-embed-large-v1 BERT 1024 1.000032 0.9999 0.99

Q8_0 = all PASS (cos > 0.99). Q4_K = most PASS; -- = use Q5_K or Q8_0 for this model.

Performance (Apple M1, Metal):

Engine Single text Batch (10)
CrispEmbed Python (ctypes) 3.6 ms / 280 t/s 12.7 ms / 787 t/s
fastembed-rs (Rust ONNX) 3.8 ms / 263 t/s 18.9 ms / 528 t/s
HuggingFace (PyTorch) 12.2 ms / 82 t/s 29.8 ms / 335 t/s
CrispEmbed Server (HTTP) 21.3 ms / 46 t/s 32.9 ms / 303 t/s

Model: all-MiniLM-L6-v2. See PERFORMANCE.md for full multi-model benchmarks.

Ollama-compatible: All 13 models export as Ollama-compatible GGUFs. Works with our Ollama fork (adds XLM-R, Viterbi SentencePiece tokenizer, GELU_ERF, multi-tokenizer BERT support).

Quick start

# Clone with submodule
git clone --recursive https://github.com/CrispStrobe/CrispEmbed
cd CrispEmbed

# Build (CPU)
cmake -S . -B build
cmake --build build -j

# Encode text
./build/crispembed -m model.gguf "Hello world"

# Matryoshka truncation (e.g. 128 dims from a 384-dim model)
./build/crispembed -m model.gguf -d 128 "Hello world"

# Start server (model loaded once, fast repeated queries)
./build/crispembed-server -m model.gguf --port 8080
curl -X POST http://localhost:8080/embed \
    -d '{"texts": ["Hello world"]}'

Building

Linux / macOS

# CPU only (default)
cmake -S . -B build && cmake --build build -j

# With OpenBLAS acceleration
cmake -S . -B build -DGGML_BLAS=ON && cmake --build build -j

# With Intel MKL
cmake -S . -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp

# With CUDA (NVIDIA GPU)
cmake -S . -B build -DGGML_CUDA=ON && cmake --build build -j

# With Vulkan (cross-platform GPU)
cmake -S . -B build -DGGML_VULKAN=ON && cmake --build build -j

# macOS with Metal (recommended)
./build-macos.sh              # Metal + Accelerate + embedded shaders
./build-macos.sh --cpu        # CPU only, no Metal
./build-macos.sh --shared     # Also build shared lib for Python

Windows

Requires Visual Studio 2022 Build Tools + Ninja.

:: CPU build
build-windows.bat

:: Vulkan GPU build (needs Vulkan SDK)
build-vulkan.bat

:: CUDA GPU build (needs CUDA Toolkit)
build-cuda.bat

If you get "ggml does not contain a CMakeLists.txt", run:

git submodule update --init --recursive

Dependencies

  • Required: C++17 compiler, CMake 3.14+
  • Optional: OpenBLAS (apt install libopenblas-dev), Intel MKL, CUDA Toolkit, Vulkan SDK

Converting models

# BERT / XLM-R encoder models
pip install torch transformers gguf
python models/convert-bert-to-gguf.py \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --output all-MiniLM-L6-v2.gguf

# Qwen3 / Gemma3 decoder models
python models/convert-decoder-embed-to-gguf.py \
    --model Octen/Octen-Embedding-0.6B \
    --output octen-0.6b.gguf

# Quantize (Q8_0 recommended, Q4_K for max compression)
./build/crispembed-quantize model.gguf model-q8_0.gguf q8_0
./build/crispembed-quantize model.gguf model-q4_k.gguf q4_k

Pre-converted models: HuggingFace cstr/

Quantization

Type Compression Quality (cos vs F32) Notes
Q8_0 ~3.8x >0.995 Recommended default
Q5_K ~5x >0.98 Good balance
Q4_K ~5.5x >0.95 Max compression
Q6_K ~4.5x >0.99 Premium quality

Embedding tables quantized to Q8_0 even in Q4_K mode (quality-sensitive).

BGE-M3 / Sparse / ColBERT / Reranker

CrispEmbed supports all three BGE-M3 retrieval modalities plus cross-encoder rerankers.

# Convert BGE-M3 (writes sparse_linear.weight + colbert_linear.weight into GGUF)
pip install torch transformers gguf FlagEmbedding
python models/convert-bert-to-gguf.py --model BAAI/bge-m3 --output bge-m3.gguf --crisp

# Validate all three heads against FlagEmbedding ground truth
python tests/test_bgem3.py --gguf bge-m3.gguf --lib build/libcrispembed.so
from crispembed import CrispEmbed

model = CrispEmbed("bge-m3.gguf")

# Dense (L2-normalised)
vec = model.encode("Hello world")                   # Vec<f32> len 1024

# Sparse (SPLADE-style term weights)
if model.has_sparse():
    sparse = model.encode_sparse("Hello world")     # {token_id: weight}

# ColBERT multi-vector
if model.has_colbert():
    multi = model.encode_multivec("Hello world")    # [[f32; 128]; n_tokens]

Cross-encoder rerankers:

reranker = CrispEmbed("bge-reranker-v2-m3.gguf")
score = reranker.rerank("query text", "document text")   # raw logit

Python

Requires the shared library (--shared flag or -DCRISPEMBED_BUILD_SHARED=ON).

from crispembed import CrispEmbed

model = CrispEmbed("all-MiniLM-L6-v2.gguf")

# Single text
vec = model.encode("Hello world")      # shape (384,)

# Batch — single C call, true batched Metal/GPU inference
vectors = model.encode(["Hello world", "Goodbye world"])
print(vectors.shape)  # (2, 384)

# Matryoshka dimension truncation
model.set_dim(128)
vec128 = model.encode("Hello world")   # shape (128,)

# Prompt prefix (for models that need it)
model.set_prefix("query: ")           # auto-prepended before tokenization

# Sparse (BGE-M3)
model = CrispEmbed("bge-m3.gguf")
if model.has_sparse:
    sparse = model.encode_sparse("Hello world")   # {token_id: weight}

# ColBERT multi-vector
if model.has_colbert:
    multi = model.encode_multivec("Hello world")   # (n_tokens, 128)

# Cross-encoder reranking
reranker = CrispEmbed("bge-reranker-v2-m3.gguf")
score = reranker.rerank("query", "document")       # raw logit

# Bi-encoder reranking (any embedding model, cosine similarity)
results = model.rerank_biencoder("query", ["doc1", "doc2", "doc3"], top_n=2)
for r in results:
    print(f"  [{r['index']}] {r['score']:.4f}: {r['document']}")

Rust

[dependencies]
crispembed = { git = "https://github.com/CrispStrobe/CrispEmbed" }
use crispembed::CrispEmbed;

let mut model = CrispEmbed::new("model.gguf", 0)?;
let vec = model.encode("Hello world");

// Prompt prefix
model.set_prefix("query: ");

// Sparse + ColBERT (BGE-M3)
if model.has_sparse() {
    let sparse = model.encode_sparse("query");   // Vec<(i32, f32)>
}
if model.has_colbert() {
    let multi = model.encode_multivec("query");  // Vec<Vec<f32>>
}

// Bi-encoder reranking (cosine similarity)
let ranked = model.rerank_biencoder("query", &["doc1", "doc2"], Some(2));
for (idx, score) in &ranked {
    println!("  doc {} score {:.4}", idx, score);
}

Dart / Flutter

# pubspec.yaml
dependencies:
  crispembed:
    path: path/to/CrispEmbed/flutter/crispembed
import 'package:crispembed/crispembed.dart';

final model = CrispEmbed('model.gguf');

// Dense encoding
final vec = model.encode('Hello world');           // Float32List(384)
final batch = model.encodeBatch(['Hello', 'World']); // List<Float32List>

// Matryoshka truncation + prefix
model.setDim(128);
model.setPrefix('query: ');

// Bi-encoder reranking
final ranked = model.rerankBiencoder('query', ['doc1', 'doc2']);

// Sparse / ColBERT / cross-encoder (BGE-M3, rerankers)
if (model.hasSparse) {
  final sparse = model.encodeSparse('text');  // Map<int, double>
}

model.dispose();

Works on iOS (Metal GPU), Android (Vulkan/NEON), macOS, Linux, Windows.

Mobile (iOS / Android)

# iOS — xcframework with Metal GPU acceleration
./build-ios.sh                    # arm64 device + simulator
./build-ios.sh --device           # device only

# Android — NDK cross-compilation
./build-android.sh                # arm64-v8a + armeabi-v7a + x86_64
./build-android.sh --abi arm64-v8a --vulkan  # single ABI with Vulkan GPU

Output:

  • iOS: build-ios/CrispEmbed.xcframework/
  • Android: build-android/<abi>/libcrispembed.so

Benchmarking

./benchmark.sh                          # single model, all engines
./benchmark.sh --multi                  # 6 models, all engines
./benchmark.sh -n 100 --skip-fastembed  # CrispEmbed + HF only, 100 runs

# RAG retrieval quality benchmark
python tests/bench_rag.py --lib build/libcrispembed.so --gguf model.gguf

# Reranking benchmark
python tests/bench_rerank.py --lib build/libcrispembed.so \
    --embed-gguf model.gguf --reranker-gguf reranker.gguf

Compares CrispEmbed (CLI, Python ctypes, HTTP server) against HuggingFace sentence-transformers, FastEmbed (ONNX), and fastembed-rs (Rust ONNX). Auto-creates a .bench-venv for Python dependencies.

Architecture

BERT encoder (all-MiniLM, gte, arctic-embed-xs):

  • Token + Position + Type embeddings → Post-LN transformer → Mean/CLS pooling

XLM-R encoder (PIXIE-Rune, multilingual-e5, arctic-embed-l-v2):

  • Token + Position(+offset) embeddings → Post-LN transformer → CLS/Mean pooling
  • SentencePiece Unigram tokenizer (Viterbi DP)

BGE-M3 multi-modal (BAAI/bge-m3):

  • Same BERT encoder trunk with three output heads:
    • Dense: mean-pool → L2 normalize → float[1024]
    • Sparse: Linear(H,1) + ReLU → scatter via input_ids → {token_id: weight}
    • ColBERT: Linear(H,128) → per-token L2 normalize → float[n_tokens][128]

MPNet encoder (all-mpnet-base-v2):

  • Token + Position(+offset) embeddings → Post-LN transformer with relative position bias → Mean pooling
  • T5-style logarithmic bucket relative attention bias (32 buckets × n_heads)

NomicBERT encoder (nomic-embed-text-v1.5):

  • Token embeddings (no position) + RoPE → Post-LN transformer + SwiGLU FFN → Mean pooling
  • Rotary position embeddings (same as decoder path), no absolute position embeddings

Cross-encoder reranker (BGE-reranker-v2-m3, ms-marco-MiniLM, mxbai-rerank, etc.):

  • [CLS] query [SEP] document [SEP] pair tokenization → CLS hidden state → Linear(H,1) → scalar score

Qwen3 decoder (Octen, F2LLM, Jina v5, Harrier-0.6B, Qwen3-Embed):

  • Token embeddings + RoPE → RMSNorm + GQA with causal mask + SwiGLU → Last-token pooling

Gemma3 decoder (Harrier-270M):

  • Token embeddings * sqrt(H) + RoPE → Gemma3 RMSNorm(1+w) + GQA + GeGLU → Last-token pooling

All via ggml graphs with GPU dispatch (ggml_backend_sched). See PLAN.md, LEARNINGS.md, PERFORMANCE.md.

Credits

About

Lightweight text embedding inference via ggml in pure C++: XLM-R/Qwen3/Gemma/MPNet/etc

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors