Skip to content

skuznetsov/cogni-ml

Repository files navigation

Cogni-ML

Crystal machine learning library with native Apple Silicon GPU acceleration.

Cogni-ML is currently two things:

  • A general Crystal ML toolkit: tensors, autograd, NN layers, optimizers, GGUF readers, and llama.cpp bindings.
  • A native Metal inference lab for GGUF models, with production-oriented work on nomic-embed-text-v2-moe embeddings and Qwen 3.5 text generation.

Highlights

  • Native Metal embedding pipeline for nomic-embed-text-v2-moe.
  • Native Qwen 3.5 9B GGUF inference path for Apple Silicon Metal.
  • Q4_K/Q5_K/Q6_K/Q8_0 quantized matmul kernels.
  • Chunked Qwen 3.5 prefill, decode wave scheduling, prompt-state cache restore, and exact speculative decode harnesses.
  • ComputeGraph wave scheduling with offset-aware barrier optimization.
  • Crystal autograd engine, NN layers, and Adam/AdamW optimizers.
  • llama.cpp FFI bindings for general GGUF model access.

Architecture

src/ml/
  core/         Tensor, Shape, MetalBuffer
  autograd/     Variable, GradFn backward pass
  nn/           Linear, LayerNorm, MultiHeadAttention, ViT
  optim/        Adam/AdamW
  llm/          llama.cpp FFI bindings
  gguf/         GGUF reader, tokenizer, dequantization, Qwen35, NomicBertMoE
  metal/        Device, ComputeEncoder, ComputeGraph, GraphEncoder

Qwen 3.5 Native Metal

The native Qwen path targets Qwen3.5-9B-Q4_K_M.gguf on Apple Silicon. The code supports:

  • Qwen 3.5 GGUF metadata and tokenizer loading.
  • Q4_K, Q5_K, Q6_K, and Q8_0 quantized projections.
  • Full-attention layers with GQA, partial RoPE, KV cache writes, and fused output projection.
  • DeltaNet/recurrent layers with GPU-resident recurrent state and chunked prefill scan.
  • Chunked prefill with final-token top1 shortcut.
  • Decode wave scheduling to reduce command-buffer boundaries.
  • Exact prompt-state save/restore and longest-prefix prompt cache.
  • Exact speculative decode harnesses:
    • neural draft with Qwen 3.5 0.8B Q8_0,
    • n-gram/cache draft for repeated/generated-template text,
    • target-verifier chunks with row-batched top1 for larger accepted chunks.

The 9B Q4_K_M path is the primary verified target. Qwen 3.6 27B is a scale-up target, but it should be treated as experimental until local correctness and performance runs are completed.

Model Layout

The developer CLIs default to local LM Studio / llama.cpp-style paths:

~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf
~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf
~/SrcArchives/AI/llama.cpp/build/bin/llama-tokenize
~/SrcArchives/AI/llama.cpp/build/bin/llama-bench

Most benchmark/probe CLIs also accept --model, --target, --draft, --tokenizer-bin, or environment overrides. bin/qwen35_generate.cr is intentionally a small demo and currently uses its constants at the top of the file.

Build Qwen CLIs

Build the CPU-only GGUF/Qwen metadata smoke on Linux, CUDA hosts, or any environment where Metal is unavailable:

crystal build -Dcpu_only bin/qwen35_gguf_info.cr -o build/qwen35_gguf_info
./build/qwen35_gguf_info --model /path/to/Qwen3.5-9B-Q4_K_M.gguf
./build/qwen35_gguf_info --model /path/to/Qwen3.5-0.8B-Q8_0.gguf --load-weights

This entrypoint intentionally does not run inference. It verifies GGUF parsing, Qwen 3.5/3.6 hparams, tensor inventory, and the structured Qwen35Weights loader without pulling the Metal bridge into a Linux build.

Build the minimal Crystal CUDA Driver API smoke on NVIDIA/Linux hosts:

crystal build bin/cuda_driver_smoke.cr -o build/cuda_driver_smoke
./build/cuda_driver_smoke 4096

The CUDA smoke is a backend boundary probe only: it links libcuda, loads embedded PTX, launches a vector-add kernel, and checks the result.

CUDA probe code uses src/ml/cuda/driver.cr for reusable CUDA context, module/function, launch, copy, synchronize, and device-buffer ownership. This is intentionally small: it owns the raw CUDA Driver API lifecycle and calls, while higher-level layer execution is still probe-local until the CUDA backend split is promoted. It also provides ML::CUDA::ResidentSequenceRunner, a thin lifecycle facade for resident sequence probes with explicit upload_weights, reset_sequence, run_sequence, and read_outputs phases. src/ml/cuda/qwen_recurrent_layer_runner.cr is the first Qwen-specific runner extraction: it owns one recurrent layer's CUDA modules, device buffers, kernel parameters, weight upload, sequence reset, token launch graph, and output readback. QwenRecurrentLayerRunner::Weights.load owns GGUF tensor lookup, tensor-shape/type validation, and raw weight reads for the runner, including recurrent-layer ffn_down tensors stored as either Q4_K or Q6_K. CPU-reference comparison intentionally remains in the probe.

Build the first quantized CUDA correctness probe on NVIDIA/Linux hosts:

crystal build -Dcpu_only bin/cuda_q8_gemv_probe.cr -o build/cuda_q8_gemv_probe
./build/cuda_q8_gemv_probe \
  --model /path/to/Qwen3.5-0.8B-Q8_0.gguf \
  --tensor blk.0.ffn_up.weight \
  --kernel warp4 \
  --reps 100 \
  --warmup 10

cuda_q8_gemv_probe loads a real GGUF Q8_0 tensor, launches a Crystal-driven CUDA Driver API GEMV kernel over the raw GGUF block layout, and compares against the existing CPU QuantMatmul reference. --kernel scalar keeps the first one-thread-per-output-row correctness kernel; the default --kernel warp4 maps four output rows to four warps per thread block and is the current faster probe shape. This is still a standalone backend-boundary probe, not an optimized Qwen CUDA inference path yet. The current full qwen35_generate CLI remains Metal-first.

Build the first Q4_K CUDA correctness probe for Qwen 9B/27B-style target tensors:

crystal build -Dcpu_only bin/cuda_q4k_gemv_probe.cr -o build/cuda_q4k_gemv_probe
./build/cuda_q4k_gemv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --tensor blk.0.attn_gate.weight \
  --kernel warp4 \
  --reps 20 \
  --warmup 3

cuda_q4k_gemv_probe uses the raw GGUF Q4_K block layout (d, dmin, 12-byte packed scales/mins, 128-byte packed nibbles) and checks the CUDA output against the CPU QuantMatmul Q4_K reference. --kernel scalar keeps the first correctness kernel; the default --kernel warp4 maps four output rows to four warps per block and is the current faster probe shape.

Build the Q6_K CUDA correctness/speed probe for Q4_K_M tensors that remain in Q6_K:

crystal build -Dcpu_only bin/cuda_q6k_gemv_probe.cr -o build/cuda_q6k_gemv_probe
./build/cuda_q6k_gemv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --tensor blk.0.ffn_down.weight \
  --kernel warp4 \
  --reps 10 \
  --warmup 2

cuda_q6k_gemv_probe covers the GGUF Q6_K block layout (ql, qh, signed scales, d) used by output/value/down projections in mixed-quant target models. Like the Q4_K/Q8_0 probes, it is a standalone backend primitive check; full CUDA Qwen execution is still a separate backend split.

Build the first GPU-resident FFN sequence probe:

crystal build -Dcpu_only bin/cuda_ffn_sequence_probe.cr -o build/cuda_ffn_sequence_probe
./build/cuda_ffn_sequence_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --reps 10 \
  --warmup 2

cuda_ffn_sequence_probe composes the checked CUDA primitives as Q4_K ffn_gate + Q4_K ffn_up -> SwiGLU -> Q6_K ffn_down while keeping the input, intermediate activations, and output projection input GPU-resident. Only the final hidden vector is copied back for comparison against the CPU QuantMatmul FFN reference.

Build the full-attention input projection bundle probe:

crystal build -Dcpu_only bin/cuda_attn_projection_probe.cr -o build/cuda_attn_projection_probe
./build/cuda_attn_projection_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 3 \
  --reps 10 \
  --warmup 2

cuda_attn_projection_probe runs Q4_K attn_q + Q4_K attn_k + Q6_K attn_v from one GPU-resident hidden vector and copies Q/K/V back only after all projections complete. It targets full-attention layers such as blk.3 in Qwen3.5 9B. The probe now routes through ML::CUDA::QwenFullAttnProjectionRunner, supports --tokens N, and keeps Q/K/V outputs GPU-resident until the final correctness readback. It is the reusable input-projection boundary for future full-attention/KV CUDA work, not a complete full-attention layer runner yet.

Build the full-attention Q/K normalization + RoPE + KV-cache boundary probe:

crystal build -Dcpu_only bin/cuda_full_attn_kv_probe.cr -o build/cuda_full_attn_kv_probe
./build/cuda_full_attn_kv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 3 \
  --tokens 4 \
  --start-pos 2 \
  --max-seq 12

cuda_full_attn_kv_probe now routes through ML::CUDA::QwenFullAttnLayerRunner, a residual-hidden-to-final-hidden wrapper around the projection runner and ML::CUDA::QwenFullAttnKVRunner. The projection runner can apply the initial attn_norm on CUDA from residual hidden states before Q/K/V projection; Q is split into normalized/RoPE'd Q and gate, K is RMSNormed/RoPE'd, K/V rows are appended to a CUDA-resident cache at start_pos, a correctness-first serial CUDA kernel computes GQA scores, softmax, value reduction, and Q-gate multiplication, the resident gated attention output is projected through attn_output.weight, and the layer tail runs residual add, post-attention RMSNorm, FFN gate/up/SwiGLU/down, and final residual. It checks Q, gate, K, gated attention output, projected attention output, final hidden, K-cache, and V-cache against the CPU Qwen reference. This is now a clean one-layer semantics probe with device input/output hooks for mixed-stack composition, but it is not yet an end-to-end Linux decode path: full/recurrent stack scheduling, logits/top1, tokenizer/sampling, restored nonzero prefix KV, and a faster attention kernel remain separate gates.

Build the Q5_K CUDA recurrent-QKV probe:

crystal build -Dcpu_only bin/cuda_q5k_gemv_probe.cr -o build/cuda_q5k_gemv_probe
./build/cuda_q5k_gemv_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --tensor blk.0.attn_qkv.weight \
  --reps 10 \
  --warmup 2

cuda_q5k_gemv_probe covers the GGUF Q5_K block layout used by recurrent-layer combined attn_qkv.weight tensors in the current Qwen3.5 9B Q4_K_M file.

Build the recurrent-layer projection bundle probe:

crystal build -Dcpu_only bin/cuda_recurrent_projection_probe.cr -o build/cuda_recurrent_projection_probe
./build/cuda_recurrent_projection_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --reps 10 \
  --warmup 2

cuda_recurrent_projection_probe runs Q5_K attn_qkv + Q4_K attn_gate + Q4_K ssm_alpha + Q4_K ssm_beta from one GPU-resident hidden vector and copies the four outputs back only after all kernels complete. It is the first CUDA recurrent projection-bundle proof; DeltaNet recurrence, convolution, state updates, and ssm_out remain separate work.

Build the synthetic DeltaNet output slice probe:

crystal build -Dcpu_only bin/cuda_deltanet_output_probe.cr -o build/cuda_deltanet_output_probe
./build/cuda_deltanet_output_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --reps 10 \
  --warmup 2

cuda_deltanet_output_probe runs a synthetic CUDA DeltaNet state update, applies post RMSNorm/SiLU gating on GPU, and feeds the result directly into the real Q4_K ssm_out.weight projection. It is a stateful boundary probe, not a full recurrent layer: recurrent conv prep, alpha/beta transforms, residuals, and FFN remain separate work.

Build the recurrent prep/output slice probe:

crystal build -Dcpu_only bin/cuda_recurrent_prep_output_probe.cr -o build/cuda_recurrent_prep_output_probe
./build/cuda_recurrent_prep_output_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layer 0 \
  --tokens 4 \
  --reps 10 \
  --warmup 2

cuda_recurrent_prep_output_probe now composes one full recurrent layer token slice: input RMSNorm, real recurrent projection bundle (attn_qkv, attn_gate, ssm_alpha, ssm_beta), recurrent conv prep, alpha/beta transforms, DeltaNet, post RMSNorm/SiLU, Q4_K ssm_out, residual add, post-attention RMSNorm, Q4_K FFN gate/up, SwiGLU, Q6_K FFN down, and final residual. --tokens N runs a GPU-resident sequence through persistent conv/SSM state and compares all token outputs plus final recurrent states against the CPU reference. The probe now separates one-time weight upload from per-sequence input/state reset and prints weight_upload_ms; timed cuda_ms_per_token excludes the persistent weight upload. GGUF recurrent-layer tensor loading is routed through QwenRecurrentLayerRunner::Weights.load, so the probe no longer manually passes every raw tensor into the runner constructor. It is still a standalone one-layer probe, not an end-to-end Linux decoder.

Build the recurrent multi-layer stack scaffold:

crystal build -Dcpu_only bin/cuda_recurrent_stack_probe.cr -o build/cuda_recurrent_stack_probe
./build/cuda_recurrent_stack_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layers 0,2,4 \
  --tokens 2

cuda_recurrent_stack_probe chains multiple QwenRecurrentLayerRunner instances and compares the final hidden sequence plus each layer's recurrent conv/SSM state against the CPU reference. The default path hands each recurrent layer's CUDA output buffer directly to the next layer's CUDA input; --host-handoff keeps the older host-copy route as a debug oracle. This is still a recurrent-only scaffold, not an end-to-end Linux decoder.

Build the mixed recurrent/full-attention CUDA stack probe:

crystal build -Dcpu_only bin/cuda_mixed_stack_probe.cr -o build/cuda_mixed_stack_probe
./build/cuda_mixed_stack_probe \
  --model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
  --layers 0,1,2,3,4 \
  --tokens 2 \
  --start-pos 2 \
  --max-seq 12

cuda_mixed_stack_probe composes QwenRecurrentLayerRunner and QwenFullAttnLayerRunner in model layer order with device-resident hidden handoff across recurrent/full-attention boundaries, then runs QwenOutputHeadRunner for output RMSNorm, quantized lm-head projection, and resident top1. The layer/head loop is now owned by ML::CUDA::QwenMixedStackRunner, which is the first model-slice decode-state object for CUDA. By default the probe copies back only the CUDA top1 id/value plus hidden/state debug outputs; pass --read-logits to also copy full logits for attribution, and --profile-phases to insert per-layer/head synchronizations and print attribution lines. It compares the final hidden sequence, top1, recurrent conv/SSM states, and full-attention KV cache rows against the CPU reference. This is the first mixed-stack CUDA correctness scaffold through resident top1; it still stops before tokenizer/sampling, repeated full-model decode ownership, and an optimized topK/sampling kernel. The current resident top1 is a simple two-phase partial-scan/reduce kernel: correct, but not yet promoted as a speed-optimized head.

Build the Metal bridge once:

make build/bridge.o

Build the practical generation demo:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/qwen35_generate.cr \
  -o build/qwen35_generate

Run greedy generation:

./build/qwen35_generate "The capital of France is" 64

Enable exact n-gram speculative decode for repeated text:

QWEN35_NGRAM_DECODE=1 ./build/qwen35_generate "The capital of France is" 64

Use the conservative automatic decode policy:

QWEN35_DECODE_POLICY=auto ./build/qwen35_generate "The capital of France is" 64

auto is the product-safe proposal-aware profile: it uses exact n-gram/cache proposals only when they are large enough to amortize verification, enables the candidate-shape risk gate, and otherwise falls back to exact target decoding without invoking the neural draft model.

Enable exact neural speculative decode with the Qwen 3.5 0.8B draft:

QWEN35_SPECULATIVE_DECODE=1 \
QWEN35_HEAD_FULL_ROWS_GUARDED=1 \
./build/qwen35_generate "The capital of France is" 64

Enable exact prompt cache:

QWEN35_PROMPT_CACHE=1 \
QWEN35_SESSION_ID=demo \
./build/qwen35_generate "The capital of France is" 64

Useful Qwen environment switches:

Variable Effect
QWEN35_PROMPT_CACHE=1 Enable exact prompt-state cache lookup/save in qwen35_generate.
QWEN35_PROMPT_CACHE_ROOT=/path Override prompt-cache artifact root.
QWEN35_PREPARE_STATE_OFF=1 Disable eager Metal state-buffer preparation in qwen35_generate. By default the CLI prepares KV/DeltaNet buffers before timing prompt ingest.
QWEN35_DECODE_POLICY=greedy|ngram|speculative|auto Explicit decode-mode selector. auto chooses the exact fail-closed n-gram path with risk gating; explicit policy overrides legacy mode envs.
QWEN35_TRACE_STEPS_OFF=1 Suppress per-token/per-cycle trace lines in qwen35_generate while keeping summaries and final output.
QWEN35_QUIET=1 Alias for suppressing per-step traces in qwen35_generate; useful for cleaner local timing.
QWEN35_NGRAM_DECODE=1 Enable exact n-gram speculative decode in qwen35_generate.
QWEN35_NGRAM_GAMMA=32 Maximum n-gram verifier chunk size.
QWEN35_NGRAM_MIN=6 Minimum repeated suffix length before n-gram drafting.
QWEN35_NGRAM_MAX=8 Maximum suffix length to search for n-gram drafting.
QWEN35_NGRAM_MIN_CANDIDATES=N Skip n-gram proposals shorter than N candidates. In auto, the default is 8; explicit ngram keeps the old default 0 unless set.
QWEN35_NGRAM_STAGE_MIN=N Split only n-gram verifier chunks with at least N candidates into staged subchunks. In auto, the default is QWEN35_NGRAM_GAMMA + 1, so the common full chunk is kept intact unless overridden. Explicit ngram keeps the old default 0 unless set.
QWEN35_NGRAM_RISK_GATE=0|1 In auto, the exact candidate-shape risk gate is enabled by default; set 0 to disable. Explicit ngram keeps the old default off unless set to 1.
QWEN35_NGRAM_RISK_MIN_SIZE=16 Candidate size threshold used by the n-gram risk gate. This is independent from QWEN35_NGRAM_STAGE_MIN, so staging can be disabled without weakening fail-closed risk checks.
QWEN35_NGRAM_RECURSIVE_OFF=1 Disable recursive n-gram extension through scratch history.
QWEN35_NGRAM_DISABLE_AFTER_REJECT_OFF=1 Exploration mode: keep trying n-gram chunks after first rejection.
QWEN35_NGRAM_REPLAY_ON_REJECT=1 Research/fast-path mode: skip n-gram target-state backups and rebuild the exact target state only after a non-final n-gram reject. Use with the default auto risk gate; it can regress badly when a large bad n-gram chunk is forced through verification.
QWEN35_SPECULATIVE_DECODE=1 Enable exact neural speculative decode in qwen35_generate using the 0.8B draft.
QWEN35_DRAFT_MODEL=/path Override the Qwen 3.5 draft GGUF used by neural speculative decode.
QWEN35_SPEC_GAMMA=4 Initial neural draft chunk size in qwen35_generate.
QWEN35_SPEC_MAX_GAMMA=32 Maximum adaptive neural draft chunk size.
QWEN35_SPEC_PLAIN_FALLBACK_OFF=1 Disable target-only fallback after low-gamma speculative rejection. Useful for A/B experiments; default fallback is faster on rejection-heavy prompts.
QWEN35_SPEC_PLAIN_FALLBACK_GAMMA=2 Gamma threshold at or below which rejected neural speculative decode falls back to target-only generation.
QWEN35_SPEC_BOOTSTRAP_GAMMA=N Default-off neural speculative jump after a fully accepted initial chunk. Can help 100%-accept runs; may regress prompts that reject after an accepted prefix.
QWEN35_SPEC_SINGLE_FAST_OFF=1 Disable the exact gamma=1 accepted-token fast path in neural speculative decode. Mostly useful when target-only fallback is disabled for A/B experiments.
QWEN35_SPEC_VERIFY=chunk-inplace|hybrid|serial Choose neural speculative verifier strategy. Default chunk-inplace is best for high-accept prompts; hybrid can help first-cycle partial-reject prompts.
QWEN35_SPEC_SKIP_DRAFT_BEFORE_FALLBACK_OFF=1 Disable the exact optimization that skips draft resync work when a rejection is guaranteed to enter target-only fallback.
QWEN35_SPEC_SKIP_DRAFT_BACKUP_BEFORE_FALLBACK_OFF=1 Disable the matching draft-backup skip before fallback-bound speculative chunks.
QWEN35_HEAD_FULL_ROWS_GUARDED=1 Experimental speculative-verifier accelerator for large accepted chunks; uses a margin guard and exact fallback for low-margin rows.
QWEN35_HEAD_FULL_ROWS_MARGIN=0.25 Margin threshold for the guarded full-row verifier route. Higher is safer but falls back more often.
QWEN35_FFN_DOWN_ADD_FUSED_OFF=1 Disable decode-wave FFN-down residual-add fusion for Q4/Q6 target and Q8 draft experiments.
QWEN35_Q4K_PAIR_H16_MIN_BATCH=64 Tune the prefill Q4 gate/up shared H16 conversion threshold. The current default enables sharing from pp64 upward after refreshed A/B showed a small exact win.
QWEN35_REC_PROJ_SHARED_H16_OFF=1 Disable the exact recurrent prefill projection optimization that shares one H16 input conversion between Q5 qkv and Q4 gate GEMMs.
QWEN35_PREFILL_CHUNK_OFF=1 Force older non-chunked prefill path.
QWEN35_DECODE_WAVE_OFF=1 Force older non-wave decode path.

Library Integration

The current Qwen API is low-level and intended for native inference experiments:

require "ml/gguf/qwen35_cpu"
require "ml/gguf/qwen35_weights"

model = "/path/to/Qwen3.5-9B-Q4_K_M.gguf"
weights = ML::GGUF::Qwen35Weights.from_gguf(model)
state = ML::GGUF::Qwen35CPU::State.new(weights.hparams, max_seq: 1024)

prompt_ids = [760_i32, 6511_i32, 314_i32, 9338_i32, 13_i32]
next_id, next_logit = ML::GGUF::Qwen35CPU.prefill_tokens_top1(weights, prompt_ids, 0, state)

64.times do |i|
  puts next_id
  next_id, next_logit = ML::GGUF::Qwen35CPU.forward_top1(weights, next_id, prompt_ids.size + i, state)
end

When linking an executable that uses Metal, include the bridge object and Apple frameworks:

crystal build your_app.cr \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++"

For CPU-only builds:

crystal build -Dcpu_only your_app.cr

Benchmark Against llama.cpp

Build the matched benchmark:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/benchmark_qwen_vs_llama.cr \
  -o build/benchmark_qwen_vs_llama

Run a normal first-run prefill/decode comparison:

./build/benchmark_qwen_vs_llama \
  --model ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
  --llama-bench ~/SrcArchives/AI/llama.cpp/build/bin/llama-bench \
  --prompt=64 \
  --gen=64 \
  --reps=5 \
  --warmup=2

For publishable measurements, wait for a quiet host:

./build/benchmark_qwen_vs_llama \
  --prompt=64 \
  --gen=64 \
  --reps=5 \
  --warmup=2 \
  --wait-quiet-ms=60000 \
  --require-quiet

Additional benchmark modes:

# Fresh State per repetition, but Metal state buffers are prepared before
# the timed prefill. This measures prompt ingest without first-touch buffer
# allocation/zeroing in the timed region.
./build/benchmark_qwen_vs_llama --native-prefill-prepare-state

# State buffers allocated once, then reset between reps.
./build/benchmark_qwen_vs_llama --native-prefill-prealloc

# Exact prompt-cache restore after one seeded native prefill.
./build/benchmark_qwen_vs_llama --native-prefill-cache

Fresh local M2 Max 64GB relaxed-load snapshot after the shared-H16 recurrent projection cleanup, Qwen 3.5 9B Q4_K_M, llama.cpp llama-bench, prompt=64, gen=64, reps=3, warmup=1, flash-attention off:

Mode cogni-ml llama.cpp Gap
First-run prefill 426.70 tok/s p50 455.10 tok/s avg -6.24%
Fresh state, prepared Metal buffers 449.73 tok/s p50 464.78 tok/s avg -3.24%
Prefill with preallocated state 448.60 tok/s p50 465.91 tok/s avg -3.71%
Prompt-cache restore 1350.65 tok/s p50 465.80 tok/s avg +189.97%
Plain greedy decode, first-run bench 48.67 tok/s p50 46.67 tok/s avg +4.29%
Plain greedy decode, prepared-state bench 48.59 tok/s p50 46.43 tok/s avg +4.67%
Plain greedy decode, preallocated bench 48.51 tok/s p50 46.63 tok/s avg +4.05%
Plain greedy decode, prompt-cache bench 48.52 tok/s p50 46.58 tok/s avg +4.18%

Notes:

  • The table is a local engineering snapshot, not a lab-clean public benchmark.
  • First-run prefill is still behind llama.cpp on this machine. The native wins currently come from state reuse, prompt-cache restore, and exact speculative decode.
  • --native-prefill-prepare-state uses a fresh State per repetition but calls Qwen35CPU.prepare_state_metal! before timing. This is useful for server-style latency where a session object can be prepared before the prompt arrives.
  • --native-prefill-cache measures exact restore of a previously computed prompt state; it is not a first-run prefill replacement.
  • Short decode runs are noisy on a desktop system. The two plain decode rows above are intentionally both shown: treat plain decode as parity-to-faster, not as a stable public margin without a quiet rerun.

Speculative Decode Harnesses

Neural draft harness with Qwen 3.5 0.8B Q8_0:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/qwen35_speculative_accept.cr \
  -o build/qwen35_speculative_accept

./build/qwen35_speculative_accept \
  --target ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
  --draft ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf \
  --tokens 64 \
  --ngram \
  "The capital of France is"

Cheap-proposal-only policy for repeated/template spans:

QWEN35_SPEC_NGRAM_MIN_CANDIDATES=8 \
./build/qwen35_speculative_accept \
  --target ~/.cache/lm-studio/models/lmstudio-community/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf \
  --draft ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf \
  --tokens 32 \
  --ngram \
  --ngram-risk-gate \
  --ngram-target-only \
  "alpha beta gamma alpha beta gamma alpha beta gamma alpha"

Target-only n-gram speculative harness:

crystal build --release --no-debug \
  --link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
  bin/qwen35_ngram_speculative.cr \
  -o build/qwen35_ngram_speculative

./build/qwen35_ngram_speculative \
  --tokens 64 \
  --gamma 32 \
  --min-ngram 6 \
  "The capital of France is"

Both harnesses replay/check exact greedy target output by default unless their CLI explicitly says otherwise.

Fresh local speculative smoke, same M2 Max 64GB and Qwen 3.5 9B target:

Mode / prompt Effective speed Plain target Notes
Neural draft, The capital of France is 15.38 ms/tok, 65.01 tok/s 21.98 ms/tok, 45.49 tok/s 100% accepted, 64/64 candidates
Neural draft, def fibonacci(n): 21.06 ms/tok, 47.48 tok/s 21.71 ms/tok, 46.07 tok/s falls back after rejection; small but safe win
N-gram + neural, The capital of France is 10.10 ms/tok, 98.98 tok/s 21.91 ms/tok, 45.64 tok/s repeated-text path, 48/48 n-gram candidates accepted
Experimental guarded full-row verifier + neural, The capital of France is 14.32 ms/tok, 69.82 tok/s 22.36 ms/tok, 44.73 tok/s QWEN35_HEAD_FULL_ROWS_GUARDED=1, 0 fallback rows in this run
Experimental guarded full-row verifier + n-gram + neural, The capital of France is 9.20 ms/tok, 108.64 tok/s noisy target run QWEN35_HEAD_FULL_ROWS_GUARDED=1, 48/48 n-gram candidates accepted

Speculative decode caveats:

  • The speculative paths are exact greedy verification paths, not approximate sampling shortcuts.
  • Neural speculative speed depends on draft acceptance. High-accept prompts are faster; rejection-heavy prompts quickly fall back to plain target decode.
  • In qwen35_generate, neural speculative decode is useful for longer high-accept generations. In a local 64-token smoke, The capital of France is measured 20.40 ms/tok greedy, 16.61 ms/tok neural speculative, and 15.10 ms/tok neural speculative with guarded full-row verification. A 32-token smoke was slower due fixed draft/verifier overhead.
  • N-gram speculation is a workload-specialized path for repeated/generated-template text. QWEN35_DECODE_POLICY=auto is the recommended product profile: risk-gated, min_candidates=8, no neural fallback, and exact target-only fallback outside clean cheap-copy spans.
  • In the research acceptance harness, --ngram-target-only / QWEN35_SPEC_NGRAM_TARGET_ONLY=1 skips neural draft fallback after the cheap n-gram proposal source and uses exact target-only steps instead. On a 27B mixed JSONL gate it beat neural default on 5/7 prompts with paired ratio 0.909x when combined with --ngram-risk-gate, but its average speed was near plain target decode; the real win is on clean repeated spans.
  • The n-gram risk gate now also catches small-period prefix overruns such as IP/YAML tails. A focused 27B probe kept clean alpha beta gamma repeats at ~2.58x while turning a YAML overrun from 0.80x into fail-closed target-only 1.006x.
  • The productization smoke after the structured-tail fix kept clean 27B repeats fast (~2.55x over plain), made a YAML-like overrun fail closed (1.003x), and had ngram_target_only_risk beat neural default on the tiny paired suite (0.815x ratio). Treat this as opt-in CLI evidence, not a broad default claim.
  • QWEN35_NGRAM_REPLAY_ON_REJECT=1 is exact but deliberately opt-in. It removes rollback-copy overhead on high-confidence accepted n-gram chunks; on the local 27B+0.8B repeat8 harness it improved ngram_router16_risk from 30.85 to 30.21 ms/tok. A forced no-risk YAML reject regressed from 71.53 to 95.86 ms/tok, so this is not a broad default.
  • N-gram verifier chunks temporarily disable guarded full-row verification even if QWEN35_HEAD_FULL_ROWS_GUARDED=1, because partial n-gram rejection exposed a close-row guard failure during adversarial CLI testing.
  • QWEN35_HEAD_FULL_ROWS_GUARDED=1 is still an experimental research switch. The harness checks final output against plain greedy target output, but the route is not broad-defaulted because it relies on a full-row F16 top1 margin guard.
  • These numbers are effective decode throughput after prompt prefill; they do not make first-run prefill faster.

Native Metal Embeddings

The embedding path targets nomic-embed-text-v2-moe with a fully native Metal compute pipeline.

require "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"

ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf("path/to/model.gguf", ML::GGUF::MetalBackend.new)

embedding = model.embed("Your text here")

Embedding Performance

Apple M2 Max, 38 GPU cores:

Tokens Latency
20 14 ms
94 16 ms
196 33 ms
433 70 ms

Embedding Pipeline Internals

  • simdgroup-matrix GEMM for Q5_K/Q6_K dequant+multiply.
  • Batched expert GEMM for MoE experts.
  • ComputeGraph wave scheduling with offset-aware dependency analysis.
  • Fused QKV split/RoPE, gate/softmax/top-k, scatter, and norm kernels.
  • GPU-driven dispatch where useful.

Supported Models

Model Format Status
Qwen3.5-9B GGUF Q4_K_M Native Metal text generation path, active optimization target.
Qwen3.5-0.8B GGUF Q8_0 Native draft model path for speculative decode harnesses.
Qwen3.6-27B GGUF Q4_K_M target Planned/experimental scale-up target.
nomic-embed-text-v2-moe GGUF Q5_K_M Native Metal embedding pipeline.
BERT-like encoders GGUF Via NomicBertMoE when the architecture matches.
Other Llama/Qwen/Mistral-style models GGUF Via llama.cpp bindings.

Installation

# shard.yml
dependencies:
  cogni-ml:
    github: skuznetsov/cogni-ml
    version: ~> 0.40.0

Build And Test

make build
make spec

CPU-only:

make build_cpu
make spec_cpu

llama.cpp helper targets:

make llama
make llama_env

The Makefile searches common local, Homebrew, and system library locations for libllama. Override with LLAMA_DIR, LLAMA_BUILD, or LLAMA_LIB_DIR if needed.

Quick Start

Tensor + Autograd

require "ml"

x = ML::Autograd::Variable.rand(2, 3, requires_grad: true, device: ML::Tensor::Device::CPU)
layer = ML::NN::Linear.new(3, 4, device: ML::Tensor::Device::CPU)

out = layer.forward(x)
loss = out.mean
loss.backward

opt = ML::Optim::Adam.new(layer.parameters)
opt.step
opt.zero_grad

LLM Inference Through llama.cpp

require "ml/llm/llama"

ML::LLM.init
model = ML::LLM::Model.new("path/to/model.gguf")
gen = ML::LLM::Generator.new(model)
puts gen.ask("What is Crystal?", max_tokens: 100)
ML::LLM.cleanup

GGUF Embeddings

require "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"

ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf(
  "nomic-embed-text-v2-moe.Q5_K_M.gguf",
  ML::GGUF::MetalBackend.new
)

vec = model.embed("Crystal programming language")
puts "dim=#{vec.size}"

vecs = model.embed_batch(["Hello", "World", "Crystal"])

Metal Kernels

Kernel Purpose
gemm_q4k.metal Q4_K GEMV/GEMM paths for Qwen.
gemm_q56k.metal Q5_K/Q6_K/Q8_0 GEMV, top1, and helper kernels for Qwen.
gemm_mm.metal simdgroup-matrix GEMM for Q5_K/Q6_K and batched expert variants.
gemm_simd.metal Scalar SIMD GEMM fallback.
ffn_qwen35.metal Qwen FFN, add, RMSNorm, and activation helpers.
delta_net.metal Qwen 3.5 DeltaNet/recurrent kernels.
fullattn_qwen35.metal Qwen full-attention prefill/decode helpers.
attn_decode_qwen35.metal Qwen gated attention decode.
attention_matmul.metal Flash-style attention matrix helpers.
bert_fp16.metal Nomic/BERT fused ops.
nn.metal General NN ops.

Platform Support

Platform GPU CPU Status
macOS Apple Silicon Metal Yes Primary target.
macOS Intel Metal Yes Supported for general Metal paths; Qwen performance focus is Apple Silicon.
Linux Experimental CUDA probes Yes Use -Dcpu_only for GGUF/metadata and CUDA probe CLIs; full Qwen generation is still Metal-first.
FreeBSD No native Metal Untested CPU-only Not a primary CI target.

NVIDIA/CUDA support is currently an experimental backend-probe track, not a full decoder. The Qwen native generation path remains Metal-first.

Build Flags

Flag Effect
-Dcpu_only Disable Metal and build pure CPU paths.
-Duse_gguf Enable GGUF model loading where applicable.

License

MIT

About

Crystal ML library: Autograd, Tensors, Neural Networks, Optimizers

Resources

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors