Crystal machine learning library with native Apple Silicon GPU acceleration.
Cogni-ML is currently two things:
- A general Crystal ML toolkit: tensors, autograd, NN layers, optimizers, GGUF readers, and llama.cpp bindings.
- A native Metal inference lab for GGUF models, with production-oriented work on
nomic-embed-text-v2-moeembeddings and Qwen 3.5 text generation.
- Native Metal embedding pipeline for
nomic-embed-text-v2-moe. - Native Qwen 3.5 9B GGUF inference path for Apple Silicon Metal.
- Q4_K/Q5_K/Q6_K/Q8_0 quantized matmul kernels.
- Chunked Qwen 3.5 prefill, decode wave scheduling, prompt-state cache restore, and exact speculative decode harnesses.
- ComputeGraph wave scheduling with offset-aware barrier optimization.
- Crystal autograd engine, NN layers, and Adam/AdamW optimizers.
- llama.cpp FFI bindings for general GGUF model access.
src/ml/
core/ Tensor, Shape, MetalBuffer
autograd/ Variable, GradFn backward pass
nn/ Linear, LayerNorm, MultiHeadAttention, ViT
optim/ Adam/AdamW
llm/ llama.cpp FFI bindings
gguf/ GGUF reader, tokenizer, dequantization, Qwen35, NomicBertMoE
metal/ Device, ComputeEncoder, ComputeGraph, GraphEncoder
The native Qwen path targets Qwen3.5-9B-Q4_K_M.gguf on Apple Silicon. The code supports:
- Qwen 3.5 GGUF metadata and tokenizer loading.
- Q4_K, Q5_K, Q6_K, and Q8_0 quantized projections.
- Full-attention layers with GQA, partial RoPE, KV cache writes, and fused output projection.
- DeltaNet/recurrent layers with GPU-resident recurrent state and chunked prefill scan.
- Chunked prefill with final-token top1 shortcut.
- Decode wave scheduling to reduce command-buffer boundaries.
- Exact prompt-state save/restore and longest-prefix prompt cache.
- Exact speculative decode harnesses:
- neural draft with Qwen 3.5 0.8B Q8_0,
- n-gram/cache draft for repeated/generated-template text,
- target-verifier chunks with row-batched top1 for larger accepted chunks.
The 9B Q4_K_M path is the primary verified target. Qwen 3.6 27B is a scale-up target, but it should be treated as experimental until local correctness and performance runs are completed.
The developer CLIs default to local LM Studio / llama.cpp-style paths:
~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf
~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf
~/SrcArchives/AI/llama.cpp/build/bin/llama-tokenize
~/SrcArchives/AI/llama.cpp/build/bin/llama-bench
Most benchmark/probe CLIs also accept --model, --target, --draft, --tokenizer-bin, or environment overrides. bin/qwen35_generate.cr is intentionally a small demo and currently uses its constants at the top of the file.
Build the CPU-only GGUF/Qwen metadata smoke on Linux, CUDA hosts, or any environment where Metal is unavailable:
crystal build -Dcpu_only bin/qwen35_gguf_info.cr -o build/qwen35_gguf_info
./build/qwen35_gguf_info --model /path/to/Qwen3.5-9B-Q4_K_M.gguf
./build/qwen35_gguf_info --model /path/to/Qwen3.5-0.8B-Q8_0.gguf --load-weightsThis entrypoint intentionally does not run inference. It verifies GGUF parsing, Qwen 3.5/3.6 hparams, tensor inventory, and the structured Qwen35Weights loader without pulling the Metal bridge into a Linux build.
Build the minimal Crystal CUDA Driver API smoke on NVIDIA/Linux hosts:
crystal build bin/cuda_driver_smoke.cr -o build/cuda_driver_smoke
./build/cuda_driver_smoke 4096The CUDA smoke is a backend boundary probe only: it links libcuda, loads embedded PTX, launches a vector-add kernel, and checks the result.
CUDA probe code uses src/ml/cuda/driver.cr for reusable CUDA context, module/function, launch, copy, synchronize, and device-buffer ownership. This is intentionally small: it owns the raw CUDA Driver API lifecycle and calls, while higher-level layer execution is still probe-local until the CUDA backend split is promoted.
It also provides ML::CUDA::ResidentSequenceRunner, a thin lifecycle facade for resident sequence probes with explicit upload_weights, reset_sequence, run_sequence, and read_outputs phases.
src/ml/cuda/qwen_recurrent_layer_runner.cr is the first Qwen-specific runner extraction: it owns one recurrent layer's CUDA modules, device buffers, kernel parameters, weight upload, sequence reset, token launch graph, and output readback. QwenRecurrentLayerRunner::Weights.load owns GGUF tensor lookup, tensor-shape/type validation, and raw weight reads for the runner, including recurrent-layer ffn_down tensors stored as either Q4_K or Q6_K. CPU-reference comparison intentionally remains in the probe.
Build the first quantized CUDA correctness probe on NVIDIA/Linux hosts:
crystal build -Dcpu_only bin/cuda_q8_gemv_probe.cr -o build/cuda_q8_gemv_probe
./build/cuda_q8_gemv_probe \
--model /path/to/Qwen3.5-0.8B-Q8_0.gguf \
--tensor blk.0.ffn_up.weight \
--kernel warp4 \
--reps 100 \
--warmup 10cuda_q8_gemv_probe loads a real GGUF Q8_0 tensor, launches a Crystal-driven CUDA Driver API GEMV kernel over the raw GGUF block layout, and compares against the existing CPU QuantMatmul reference. --kernel scalar keeps the first one-thread-per-output-row correctness kernel; the default --kernel warp4 maps four output rows to four warps per thread block and is the current faster probe shape. This is still a standalone backend-boundary probe, not an optimized Qwen CUDA inference path yet. The current full qwen35_generate CLI remains Metal-first.
Build the first Q4_K CUDA correctness probe for Qwen 9B/27B-style target tensors:
crystal build -Dcpu_only bin/cuda_q4k_gemv_probe.cr -o build/cuda_q4k_gemv_probe
./build/cuda_q4k_gemv_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--tensor blk.0.attn_gate.weight \
--kernel warp4 \
--reps 20 \
--warmup 3cuda_q4k_gemv_probe uses the raw GGUF Q4_K block layout (d, dmin, 12-byte packed scales/mins, 128-byte packed nibbles) and checks the CUDA output against the CPU QuantMatmul Q4_K reference. --kernel scalar keeps the first correctness kernel; the default --kernel warp4 maps four output rows to four warps per block and is the current faster probe shape.
Build the Q6_K CUDA correctness/speed probe for Q4_K_M tensors that remain in Q6_K:
crystal build -Dcpu_only bin/cuda_q6k_gemv_probe.cr -o build/cuda_q6k_gemv_probe
./build/cuda_q6k_gemv_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--tensor blk.0.ffn_down.weight \
--kernel warp4 \
--reps 10 \
--warmup 2cuda_q6k_gemv_probe covers the GGUF Q6_K block layout (ql, qh, signed scales, d) used by output/value/down projections in mixed-quant target models. Like the Q4_K/Q8_0 probes, it is a standalone backend primitive check; full CUDA Qwen execution is still a separate backend split.
Build the first GPU-resident FFN sequence probe:
crystal build -Dcpu_only bin/cuda_ffn_sequence_probe.cr -o build/cuda_ffn_sequence_probe
./build/cuda_ffn_sequence_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layer 0 \
--reps 10 \
--warmup 2cuda_ffn_sequence_probe composes the checked CUDA primitives as Q4_K ffn_gate + Q4_K ffn_up -> SwiGLU -> Q6_K ffn_down while keeping the input, intermediate activations, and output projection input GPU-resident. Only the final hidden vector is copied back for comparison against the CPU QuantMatmul FFN reference.
Build the full-attention input projection bundle probe:
crystal build -Dcpu_only bin/cuda_attn_projection_probe.cr -o build/cuda_attn_projection_probe
./build/cuda_attn_projection_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layer 3 \
--reps 10 \
--warmup 2cuda_attn_projection_probe runs Q4_K attn_q + Q4_K attn_k + Q6_K attn_v from one GPU-resident hidden vector and copies Q/K/V back only after all projections complete. It targets full-attention layers such as blk.3 in Qwen3.5 9B.
The probe now routes through ML::CUDA::QwenFullAttnProjectionRunner, supports --tokens N, and keeps Q/K/V outputs GPU-resident until the final correctness readback. It is the reusable input-projection boundary for future full-attention/KV CUDA work, not a complete full-attention layer runner yet.
Build the full-attention Q/K normalization + RoPE + KV-cache boundary probe:
crystal build -Dcpu_only bin/cuda_full_attn_kv_probe.cr -o build/cuda_full_attn_kv_probe
./build/cuda_full_attn_kv_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layer 3 \
--tokens 4 \
--start-pos 2 \
--max-seq 12cuda_full_attn_kv_probe now routes through ML::CUDA::QwenFullAttnLayerRunner, a residual-hidden-to-final-hidden wrapper around the projection runner and ML::CUDA::QwenFullAttnKVRunner. The projection runner can apply the initial attn_norm on CUDA from residual hidden states before Q/K/V projection; Q is split into normalized/RoPE'd Q and gate, K is RMSNormed/RoPE'd, K/V rows are appended to a CUDA-resident cache at start_pos, a correctness-first serial CUDA kernel computes GQA scores, softmax, value reduction, and Q-gate multiplication, the resident gated attention output is projected through attn_output.weight, and the layer tail runs residual add, post-attention RMSNorm, FFN gate/up/SwiGLU/down, and final residual. It checks Q, gate, K, gated attention output, projected attention output, final hidden, K-cache, and V-cache against the CPU Qwen reference. This is now a clean one-layer semantics probe with device input/output hooks for mixed-stack composition, but it is not yet an end-to-end Linux decode path: full/recurrent stack scheduling, logits/top1, tokenizer/sampling, restored nonzero prefix KV, and a faster attention kernel remain separate gates.
Build the Q5_K CUDA recurrent-QKV probe:
crystal build -Dcpu_only bin/cuda_q5k_gemv_probe.cr -o build/cuda_q5k_gemv_probe
./build/cuda_q5k_gemv_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--tensor blk.0.attn_qkv.weight \
--reps 10 \
--warmup 2cuda_q5k_gemv_probe covers the GGUF Q5_K block layout used by recurrent-layer combined attn_qkv.weight tensors in the current Qwen3.5 9B Q4_K_M file.
Build the recurrent-layer projection bundle probe:
crystal build -Dcpu_only bin/cuda_recurrent_projection_probe.cr -o build/cuda_recurrent_projection_probe
./build/cuda_recurrent_projection_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layer 0 \
--reps 10 \
--warmup 2cuda_recurrent_projection_probe runs Q5_K attn_qkv + Q4_K attn_gate + Q4_K ssm_alpha + Q4_K ssm_beta from one GPU-resident hidden vector and copies the four outputs back only after all kernels complete. It is the first CUDA recurrent projection-bundle proof; DeltaNet recurrence, convolution, state updates, and ssm_out remain separate work.
Build the synthetic DeltaNet output slice probe:
crystal build -Dcpu_only bin/cuda_deltanet_output_probe.cr -o build/cuda_deltanet_output_probe
./build/cuda_deltanet_output_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layer 0 \
--reps 10 \
--warmup 2cuda_deltanet_output_probe runs a synthetic CUDA DeltaNet state update, applies post RMSNorm/SiLU gating on GPU, and feeds the result directly into the real Q4_K ssm_out.weight projection. It is a stateful boundary probe, not a full recurrent layer: recurrent conv prep, alpha/beta transforms, residuals, and FFN remain separate work.
Build the recurrent prep/output slice probe:
crystal build -Dcpu_only bin/cuda_recurrent_prep_output_probe.cr -o build/cuda_recurrent_prep_output_probe
./build/cuda_recurrent_prep_output_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layer 0 \
--tokens 4 \
--reps 10 \
--warmup 2cuda_recurrent_prep_output_probe now composes one full recurrent layer token slice: input RMSNorm, real recurrent projection bundle (attn_qkv, attn_gate, ssm_alpha, ssm_beta), recurrent conv prep, alpha/beta transforms, DeltaNet, post RMSNorm/SiLU, Q4_K ssm_out, residual add, post-attention RMSNorm, Q4_K FFN gate/up, SwiGLU, Q6_K FFN down, and final residual. --tokens N runs a GPU-resident sequence through persistent conv/SSM state and compares all token outputs plus final recurrent states against the CPU reference. The probe now separates one-time weight upload from per-sequence input/state reset and prints weight_upload_ms; timed cuda_ms_per_token excludes the persistent weight upload. GGUF recurrent-layer tensor loading is routed through QwenRecurrentLayerRunner::Weights.load, so the probe no longer manually passes every raw tensor into the runner constructor. It is still a standalone one-layer probe, not an end-to-end Linux decoder.
Build the recurrent multi-layer stack scaffold:
crystal build -Dcpu_only bin/cuda_recurrent_stack_probe.cr -o build/cuda_recurrent_stack_probe
./build/cuda_recurrent_stack_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layers 0,2,4 \
--tokens 2cuda_recurrent_stack_probe chains multiple QwenRecurrentLayerRunner instances and compares the final hidden sequence plus each layer's recurrent conv/SSM state against the CPU reference. The default path hands each recurrent layer's CUDA output buffer directly to the next layer's CUDA input; --host-handoff keeps the older host-copy route as a debug oracle. This is still a recurrent-only scaffold, not an end-to-end Linux decoder.
Build the mixed recurrent/full-attention CUDA stack probe:
crystal build -Dcpu_only bin/cuda_mixed_stack_probe.cr -o build/cuda_mixed_stack_probe
./build/cuda_mixed_stack_probe \
--model /path/to/Qwen3.5-9B-Q4_K_M.gguf \
--layers 0,1,2,3,4 \
--tokens 2 \
--start-pos 2 \
--max-seq 12cuda_mixed_stack_probe composes QwenRecurrentLayerRunner and QwenFullAttnLayerRunner in model layer order with device-resident hidden handoff across recurrent/full-attention boundaries, then runs QwenOutputHeadRunner for output RMSNorm, quantized lm-head projection, and resident top1. The layer/head loop is now owned by ML::CUDA::QwenMixedStackRunner, which is the first model-slice decode-state object for CUDA. By default the probe copies back only the CUDA top1 id/value plus hidden/state debug outputs; pass --read-logits to also copy full logits for attribution, and --profile-phases to insert per-layer/head synchronizations and print attribution lines. It compares the final hidden sequence, top1, recurrent conv/SSM states, and full-attention KV cache rows against the CPU reference. This is the first mixed-stack CUDA correctness scaffold through resident top1; it still stops before tokenizer/sampling, repeated full-model decode ownership, and an optimized topK/sampling kernel. The current resident top1 is a simple two-phase partial-scan/reduce kernel: correct, but not yet promoted as a speed-optimized head.
Build the Metal bridge once:
make build/bridge.oBuild the practical generation demo:
crystal build --release --no-debug \
--link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
bin/qwen35_generate.cr \
-o build/qwen35_generateRun greedy generation:
./build/qwen35_generate "The capital of France is" 64Enable exact n-gram speculative decode for repeated text:
QWEN35_NGRAM_DECODE=1 ./build/qwen35_generate "The capital of France is" 64Use the conservative automatic decode policy:
QWEN35_DECODE_POLICY=auto ./build/qwen35_generate "The capital of France is" 64auto is the product-safe proposal-aware profile: it uses exact n-gram/cache
proposals only when they are large enough to amortize verification, enables the
candidate-shape risk gate, and otherwise falls back to exact target decoding
without invoking the neural draft model.
Enable exact neural speculative decode with the Qwen 3.5 0.8B draft:
QWEN35_SPECULATIVE_DECODE=1 \
QWEN35_HEAD_FULL_ROWS_GUARDED=1 \
./build/qwen35_generate "The capital of France is" 64Enable exact prompt cache:
QWEN35_PROMPT_CACHE=1 \
QWEN35_SESSION_ID=demo \
./build/qwen35_generate "The capital of France is" 64Useful Qwen environment switches:
| Variable | Effect |
|---|---|
QWEN35_PROMPT_CACHE=1 |
Enable exact prompt-state cache lookup/save in qwen35_generate. |
QWEN35_PROMPT_CACHE_ROOT=/path |
Override prompt-cache artifact root. |
QWEN35_PREPARE_STATE_OFF=1 |
Disable eager Metal state-buffer preparation in qwen35_generate. By default the CLI prepares KV/DeltaNet buffers before timing prompt ingest. |
QWEN35_DECODE_POLICY=greedy|ngram|speculative|auto |
Explicit decode-mode selector. auto chooses the exact fail-closed n-gram path with risk gating; explicit policy overrides legacy mode envs. |
QWEN35_TRACE_STEPS_OFF=1 |
Suppress per-token/per-cycle trace lines in qwen35_generate while keeping summaries and final output. |
QWEN35_QUIET=1 |
Alias for suppressing per-step traces in qwen35_generate; useful for cleaner local timing. |
QWEN35_NGRAM_DECODE=1 |
Enable exact n-gram speculative decode in qwen35_generate. |
QWEN35_NGRAM_GAMMA=32 |
Maximum n-gram verifier chunk size. |
QWEN35_NGRAM_MIN=6 |
Minimum repeated suffix length before n-gram drafting. |
QWEN35_NGRAM_MAX=8 |
Maximum suffix length to search for n-gram drafting. |
QWEN35_NGRAM_MIN_CANDIDATES=N |
Skip n-gram proposals shorter than N candidates. In auto, the default is 8; explicit ngram keeps the old default 0 unless set. |
QWEN35_NGRAM_STAGE_MIN=N |
Split only n-gram verifier chunks with at least N candidates into staged subchunks. In auto, the default is QWEN35_NGRAM_GAMMA + 1, so the common full chunk is kept intact unless overridden. Explicit ngram keeps the old default 0 unless set. |
QWEN35_NGRAM_RISK_GATE=0|1 |
In auto, the exact candidate-shape risk gate is enabled by default; set 0 to disable. Explicit ngram keeps the old default off unless set to 1. |
QWEN35_NGRAM_RISK_MIN_SIZE=16 |
Candidate size threshold used by the n-gram risk gate. This is independent from QWEN35_NGRAM_STAGE_MIN, so staging can be disabled without weakening fail-closed risk checks. |
QWEN35_NGRAM_RECURSIVE_OFF=1 |
Disable recursive n-gram extension through scratch history. |
QWEN35_NGRAM_DISABLE_AFTER_REJECT_OFF=1 |
Exploration mode: keep trying n-gram chunks after first rejection. |
QWEN35_NGRAM_REPLAY_ON_REJECT=1 |
Research/fast-path mode: skip n-gram target-state backups and rebuild the exact target state only after a non-final n-gram reject. Use with the default auto risk gate; it can regress badly when a large bad n-gram chunk is forced through verification. |
QWEN35_SPECULATIVE_DECODE=1 |
Enable exact neural speculative decode in qwen35_generate using the 0.8B draft. |
QWEN35_DRAFT_MODEL=/path |
Override the Qwen 3.5 draft GGUF used by neural speculative decode. |
QWEN35_SPEC_GAMMA=4 |
Initial neural draft chunk size in qwen35_generate. |
QWEN35_SPEC_MAX_GAMMA=32 |
Maximum adaptive neural draft chunk size. |
QWEN35_SPEC_PLAIN_FALLBACK_OFF=1 |
Disable target-only fallback after low-gamma speculative rejection. Useful for A/B experiments; default fallback is faster on rejection-heavy prompts. |
QWEN35_SPEC_PLAIN_FALLBACK_GAMMA=2 |
Gamma threshold at or below which rejected neural speculative decode falls back to target-only generation. |
QWEN35_SPEC_BOOTSTRAP_GAMMA=N |
Default-off neural speculative jump after a fully accepted initial chunk. Can help 100%-accept runs; may regress prompts that reject after an accepted prefix. |
QWEN35_SPEC_SINGLE_FAST_OFF=1 |
Disable the exact gamma=1 accepted-token fast path in neural speculative decode. Mostly useful when target-only fallback is disabled for A/B experiments. |
QWEN35_SPEC_VERIFY=chunk-inplace|hybrid|serial |
Choose neural speculative verifier strategy. Default chunk-inplace is best for high-accept prompts; hybrid can help first-cycle partial-reject prompts. |
QWEN35_SPEC_SKIP_DRAFT_BEFORE_FALLBACK_OFF=1 |
Disable the exact optimization that skips draft resync work when a rejection is guaranteed to enter target-only fallback. |
QWEN35_SPEC_SKIP_DRAFT_BACKUP_BEFORE_FALLBACK_OFF=1 |
Disable the matching draft-backup skip before fallback-bound speculative chunks. |
QWEN35_HEAD_FULL_ROWS_GUARDED=1 |
Experimental speculative-verifier accelerator for large accepted chunks; uses a margin guard and exact fallback for low-margin rows. |
QWEN35_HEAD_FULL_ROWS_MARGIN=0.25 |
Margin threshold for the guarded full-row verifier route. Higher is safer but falls back more often. |
QWEN35_FFN_DOWN_ADD_FUSED_OFF=1 |
Disable decode-wave FFN-down residual-add fusion for Q4/Q6 target and Q8 draft experiments. |
QWEN35_Q4K_PAIR_H16_MIN_BATCH=64 |
Tune the prefill Q4 gate/up shared H16 conversion threshold. The current default enables sharing from pp64 upward after refreshed A/B showed a small exact win. |
QWEN35_REC_PROJ_SHARED_H16_OFF=1 |
Disable the exact recurrent prefill projection optimization that shares one H16 input conversion between Q5 qkv and Q4 gate GEMMs. |
QWEN35_PREFILL_CHUNK_OFF=1 |
Force older non-chunked prefill path. |
QWEN35_DECODE_WAVE_OFF=1 |
Force older non-wave decode path. |
The current Qwen API is low-level and intended for native inference experiments:
require "ml/gguf/qwen35_cpu"
require "ml/gguf/qwen35_weights"
model = "/path/to/Qwen3.5-9B-Q4_K_M.gguf"
weights = ML::GGUF::Qwen35Weights.from_gguf(model)
state = ML::GGUF::Qwen35CPU::State.new(weights.hparams, max_seq: 1024)
prompt_ids = [760_i32, 6511_i32, 314_i32, 9338_i32, 13_i32]
next_id, next_logit = ML::GGUF::Qwen35CPU.prefill_tokens_top1(weights, prompt_ids, 0, state)
64.times do |i|
puts next_id
next_id, next_logit = ML::GGUF::Qwen35CPU.forward_top1(weights, next_id, prompt_ids.size + i, state)
endWhen linking an executable that uses Metal, include the bridge object and Apple frameworks:
crystal build your_app.cr \
--link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++"For CPU-only builds:
crystal build -Dcpu_only your_app.crBuild the matched benchmark:
crystal build --release --no-debug \
--link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
bin/benchmark_qwen_vs_llama.cr \
-o build/benchmark_qwen_vs_llamaRun a normal first-run prefill/decode comparison:
./build/benchmark_qwen_vs_llama \
--model ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
--llama-bench ~/SrcArchives/AI/llama.cpp/build/bin/llama-bench \
--prompt=64 \
--gen=64 \
--reps=5 \
--warmup=2For publishable measurements, wait for a quiet host:
./build/benchmark_qwen_vs_llama \
--prompt=64 \
--gen=64 \
--reps=5 \
--warmup=2 \
--wait-quiet-ms=60000 \
--require-quietAdditional benchmark modes:
# Fresh State per repetition, but Metal state buffers are prepared before
# the timed prefill. This measures prompt ingest without first-touch buffer
# allocation/zeroing in the timed region.
./build/benchmark_qwen_vs_llama --native-prefill-prepare-state
# State buffers allocated once, then reset between reps.
./build/benchmark_qwen_vs_llama --native-prefill-prealloc
# Exact prompt-cache restore after one seeded native prefill.
./build/benchmark_qwen_vs_llama --native-prefill-cacheFresh local M2 Max 64GB relaxed-load snapshot after the shared-H16 recurrent projection cleanup, Qwen 3.5 9B Q4_K_M, llama.cpp llama-bench, prompt=64, gen=64, reps=3, warmup=1, flash-attention off:
| Mode | cogni-ml | llama.cpp | Gap |
|---|---|---|---|
| First-run prefill | 426.70 tok/s p50 | 455.10 tok/s avg | -6.24% |
| Fresh state, prepared Metal buffers | 449.73 tok/s p50 | 464.78 tok/s avg | -3.24% |
| Prefill with preallocated state | 448.60 tok/s p50 | 465.91 tok/s avg | -3.71% |
| Prompt-cache restore | 1350.65 tok/s p50 | 465.80 tok/s avg | +189.97% |
| Plain greedy decode, first-run bench | 48.67 tok/s p50 | 46.67 tok/s avg | +4.29% |
| Plain greedy decode, prepared-state bench | 48.59 tok/s p50 | 46.43 tok/s avg | +4.67% |
| Plain greedy decode, preallocated bench | 48.51 tok/s p50 | 46.63 tok/s avg | +4.05% |
| Plain greedy decode, prompt-cache bench | 48.52 tok/s p50 | 46.58 tok/s avg | +4.18% |
Notes:
- The table is a local engineering snapshot, not a lab-clean public benchmark.
- First-run prefill is still behind llama.cpp on this machine. The native wins currently come from state reuse, prompt-cache restore, and exact speculative decode.
--native-prefill-prepare-stateuses a freshStateper repetition but callsQwen35CPU.prepare_state_metal!before timing. This is useful for server-style latency where a session object can be prepared before the prompt arrives.--native-prefill-cachemeasures exact restore of a previously computed prompt state; it is not a first-run prefill replacement.- Short decode runs are noisy on a desktop system. The two plain decode rows above are intentionally both shown: treat plain decode as parity-to-faster, not as a stable public margin without a quiet rerun.
Neural draft harness with Qwen 3.5 0.8B Q8_0:
crystal build --release --no-debug \
--link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
bin/qwen35_speculative_accept.cr \
-o build/qwen35_speculative_accept
./build/qwen35_speculative_accept \
--target ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-9B-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
--draft ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf \
--tokens 64 \
--ngram \
"The capital of France is"Cheap-proposal-only policy for repeated/template spans:
QWEN35_SPEC_NGRAM_MIN_CANDIDATES=8 \
./build/qwen35_speculative_accept \
--target ~/.cache/lm-studio/models/lmstudio-community/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf \
--draft ~/.cache/lm-studio/models/lmstudio-community/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q8_0.gguf \
--tokens 32 \
--ngram \
--ngram-risk-gate \
--ngram-target-only \
"alpha beta gamma alpha beta gamma alpha beta gamma alpha"Target-only n-gram speculative harness:
crystal build --release --no-debug \
--link-flags="$(pwd)/build/bridge.o -framework Metal -framework Foundation -lc++" \
bin/qwen35_ngram_speculative.cr \
-o build/qwen35_ngram_speculative
./build/qwen35_ngram_speculative \
--tokens 64 \
--gamma 32 \
--min-ngram 6 \
"The capital of France is"Both harnesses replay/check exact greedy target output by default unless their CLI explicitly says otherwise.
Fresh local speculative smoke, same M2 Max 64GB and Qwen 3.5 9B target:
| Mode / prompt | Effective speed | Plain target | Notes |
|---|---|---|---|
Neural draft, The capital of France is |
15.38 ms/tok, 65.01 tok/s | 21.98 ms/tok, 45.49 tok/s | 100% accepted, 64/64 candidates |
Neural draft, def fibonacci(n): |
21.06 ms/tok, 47.48 tok/s | 21.71 ms/tok, 46.07 tok/s | falls back after rejection; small but safe win |
N-gram + neural, The capital of France is |
10.10 ms/tok, 98.98 tok/s | 21.91 ms/tok, 45.64 tok/s | repeated-text path, 48/48 n-gram candidates accepted |
Experimental guarded full-row verifier + neural, The capital of France is |
14.32 ms/tok, 69.82 tok/s | 22.36 ms/tok, 44.73 tok/s | QWEN35_HEAD_FULL_ROWS_GUARDED=1, 0 fallback rows in this run |
Experimental guarded full-row verifier + n-gram + neural, The capital of France is |
9.20 ms/tok, 108.64 tok/s | noisy target run | QWEN35_HEAD_FULL_ROWS_GUARDED=1, 48/48 n-gram candidates accepted |
Speculative decode caveats:
- The speculative paths are exact greedy verification paths, not approximate sampling shortcuts.
- Neural speculative speed depends on draft acceptance. High-accept prompts are faster; rejection-heavy prompts quickly fall back to plain target decode.
- In
qwen35_generate, neural speculative decode is useful for longer high-accept generations. In a local 64-token smoke,The capital of France ismeasured20.40 ms/tokgreedy,16.61 ms/tokneural speculative, and15.10 ms/tokneural speculative with guarded full-row verification. A 32-token smoke was slower due fixed draft/verifier overhead. - N-gram speculation is a workload-specialized path for repeated/generated-template text.
QWEN35_DECODE_POLICY=autois the recommended product profile: risk-gated,min_candidates=8, no neural fallback, and exact target-only fallback outside clean cheap-copy spans. - In the research acceptance harness,
--ngram-target-only/QWEN35_SPEC_NGRAM_TARGET_ONLY=1skips neural draft fallback after the cheap n-gram proposal source and uses exact target-only steps instead. On a 27B mixed JSONL gate it beat neural default on5/7prompts with paired ratio0.909xwhen combined with--ngram-risk-gate, but its average speed was near plain target decode; the real win is on clean repeated spans. - The n-gram risk gate now also catches small-period prefix overruns such as IP/YAML tails. A focused 27B probe kept clean
alpha beta gammarepeats at~2.58xwhile turning a YAML overrun from0.80xinto fail-closed target-only1.006x. - The productization smoke after the structured-tail fix kept clean 27B repeats fast (
~2.55xover plain), made a YAML-like overrun fail closed (1.003x), and hadngram_target_only_riskbeat neural default on the tiny paired suite (0.815xratio). Treat this as opt-in CLI evidence, not a broad default claim. QWEN35_NGRAM_REPLAY_ON_REJECT=1is exact but deliberately opt-in. It removes rollback-copy overhead on high-confidence accepted n-gram chunks; on the local 27B+0.8B repeat8 harness it improvedngram_router16_riskfrom30.85to30.21 ms/tok. A forced no-risk YAML reject regressed from71.53to95.86 ms/tok, so this is not a broad default.- N-gram verifier chunks temporarily disable guarded full-row verification even if
QWEN35_HEAD_FULL_ROWS_GUARDED=1, because partial n-gram rejection exposed a close-row guard failure during adversarial CLI testing. QWEN35_HEAD_FULL_ROWS_GUARDED=1is still an experimental research switch. The harness checks final output against plain greedy target output, but the route is not broad-defaulted because it relies on a full-row F16 top1 margin guard.- These numbers are effective decode throughput after prompt prefill; they do not make first-run prefill faster.
The embedding path targets nomic-embed-text-v2-moe with a fully native Metal compute pipeline.
require "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"
ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf("path/to/model.gguf", ML::GGUF::MetalBackend.new)
embedding = model.embed("Your text here")Apple M2 Max, 38 GPU cores:
| Tokens | Latency |
|---|---|
| 20 | 14 ms |
| 94 | 16 ms |
| 196 | 33 ms |
| 433 | 70 ms |
- simdgroup-matrix GEMM for Q5_K/Q6_K dequant+multiply.
- Batched expert GEMM for MoE experts.
- ComputeGraph wave scheduling with offset-aware dependency analysis.
- Fused QKV split/RoPE, gate/softmax/top-k, scatter, and norm kernels.
- GPU-driven dispatch where useful.
| Model | Format | Status |
|---|---|---|
Qwen3.5-9B |
GGUF Q4_K_M | Native Metal text generation path, active optimization target. |
Qwen3.5-0.8B |
GGUF Q8_0 | Native draft model path for speculative decode harnesses. |
Qwen3.6-27B |
GGUF Q4_K_M target | Planned/experimental scale-up target. |
nomic-embed-text-v2-moe |
GGUF Q5_K_M | Native Metal embedding pipeline. |
| BERT-like encoders | GGUF | Via NomicBertMoE when the architecture matches. |
| Other Llama/Qwen/Mistral-style models | GGUF | Via llama.cpp bindings. |
# shard.yml
dependencies:
cogni-ml:
github: skuznetsov/cogni-ml
version: ~> 0.40.0make build
make specCPU-only:
make build_cpu
make spec_cpullama.cpp helper targets:
make llama
make llama_envThe Makefile searches common local, Homebrew, and system library locations for libllama. Override with LLAMA_DIR, LLAMA_BUILD, or LLAMA_LIB_DIR if needed.
require "ml"
x = ML::Autograd::Variable.rand(2, 3, requires_grad: true, device: ML::Tensor::Device::CPU)
layer = ML::NN::Linear.new(3, 4, device: ML::Tensor::Device::CPU)
out = layer.forward(x)
loss = out.mean
loss.backward
opt = ML::Optim::Adam.new(layer.parameters)
opt.step
opt.zero_gradrequire "ml/llm/llama"
ML::LLM.init
model = ML::LLM::Model.new("path/to/model.gguf")
gen = ML::LLM::Generator.new(model)
puts gen.ask("What is Crystal?", max_tokens: 100)
ML::LLM.cleanuprequire "ml"
require "ml/gguf/nomic_bert"
require "ml/gguf/metal_backend"
require "ml/metal/compute_graph"
ML::Metal::Device.init!
model = ML::GGUF::NomicBertMoE.from_gguf(
"nomic-embed-text-v2-moe.Q5_K_M.gguf",
ML::GGUF::MetalBackend.new
)
vec = model.embed("Crystal programming language")
puts "dim=#{vec.size}"
vecs = model.embed_batch(["Hello", "World", "Crystal"])| Kernel | Purpose |
|---|---|
gemm_q4k.metal |
Q4_K GEMV/GEMM paths for Qwen. |
gemm_q56k.metal |
Q5_K/Q6_K/Q8_0 GEMV, top1, and helper kernels for Qwen. |
gemm_mm.metal |
simdgroup-matrix GEMM for Q5_K/Q6_K and batched expert variants. |
gemm_simd.metal |
Scalar SIMD GEMM fallback. |
ffn_qwen35.metal |
Qwen FFN, add, RMSNorm, and activation helpers. |
delta_net.metal |
Qwen 3.5 DeltaNet/recurrent kernels. |
fullattn_qwen35.metal |
Qwen full-attention prefill/decode helpers. |
attn_decode_qwen35.metal |
Qwen gated attention decode. |
attention_matmul.metal |
Flash-style attention matrix helpers. |
bert_fp16.metal |
Nomic/BERT fused ops. |
nn.metal |
General NN ops. |
| Platform | GPU | CPU | Status |
|---|---|---|---|
| macOS Apple Silicon | Metal | Yes | Primary target. |
| macOS Intel | Metal | Yes | Supported for general Metal paths; Qwen performance focus is Apple Silicon. |
| Linux | Experimental CUDA probes | Yes | Use -Dcpu_only for GGUF/metadata and CUDA probe CLIs; full Qwen generation is still Metal-first. |
| FreeBSD | No native Metal | Untested CPU-only | Not a primary CI target. |
NVIDIA/CUDA support is currently an experimental backend-probe track, not a full decoder. The Qwen native generation path remains Metal-first.
| Flag | Effect |
|---|---|
-Dcpu_only |
Disable Metal and build pure CPU paths. |
-Duse_gguf |
Enable GGUF model loading where applicable. |
MIT