Comprehensive benchmarking suite for evaluating local LLM models on real-world coding and debugging tasks. Tests models on production mining pool code with bugs ranging from easy threading issues to expert-level crypto byte order vulnerabilities.
| Test | Difficulty | Checks | Description |
|---|---|---|---|
| ZMQ Listener | Easy | 2 | Threading bug in ZMQ socket listener |
| PPLNS Mining Pool | Hard | 3 | Config mismatch, hardcoded values, unused variables |
| Payment System | Expert | 4 | Race condition, SQL injection, float precision, atomicity |
| Stratum Protocol | Nightmare | 5 | Byte order/endianness, info leak, stale data, memory leak, input validation |
| HiveOS Wrapper | Practical | 8 | Multi-file creation (manifest, config, run, stats scripts) |
Total: 22 checks across 5 practical debugging tests
Platform: 8x NVIDIA RTX A4000 (16GB each), AMD EPYC 7532, PCIe Gen4 Software: vLLM 0.14, SGLang 0.3.2, Python 3.12, CUDA 12.8 Date: February 12-15, 2026
| # | Model | Quant | TP | Quality | Single t/s | Peak t/s | Context | Framework |
|---|---|---|---|---|---|---|---|---|
| 1 | Devstral-2-123B | AWQ | 8 | 100% (22/22) | 41 | 300 @ C=32 | 32K | vLLM |
| 2 | Nemotron-3-Nano-30B | BF16 | 8 | 100% (22/22) | 262 | 1307 @ C=64 | 16K | vLLM |
| 3 | Qwen3-Coder-30B-A3B | AWQ | 4 | 100% (22/22) | 184 | 1025 @ C=32 | 32K | vLLM |
| 4 | GLM-4.7-Flash | AWQ | 4 | 100% (22/22) | 101 | 566 @ C=8 | 65K | SGLang |
| 5 | GLM-4.5-Air | AWQ FP16Mix | 8 | 95.5% (21/22) | 87 | 724 @ C=64 | 8K | vLLM |
| 6 | Magistral-Small-2509 | AWQ | 8 | 95.5% (21/22) | 144 | 1470 @ C=64 | 32K | vLLM |
| 7 | Magistral-Small-2506 | AWQ | 8 | 95.5% (21/22) | 156 | 1831 @ C=32 | 32K | vLLM |
| 8 | QwQ-32B | AWQ | 8 | 95.5% (21/22) | 102 | 733 @ C=64 | 16K | vLLM |
| 9 | Qwen3-32B | AWQ | 8 | 95.5% (21/22) | 78 | 1013 @ C=32 | 32K | vLLM |
| 10 | EXAONE-4.0-32B | GPTQ g32 | 8 | 95.5% (21/22) | 110 | 719 @ C=64 | 131K | vLLM |
| 11 | Qwen3-30B-A3B | AWQ | 4 | 95.5% (21/22) | 178 | 1575 @ C=32 | 32K | vLLM |
| 12 | Devstral-Small-2-24B | AWQ | 8 | 95.5% (21/22) | 148 | 1452 @ C=32 | 32K | vLLM |
| 13 | Seed-OSS-36B | AWQ | 8 | 90.9% (20/22) | 88 | 1163 @ C=32 | 32K | vLLM |
| 14 | Qwen3-30B-A3B-Thinking | AWQ | 4 | 81.8% (18/22) | 160 | 1031 @ C=32 | 32K | vLLM |
| 15 | Nanbeige4.1-3B | BF16 | 4 | 77.3% (17/22) | 187 | 1239 @ C=64 | 131K | vLLM |
| 16 | DS-R1-Distill-Qwen-32B | AWQ | 8 | 54.5% (12/22) | 78 | 992 @ C=32 | 32K | vLLM |
| 17 | DS-R1-Distill-Llama-70B | AWQ | 8 | 45.5% (10/22) | 57 | 540 @ C=32 | 16K | vLLM |
| 18 | GPT-OSS-20B | MXFP4 | 8 | 40.9% (9/22) | 52 | 933 @ C=16 | 8K | vLLM |
| Category | Model | Why |
|---|---|---|
| Best Quality | Devstral-2-123B | 100% quality (22/22). 5/5 Stratum also achieved by: Nemotron, Qwen3-Coder, GLM-4.7-Flash, GLM-4.5-Air |
| Best Overall | Nemotron-3-Nano-30B | 100% quality + hidden reasoning + 262 t/s single + 1307 t/s peak |
| Best Code Model | Qwen3-Coder-30B-A3B | 100% quality, purpose-built for code tasks |
| Best Throughput | Magistral-Small-2506 AWQ | 1831 t/s peak, 95.5% quality, only 14GB |
| Best Long Context | Magistral-Small-2509 | 131K context, 95.5% quality, working reasoning |
| Best Reasoning | Nemotron-3-Nano-30B | 100% quality, hidden <think> mode enabled by default, 262 t/s |
| Best Reasoning (Runner-up) | QwQ-32B | 95.5% quality with <think> mode, 102 t/s |
| Model | ZMQ (2) | PPLNS (3) | Payment (4) | Stratum (5) | HiveOS (8) | Total |
|---|---|---|---|---|---|---|
| Devstral-2-123B | 2/2 | 3/3 | 4/4 | 5/5 | 8/8 | 22/22 |
| Nemotron-3-Nano-30B | 2/2 | 3/3 | 4/4 | 5/5 | 8/8 | 22/22 |
| Qwen3-Coder-30B-A3B | 2/2 | 3/3 | 4/4 | 5/5 | 8/8 | 22/22 |
| GLM-4.7-Flash | 2/2 | 3/3 | 4/4 | 5/5 | 8/8 | 22/22 |
| QwQ-32B | 2/2 | 3/3 | 4/4 | 4/5 | 8/8 | 21/22 |
| GLM-4.5-Air | 2/2 | 3/3 | 3/4 | 5/5 | 8/8 | 21/22 |
| Magistral-Small-2509 | 2/2 | 3/3 | 4/4 | 4/5 | 8/8 | 21/22 |
| Magistral-Small-2506 | 2/2 | 3/3 | 4/4 | 4/5 | 8/8 | 21/22 |
| Qwen3-32B | 2/2 | 3/3 | 4/4 | 4/5 | 8/8 | 21/22 |
| EXAONE-4.0-32B | 2/2 | 3/3 | 4/4 | 4/5 | 8/8 | 21/22 |
| Qwen3-30B-A3B | 2/2 | 3/3 | 4/4 | 4/5 | 8/8 | 21/22 |
| Devstral-Small-2-24B | 2/2 | 3/3 | 4/4 | 4/5 | 8/8 | 21/22 |
| Seed-OSS-36B | 2/2 | 3/3 | 4/4 | 3/5 | 8/8 | 20/22 |
| Qwen3-30B-A3B-Think | 2/2 | 2/3 | 4/4 | 4/5 | 6/8 | 18/22 |
| Nanbeige4.1-3B | 2/2 | 3/3 | 3/4 | 4/5 | 5/8 | 17/22 |
| DS-R1-Distill-Qwen-32B | 2/2 | 2/3 | 0/4 | 0/5 | 8/8 | 12/22 |
| DS-R1-Distill-Llama-70B | 2/2 | 0/3 | 0/4 | 0/5 | 8/8 | 10/22 |
| GPT-OSS-20B | 0/2 | 3/3 | 0/4 | 0/5 | 6/8 | 9/22 |
| Concurrency | Nemotron | QwQ-32B | Mag-2506 | Mag-2509 | Qwen3-30B | Devstral-S | Nanbeige-3B | Seed-OSS | Qwen3-Coder | GLM-4.5 | Qwen3-32B | EXAONE | Devstral-2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 262 | 102 | 156 | 144 | 178 | 180 | 249 | 78 | 184 | 119 | 78 | 110 | 41 |
| 2 | 253 | 108 | 275 | 387 | 347 | 186 | 251 | 168 | 333 | 122 | 170 | 142 | 82 |
| 4 | 280 | 136 | 524 | 281 | 621 | 425 | 279 | 373 | 462 | 150 | 331 | 161 | 124 |
| 8 | 426 | 234 | 839 | 486 | 869 | 691 | 453 | 616 | 589 | 241 | 571 | 279 | 185 |
| 16 | 637 | 416 | 1266 | 827 | 1208 | 1051 | 737 | 923 | 840 | 350 | 825 | 444 | 271 |
| 32 | 479 | 611 | 1831 | 1238 | 1575 | 1452 | 1044 | 1163 | 1025 | 523 | 1013 | 636 | 300 |
| 64 | 1307 | 733 | - | 1470 | - | - | 1239 | - | - | 724 | - | 719 | - |
| Model | Issue |
|---|---|
| GPT-OSS-120B "AWQ" | twhitworth repo: only 6/36 layers present, FP16 not AWQ. No viable quant exists. |
--enable-expert-parallel → 95.5%, 5/5 Stratum |
|
| Qwen3-Next-80B-A3B AWQ | 2 KV heads (max TP=2), 24.5GB/GPU exceeds 16GB |
| Qwen3-Coder-Next AWQ | 2 KV heads (max TP=2), needs vLLM 0.15+ |
| EXAONE-4.0-32B AWQ/GPTQ g128 | Marlin min_thread_k=128 alignment fails at TP>2 (3424%128≠0). Fixed with custom GPTQ g32 |
| Qwen3-480B-Coder AWQ | 236GB model, 128GB VRAM. At TP=8: 29.5GB/GPU needed vs 14.7GB available. Needs ~500GB VRAM. |
1. Hidden Reasoning: Nemotron Was Thinking All Along
Nemotron-3-Nano-30B has enable_thinking=True as the DEFAULT in its chat template. It uses <think>/</think> tags (DeepSeek R1 format) and was reasoning during all benchmarks. This makes it the only 100% quality reasoning model - and the fastest. The top 4 models by quality (100%) include 3 non-reasoning (Devstral-2, Qwen3-Coder, GLM-4.7-Flash) and 1 hidden reasoner (Nemotron).
Only 4 models found the byte order/endianness bug in the Stratum protocol test. This is the strongest differentiator between 95.5% and 100% quality models.
3. Nemotron-3-Nano-30B: Best Overall (Hidden Reasoning Champion)
Mamba+MoE hybrid architecture achieves 100% quality AND fastest single-request speed (262 t/s). At 59GB BF16, it uses ~7.4GB/GPU at TP=8. Has hidden <think> reasoning enabled by default in chat template - was thinking during ALL benchmarks. QwQ-32B AWQ (95.5%, 102 t/s) is a solid reasoning alternative.
At only 14GB (24B params), it achieves 1831 t/s peak - highest of any model. With only ~1.75GB/GPU for weights, it has massive KV cache headroom.
After patching two vLLM bugs (flashinfer positional args + compressed-tensors false rejection), fp8 KV cache works for all quant formats:
- Devstral-2-123B: 16K -> 32K context
- Seed-OSS-36B: 32K -> 64K context
- Magistral: 1.25M tokens, 38x concurrency
Tested 3 models on SGLang - all performed worse than vLLM (Nemotron -47% peak, Qwen3-Coder quality collapsed to 54.5%). Exception: GLM-4.7-Flash works ONLY on SGLang.
EXAONE TP=2+PP=4 (8 GPUs): quality dropped 18pp, throughput dropped 26%. Only gained context length. Pipeline latency overhead outweighs memory savings.
At only 3B parameters (~6GB BF16), Nanbeige4.1-3B scores 77.3% - matching EXAONE-4.0-32B (10x larger). Uses <think> reasoning tags, 131K context, and achieves 187 t/s single-request. Fits on a single 16GB GPU. Best quality-per-parameter ratio in the benchmark.
- EXAONE-4.0-32B: Stuck at TP=2 with AWQ/GPTQ g128 (Marlin alignment). Custom GPTQ g32 +
--dtype float16enables TP=8: 110 t/s (+67%), quality preserved. - Magistral-Small-2509: Multimodal model (
Mistral3ForConditionalGeneration) had no community AWQ. Extracted text model from multimodal wrapper, AWQ quantized with fragmentation fix. Result: 144 t/s (+64% vs BF16), 1470 t/s peak (+37%), quality preserved at 95.5%.
| Model | Q Heads | KV Heads | Max TP | Type | Notes |
|---|---|---|---|---|---|
| Devstral-2-123B | 96 | 8 | 8 | Dense | 123B, tight VRAM |
| Magistral-Small | 32 | 8 | 8 | Dense | 24B, both 2506/2509 |
| QwQ-32B | 40 | 8 | 8 | Dense | Qwen2-based reasoning |
| Qwen3-32B | 64 | 8 | 8 | Dense | Reasoning |
| Seed-OSS-36B | 80 | 8 | 8 | Dense | Reasoning |
| Nemotron-3-Nano-30B | 32 | 8 | 8 | Mamba+MoE | --trust-remote-code |
| Devstral-Small-2-24B | 32 | 8 | 8 | Dense | Mistral3 |
| GPT-OSS-20B | 64 | 8 | 8 | MoE | enforce-eager only |
| Qwen3-30B-A3B | 32 | 4 | 4 | MoE | 128 experts, 8 active |
| Qwen3-Coder-30B-A3B | 32 | 4 | 4 | MoE | 128 experts, 8 active |
| GLM-4.5-Air | 96 | 8 | 8 | MoE | 128E/8A, needs --enable-expert-parallel, QuantTrio FP16Mix |
| GLM-4.7-Flash | 20 | 20 | 4 | MoE | SGLang only |
| Nanbeige4.1-3B | 20 | 4 | 4 | Dense | 3B reasoning, fits single GPU |
| EXAONE-4.0-32B | 40 | 8 | 8 | Dense | Custom GPTQ g32 + float16 for TP=8 |
# Qwen3-Coder-30B-A3B x2 replicas (100% quality, 184 t/s each)
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
--tensor-parallel-size 4 --gpu-memory-utilization 0.90 \
--max-num-seqs 16 --max-model-len 32768 --port 8000
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
--tensor-parallel-size 4 --gpu-memory-utilization 0.90 \
--max-num-seqs 16 --max-model-len 32768 --port 8001# Magistral-Small-2509 BF16 (95.5%, 131K context, [THINK] reasoning)
vllm serve mistralai/Magistral-Small-2509 \
--tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
--max-num-seqs 48 --tokenizer-mode mistral --config-format mistral \
--load-format mistral --reasoning-parser mistral --port 8000# Magistral-Small-2506 AWQ (1831 t/s peak, 95.5% quality)
vllm serve abhishekchohan/Magistral-Small-2506-AWQ \
--tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
--max-num-seqs 48 --max-model-len 32768 --port 8000# Devstral-2-123B with fp8 KV (100% quality, 32K context)
vllm serve cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit \
--tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
--max-num-seqs 16 --max-model-len 32768 \
--kv-cache-dtype fp8_e5m2 --port 8000Two bugs in vLLM 0.14 had to be patched for fp8 KV cache to work across all quant formats. Status as of vLLM 0.20.0:
| Patch | Upstream status | Location |
|---|---|---|
| FlashInfer positional arg fix | Fixed in 0.18.0+ (still fixed at 0.20.0) | archive/patches/ |
| Compressed-tensors false fp8 KV rejection | Still broken at 0.20.0 | projects/turboquant/patches/ |
See the README in each patches dir for the diff and apply instructions.
# Install
pip install pyyaml requests
# Quality bench (5 prompts, 22 pass/fail checks against rubric)
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1
# With parallel throughput test
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1 --parallel
# With server config metadata
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1 \
--parallel --tp 8 --framework vllm --quant-method awq \
--output ./results/8xA4000
# Decode-rate benchmark (single-stream, isolates decode from prefill cost)
python decode_rate_bench.py --api-url http://localhost:8000/v1 --model <model-id> \
--tag "config description" --targets 500 2000 4000 8000 14000| Tool | Purpose | Output |
|---|---|---|
compare_models.py |
"Is this model worth running?" | Quality score (22-check rubric) + ballpark throughput |
decode_rate_bench.py |
"How fast is this kernel/config?" | Pure single-stream decode rate (t/s) at varying context, separated TTFT, handles reasoning_content deltas |
parallel_benchmark.py |
Multi-stream peak throughput | Concurrency-scaling t/s curve |
long_context_test.py |
Functional test at long context | Pass/fail at target ctx |
tool_call_benchmark/ |
Multi-step SSH tool-call reliability | See subdir README |
decode_rate_bench.py is the canonical "how fast does it decode?" tool — use it when comparing kernels, KV-quant settings, or autotune results. compare_models.py reports throughput as a side effect of quality testing, but its numbers are distorted by reasoning-token accounting on thinking models. For honest decode-rate comparisons, prefer decode_rate_bench.py.
For the multi-step SSH tool-call reliability bench, see tool_call_benchmark/README.md.
llm-bench/
├── compare_models.py # main benchmark harness (v2.1) — quality scoring
├── decode_rate_bench.py # single-stream decode-rate bench, SSE-streaming, separates TTFT
├── parallel_benchmark.py # standalone throughput benchmark (concurrency-scaling)
├── long_context_test.py # long-context functional test
├── models.yaml # model configurations
├── CONSOLIDATED-FINDINGS.md # detailed analysis and findings (8x A4000 sweep)
├── MODEL-RESULTS-2026.md # cross-model results summary
├── practical/ # individual test scripts (mining pool scenarios)
├── tool_call_benchmark/ # multi-step SSH tool-call reliability bench
├── docs/ # quickstarts, install guide, planned bench notes
├── results/ # generic bench output
│ ├── 8xA4000/ # 8x RTX A4000 sweep
│ ├── compare/ # quality comparison runs
│ ├── gen4/ # PCIe Gen4 test results
│ ├── parallel/ # throughput-only runs
│ └── tp8/ # TP=8 scaling runs
├── archive/
│ └── patches/ # vLLM patches now obsolete upstream
└── projects/
└── turboquant/ # TurboQuant KV-compression project
├── nemo-tq-benchmark.md
├── patches/ # vLLM patches still needed for TQ workflow
├── results/ # V100/A4000 TQ-specific runs
└── tool-call-results/ # tool-call bench: TQ vs BF16/fp8 comparisons
The repo holds two kinds of content:
- The bench utility — runners, configs, generic test suites (root,
tool_call_benchmark/,practical/,results/) - Projects that consume the utility — under
projects/<name>/. Currently justturboquant/. New projects (other quant methods, model evals) should follow the same pattern.
archive/ holds material no longer in active use but kept for reproducibility.
| Model | Quality | Throughput |
|---|---|---|
| Seed-OSS-36B AWQ | 100% (22/22) | 38.4 t/s |
| Qwen3-30B-A3B AWQ | 100% (22/22) | 31.2 t/s |
| Devstral-Small-24B | 95.5% (21/22) | 53.6 t/s |
| Model | Quant | Standard | BWA-MEM2 | Throughput | KV |
|---|---|---|---|---|---|
| Qwen3.6-35B-A3B | AWQ | 22/21 (105%) | 18/30 (60%) | 49.4 t/s @ 14K | TQ-t3nc (155K ctx) |
| Seed-OSS-36B | GPTQ | 95.5% (21/22) | — | 48.3 t/s | fp16 |
| Seed-OSS-36B | AWQ | 95.5% (21/22) | — | 7.0 t/s | fp16 |
V100 Findings:
- GPTQ → AWQ parity restored on Volta. The historical "GPTQ 7× faster than AWQ on V100" was caused by AWQ falling back to Triton (Marlin requires SM75+). Porting InternLM/lmdeploy's TurboMind SM70 m8n8k4 WMMA GEMM kernels (via 1CatAI/1Cat-vLLM) closes the gap and makes modern AWQ-quantized models — including MoE — viable on V100.
- TurboQuant + TurboMind stack runs Qwen3.6-35B-A3B at 49.4 t/s with 155K context on a single V100 32GB. tq-t3nc (3-bit MSE keys, 3-bit values, norm-correction; ~5× KV compression) preserves quality through the practical bench; flash-decode-via-dequant-scratch path is opt-in (
VLLM_TQ_FLASH_DECODE=1) and gains another +4% at long context. - BWA-MEM2 score 18/30 (Qwen3.6) is +6 above the historical 35B-class band (8-12/30). Outperforms Qwen3.5-122B-A10B (15/30) — model 4× larger — on this domain test. See
docs/bwamem2-benchmark.mdfor the rubric. - Reasoning models need budget headroom on the practical bench. The default
max_tokens=4096is too small for Qwen3.6's deep thinking; the model uses the entire budget reasoning before producing an answer (finish_reason: length, content empty). Either disable thinking viachat_template_kwargs={"enable_thinking": False}for parity with non-reasoning baselines, or bumpmax_tokensto 8K+. Qwen3.6 with thinking disabled scores 22/21 (perfect+).
MIT License
Created: January 2026 | Updated: February 2026 For: Mining Pool Development & AI Model Evaluation