Skip to content

MidasMining/llm-bench

Repository files navigation

LLM Benchmark Suite v2.1

Comprehensive benchmarking suite for evaluating local LLM models on real-world coding and debugging tasks. Tests models on production mining pool code with bugs ranging from easy threading issues to expert-level crypto byte order vulnerabilities.

Test Suite

Test Difficulty Checks Description
ZMQ Listener Easy 2 Threading bug in ZMQ socket listener
PPLNS Mining Pool Hard 3 Config mismatch, hardcoded values, unused variables
Payment System Expert 4 Race condition, SQL injection, float precision, atomicity
Stratum Protocol Nightmare 5 Byte order/endianness, info leak, stale data, memory leak, input validation
HiveOS Wrapper Practical 8 Multi-file creation (manifest, config, run, stats scripts)

Total: 22 checks across 5 practical debugging tests


8x RTX A4000 Results (128GB VRAM)

Platform: 8x NVIDIA RTX A4000 (16GB each), AMD EPYC 7532, PCIe Gen4 Software: vLLM 0.14, SGLang 0.3.2, Python 3.12, CUDA 12.8 Date: February 12-15, 2026

Model Rankings

# Model Quant TP Quality Single t/s Peak t/s Context Framework
1 Devstral-2-123B AWQ 8 100% (22/22) 41 300 @ C=32 32K vLLM
2 Nemotron-3-Nano-30B BF16 8 100% (22/22) 262 1307 @ C=64 16K vLLM
3 Qwen3-Coder-30B-A3B AWQ 4 100% (22/22) 184 1025 @ C=32 32K vLLM
4 GLM-4.7-Flash AWQ 4 100% (22/22) 101 566 @ C=8 65K SGLang
5 GLM-4.5-Air AWQ FP16Mix 8 95.5% (21/22) 87 724 @ C=64 8K vLLM
6 Magistral-Small-2509 AWQ 8 95.5% (21/22) 144 1470 @ C=64 32K vLLM
7 Magistral-Small-2506 AWQ 8 95.5% (21/22) 156 1831 @ C=32 32K vLLM
8 QwQ-32B AWQ 8 95.5% (21/22) 102 733 @ C=64 16K vLLM
9 Qwen3-32B AWQ 8 95.5% (21/22) 78 1013 @ C=32 32K vLLM
10 EXAONE-4.0-32B GPTQ g32 8 95.5% (21/22) 110 719 @ C=64 131K vLLM
11 Qwen3-30B-A3B AWQ 4 95.5% (21/22) 178 1575 @ C=32 32K vLLM
12 Devstral-Small-2-24B AWQ 8 95.5% (21/22) 148 1452 @ C=32 32K vLLM
13 Seed-OSS-36B AWQ 8 90.9% (20/22) 88 1163 @ C=32 32K vLLM
14 Qwen3-30B-A3B-Thinking AWQ 4 81.8% (18/22) 160 1031 @ C=32 32K vLLM
15 Nanbeige4.1-3B BF16 4 77.3% (17/22) 187 1239 @ C=64 131K vLLM
16 DS-R1-Distill-Qwen-32B AWQ 8 54.5% (12/22) 78 992 @ C=32 32K vLLM
17 DS-R1-Distill-Llama-70B AWQ 8 45.5% (10/22) 57 540 @ C=32 16K vLLM
18 GPT-OSS-20B MXFP4 8 40.9% (9/22) 52 933 @ C=16 8K vLLM

Category Winners

Category Model Why
Best Quality Devstral-2-123B 100% quality (22/22). 5/5 Stratum also achieved by: Nemotron, Qwen3-Coder, GLM-4.7-Flash, GLM-4.5-Air
Best Overall Nemotron-3-Nano-30B 100% quality + hidden reasoning + 262 t/s single + 1307 t/s peak
Best Code Model Qwen3-Coder-30B-A3B 100% quality, purpose-built for code tasks
Best Throughput Magistral-Small-2506 AWQ 1831 t/s peak, 95.5% quality, only 14GB
Best Long Context Magistral-Small-2509 131K context, 95.5% quality, working reasoning
Best Reasoning Nemotron-3-Nano-30B 100% quality, hidden <think> mode enabled by default, 262 t/s
Best Reasoning (Runner-up) QwQ-32B 95.5% quality with <think> mode, 102 t/s

Detailed Test Results

Model ZMQ (2) PPLNS (3) Payment (4) Stratum (5) HiveOS (8) Total
Devstral-2-123B 2/2 3/3 4/4 5/5 8/8 22/22
Nemotron-3-Nano-30B 2/2 3/3 4/4 5/5 8/8 22/22
Qwen3-Coder-30B-A3B 2/2 3/3 4/4 5/5 8/8 22/22
GLM-4.7-Flash 2/2 3/3 4/4 5/5 8/8 22/22
QwQ-32B 2/2 3/3 4/4 4/5 8/8 21/22
GLM-4.5-Air 2/2 3/3 3/4 5/5 8/8 21/22
Magistral-Small-2509 2/2 3/3 4/4 4/5 8/8 21/22
Magistral-Small-2506 2/2 3/3 4/4 4/5 8/8 21/22
Qwen3-32B 2/2 3/3 4/4 4/5 8/8 21/22
EXAONE-4.0-32B 2/2 3/3 4/4 4/5 8/8 21/22
Qwen3-30B-A3B 2/2 3/3 4/4 4/5 8/8 21/22
Devstral-Small-2-24B 2/2 3/3 4/4 4/5 8/8 21/22
Seed-OSS-36B 2/2 3/3 4/4 3/5 8/8 20/22
Qwen3-30B-A3B-Think 2/2 2/3 4/4 4/5 6/8 18/22
Nanbeige4.1-3B 2/2 3/3 3/4 4/5 5/8 17/22
DS-R1-Distill-Qwen-32B 2/2 2/3 0/4 0/5 8/8 12/22
DS-R1-Distill-Llama-70B 2/2 0/3 0/4 0/5 8/8 10/22
GPT-OSS-20B 0/2 3/3 0/4 0/5 6/8 9/22

Throughput Scaling (tokens/second)

Concurrency Nemotron QwQ-32B Mag-2506 Mag-2509 Qwen3-30B Devstral-S Nanbeige-3B Seed-OSS Qwen3-Coder GLM-4.5 Qwen3-32B EXAONE Devstral-2
1 262 102 156 144 178 180 249 78 184 119 78 110 41
2 253 108 275 387 347 186 251 168 333 122 170 142 82
4 280 136 524 281 621 425 279 373 462 150 331 161 124
8 426 234 839 486 869 691 453 616 589 241 571 279 185
16 637 416 1266 827 1208 1051 737 923 840 350 825 444 271
32 479 611 1831 1238 1575 1452 1044 1163 1025 523 1013 636 300
64 1307 733 - 1470 - - 1239 - - 724 - 719 -

Failed / Incompatible Models

Model Issue
GPT-OSS-120B "AWQ" twhitworth repo: only 6/36 layers present, FP16 not AWQ. No viable quant exists.
GLM-4.5-Air AWQ Marlin error FIXED: QuantTrio FP16Mix + --enable-expert-parallel → 95.5%, 5/5 Stratum
Qwen3-Next-80B-A3B AWQ 2 KV heads (max TP=2), 24.5GB/GPU exceeds 16GB
Qwen3-Coder-Next AWQ 2 KV heads (max TP=2), needs vLLM 0.15+
EXAONE-4.0-32B AWQ/GPTQ g128 Marlin min_thread_k=128 alignment fails at TP>2 (3424%128≠0). Fixed with custom GPTQ g32
Qwen3-480B-Coder AWQ 236GB model, 128GB VRAM. At TP=8: 29.5GB/GPU needed vs 14.7GB available. Needs ~500GB VRAM.

Key Findings

1. Hidden Reasoning: Nemotron Was Thinking All Along

Nemotron-3-Nano-30B has enable_thinking=True as the DEFAULT in its chat template. It uses <think>/</think> tags (DeepSeek R1 format) and was reasoning during all benchmarks. This makes it the only 100% quality reasoning model - and the fastest. The top 4 models by quality (100%) include 3 non-reasoning (Devstral-2, Qwen3-Coder, GLM-4.7-Flash) and 1 hidden reasoner (Nemotron).

2. The Hardest Test: Stratum Byte Order

Only 4 models found the byte order/endianness bug in the Stratum protocol test. This is the strongest differentiator between 95.5% and 100% quality models.

3. Nemotron-3-Nano-30B: Best Overall (Hidden Reasoning Champion)

Mamba+MoE hybrid architecture achieves 100% quality AND fastest single-request speed (262 t/s). At 59GB BF16, it uses ~7.4GB/GPU at TP=8. Has hidden <think> reasoning enabled by default in chat template - was thinking during ALL benchmarks. QwQ-32B AWQ (95.5%, 102 t/s) is a solid reasoning alternative.

4. Magistral-2506 AWQ: Throughput King

At only 14GB (24B params), it achieves 1831 t/s peak - highest of any model. With only ~1.75GB/GPU for weights, it has massive KV cache headroom.

5. fp8 KV Cache Doubles Capacity

After patching two vLLM bugs (flashinfer positional args + compressed-tensors false rejection), fp8 KV cache works for all quant formats:

  • Devstral-2-123B: 16K -> 32K context
  • Seed-OSS-36B: 32K -> 64K context
  • Magistral: 1.25M tokens, 38x concurrency

6. SGLang vs vLLM: vLLM Wins

Tested 3 models on SGLang - all performed worse than vLLM (Nemotron -47% peak, Qwen3-Coder quality collapsed to 54.5%). Exception: GLM-4.7-Flash works ONLY on SGLang.

7. Pipeline Parallelism Not Beneficial

EXAONE TP=2+PP=4 (8 GPUs): quality dropped 18pp, throughput dropped 26%. Only gained context length. Pipeline latency overhead outweighs memory savings.

8. Nanbeige4.1-3B: 3B Reasoning Model Punches Above Its Weight

At only 3B parameters (~6GB BF16), Nanbeige4.1-3B scores 77.3% - matching EXAONE-4.0-32B (10x larger). Uses <think> reasoning tags, 131K context, and achieves 187 t/s single-request. Fits on a single 16GB GPU. Best quality-per-parameter ratio in the benchmark.

9. Custom Quantization Unlocks Performance

  • EXAONE-4.0-32B: Stuck at TP=2 with AWQ/GPTQ g128 (Marlin alignment). Custom GPTQ g32 + --dtype float16 enables TP=8: 110 t/s (+67%), quality preserved.
  • Magistral-Small-2509: Multimodal model (Mistral3ForConditionalGeneration) had no community AWQ. Extracted text model from multimodal wrapper, AWQ quantized with fragmentation fix. Result: 144 t/s (+64% vs BF16), 1470 t/s peak (+37%), quality preserved at 95.5%.

TP Compatibility Reference

Model Q Heads KV Heads Max TP Type Notes
Devstral-2-123B 96 8 8 Dense 123B, tight VRAM
Magistral-Small 32 8 8 Dense 24B, both 2506/2509
QwQ-32B 40 8 8 Dense Qwen2-based reasoning
Qwen3-32B 64 8 8 Dense Reasoning
Seed-OSS-36B 80 8 8 Dense Reasoning
Nemotron-3-Nano-30B 32 8 8 Mamba+MoE --trust-remote-code
Devstral-Small-2-24B 32 8 8 Dense Mistral3
GPT-OSS-20B 64 8 8 MoE enforce-eager only
Qwen3-30B-A3B 32 4 4 MoE 128 experts, 8 active
Qwen3-Coder-30B-A3B 32 4 4 MoE 128 experts, 8 active
GLM-4.5-Air 96 8 8 MoE 128E/8A, needs --enable-expert-parallel, QuantTrio FP16Mix
GLM-4.7-Flash 20 20 4 MoE SGLang only
Nanbeige4.1-3B 20 4 4 Dense 3B reasoning, fits single GPU
EXAONE-4.0-32B 40 8 8 Dense Custom GPTQ g32 + float16 for TP=8

Recommended Configurations

Daily Coding (quality-first, C=1-2)

# Qwen3-Coder-30B-A3B x2 replicas (100% quality, 184 t/s each)
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
  --tensor-parallel-size 4 --gpu-memory-utilization 0.90 \
  --max-num-seqs 16 --max-model-len 32768 --port 8000

CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
  --tensor-parallel-size 4 --gpu-memory-utilization 0.90 \
  --max-num-seqs 16 --max-model-len 32768 --port 8001

Long Context Reasoning

# Magistral-Small-2509 BF16 (95.5%, 131K context, [THINK] reasoning)
vllm serve mistralai/Magistral-Small-2509 \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
  --max-num-seqs 48 --tokenizer-mode mistral --config-format mistral \
  --load-format mistral --reasoning-parser mistral --port 8000

High-Throughput Batch

# Magistral-Small-2506 AWQ (1831 t/s peak, 95.5% quality)
vllm serve abhishekchohan/Magistral-Small-2506-AWQ \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
  --max-num-seqs 48 --max-model-len 32768 --port 8000

Maximum Quality

# Devstral-2-123B with fp8 KV (100% quality, 32K context)
vllm serve cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
  --max-num-seqs 16 --max-model-len 32768 \
  --kv-cache-dtype fp8_e5m2 --port 8000

vLLM Patches

Two bugs in vLLM 0.14 had to be patched for fp8 KV cache to work across all quant formats. Status as of vLLM 0.20.0:

Patch Upstream status Location
FlashInfer positional arg fix Fixed in 0.18.0+ (still fixed at 0.20.0) archive/patches/
Compressed-tensors false fp8 KV rejection Still broken at 0.20.0 projects/turboquant/patches/

See the README in each patches dir for the diff and apply instructions.


Quick Start

# Install
pip install pyyaml requests

# Quality bench (5 prompts, 22 pass/fail checks against rubric)
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1

# With parallel throughput test
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1 --parallel

# With server config metadata
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1 \
  --parallel --tp 8 --framework vllm --quant-method awq \
  --output ./results/8xA4000

# Decode-rate benchmark (single-stream, isolates decode from prefill cost)
python decode_rate_bench.py --api-url http://localhost:8000/v1 --model <model-id> \
  --tag "config description" --targets 500 2000 4000 8000 14000

When to use which

Tool Purpose Output
compare_models.py "Is this model worth running?" Quality score (22-check rubric) + ballpark throughput
decode_rate_bench.py "How fast is this kernel/config?" Pure single-stream decode rate (t/s) at varying context, separated TTFT, handles reasoning_content deltas
parallel_benchmark.py Multi-stream peak throughput Concurrency-scaling t/s curve
long_context_test.py Functional test at long context Pass/fail at target ctx
tool_call_benchmark/ Multi-step SSH tool-call reliability See subdir README

decode_rate_bench.py is the canonical "how fast does it decode?" tool — use it when comparing kernels, KV-quant settings, or autotune results. compare_models.py reports throughput as a side effect of quality testing, but its numbers are distorted by reasoning-token accounting on thinking models. For honest decode-rate comparisons, prefer decode_rate_bench.py.

For the multi-step SSH tool-call reliability bench, see tool_call_benchmark/README.md.

File Structure

llm-bench/
├── compare_models.py              # main benchmark harness (v2.1) — quality scoring
├── decode_rate_bench.py           # single-stream decode-rate bench, SSE-streaming, separates TTFT
├── parallel_benchmark.py          # standalone throughput benchmark (concurrency-scaling)
├── long_context_test.py           # long-context functional test
├── models.yaml                    # model configurations
├── CONSOLIDATED-FINDINGS.md       # detailed analysis and findings (8x A4000 sweep)
├── MODEL-RESULTS-2026.md          # cross-model results summary
├── practical/                     # individual test scripts (mining pool scenarios)
├── tool_call_benchmark/           # multi-step SSH tool-call reliability bench
├── docs/                          # quickstarts, install guide, planned bench notes
├── results/                       # generic bench output
│   ├── 8xA4000/                   # 8x RTX A4000 sweep
│   ├── compare/                   # quality comparison runs
│   ├── gen4/                      # PCIe Gen4 test results
│   ├── parallel/                  # throughput-only runs
│   └── tp8/                       # TP=8 scaling runs
├── archive/
│   └── patches/                   # vLLM patches now obsolete upstream
└── projects/
    └── turboquant/                # TurboQuant KV-compression project
        ├── nemo-tq-benchmark.md
        ├── patches/               # vLLM patches still needed for TQ workflow
        ├── results/               # V100/A4000 TQ-specific runs
        └── tool-call-results/     # tool-call bench: TQ vs BF16/fp8 comparisons

The repo holds two kinds of content:

  • The bench utility — runners, configs, generic test suites (root, tool_call_benchmark/, practical/, results/)
  • Projects that consume the utility — under projects/<name>/. Currently just turboquant/. New projects (other quant methods, model evals) should follow the same pattern.

archive/ holds material no longer in active use but kept for reproducibility.

Historical Results

RTX 5090 (Single GPU, 32GB)

Model Quality Throughput
Seed-OSS-36B AWQ 100% (22/22) 38.4 t/s
Qwen3-30B-A3B AWQ 100% (22/22) 31.2 t/s
Devstral-Small-24B 95.5% (21/22) 53.6 t/s

Tesla V100 (Single GPU, 32GB)

Model Quant Standard BWA-MEM2 Throughput KV
Qwen3.6-35B-A3B AWQ 22/21 (105%) 18/30 (60%) 49.4 t/s @ 14K TQ-t3nc (155K ctx)
Seed-OSS-36B GPTQ 95.5% (21/22) 48.3 t/s fp16
Seed-OSS-36B AWQ 95.5% (21/22) 7.0 t/s fp16

V100 Findings:

  1. GPTQ → AWQ parity restored on Volta. The historical "GPTQ 7× faster than AWQ on V100" was caused by AWQ falling back to Triton (Marlin requires SM75+). Porting InternLM/lmdeploy's TurboMind SM70 m8n8k4 WMMA GEMM kernels (via 1CatAI/1Cat-vLLM) closes the gap and makes modern AWQ-quantized models — including MoE — viable on V100.
  2. TurboQuant + TurboMind stack runs Qwen3.6-35B-A3B at 49.4 t/s with 155K context on a single V100 32GB. tq-t3nc (3-bit MSE keys, 3-bit values, norm-correction; ~5× KV compression) preserves quality through the practical bench; flash-decode-via-dequant-scratch path is opt-in (VLLM_TQ_FLASH_DECODE=1) and gains another +4% at long context.
  3. BWA-MEM2 score 18/30 (Qwen3.6) is +6 above the historical 35B-class band (8-12/30). Outperforms Qwen3.5-122B-A10B (15/30) — model 4× larger — on this domain test. See docs/bwamem2-benchmark.md for the rubric.
  4. Reasoning models need budget headroom on the practical bench. The default max_tokens=4096 is too small for Qwen3.6's deep thinking; the model uses the entire budget reasoning before producing an answer (finish_reason: length, content empty). Either disable thinking via chat_template_kwargs={"enable_thinking": False} for parity with non-reasoning baselines, or bump max_tokens to 8K+. Qwen3.6 with thinking disabled scores 22/21 (perfect+).

License

MIT License

Created: January 2026 | Updated: February 2026 For: Mining Pool Development & AI Model Evaluation

About

custom benchmark for llms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages