LLM Benchmark Suite v2.1

Comprehensive benchmarking suite for evaluating local LLM models on real-world coding and debugging tasks. Tests models on production mining pool code with bugs ranging from easy threading issues to expert-level crypto byte order vulnerabilities.

Test Suite

Test	Difficulty	Checks	Description
ZMQ Listener	Easy	2	Threading bug in ZMQ socket listener
PPLNS Mining Pool	Hard	3	Config mismatch, hardcoded values, unused variables
Payment System	Expert	4	Race condition, SQL injection, float precision, atomicity
Stratum Protocol	Nightmare	5	Byte order/endianness, info leak, stale data, memory leak, input validation
HiveOS Wrapper	Practical	8	Multi-file creation (manifest, config, run, stats scripts)

Total: 22 checks across 5 practical debugging tests

8x RTX A4000 Results (128GB VRAM)

Platform: 8x NVIDIA RTX A4000 (16GB each), AMD EPYC 7532, PCIe Gen4 Software: vLLM 0.14, SGLang 0.3.2, Python 3.12, CUDA 12.8 Date: February 12-15, 2026

Model Rankings

#	Model	Quant	TP	Quality	Single t/s	Peak t/s	Context	Framework
1	Devstral-2-123B	AWQ	8	100% (22/22)	41	300 @ C=32	32K	vLLM
2	Nemotron-3-Nano-30B	BF16	8	100% (22/22)	262	1307 @ C=64	16K	vLLM
3	Qwen3-Coder-30B-A3B	AWQ	4	100% (22/22)	184	1025 @ C=32	32K	vLLM
4	GLM-4.7-Flash	AWQ	4	100% (22/22)	101	566 @ C=8	65K	SGLang
5	GLM-4.5-Air	AWQ FP16Mix	8	95.5% (21/22)	87	724 @ C=64	8K	vLLM
6	Magistral-Small-2509	AWQ	8	95.5% (21/22)	144	1470 @ C=64	32K	vLLM
7	Magistral-Small-2506	AWQ	8	95.5% (21/22)	156	1831 @ C=32	32K	vLLM
8	QwQ-32B	AWQ	8	95.5% (21/22)	102	733 @ C=64	16K	vLLM
9	Qwen3-32B	AWQ	8	95.5% (21/22)	78	1013 @ C=32	32K	vLLM
10	EXAONE-4.0-32B	GPTQ g32	8	95.5% (21/22)	110	719 @ C=64	131K	vLLM
11	Qwen3-30B-A3B	AWQ	4	95.5% (21/22)	178	1575 @ C=32	32K	vLLM
12	Devstral-Small-2-24B	AWQ	8	95.5% (21/22)	148	1452 @ C=32	32K	vLLM
13	Seed-OSS-36B	AWQ	8	90.9% (20/22)	88	1163 @ C=32	32K	vLLM
14	Qwen3-30B-A3B-Thinking	AWQ	4	81.8% (18/22)	160	1031 @ C=32	32K	vLLM
15	Nanbeige4.1-3B	BF16	4	77.3% (17/22)	187	1239 @ C=64	131K	vLLM
16	DS-R1-Distill-Qwen-32B	AWQ	8	54.5% (12/22)	78	992 @ C=32	32K	vLLM
17	DS-R1-Distill-Llama-70B	AWQ	8	45.5% (10/22)	57	540 @ C=32	16K	vLLM
18	GPT-OSS-20B	MXFP4	8	40.9% (9/22)	52	933 @ C=16	8K	vLLM

Category Winners

Category	Model	Why
Best Quality	Devstral-2-123B	100% quality (22/22). 5/5 Stratum also achieved by: Nemotron, Qwen3-Coder, GLM-4.7-Flash, GLM-4.5-Air
Best Overall	Nemotron-3-Nano-30B	100% quality + hidden reasoning + 262 t/s single + 1307 t/s peak
Best Code Model	Qwen3-Coder-30B-A3B	100% quality, purpose-built for code tasks
Best Throughput	Magistral-Small-2506 AWQ	1831 t/s peak, 95.5% quality, only 14GB
Best Long Context	Magistral-Small-2509	131K context, 95.5% quality, working reasoning
Best Reasoning	Nemotron-3-Nano-30B	100% quality, hidden `<think>` mode enabled by default, 262 t/s
Best Reasoning (Runner-up)	QwQ-32B	95.5% quality with `<think>` mode, 102 t/s

Detailed Test Results

Model	ZMQ (2)	PPLNS (3)	Payment (4)	Stratum (5)	HiveOS (8)	Total
Devstral-2-123B	2/2	3/3	4/4	5/5	8/8	22/22
Nemotron-3-Nano-30B	2/2	3/3	4/4	5/5	8/8	22/22
Qwen3-Coder-30B-A3B	2/2	3/3	4/4	5/5	8/8	22/22
GLM-4.7-Flash	2/2	3/3	4/4	5/5	8/8	22/22
QwQ-32B	2/2	3/3	4/4	4/5	8/8	21/22
GLM-4.5-Air	2/2	3/3	3/4	5/5	8/8	21/22
Magistral-Small-2509	2/2	3/3	4/4	4/5	8/8	21/22
Magistral-Small-2506	2/2	3/3	4/4	4/5	8/8	21/22
Qwen3-32B	2/2	3/3	4/4	4/5	8/8	21/22
EXAONE-4.0-32B	2/2	3/3	4/4	4/5	8/8	21/22
Qwen3-30B-A3B	2/2	3/3	4/4	4/5	8/8	21/22
Devstral-Small-2-24B	2/2	3/3	4/4	4/5	8/8	21/22
Seed-OSS-36B	2/2	3/3	4/4	3/5	8/8	20/22
Qwen3-30B-A3B-Think	2/2	2/3	4/4	4/5	6/8	18/22
Nanbeige4.1-3B	2/2	3/3	3/4	4/5	5/8	17/22
DS-R1-Distill-Qwen-32B	2/2	2/3	0/4	0/5	8/8	12/22
DS-R1-Distill-Llama-70B	2/2	0/3	0/4	0/5	8/8	10/22
GPT-OSS-20B	0/2	3/3	0/4	0/5	6/8	9/22

Throughput Scaling (tokens/second)

Concurrency	Nemotron	QwQ-32B	Mag-2506	Mag-2509	Qwen3-30B	Devstral-S	Nanbeige-3B	Seed-OSS	Qwen3-Coder	GLM-4.5	Qwen3-32B	EXAONE	Devstral-2
1	262	102	156	144	178	180	249	78	184	119	78	110	41
2	253	108	275	387	347	186	251	168	333	122	170	142	82
4	280	136	524	281	621	425	279	373	462	150	331	161	124
8	426	234	839	486	869	691	453	616	589	241	571	279	185
16	637	416	1266	827	1208	1051	737	923	840	350	825	444	271
32	479	611	1831	1238	1575	1452	1044	1163	1025	523	1013	636	300
64	1307	733	-	1470	-	-	1239	-	-	724	-	719	-

Failed / Incompatible Models

Model	Issue
GPT-OSS-120B "AWQ"	twhitworth repo: only 6/36 layers present, FP16 not AWQ. No viable quant exists.
~~GLM-4.5-Air AWQ~~	~~Marlin error~~ FIXED: QuantTrio FP16Mix + `--enable-expert-parallel` → 95.5%, 5/5 Stratum
Qwen3-Next-80B-A3B AWQ	2 KV heads (max TP=2), 24.5GB/GPU exceeds 16GB
Qwen3-Coder-Next AWQ	2 KV heads (max TP=2), needs vLLM 0.15+
EXAONE-4.0-32B AWQ/GPTQ g128	Marlin min_thread_k=128 alignment fails at TP>2 (3424%128≠0). Fixed with custom GPTQ g32
Qwen3-480B-Coder AWQ	236GB model, 128GB VRAM. At TP=8: 29.5GB/GPU needed vs 14.7GB available. Needs ~500GB VRAM.

Key Findings

1. Hidden Reasoning: Nemotron Was Thinking All Along

Nemotron-3-Nano-30B has enable_thinking=True as the DEFAULT in its chat template. It uses <think>/</think> tags (DeepSeek R1 format) and was reasoning during all benchmarks. This makes it the only 100% quality reasoning model - and the fastest. The top 4 models by quality (100%) include 3 non-reasoning (Devstral-2, Qwen3-Coder, GLM-4.7-Flash) and 1 hidden reasoner (Nemotron).

2. The Hardest Test: Stratum Byte Order

Only 4 models found the byte order/endianness bug in the Stratum protocol test. This is the strongest differentiator between 95.5% and 100% quality models.

3. Nemotron-3-Nano-30B: Best Overall (Hidden Reasoning Champion)

Mamba+MoE hybrid architecture achieves 100% quality AND fastest single-request speed (262 t/s). At 59GB BF16, it uses ~7.4GB/GPU at TP=8. Has hidden <think> reasoning enabled by default in chat template - was thinking during ALL benchmarks. QwQ-32B AWQ (95.5%, 102 t/s) is a solid reasoning alternative.

4. Magistral-2506 AWQ: Throughput King

At only 14GB (24B params), it achieves 1831 t/s peak - highest of any model. With only ~1.75GB/GPU for weights, it has massive KV cache headroom.

5. fp8 KV Cache Doubles Capacity

After patching two vLLM bugs (flashinfer positional args + compressed-tensors false rejection), fp8 KV cache works for all quant formats:

Devstral-2-123B: 16K -> 32K context
Seed-OSS-36B: 32K -> 64K context
Magistral: 1.25M tokens, 38x concurrency

6. SGLang vs vLLM: vLLM Wins

Tested 3 models on SGLang - all performed worse than vLLM (Nemotron -47% peak, Qwen3-Coder quality collapsed to 54.5%). Exception: GLM-4.7-Flash works ONLY on SGLang.

7. Pipeline Parallelism Not Beneficial

EXAONE TP=2+PP=4 (8 GPUs): quality dropped 18pp, throughput dropped 26%. Only gained context length. Pipeline latency overhead outweighs memory savings.

8. Nanbeige4.1-3B: 3B Reasoning Model Punches Above Its Weight

At only 3B parameters (~6GB BF16), Nanbeige4.1-3B scores 77.3% - matching EXAONE-4.0-32B (10x larger). Uses <think> reasoning tags, 131K context, and achieves 187 t/s single-request. Fits on a single 16GB GPU. Best quality-per-parameter ratio in the benchmark.

9. Custom Quantization Unlocks Performance

EXAONE-4.0-32B: Stuck at TP=2 with AWQ/GPTQ g128 (Marlin alignment). Custom GPTQ g32 + --dtype float16 enables TP=8: 110 t/s (+67%), quality preserved.
Magistral-Small-2509: Multimodal model (Mistral3ForConditionalGeneration) had no community AWQ. Extracted text model from multimodal wrapper, AWQ quantized with fragmentation fix. Result: 144 t/s (+64% vs BF16), 1470 t/s peak (+37%), quality preserved at 95.5%.

TP Compatibility Reference

Model	Q Heads	KV Heads	Max TP	Type	Notes
Devstral-2-123B	96	8	8	Dense	123B, tight VRAM
Magistral-Small	32	8	8	Dense	24B, both 2506/2509
QwQ-32B	40	8	8	Dense	Qwen2-based reasoning
Qwen3-32B	64	8	8	Dense	Reasoning
Seed-OSS-36B	80	8	8	Dense	Reasoning
Nemotron-3-Nano-30B	32	8	8	Mamba+MoE	--trust-remote-code
Devstral-Small-2-24B	32	8	8	Dense	Mistral3
GPT-OSS-20B	64	8	8	MoE	enforce-eager only
Qwen3-30B-A3B	32	4	4	MoE	128 experts, 8 active
Qwen3-Coder-30B-A3B	32	4	4	MoE	128 experts, 8 active
GLM-4.5-Air	96	8	8	MoE	128E/8A, needs --enable-expert-parallel, QuantTrio FP16Mix
GLM-4.7-Flash	20	20	4	MoE	SGLang only
Nanbeige4.1-3B	20	4	4	Dense	3B reasoning, fits single GPU
EXAONE-4.0-32B	40	8	8	Dense	Custom GPTQ g32 + float16 for TP=8

Recommended Configurations

Daily Coding (quality-first, C=1-2)

# Qwen3-Coder-30B-A3B x2 replicas (100% quality, 184 t/s each)
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
  --tensor-parallel-size 4 --gpu-memory-utilization 0.90 \
  --max-num-seqs 16 --max-model-len 32768 --port 8000

CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ \
  --tensor-parallel-size 4 --gpu-memory-utilization 0.90 \
  --max-num-seqs 16 --max-model-len 32768 --port 8001

Long Context Reasoning

# Magistral-Small-2509 BF16 (95.5%, 131K context, [THINK] reasoning)
vllm serve mistralai/Magistral-Small-2509 \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
  --max-num-seqs 48 --tokenizer-mode mistral --config-format mistral \
  --load-format mistral --reasoning-parser mistral --port 8000

High-Throughput Batch

# Magistral-Small-2506 AWQ (1831 t/s peak, 95.5% quality)
vllm serve abhishekchohan/Magistral-Small-2506-AWQ \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
  --max-num-seqs 48 --max-model-len 32768 --port 8000

Maximum Quality

# Devstral-2-123B with fp8 KV (100% quality, 32K context)
vllm serve cyankiwi/Devstral-2-123B-Instruct-2512-AWQ-4bit \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.90 \
  --max-num-seqs 16 --max-model-len 32768 \
  --kv-cache-dtype fp8_e5m2 --port 8000

vLLM Patches

Two bugs in vLLM 0.14 had to be patched for fp8 KV cache to work across all quant formats. Status as of vLLM 0.20.0:

Patch	Upstream status	Location
FlashInfer positional arg fix	Fixed in 0.18.0+ (still fixed at 0.20.0)	`archive/patches/`
Compressed-tensors false fp8 KV rejection	Still broken at 0.20.0	`projects/turboquant/patches/`

See the README in each patches dir for the diff and apply instructions.

Quick Start

# Install
pip install pyyaml requests

# Quality bench (5 prompts, 22 pass/fail checks against rubric)
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1

# With parallel throughput test
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1 --parallel

# With server config metadata
python compare_models.py --model <model-id> --api-url http://localhost:8000/v1 \
  --parallel --tp 8 --framework vllm --quant-method awq \
  --output ./results/8xA4000

# Decode-rate benchmark (single-stream, isolates decode from prefill cost)
python decode_rate_bench.py --api-url http://localhost:8000/v1 --model <model-id> \
  --tag "config description" --targets 500 2000 4000 8000 14000

When to use which

Tool	Purpose	Output
`compare_models.py`	"Is this model worth running?"	Quality score (22-check rubric) + ballpark throughput
`decode_rate_bench.py`	"How fast is this kernel/config?"	Pure single-stream decode rate (t/s) at varying context, separated TTFT, handles `reasoning_content` deltas
`parallel_benchmark.py`	Multi-stream peak throughput	Concurrency-scaling t/s curve
`long_context_test.py`	Functional test at long context	Pass/fail at target ctx
`tool_call_benchmark/`	Multi-step SSH tool-call reliability	See subdir README

decode_rate_bench.py is the canonical "how fast does it decode?" tool — use it when comparing kernels, KV-quant settings, or autotune results. compare_models.py reports throughput as a side effect of quality testing, but its numbers are distorted by reasoning-token accounting on thinking models. For honest decode-rate comparisons, prefer decode_rate_bench.py.

For the multi-step SSH tool-call reliability bench, see tool_call_benchmark/README.md.

File Structure

llm-bench/
├── compare_models.py              # main benchmark harness (v2.1) — quality scoring
├── decode_rate_bench.py           # single-stream decode-rate bench, SSE-streaming, separates TTFT
├── parallel_benchmark.py          # standalone throughput benchmark (concurrency-scaling)
├── long_context_test.py           # long-context functional test
├── models.yaml                    # model configurations
├── CONSOLIDATED-FINDINGS.md       # detailed analysis and findings (8x A4000 sweep)
├── MODEL-RESULTS-2026.md          # cross-model results summary
├── practical/                     # individual test scripts (mining pool scenarios)
├── tool_call_benchmark/           # multi-step SSH tool-call reliability bench
├── docs/                          # quickstarts, install guide, planned bench notes
├── results/                       # generic bench output
│   ├── 8xA4000/                   # 8x RTX A4000 sweep
│   ├── compare/                   # quality comparison runs
│   ├── gen4/                      # PCIe Gen4 test results
│   ├── parallel/                  # throughput-only runs
│   └── tp8/                       # TP=8 scaling runs
├── archive/
│   └── patches/                   # vLLM patches now obsolete upstream
└── projects/
    └── turboquant/                # TurboQuant KV-compression project
        ├── nemo-tq-benchmark.md
        ├── patches/               # vLLM patches still needed for TQ workflow
        ├── results/               # V100/A4000 TQ-specific runs
        └── tool-call-results/     # tool-call bench: TQ vs BF16/fp8 comparisons

The repo holds two kinds of content:

The bench utility — runners, configs, generic test suites (root, tool_call_benchmark/, practical/, results/)
Projects that consume the utility — under projects/<name>/. Currently just turboquant/. New projects (other quant methods, model evals) should follow the same pattern.

archive/ holds material no longer in active use but kept for reproducibility.

Historical Results

RTX 5090 (Single GPU, 32GB)

Model	Quality	Throughput
Seed-OSS-36B AWQ	100% (22/22)	38.4 t/s
Qwen3-30B-A3B AWQ	100% (22/22)	31.2 t/s
Devstral-Small-24B	95.5% (21/22)	53.6 t/s

Tesla V100 (Single GPU, 32GB)

Model	Quant	Standard	BWA-MEM2	Throughput	KV
Qwen3.6-35B-A3B	AWQ	22/21 (105%)	18/30 (60%)	49.4 t/s @ 14K	TQ-t3nc (155K ctx)
Seed-OSS-36B	GPTQ	95.5% (21/22)	—	48.3 t/s	fp16
Seed-OSS-36B	AWQ	95.5% (21/22)	—	7.0 t/s	fp16

V100 Findings:

GPTQ → AWQ parity restored on Volta. The historical "GPTQ 7× faster than AWQ on V100" was caused by AWQ falling back to Triton (Marlin requires SM75+). Porting InternLM/lmdeploy's TurboMind SM70 m8n8k4 WMMA GEMM kernels (via 1CatAI/1Cat-vLLM) closes the gap and makes modern AWQ-quantized models — including MoE — viable on V100.
TurboQuant + TurboMind stack runs Qwen3.6-35B-A3B at 49.4 t/s with 155K context on a single V100 32GB. tq-t3nc (3-bit MSE keys, 3-bit values, norm-correction; ~5× KV compression) preserves quality through the practical bench; flash-decode-via-dequant-scratch path is opt-in (VLLM_TQ_FLASH_DECODE=1) and gains another +4% at long context.
BWA-MEM2 score 18/30 (Qwen3.6) is +6 above the historical 35B-class band (8-12/30). Outperforms Qwen3.5-122B-A10B (15/30) — model 4× larger — on this domain test. See docs/bwamem2-benchmark.md for the rubric.
Reasoning models need budget headroom on the practical bench. The default max_tokens=4096 is too small for Qwen3.6's deep thinking; the model uses the entire budget reasoning before producing an answer (finish_reason: length, content empty). Either disable thinking via chat_template_kwargs={"enable_thinking": False} for parity with non-reasoning baselines, or bump max_tokens to 8K+. Qwen3.6 with thinking disabled scores 22/21 (perfect+).

License

MIT License

Created: January 2026 | Updated: February 2026 For: Mining Pool Development & AI Model Evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
archive/patches		archive/patches
coding_benchmark		coding_benchmark
docs		docs
practical		practical
projects/turboquant		projects/turboquant
results		results
tool_call_benchmark		tool_call_benchmark
.gitignore		.gitignore
CONSOLIDATED-FINDINGS.md		CONSOLIDATED-FINDINGS.md
MODEL-RESULTS-2026.md		MODEL-RESULTS-2026.md
README.md		README.md
compare_models.py		compare_models.py
decode_rate_bench.py		decode_rate_bench.py
long_context_test.py		long_context_test.py
models.yaml		models.yaml
parallel_benchmark.py		parallel_benchmark.py

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Suite v2.1

Test Suite

8x RTX A4000 Results (128GB VRAM)

Model Rankings

Category Winners

Detailed Test Results

Throughput Scaling (tokens/second)

Failed / Incompatible Models

Key Findings

1. Hidden Reasoning: Nemotron Was Thinking All Along

2. The Hardest Test: Stratum Byte Order

3. Nemotron-3-Nano-30B: Best Overall (Hidden Reasoning Champion)

4. Magistral-2506 AWQ: Throughput King

5. fp8 KV Cache Doubles Capacity

6. SGLang vs vLLM: vLLM Wins

7. Pipeline Parallelism Not Beneficial

8. Nanbeige4.1-3B: 3B Reasoning Model Punches Above Its Weight

9. Custom Quantization Unlocks Performance

TP Compatibility Reference

Recommended Configurations

Daily Coding (quality-first, C=1-2)

Long Context Reasoning

High-Throughput Batch

Maximum Quality

vLLM Patches

Quick Start

When to use which

File Structure

Historical Results

RTX 5090 (Single GPU, 32GB)

Tesla V100 (Single GPU, 32GB)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages