[cuda] Fix mmq/mma path by khosravipasha · Pull Request #1 · PrismML-Eng/llama.cpp

khosravipasha · 2026-03-19T03:51:45Z

Fixes the prompt processing MMQ kernels for Q1_0 and Q1_0_g128, before was doing cuBLAS fallback which is much slower.

Copilot

Pull request overview

Enables the CUDA MMQ (MMA) path for Q1_0 and Q1_0_g128 to avoid slow cuBLAS fallback during prompt processing.

Changes:

Add/enable MMQ template instantiations and runtime dispatch for Q1_0 and Q1_0_g128.
Implement/adjust Q1_0 and Q1_0_g128 MMA tile loading logic and explicitly disable the DP4A path for these types.
Update model ftype display strings and extend the CUDA template-instance generator to include Q1_0 variants.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/llama-model-loader.cpp	Simplifies displayed ftype names for Q1_0 and Q1_0_g128.
ggml/src/ggml-cuda/template-instances/mmq-instance-q1_0.cu	Adds MMQ instantiation for GGML_TYPE_Q1_0.
ggml/src/ggml-cuda/template-instances/mmq-instance-q1_0_g128.cu	Adds MMQ instantiation for GGML_TYPE_Q1_0_g128.
ggml/src/ggml-cuda/template-instances/generate_cu_files.py	Generates MMQ instance files for Q1_0 and Q1_0_g128.
ggml/src/ggml-cuda/quantize.cu	Minor formatting-only change.
ggml/src/ggml-cuda/mmq.cuh	Adds MMA tile loaders for Q1_0/Q1_0_g128 and disables their DP4A vec-dot path.
ggml/src/ggml-cuda/mmq.cu	Enables MMQ dispatch/eligibility for Q1_0/Q1_0_g128 with a Turing+ MMA guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -6,12 +6,12 @@
 static void ggml_cuda_mul_mat_q_switch_type(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
    switch (args.type_x) {
        // TODO: Q1_0/Q1_0_g128 MMQ disabled due to accuracy issues; for now commenting these to use cuBLAS fallback


        // TODO: Q1_0 and Q1_0_g128 MMQ implementation exists but is currently disabled due to accuracy issues
-        // case GGML_TYPE_Q1_0:
-        // case GGML_TYPE_Q1_0_g128:
+        case GGML_TYPE_Q1_0:
+        case GGML_TYPE_Q1_0_g128:


Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed PrismML-Eng#3 (TURBO_D). PrismML-Eng#1 and PrismML-Eng#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete experiment log: PrismML-Eng#1 4-mag LUT: 15.1 at 8K (BEST, +38%) PrismML-Eng#2 Batched extract: 13.7 (+25%) PrismML-Eng#3 Inline FA block: 13.5 (I-cache pressure) PrismML-Eng#4 Deferred norm: 12.9 (loses ILP) PrismML-Eng#5 2-pair half2: 12.0 (ternary overhead) PrismML-Eng#6 Select chain: 11.9 (branches kill) PrismML-Eng#7 Bit-arithmetic: 11.6 (ALU too heavy) PrismML-Eng#8 FMA branchless: 11.4 (ALU still too heavy) PrismML-Eng#9 Named-reg ternary: 10.3 (branches worst) PrismML-Eng#10 Main (8-LUT): 10.95 (baseline) PrismML-Eng#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

[cuda] fix mmq/mma path

43e67bd

Copilot AI review requested due to automatic review settings March 19, 2026 03:51

github-actions Bot added Nvidia GPU python ggml labels Mar 19, 2026

khosravipasha merged commit bc8122e into prism Mar 19, 2026
52 of 84 checks passed

Copilot AI reviewed Mar 19, 2026

View reviewed changes

khosravipasha deleted the mmq branch March 24, 2026 21:14

claudlos mentioned this pull request Apr 7, 2026

vulkan: add Q1_0_g128 (1-bit ternary) shader support #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] Fix mmq/mma path#1

[cuda] Fix mmq/mma path#1
khosravipasha merged 1 commit intoprismfrom
mmq

khosravipasha commented Mar 19, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khosravipasha commented Mar 19, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants