Skip to content

vulkan: add peak performance tuning for ARM Mali GPUs (G720)#18493

Draft
Gong-Mi wants to merge 8 commits intoggml-org:masterfrom
Gong-Mi:mali-g720-tuning
Draft

vulkan: add peak performance tuning for ARM Mali GPUs (G720)#18493
Gong-Mi wants to merge 8 commits intoggml-org:masterfrom
Gong-Mi:mali-g720-tuning

Conversation

@Gong-Mi
Copy link
Copy Markdown

@Gong-Mi Gong-Mi commented Dec 30, 2025

Description

This PR implements specialized performance tuning for ARM Mali G720 GPUs (vendor ID 0x13B5) in the Vulkan backend.

Changes

  • Implement specialized warptile configurations for Mali G720 to optimize compute throughput.
  • Force FP32 path by disabling FP16/BF16 based on experimental findings for peak performance on this architecture.
  • Limit suballocation block size to 256MB to improve memory stability on mobile devices within the Termux environment.

Performance Findings (Termux Native Build)

  • 1B Models (e.g., Llama 3.2 1B): Show significant performance advantages when fully offloaded to the GPU.
  • 4B/8B Models: Require partial CPU offloading to manage memory constraints effectively on typical mobile RAM configurations.
  • Benchmark Command: llama-bench -m models/llama-3.2-1b.gguf -p 512 -n 128 -t 4

These optimizations aim to provide a more usable experience for Android users running LLMs locally via Vulkan on modern Dimensity/Mali-based SoCs.

@Gong-Mi Gong-Mi requested a review from 0cc4m as a code owner December 30, 2025 16:05
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 30, 2025
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Gong-Mi added a commit to Gong-Mi/llama.cpp that referenced this pull request Dec 31, 2025
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment.

## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0)

**Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99`
**Device**: Redmi K70 Ultra (24GB RAM)

| Mode | Configuration | Stability | PP (t/s) | TG (t/s) | Conclusion |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Forcing FP32 (Optimal)** | **Disable FP16/BF16, Enable INT8** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** |
| **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow |
| **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 |
| **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) |

> **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed.

---

## 2. Memory Architecture Optimizations (Zero-Copy)

Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA).

| Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note |
| :--- | :--- | :--- | :--- | :--- |
| **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough |
| **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead |
| **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) |

**Clarification on Zero-Copy Benefits:**
- **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`).
- **Effect**: It **does not** significantly reduce latency or improve performance speed.
- **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies.
- **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints.
- **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA.

---

## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions)

Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`:

| Optimization | Implementation Detail | Purpose |
| :--- | :--- | :--- |
| **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. |
| **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. |
| **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. |
| **INT8 Acceleration** | `device->integer_dot_product = true` | Forcibly enables hardware-level 8-bit integer dot product support (confirmed supported by driver). |
| **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. |
| **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. |

---

## 4. Final Best Practice for Mali G-Series (G720)
The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by:
1.  **Forcing the FP32 computation path** to avoid massive conversion overhead.
2.  **Enabling INT8 dot product** for quantized model acceleration.
3.  **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully.
4.  **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility.

---

## 5. Pull Request Status
These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository.
- **Local status**: Verified, compiled, and benchmarked (FP32 + INT8 + Zero-Copy).
- **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
Gong-Mi added a commit to Gong-Mi/llama.cpp that referenced this pull request Dec 31, 2025
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment.

## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0)

**Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99`
**Device**: Redmi K70 Ultra (24GB RAM)

| Mode | Configuration | Stability | PP (t/s) | TG (t/s) | Conclusion |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Forcing FP32 (Optimal)** | **Disable FP16/BF16, Enable INT8** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** |
| **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow |
| **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 |
| **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) |

> **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed.

---

## 2. Memory Architecture Optimizations (Zero-Copy)

Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA).

| Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note |
| :--- | :--- | :--- | :--- | :--- |
| **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough |
| **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead |
| **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) |

**Clarification on Zero-Copy Benefits:**
- **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`).
- **Effect**: It **does not** significantly reduce latency or improve performance speed.
- **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies.
- **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints.
- **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA.

---

## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions)

Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`:

| Optimization | Implementation Detail | Purpose |
| :--- | :--- | :--- |
| **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. |
| **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. |
| **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. |
| **INT8 Acceleration** | `device->integer_dot_product = true` | Forcibly enables hardware-level 8-bit integer dot product support (confirmed supported by driver). |
| **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. |
| **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. |

---

## 4. Final Best Practice for Mali G-Series (G720)
The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by:
1.  **Forcing the FP32 computation path** to avoid massive conversion overhead.
2.  **Enabling INT8 dot product** for quantized model acceleration.
3.  **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully.
4.  **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility.

---

## 5. Pull Request Status
These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository.
- **Local status**: Verified, compiled, and benchmarked (FP32 + INT8 + Zero-Copy).
- **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
Gong-Mi added a commit to Gong-Mi/llama.cpp that referenced this pull request Dec 31, 2025
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment.

## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0)

**Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99`
**Device**: Redmi K70 Ultra (24GB RAM)

| Mode | Configuration | Stability | PP (t/s) | TG (t/s) | Conclusion |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Forcing FP32 (Optimal)** | **Disable FP16/BF16, Enable INT8** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** |
| **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow |
| **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 |
| **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) |

> **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed.

---

## 2. Memory Architecture Optimizations (Zero-Copy)

Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA).

| Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note |
| :--- | :--- | :--- | :--- | :--- |
| **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough |
| **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead |
| **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) |

**Clarification on Zero-Copy Benefits:**
- **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`).
- **Effect**: It **does not** significantly reduce latency or improve performance speed.
- **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies.
- **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints.
- **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA.

---

## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions)

Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`:

| Optimization | Implementation Detail | Purpose |
| :--- | :--- | :--- |
| **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. |
| **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. |
| **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. |
| **INT8 Acceleration** | `device->integer_dot_product = true` | Forcibly enables hardware-level 8-bit integer dot product support (confirmed supported by driver). |
| **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. |
| **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. |

---

## 4. Final Best Practice for Mali G-Series (G720)
The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by:
1.  **Forcing the FP32 computation path** to avoid massive conversion overhead.
2.  **Enabling INT8 dot product** for quantized model acceleration.
3.  **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully.
4.  **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility.

---

## 5. Pull Request Status
These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository.
- **Local status**: Verified, compiled, and benchmarked (FP32 + INT8 + Zero-Copy).
- **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
Gong-Mi added a commit to Gong-Mi/llama.cpp that referenced this pull request Dec 31, 2025
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment.

## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0)

**Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99`
**Device**: Redmi K70 Ultra (24GB RAM)

| Mode | Configuration | Stability | Prompt Processing (t/s) | Token Generation (t/s) | Conclusion |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Forcing FP32 (Optimal)** | **Disable FP16/BF16** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** |
| **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow |
| **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 |
| **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) |

> **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed.

---

## 2. Memory Architecture Optimizations (Zero-Copy)

Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA).

| Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note |
| :--- | :--- | :--- | :--- | :--- |
| **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough |
| **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead |
| **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) |

**Clarification on Zero-Copy Benefits:**
- **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`).
- **Effect**: It **does not** significantly reduce latency or improve performance speed.
- **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies.
- **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints.
- **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA.

---

## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions)

Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`:

| Optimization | Implementation Detail | Purpose |
| :--- | :--- | :--- |
| **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. |
| **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. |
| **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. |
| **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. |
| **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. |

---

## 4. Final Best Practice for Mali G-Series (G720)
The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by:
1.  **Forcing the FP32 computation path** to avoid massive conversion overhead.
2.  **Relying on Native Driver Detection for INT8** (Removed forced override for better compatibility with G78/older models).
3.  **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully.
4.  **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility.

---

## 5. Pull Request Status
These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository.
- **Local status**: Verified, compiled, and benchmarked (FP32 + Zero-Copy).
- **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
@Gong-Mi Gong-Mi requested a review from jeffbolznv January 1, 2026 11:35
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated

fp16_storage = false;
fp16_compute = false;
bfloat16_support = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these are necessary and why?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to better understand what these are accomplishing. Which shaders are slower with fp16? You can use GGML_VK_PERF_LOGGER to help identify them.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been merged into one submission, please review.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vkpeak ,Determine the actual calculation force and calculation method before reference.

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Gong-Mi added a commit to Gong-Mi/llama.cpp that referenced this pull request Jan 1, 2026
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment.

## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0)

**Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99`
**Device**: Redmi K70 Ultra (24GB RAM)

| Mode | Configuration | Stability | Prompt Processing (t/s) | Token Generation (t/s) | Conclusion |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Forcing FP32 (Optimal)** | **Disable FP16/BF16** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** |
| **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow |
| **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 |
| **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) |

> **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed.

---

## 2. Memory Architecture Optimizations (Zero-Copy)

Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA).

| Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note |
| :--- | :--- | :--- | :--- | :--- |
| **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough |
| **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead |
| **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) |

**Clarification on Zero-Copy Benefits:**
- **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`).
- **Effect**: It **does not** significantly reduce latency or improve performance speed.
- **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies.
- **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints.
- **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA.

---

## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions)

Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`:

| Optimization | Implementation Detail | Purpose |
| :--- | :--- | :--- |
| **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. |
| **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. |
| **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. |
| **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. |
| **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. |

---

## 4. Final Best Practice for Mali G-Series (G720)
The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by:
1.  **Forcing the FP32 computation path** to avoid massive conversion overhead.
2.  **Relying on Native Driver Detection for INT8** (Removed forced override for better compatibility with G78/older models).
3.  **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully.
4.  **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility.

---

## 5. Pull Request Status
These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository.
- **Local status**: Verified, compiled, and benchmarked (FP32 + Zero-Copy).
- **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
@Gong-Mi Gong-Mi force-pushed the mali-g720-tuning branch 2 times, most recently from e09331b to ebcec19 Compare January 2, 2026 00:07
@Gong-Mi Gong-Mi requested a review from jeffbolznv January 2, 2026 00:25
@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Jan 2, 2026

./build/bin/llama-bench -m models/gemma-3-4b-it-qat-Q4_0.gguf -p 512 -n 128 -b 512
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis MC12 (Mali-G720-Immortalis MC12) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 1 | matrix cores: none

model size params backend ngl n_batch test t/s
gemma3 4B Q4_0 2.35 GiB 3.88 B Vulkan 99 512 pp512 205.39 ± 1.93
gemma3 4B Q4_0 2.35 GiB 3.88 B Vulkan 99 512 tg128 12.68 ± 1.01

build: 56b8f27 (7611)

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase to resolve the conflict.

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Jan 7, 2026

Resolved conflicts and cleaned up code as requested (removed empty blocks, moved tuning logic).
Verified performance on Mali G720 (llama-3.2-1b):

  • Prompt Processing: ~86.69 t/s (+9.6%)
  • Token Generation: ~29.15 t/s (+3.9%)

Ready for merge.

@Gong-Mi Gong-Mi requested review from 0cc4m and jeffbolznv January 7, 2026 14:23
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Gong-Mi added a commit to Gong-Mi/llama.cpp that referenced this pull request Feb 1, 2026
Moves the logic that disables large matrix multiplication for ARM and Qualcomm devices from ggml_vk_load_shaders to the device initialization switch block.

This fixes stability issues (silent calculation errors) on Mali G720/Immortalis MC12 while adhering to the code structure requested in PR ggml-org#18493 discussion.
@Gong-Mi Gong-Mi force-pushed the mali-g720-tuning branch 3 times, most recently from 0764c47 to 9b86eb6 Compare February 2, 2026 05:10
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Feb 5, 2026

Please stop making large changes, it is impossible to review a moving target. Also, make sure to follow https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md, your recent responses were definitely AI and your code changes may also be. This does not give me confidence that the PR actually does what it claims to do.

@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Feb 5, 2026

Please stop making large changes, it is impossible to review a moving target. Also, make sure to follow https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md, your recent responses were definitely AI and your code changes may also be. This does not give me confidence that the PR actually does what it claims to do.

是,我知道。在这一个月里面的情况我还不清楚什么吗?
谁会指望安卓,树莓派,机顶盒能使用llm服务。

所以支持高通六芒星的,你们也应该注释掉 。
最好线路就那几个,你们最好也想想看后端怎么处理。
你唯一能阻止对方的方式就是证明这个方案是最好,人自然闭嘴。

安卓设备使用年限就5年,我手头上的安卓设备还有二年就算退役了。

项目只是爱好,挣钱才是工作。

@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Feb 7, 2026

Please stop making large changes, it is impossible to review a moving target. Also, make sure to follow https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md, your recent responses were definitely AI and your code changes may also be. This does not give me confidence that the PR actually does what it claims to do.

I'm sorry, I've tried many ways to get a conclusion recently, or I have to use zero copy in Android mode.

One of the main purposes of opening this pr is to solve the problem of high memory occupation in Vulkan.(UMA architecture) but you said you supported me and turned the discovery to tuning.

But now I'm sure the problem is ahb...VK_ANDROID_external_memory_android_hardware_buffer:

@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Feb 7, 2026

Please stop making large changes, it is impossible to review a moving target. Also, make sure to follow https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md, your recent responses were definitely AI and your code changes may also be. This does not give me confidence that the PR actually does what it claims to do.

If you mind my behavior, change this into a draft.At present, in my case, I can only try to answer. I'm sorry to waste you so much time.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Feb 8, 2026

It's fine, if you still need to work on the PR, please mark it as draft. I can still help if you have questions. When you're ready for merge, let me know and I'll do a complete review.

@Gong-Mi Gong-Mi marked this pull request as draft February 10, 2026 02:30
@Gong-Mi Gong-Mi marked this pull request as ready for review February 10, 2026 05:32
@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Feb 10, 2026

This is the preparation work. I have to make sure that this change is acceptable to continue.

@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Feb 10, 2026

I'll keep the status quo until you give me your opinion. This submission seems to score several specific steps.

@Gong-Mi
Copy link
Copy Markdown
Author

Gong-Mi commented Feb 10, 2026

This submission is only to correct GPU memory. type

@Gong-Mi Gong-Mi marked this pull request as draft February 10, 2026 06:37
@SuperPauly
Copy link
Copy Markdown

@Gong-Mi Any updates?

@BlindDeveloper
Copy link
Copy Markdown

@Gong-Mi Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants