vulkan: add peak performance tuning for ARM Mali GPUs (G720)#18493
vulkan: add peak performance tuning for ARM Mali GPUs (G720)#18493Gong-Mi wants to merge 8 commits intoggml-org:masterfrom
Conversation
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment. ## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0) **Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99` **Device**: Redmi K70 Ultra (24GB RAM) | Mode | Configuration | Stability | PP (t/s) | TG (t/s) | Conclusion | | :--- | :--- | :--- | :--- | :--- | :--- | | **Forcing FP32 (Optimal)** | **Disable FP16/BF16, Enable INT8** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** | | **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow | | **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 | | **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) | > **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed. --- ## 2. Memory Architecture Optimizations (Zero-Copy) Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA). | Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note | | :--- | :--- | :--- | :--- | :--- | | **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough | | **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead | | **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) | **Clarification on Zero-Copy Benefits:** - **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`). - **Effect**: It **does not** significantly reduce latency or improve performance speed. - **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies. - **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints. - **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA. --- ## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions) Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`: | Optimization | Implementation Detail | Purpose | | :--- | :--- | :--- | | **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. | | **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. | | **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. | | **INT8 Acceleration** | `device->integer_dot_product = true` | Forcibly enables hardware-level 8-bit integer dot product support (confirmed supported by driver). | | **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. | | **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. | --- ## 4. Final Best Practice for Mali G-Series (G720) The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by: 1. **Forcing the FP32 computation path** to avoid massive conversion overhead. 2. **Enabling INT8 dot product** for quantized model acceleration. 3. **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully. 4. **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility. --- ## 5. Pull Request Status These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository. - **Local status**: Verified, compiled, and benchmarked (FP32 + INT8 + Zero-Copy). - **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
21844aa to
f9c613d
Compare
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment. ## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0) **Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99` **Device**: Redmi K70 Ultra (24GB RAM) | Mode | Configuration | Stability | PP (t/s) | TG (t/s) | Conclusion | | :--- | :--- | :--- | :--- | :--- | :--- | | **Forcing FP32 (Optimal)** | **Disable FP16/BF16, Enable INT8** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** | | **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow | | **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 | | **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) | > **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed. --- ## 2. Memory Architecture Optimizations (Zero-Copy) Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA). | Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note | | :--- | :--- | :--- | :--- | :--- | | **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough | | **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead | | **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) | **Clarification on Zero-Copy Benefits:** - **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`). - **Effect**: It **does not** significantly reduce latency or improve performance speed. - **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies. - **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints. - **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA. --- ## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions) Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`: | Optimization | Implementation Detail | Purpose | | :--- | :--- | :--- | | **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. | | **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. | | **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. | | **INT8 Acceleration** | `device->integer_dot_product = true` | Forcibly enables hardware-level 8-bit integer dot product support (confirmed supported by driver). | | **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. | | **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. | --- ## 4. Final Best Practice for Mali G-Series (G720) The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by: 1. **Forcing the FP32 computation path** to avoid massive conversion overhead. 2. **Enabling INT8 dot product** for quantized model acceleration. 3. **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully. 4. **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility. --- ## 5. Pull Request Status These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository. - **Local status**: Verified, compiled, and benchmarked (FP32 + INT8 + Zero-Copy). - **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
f9c613d to
9d1bb45
Compare
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment. ## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0) **Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99` **Device**: Redmi K70 Ultra (24GB RAM) | Mode | Configuration | Stability | PP (t/s) | TG (t/s) | Conclusion | | :--- | :--- | :--- | :--- | :--- | :--- | | **Forcing FP32 (Optimal)** | **Disable FP16/BF16, Enable INT8** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** | | **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow | | **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 | | **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) | > **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed. --- ## 2. Memory Architecture Optimizations (Zero-Copy) Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA). | Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note | | :--- | :--- | :--- | :--- | :--- | | **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough | | **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead | | **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) | **Clarification on Zero-Copy Benefits:** - **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`). - **Effect**: It **does not** significantly reduce latency or improve performance speed. - **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies. - **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints. - **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA. --- ## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions) Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`: | Optimization | Implementation Detail | Purpose | | :--- | :--- | :--- | | **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. | | **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. | | **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. | | **INT8 Acceleration** | `device->integer_dot_product = true` | Forcibly enables hardware-level 8-bit integer dot product support (confirmed supported by driver). | | **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. | | **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. | --- ## 4. Final Best Practice for Mali G-Series (G720) The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by: 1. **Forcing the FP32 computation path** to avoid massive conversion overhead. 2. **Enabling INT8 dot product** for quantized model acceleration. 3. **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully. 4. **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility. --- ## 5. Pull Request Status These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository. - **Local status**: Verified, compiled, and benchmarked (FP32 + INT8 + Zero-Copy). - **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
9d1bb45 to
f0ae267
Compare
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment. ## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0) **Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99` **Device**: Redmi K70 Ultra (24GB RAM) | Mode | Configuration | Stability | Prompt Processing (t/s) | Token Generation (t/s) | Conclusion | | :--- | :--- | :--- | :--- | :--- | :--- | | **Forcing FP32 (Optimal)** | **Disable FP16/BF16** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** | | **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow | | **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 | | **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) | > **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed. --- ## 2. Memory Architecture Optimizations (Zero-Copy) Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA). | Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note | | :--- | :--- | :--- | :--- | :--- | | **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough | | **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead | | **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) | **Clarification on Zero-Copy Benefits:** - **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`). - **Effect**: It **does not** significantly reduce latency or improve performance speed. - **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies. - **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints. - **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA. --- ## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions) Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`: | Optimization | Implementation Detail | Purpose | | :--- | :--- | :--- | | **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. | | **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. | | **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. | | **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. | | **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. | --- ## 4. Final Best Practice for Mali G-Series (G720) The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by: 1. **Forcing the FP32 computation path** to avoid massive conversion overhead. 2. **Relying on Native Driver Detection for INT8** (Removed forced override for better compatibility with G78/older models). 3. **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully. 4. **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility. --- ## 5. Pull Request Status These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository. - **Local status**: Verified, compiled, and benchmarked (FP32 + Zero-Copy). - **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
f0ae267 to
74192eb
Compare
|
|
||
| fp16_storage = false; | ||
| fp16_compute = false; | ||
| bfloat16_support = false; |
There was a problem hiding this comment.
Which of these are necessary and why?
There was a problem hiding this comment.
I'd still like to better understand what these are accomplishing. Which shaders are slower with fp16? You can use GGML_VK_PERF_LOGGER to help identify them.
There was a problem hiding this comment.
Has been merged into one submission, please review.
There was a problem hiding this comment.
vkpeak ,Determine the actual calculation force and calculation method before reference.
02f0386 to
9134022
Compare
This report details the experimental findings and code optimizations performed on the ARM Mali G720 GPU (MediaTek Dimensity 9300) within the Termux environment. ## 1. Benchmarking Results (Model: Gemma 3 4B Q4_0) **Environment**: Termux Native Build, `llama-bench`, 4 threads, `-ngl 99` **Device**: Redmi K70 Ultra (24GB RAM) | Mode | Configuration | Stability | Prompt Processing (t/s) | Token Generation (t/s) | Conclusion | | :--- | :--- | :--- | :--- | :--- | :--- | | **Forcing FP32 (Optimal)** | **Disable FP16/BF16** | **Optimal** | **68.18** | **11.19** | **Fastest PP (+125% vs FP16)** | | **Full FP16** | Enable FP16 Storage & Compute | Stable | 30.26 | 11.70 | TG slightly faster, PP very slow | | **Mixed Precision**| FP16 Storage + FP32 Compute | Stable | 122.99* | 33.01* | *Tested on 1B model, performance degraded vs Pure FP32 | | **Forced BF16** | Force enable on unsupported driver| **Crash** | N/A | N/A | Driver Segfault (VK_KHR_shader_bfloat16 missing/broken) | > **Key Finding**: The Mali G720 driver/hardware implementation for llama.cpp's Vulkan shaders is significantly more efficient using the **FP32 path**. Forcing FP32 avoids costly conversion overheads (pack/unpack) likely present in the FP16 path due to compiler limitations (lack of native 16-bit registers), doubling prompt processing speed. --- ## 2. Memory Architecture Optimizations (Zero-Copy) Exploration of memory flags confirmed the benefits of native "Zero-Copy" on Mali's Unified Memory Architecture (UMA). | Configuration | UMA Status | Prefer Host Mem | Performance Impact | Note | | :--- | :--- | :--- | :--- | :--- | | **Default** | DISABLED | NO | Baseline | Logic bug prevented UMA detection early enough | | **Corrected** | ENABLED | NO | Slight Drop | `HOST_VISIBLE` memory might have slight cache overhead | | **Optimized** | **ENABLED** | **YES** | **~+2-3%** | Ensures driver allocates native host-visible memory (`CL_MEM_ALLOC_HOST_PTR` equivalent) | **Clarification on Zero-Copy Benefits:** - **Goal**: Enable direct driver allocation of host-visible memory (`prefer_host_memory = true`). - **Effect**: It **does not** significantly reduce latency or improve performance speed. - **Power Consumption**: It effectively reduces the power spike during the initial data loading/prompt processing phase by eliminating CPU-to-GPU memory copies. - **Limitations**: It **does not** solve the high power consumption problem during the token generation phase, which is dominated by memory bandwidth constraints. - **Context**: Verified on a high-spec device (K70U with 24GB RAM); benefits on lower-RAM devices may vary but the architecture remains UMA. --- ## 3. Implemented Code Optimizations (Refactored per Upstream Suggestions) Based on feedback from NVIDIA architect `@jeffbolznv`, the following changes were implemented in `ggml-vulkan.cpp`: | Optimization | Implementation Detail | Purpose | | :--- | :--- | :--- | | **Macro Definition** | `#define VK_VENDOR_ID_ARM 0x13B5` | Standardized vendor identification, removing hardcoded values. | | **Initialization Refactor**| Logic moved to `ggml_vk_get_device` | Ensures device features (FP16/INT8) are correctly overridden at initialization. | | **Memory Management** | `suballocation_block_size = 256MB` | Prevents OOM and improves allocation stability in Termux. | | **Zero-Copy Force** | `prefer_host_memory = true` | Forces allocation of Host-Visible/Device-Local memory for true Zero-Copy. | | **Shader Tuning** | Custom `warptile` parameters | Optimized parallel workgroup sizes specifically for Mali architecture. | --- ## 4. Final Best Practice for Mali G-Series (G720) The experiments confirm that for modern Mali G-series (G720), the best performance balance is achieved by: 1. **Forcing the FP32 computation path** to avoid massive conversion overhead. 2. **Relying on Native Driver Detection for INT8** (Removed forced override for better compatibility with G78/older models). 3. **Enabling `prefer_host_memory`** to leverage UMA/Zero-Copy capabilities fully. 4. **Reducing suballocation blocks** to 256MB for better mobile RAM compatibility. --- ## 5. Pull Request Status These changes are incorporated into PR **ggml-org#18493** on the official `llama.cpp` repository. - **Local status**: Verified, compiled, and benchmarked (FP32 + Zero-Copy). - **Upstream alignment**: Code adheres to requested architectural patterns (macros, init location).
e09331b to
ebcec19
Compare
|
./build/bin/llama-bench -m models/gemma-3-4b-it-qat-Q4_0.gguf -p 512 -n 128 -b 512
build: 56b8f27 (7611) |
56b8f27 to
0ea52ac
Compare
0cc4m
left a comment
There was a problem hiding this comment.
Please rebase to resolve the conflict.
0ea52ac to
aeef532
Compare
|
Resolved conflicts and cleaned up code as requested (removed empty blocks, moved tuning logic).
Ready for merge. |
705e1bf to
e6c76f7
Compare
Moves the logic that disables large matrix multiplication for ARM and Qualcomm devices from ggml_vk_load_shaders to the device initialization switch block. This fixes stability issues (silent calculation errors) on Mali G720/Immortalis MC12 while adhering to the code structure requested in PR ggml-org#18493 discussion.
0764c47 to
9b86eb6
Compare
|
Please stop making large changes, it is impossible to review a moving target. Also, make sure to follow https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md, your recent responses were definitely AI and your code changes may also be. This does not give me confidence that the PR actually does what it claims to do. |
是,我知道。在这一个月里面的情况我还不清楚什么吗? 所以支持高通六芒星的,你们也应该注释掉 。 安卓设备使用年限就5年,我手头上的安卓设备还有二年就算退役了。 项目只是爱好,挣钱才是工作。 |
I'm sorry, I've tried many ways to get a conclusion recently, or I have to use zero copy in Android mode. One of the main purposes of opening this pr is to solve the problem of high memory occupation in Vulkan.(UMA architecture) but you said you supported me and turned the discovery to tuning. But now I'm sure the problem is ahb... |
If you mind my behavior, change this into a draft.At present, in my case, I can only try to answer. I'm sorry to waste you so much time. |
|
It's fine, if you still need to work on the PR, please mark it as draft. I can still help if you have questions. When you're ready for merge, let me know and I'll do a complete review. |
eb81f41 to
eb6a3f4
Compare
- Add CMake check for android/hardware_buffer.h - Implement runtime detection for integrated GPUs lacking standard zero-copy support - Enable AHB zero-copy only when standard zero-copy is unavailable on Android
eb6a3f4 to
11b687a
Compare
|
This is the preparation work. I have to make sure that this change is acceptable to continue. |
|
I'll keep the status quo until you give me your opinion. This submission seems to score several specific steps. |
|
This submission is only to correct GPU memory. type |
|
@Gong-Mi Any updates? |
|
@Gong-Mi Any updates? |
Description
This PR implements specialized performance tuning for ARM Mali G720 GPUs (vendor ID 0x13B5) in the Vulkan backend.
Changes
Performance Findings (Termux Native Build)
llama-bench -m models/llama-3.2-1b.gguf -p 512 -n 128 -t 4These optimizations aim to provide a more usable experience for Android users running LLMs locally via Vulkan on modern Dimensity/Mali-based SoCs.