Skip to content

ggml-opencl, llama: using reserve() if count already known#7272

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
GermanAizek:reserve-vec
May 20, 2024
Merged

ggml-opencl, llama: using reserve() if count already known#7272
ggerganov merged 1 commit intoggml-org:masterfrom
GermanAizek:reserve-vec

Conversation

@GermanAizek
Copy link
Copy Markdown
Contributor

It affects a lot ggml_cl_mul_mat_q_f32 function.

@mofosyne mofosyne added refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 14, 2024
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8563.44ms p(95)=20815.89ms fails=, finish reason: stop=478 truncated=69
  • Prompt processing (pp): avg=105.13tk/s p(95)=469.6tk/s
  • Token generation (tg): avg=33.15tk/s p(95)=46.6tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=reserve-vec commit=4ee29e5e1caf29e1bc7b094226faa890ae0e98d6

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 472.06, 472.06, 472.06, 472.06, 472.06, 525.02, 525.02, 525.02, 525.02, 525.02, 550.42, 550.42, 550.42, 550.42, 550.42, 588.73, 588.73, 588.73, 588.73, 588.73, 662.71, 662.71, 662.71, 662.71, 662.71, 665.31, 665.31, 665.31, 665.31, 665.31, 669.78, 669.78, 669.78, 669.78, 669.78, 698.16, 698.16, 698.16, 698.16, 698.16, 709.11, 709.11, 709.11, 709.11, 709.11, 725.4, 725.4, 725.4, 725.4, 725.4, 758.25, 758.25, 758.25, 758.25, 758.25, 770.08, 770.08, 770.08, 770.08, 770.08, 788.8, 788.8, 788.8, 788.8, 788.8, 841.7, 841.7, 841.7, 841.7, 841.7, 837.58, 837.58, 837.58, 837.58, 837.58, 840.07, 840.07, 840.07, 840.07, 840.07, 837.49, 837.49, 837.49, 837.49, 837.49, 853.86, 853.86, 853.86, 853.86, 853.86, 856.06, 856.06, 856.06, 856.06, 856.06, 862.22, 862.22, 862.22, 862.22, 862.22, 861.55, 861.55, 861.55, 861.55, 861.55, 866.24, 866.24, 866.24, 866.24, 866.24, 880.66, 880.66, 880.66, 880.66, 880.66, 882.05, 882.05, 882.05, 882.05, 882.05, 883.99, 883.99, 883.99, 883.99, 883.99, 895.05, 895.05, 895.05, 895.05, 895.05, 890.86, 890.86, 890.86, 890.86, 890.86, 886.13, 886.13, 886.13, 886.13, 886.13, 884.42, 884.42, 884.42, 884.42, 884.42, 888.12, 888.12, 888.12, 888.12, 888.12, 888.61, 888.61, 888.61, 888.61, 888.61, 886.54, 886.54, 886.54, 886.54, 886.54, 883.37, 883.37, 883.37, 883.37, 883.37, 893.4, 893.4, 893.4, 893.4, 893.4, 901.59, 901.59, 901.59, 901.59, 901.59, 909.05, 909.05, 909.05, 909.05, 909.05, 908.93, 908.93, 908.93, 908.93, 908.93, 902.53, 902.53, 902.53, 902.53, 902.53, 901.31, 901.31, 901.31, 901.31, 901.31, 902.46, 902.46, 902.46, 902.46, 902.46, 900.35, 900.35, 900.35, 900.35, 900.35, 893.79, 893.79, 893.79, 893.79, 893.79, 865.16, 865.16, 865.16, 865.16, 865.16, 864.17, 864.17, 864.17, 864.17, 864.17, 861.86, 861.86, 861.86, 861.86, 861.86, 860.66, 860.66, 860.66, 860.66, 860.66, 864.27, 864.27, 864.27, 864.27, 864.27, 866.95, 866.95, 866.95, 866.95, 866.95, 866.3, 866.3, 866.3, 866.3, 866.3, 870.9, 870.9, 870.9, 870.9, 870.9, 870.1, 870.1, 870.1, 870.1, 870.1, 875.37, 875.37, 875.37, 875.37, 875.37, 876.07, 876.07, 876.07, 876.07, 876.07, 874.88, 874.88, 874.88, 874.88, 874.88, 875.38, 875.38, 875.38, 875.38, 875.38, 875.58, 875.58, 875.58, 875.58, 875.58, 875.61, 875.61, 875.61, 875.61, 875.61, 875.77, 875.77, 875.77, 875.77, 875.77, 877.0, 877.0, 877.0, 877.0, 877.0, 877.51, 877.51, 877.51, 877.51, 877.51, 879.33, 879.33, 879.33, 879.33, 879.33, 879.33, 879.33]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.23, 41.23, 41.23, 41.23, 41.23, 42.01, 42.01, 42.01, 42.01, 42.01, 37.81, 37.81, 37.81, 37.81, 37.81, 36.5, 36.5, 36.5, 36.5, 36.5, 36.12, 36.12, 36.12, 36.12, 36.12, 35.43, 35.43, 35.43, 35.43, 35.43, 35.54, 35.54, 35.54, 35.54, 35.54, 36.17, 36.17, 36.17, 36.17, 36.17, 36.33, 36.33, 36.33, 36.33, 36.33, 35.78, 35.78, 35.78, 35.78, 35.78, 35.75, 35.75, 35.75, 35.75, 35.75, 35.62, 35.62, 35.62, 35.62, 35.62, 34.82, 34.82, 34.82, 34.82, 34.82, 34.24, 34.24, 34.24, 34.24, 34.24, 33.15, 33.15, 33.15, 33.15, 33.15, 33.34, 33.34, 33.34, 33.34, 33.34, 33.65, 33.65, 33.65, 33.65, 33.65, 33.46, 33.46, 33.46, 33.46, 33.46, 33.01, 33.01, 33.01, 33.01, 33.01, 32.91, 32.91, 32.91, 32.91, 32.91, 32.8, 32.8, 32.8, 32.8, 32.8, 32.88, 32.88, 32.88, 32.88, 32.88, 32.7, 32.7, 32.7, 32.7, 32.7, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.75, 32.75, 32.75, 32.75, 32.75, 32.09, 32.09, 32.09, 32.09, 32.09, 31.87, 31.87, 31.87, 31.87, 31.87, 31.85, 31.85, 31.85, 31.85, 31.85, 32.02, 32.02, 32.02, 32.02, 32.02, 32.16, 32.16, 32.16, 32.16, 32.16, 32.26, 32.26, 32.26, 32.26, 32.26, 32.31, 32.31, 32.31, 32.31, 32.31, 32.33, 32.33, 32.33, 32.33, 32.33, 32.17, 32.17, 32.17, 32.17, 32.17, 32.01, 32.01, 32.01, 32.01, 32.01, 31.66, 31.66, 31.66, 31.66, 31.66, 31.64, 31.64, 31.64, 31.64, 31.64, 31.77, 31.77, 31.77, 31.77, 31.77, 31.96, 31.96, 31.96, 31.96, 31.96, 31.98, 31.98, 31.98, 31.98, 31.98, 32.12, 32.12, 32.12, 32.12, 32.12, 31.94, 31.94, 31.94, 31.94, 31.94, 31.28, 31.28, 31.28, 31.28, 31.28, 31.21, 31.21, 31.21, 31.21, 31.21, 30.23, 30.23, 30.23, 30.23, 30.23, 29.92, 29.92, 29.92, 29.92, 29.92, 29.95, 29.95, 29.95, 29.95, 29.95, 30.1, 30.1, 30.1, 30.1, 30.1, 30.13, 30.13, 30.13, 30.13, 30.13, 30.23, 30.23, 30.23, 30.23, 30.23, 30.3, 30.3, 30.3, 30.3, 30.3, 30.28, 30.28, 30.28, 30.28, 30.28, 30.07, 30.07, 30.07, 30.07, 30.07, 30.05, 30.05, 30.05, 30.05, 30.05, 30.04, 30.04, 30.04, 30.04, 30.04, 30.2, 30.2, 30.2, 30.2, 30.2, 30.33, 30.33, 30.33, 30.33, 30.33, 30.39, 30.39, 30.39, 30.39, 30.39, 30.45, 30.45, 30.45, 30.45, 30.45, 30.55, 30.55, 30.55, 30.55, 30.55, 30.58, 30.58]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.38, 0.38, 0.38, 0.38, 0.38, 0.36, 0.36, 0.36, 0.36, 0.36, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.16, 0.16, 0.16, 0.16, 0.16, 0.08, 0.08, 0.08, 0.08, 0.08, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.56, 0.56, 0.56, 0.56, 0.56, 0.62, 0.62, 0.62, 0.62, 0.62, 0.48, 0.48, 0.48, 0.48, 0.48, 0.42, 0.42, 0.42, 0.42, 0.42, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0]
                    
Loading

Comment thread llama.cpp Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already reserved on line 6060

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it 4ee29e5

Comment thread ggml-opencl.cpp Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to keep the for loop

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it 4ee29e5

@mofosyne mofosyne marked this pull request as draft May 14, 2024 07:32
@GermanAizek GermanAizek marked this pull request as ready for review May 20, 2024 02:25
@ggerganov ggerganov merged commit 213e90e into ggml-org:master May 20, 2024
Comment thread ggml-opencl.cpp
for (int64_t i12 = i02 * r2, e12 = i12 + r2; i12 < e12; i12++) {
int64_t i12 = i02 * r2;
int64_t e12 = i12 + r2;
events.reserve(e12 - i12);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference: events is cleared at the end of this inner loop, so its actual maximum capacity is 3. Even ignoring the clear(), reserve() does not grow the vector by the specified amount, it increases the capacity to the specified amount—so you would need to reserve events.size() + e12 - i12 instead, if you were to even bother.

Luckily, this file is gone now, so this particular instance doesn't matter. But we should be more careful going forward.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cebtenzzre, good catch. More reviewers there are, lower chance making a mistake, you're right.

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants