Skip to content

llama-bench : use random tokens to improve accuracy with mixtral#6069

Merged
ggerganov merged 1 commit intomasterfrom
sl/bench-random-tokens
Mar 15, 2024
Merged

llama-bench : use random tokens to improve accuracy with mixtral#6069
ggerganov merged 1 commit intomasterfrom
sl/bench-random-tokens

Conversation

@slaren
Copy link
Copy Markdown
Member

@slaren slaren commented Mar 14, 2024

llama-bench currently does not produce accurate results with mixtral because it uses the same token for the entire prompt (bos). This results in the same experts being chosen repeatedly, which is not what happens during real usage. With this change llama-bench uses random tokens instead.

Current llama-bench results in master:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 512 189.62 ± 0.97
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 1024 182.17 ± 0.54
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 512 613.36 ± 0.81
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 1024 607.84 ± 0.48

build: 4755afd (2431)

Using main with a large representative prompt (extracted from the frankenstein book text) produces these values instead:

With -ngl 0:

llama_print_timings: prompt eval time =    8695.65 ms /   512 tokens (   16.98 ms per token,    58.88 tokens per second)
llama_print_timings: prompt eval time =   17340.24 ms /  1024 tokens (   16.93 ms per token,    59.05 tokens per second)

With -ngl 99:

llama_print_timings: prompt eval time =    1411.63 ms /   512 tokens (    2.76 ms per token,   362.70 tokens per second)
llama_print_timings: prompt eval time =    2811.67 ms /  1024 tokens (    2.75 ms per token,   364.20 tokens per second)

llama-bench after this PR:

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 512 61.87 ± 0.26
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 pp 1024 61.56 ± 0.11
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 512 378.69 ± 0.87
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 99 pp 1024 377.54 ± 1.76

The small difference is probably due to the warmup run performed by llama-bench.

Why is this important: a future change will cause all experts to be copied to VRAM during prompt processing regardless of if they are actually used, while currently only the experts used are copied. This change is important to understand the performance impact of doing that.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M2 Ultra

master

model size params backend ngl test t/s
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 512 302.32 ± 0.54
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 1024 301.49 ± 0.12

build: 4755afd (2431)

PR

model size params backend ngl test t/s
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 512 275.43 ± 1.19
llama 7B F16 86.99 GiB 46.70 B Metal 99 pp 1024 279.04 ± 0.67

build: 8281389 (2432)

@ggerganov ggerganov merged commit b0bc9f4 into master Mar 15, 2024
@slaren slaren deleted the sl/bench-random-tokens branch March 15, 2024 10:46
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants