feat: Add benchmark for Qwen3 and update readme about benchmark#487
feat: Add benchmark for Qwen3 and update readme about benchmark#487chenghuaWang merged 3 commits intoUbiquitousLearning:v2from
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis PR establishes a complete LLM benchmark tool framework for testing model performance. Changes include: adding ignore rules for virtual environments and model artifacts, defining an abstract benchmark interface with virtual methods, implementing Qwen3 benchmark support with timing instrumentation and performance metrics, and creating comprehensive documentation for the benchmark tool. Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Benchmark Runner
participant Template as BenchmarkTemplate
participant Model as Qwen3Model
Main->>Template: init(config_path, model_path, cache_length)
Template->>Model: Load config & weights
Model-->>Template: Ready
rect rgba(200, 220, 255, 0.3)
Note over Main: For each PP/TG configuration
Main->>Template: warmup()
Template->>Model: Generate test input
Model-->>Template: Complete
loop 3 benchmark runs
Main->>Template: clear() - Reset KV cache
rect rgba(255, 220, 200, 0.3)
Main->>Template: run(pp, tg) - Time generation
Template->>Model: streamGenerate with timing
Note over Model: Measure: TTFT, prefill speed, decode speed
Model-->>Template: BenchmarkTemplateResult
Template-->>Main: Metrics
end
Main->>Main: Sleep 5s (cooldown)
end
end
Main->>Main: Compute averages & print results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The changes span multiple related files with varying complexity. The Qwen3 implementation requires careful review of timing instrumentation and the streaming generation flow. The benchmark orchestration logic in main.cpp involves threading and result aggregation. However, changes are cohesive and follow a consistent pattern—implementing a benchmark framework—which moderates overall review burden. Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (8)
tools/mllm-llm-benchmark/README.md (2)
77-83: Add language to code fence to satisfy markdownlint (MD040).-``` +```text MLLM Build Version : abc123def456 ... -``` +```
183-195: Correct string lowercasing in All.hpp example.
tolower(model_name)isn’t valid for std::string. Use transform.- auto normalized_model_name = tolower(model_name); + std::string normalized_model_name = model_name; + std::transform(normalized_model_name.begin(), normalized_model_name.end(), + normalized_model_name.begin(), + [](unsigned char c){ return static_cast<char>(std::tolower(c)); });tools/mllm-llm-benchmark/main.cpp (1)
97-107: Avoid shadowing Argparse vars; use clearer names.Also compute averages using
results.size()to stay correct if run count changes.- for (auto [pp, tg] : pp_tg_pairs) { + for (auto [prompt_len, gen_len] : pp_tg_pairs) { - mllm::print(" Prompt Length (PP) :", pp); - mllm::print(" Generation Length (TG):", tg); + mllm::print(" Prompt Length (PP) :", prompt_len); + mllm::print(" Generation Length (TG):", gen_len);And later:
- auto result = benchmark->run(pp, tg); + auto result = benchmark->run(prompt_len, gen_len);And for averages/summary:
- mllm::print("Configuration: PP=", pp, " TG=", tg); + mllm::print("Configuration: PP=", prompt_len, " TG=", gen_len);- avg_ttft /= 3.0f; - avg_prefill_speed /= 3.0f; - avg_decode_speed /= 3.0f; + const float denom = results.empty() ? 1.0f : static_cast<float>(results.size()); + avg_ttft /= denom; + avg_prefill_speed /= denom; + avg_decode_speed /= denom;tools/mllm-llm-benchmark/models/Qwen3_W4A32_KAI.hpp (4)
96-98: KV cache “clear” may be incomplete; prefer explicit clear if available.Resetting sequence count might not release buffers or indices. Use the cache’s clear/flush API if it exists.
- model_->kvCache().setCurrentSeqCnt(0); + // Prefer a full clear if provided by API + // model_->kvCache().clear(); // or equivalent + model_->kvCache().setCurrentSeqCnt(0);Run the provided script in main.cpp comment to list kvCache methods.
124-145: Use steady_clock for timing; high_resolution_clock may be non‑monotonic.More stable for latency metrics.
- auto prefill_start = std::chrono::high_resolution_clock::now(); - auto decode_start = std::chrono::high_resolution_clock::now(); - auto decode_end = std::chrono::high_resolution_clock::now(); + using Clock = std::chrono::steady_clock; + auto prefill_start = Clock::now(); + auto decode_start = Clock::now(); + auto decode_end = Clock::now();And:
- auto prefill_duration = std::chrono::duration_cast<std::chrono::microseconds>(decode_start - prefill_start).count(); - auto decode_duration = std::chrono::duration_cast<std::chrono::microseconds>(decode_end - decode_start).count(); + auto prefill_duration = std::chrono::duration_cast<std::chrono::microseconds>(decode_start - prefill_start).count(); + auto decode_duration = std::chrono::duration_cast<std::chrono::microseconds>(decode_end - decode_start).count();
118-123: Confirm correct generation arg keys.Some runtimes expect
max_new_tokensinstead ofmax_length. Verify expected keys to ensure bounded generation.- args["max_length"] = mllm::AnyValue(max_len); + // If runtime expects max_new_tokens: + // args["max_new_tokens"] = mllm::AnyValue(max_len);Use the verification script in main.cpp comment to grep for supported keys.
5-8: Drop unused include.Header shouldn’t pull heavy includes unnecessarily.
-#include <thread>tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (1)
21-23: Consider explicitly defining special member functions.As flagged by static analysis, the class defines a virtual destructor but doesn't address copy/move semantics. For abstract interface classes, it's best practice to explicitly delete copy and move operations to prevent object slicing and clarify intent.
Apply this diff to follow the Rule of Five and silence the static analysis warning:
class BenchmarkTemplate { public: virtual ~BenchmarkTemplate() = default; + BenchmarkTemplate(const BenchmarkTemplate&) = delete; + BenchmarkTemplate& operator=(const BenchmarkTemplate&) = delete; + BenchmarkTemplate(BenchmarkTemplate&&) = delete; + BenchmarkTemplate& operator=(BenchmarkTemplate&&) = delete;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignore(1 hunks)examples/qwen3/main.cpp(1 hunks)tools/mllm-llm-benchmark/README.md(1 hunks)tools/mllm-llm-benchmark/main.cpp(2 hunks)tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp(1 hunks)tools/mllm-llm-benchmark/models/Qwen3_W4A32_KAI.hpp(1 hunks)
🧰 Additional context used
🪛 Clang (14.0.6)
tools/mllm-llm-benchmark/models/Qwen3_W4A32_KAI.hpp
[error] 5-5: 'memory' file not found
(clang-diagnostic-error)
[error] 14-14: invalid case style for class 'Qwen3_W4A32_KAI_Benchmark'
(readability-identifier-naming,-warnings-as-errors)
[error] 16-16: method 'init' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 16-16: 3 adjacent parameters of 'init' of similar type are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 62-62: variable 'warmup_length' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 63-63: variable 'warmup_gen' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 76-76: variable 'inputs' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 79-79: variable 'args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 80-80: variable 'max_len' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 100-100: method 'run' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 100-100: 2 adjacent parameters of 'run' of similar type ('int') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 100-100: parameter name 'pp' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 100-100: parameter name 'tg' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 103-103: floating point literal has suffix 'f', which is not uppercase
(readability-uppercase-literal-suffix,-warnings-as-errors)
[error] 103-103: floating point literal has suffix 'f', which is not uppercase
(readability-uppercase-literal-suffix,-warnings-as-errors)
[error] 103-103: floating point literal has suffix 'f', which is not uppercase
(readability-uppercase-literal-suffix,-warnings-as-errors)
[error] 115-115: variable 'inputs' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 118-118: variable 'args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 119-119: variable 'max_len' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 146-146: uninitialized record type: 'result'
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 155-155: floating point literal has suffix 'f', which is not uppercase
(readability-uppercase-literal-suffix,-warnings-as-errors)
[error] 162-162: floating point literal has suffix 'f', which is not uppercase
(readability-uppercase-literal-suffix,-warnings-as-errors)
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp
[error] 21-21: class 'BenchmarkTemplate' defines a default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator
(cppcoreguidelines-special-member-functions,-warnings-as-errors)
🪛 markdownlint-cli2 (0.18.1)
tools/mllm-llm-benchmark/README.md
77-77: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (3)
examples/qwen3/main.cpp (1)
63-64: No functional change; OK to keep.Visually separates error output from perf summary.
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (2)
10-14: Clean benchmark result structure.The struct design is appropriate for holding benchmark metrics. The documentation clearly explains each field's purpose and units.
25-60: Well-designed benchmark interface.The abstract interface provides a clear contract for benchmark implementations, covering initialization, warmup, execution, and cleanup. The method signatures are appropriate, and the documentation effectively explains each method's purpose and parameters.
| // Actual run for 3 turns and gives avg results. Each turn will sleep for 5 seconds to let the SoC or GPU/NPU cool down. | ||
| mllm::print("\n========================================"); | ||
| mllm::print("Starting Benchmark Tests"); | ||
| mllm::print("========================================\n"); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
Harden CLI: require mandatory args; set thread count; validate PP/TG lists.
- Mark
-n, -m, -c, -pp, -tgas required; fail fast if missing. - Use
num_threadsif set (hook to runtime’s thread setter). - Validate non-empty PP/TG and catch
stoierrors.
🏁 Script executed:
#!/bin/bash
# Find thread count setter and kvCache APIs across the repo.
rg -nP 'setNumThreads|num_threads|threads\W*set' -C2
rg -n 'kvCache\\(\\)\\.' -C2
# Confirm which gen args are supported
rg -nP 'max_length|max_new_tokens' -C2Length of output: 18650
🏁 Script executed:
#!/bin/bash
# Examine the main.cpp file to see argument setup and validation
cat tools/mllm-llm-benchmark/main.cpp | head -150Length of output: 6072
🏁 Script executed:
#!/bin/bash
# Search for PP/TG parsing and usage
rg -n 'pp\.value|tg\.value|\bpp\b|\btg\b' tools/mllm-llm-benchmark/main.cpp -A2 -B2Length of output: 2272
🏁 Script executed:
#!/bin/bash
# Search for num_threads usage in generation calls
rg -n 'num_threads|generate' tools/mllm-llm-benchmark/main.cpp -A3 -B1Length of output: 502
🏁 Script executed:
#!/bin/bash
# Check if there's error handling for stoi or parsing
rg -n 'stoi|try|catch|parse|exception' tools/mllm-llm-benchmark/main.cppLength of output: 1040
🏁 Script executed:
#!/bin/bash
# Check benchmark initialization and what parameters it accepts
rg -n 'benchmark->init|benchmark->run|benchmark->warmup' tools/mllm-llm-benchmark/main.cpp -A1 -B1Length of output: 324
🏁 Script executed:
#!/bin/bash
# Search for benchmark class definition and its methods
fd -t f -e hpp -e cpp | xargs grep -l 'class.*Benchmark' | head -5Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check if num_threads is used anywhere else in the benchmark tool
rg -n 'num_threads' tools/mllm-llm-benchmark/ -C1Length of output: 392
🏁 Script executed:
#!/bin/bash
# Search for .required() usage pattern in the codebase to confirm it's supported
rg -n '\.required\(\)' --type cpp -C1 | head -30Length of output: 0
🏁 Script executed:
#!/bin/bash
# Find and examine Argparse.hpp to see available methods
fd -name "Argparse.hpp" -type fLength of output: 230
🏁 Script executed:
#!/bin/bash
# Examine the Argparse header for method signatures
fd -name "Argparse.hpp" -type f -exec cat {} \;Length of output: 230
🏁 Script executed:
#!/bin/bash
# Find Argparse.hpp with proper escaping
find . -name "Argparse.hpp" -type fLength of output: 25
🏁 Script executed:
#!/bin/bash
# Search for Argparse class definition
rg -n 'class Argparse' -A 20 --type cpp --type hpp | head -50Length of output: 27
🏁 Script executed:
#!/bin/bash
# Check for .required or .mandatory or constraint methods in Argparse
rg -n '\bArgparse::add|\.help\(|\.required\(|\bmandatory\(' --type cpp --type hpp | head -30Length of output: 27
🏁 Script executed:
#!/bin/bash
# Directly read the Argparse.hpp file
cat ./mllm/utils/Argparse.hpp | head -200Length of output: 5337
🏁 Script executed:
#!/bin/bash
# Search for method definitions in Argparse.hpp
cat ./mllm/utils/Argparse.hpp | grep -n 'required\|mandatory\|constraint' -A2 -B2Length of output: 672
Harden CLI: mark mandatory args; use thread count; validate PP/TG; handle parse errors.
The following issues need fixing:
-
Mark required arguments (
-n,-m,-c,-pp,-tg) using.required()on lines 18-23. Argparse supports this and enforces it duringparse(). -
Hook
num_threadsto runtime: Currently retrieved at line 21 but never used. Callmllm::setMaximumNumThreads(num_threads.get())after parsing, or pass tobenchmark->init()if the method accepts it. -
Validate PP/TG non-empty and handle
stoierrors (lines 75-83): Wrap the parsing loops in try-catch to catchstd::invalid_argumentandstd::out_of_rangefromstoi(). Also validate thatpp.get()andtg.get()are non-empty before parsing.
🤖 Prompt for AI Agents
In tools/mllm-llm-benchmark/main.cpp around lines 92-96 and touching argument
definitions at lines 18-23 and parsing/validation at 75-83: mark the CLI args
-n, -m, -c, -pp, -tg as required by calling .required() on their definitions so
parse() enforces presence; after parsing, call
mllm::setMaximumNumThreads(num_threads.get()) (or pass num_threads.get() into
benchmark->init() if preferred) to apply the thread count; before converting pp
and tg strings ensure pp.get() and tg.get() are non-empty and wrap the stoi
loops in try-catch blocks catching std::invalid_argument and std::out_of_range,
logging a clear error and exiting/returning on parse failure.
| // Run benchmark | ||
| auto result = benchmark->run(pp, tg); | ||
|
|
||
| // TODO Sleep some times. | ||
| results.push_back(result); | ||
|
|
||
| mllm::print(" TTFT :", result.ttft, "ms"); | ||
| mllm::print(" Prefill Speed:", result.prefill_speed, "tokens/s"); | ||
| mllm::print(" Decode Speed :", result.decode_speed, "tokens/s"); | ||
|
|
There was a problem hiding this comment.
Wrap run in try/catch to prevent entire suite aborting on one failure.
Also log and continue.
- // Run benchmark
- auto result = benchmark->run(pp, tg);
- results.push_back(result);
+ // Run benchmark
+ try {
+ auto result = benchmark->run(prompt_len, gen_len);
+ results.push_back(result);
+ } catch (const std::exception& ex) {
+ mllm::print(" Error during run: ", ex.what());
+ continue; // proceed with next run
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Run benchmark | |
| auto result = benchmark->run(pp, tg); | |
| // TODO Sleep some times. | |
| results.push_back(result); | |
| mllm::print(" TTFT :", result.ttft, "ms"); | |
| mllm::print(" Prefill Speed:", result.prefill_speed, "tokens/s"); | |
| mllm::print(" Decode Speed :", result.decode_speed, "tokens/s"); | |
| // Run benchmark | |
| try { | |
| auto result = benchmark->run(pp, tg); | |
| results.push_back(result); | |
| } catch (const std::exception& ex) { | |
| mllm::print(" Error during run: ", ex.what()); | |
| continue; // proceed with next run | |
| } | |
| mllm::print(" TTFT :", result.ttft, "ms"); | |
| mllm::print(" Prefill Speed:", result.prefill_speed, "tokens/s"); | |
| mllm::print(" Decode Speed :", result.decode_speed, "tokens/s"); |
🤖 Prompt for AI Agents
In tools/mllm-llm-benchmark/main.cpp around lines 114-121, wrap the call to
benchmark->run(pp, tg) along with pushing the result and the mllm::print lines
in a try block so a single benchmark failure doesn't abort the whole suite; on
success push_back the result and print speeds as before, on failure catch
std::exception& e and log the error (e.g. with mllm::print including e.what()),
also add a catch(...) to log unknown failures, and simply continue the loop
without pushing a result.
| auto param = mllm::load(model_path, mllm::ModelFileVersion::kV2); | ||
| model_->load(param); | ||
|
|
||
| mllm::print("Model initialized successfully"); | ||
| } |
There was a problem hiding this comment.
Don’t hard-code ModelFileVersion; autodetect or make configurable.
Weights may be V1/V2. For robustness, detect version from file or pass it in.
- auto param = mllm::load(model_path, mllm::ModelFileVersion::kV2);
+ // If available, detect file version; otherwise, expose a parameter or document V2-only.
+ auto param = mllm::load(model_path, /* TODO: detect or inject version */ mllm::ModelFileVersion::kV2);If detection exists (e.g., mllm::detectModelFileVersion(model_path)), use it here. Otherwise, document V2-only in README and CLI help.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/mllm-llm-benchmark/models/Qwen3_W4A32_KAI.hpp around lines 29-33, the
code hard-codes ModelFileVersion::kV2 when loading the model; change this to
either autodetect the version from model_path (e.g., call
mllm::detectModelFileVersion(model_path) and pass the returned enum to
mllm::load) or make the version configurable (read from a CLI flag or config and
use that value), and if no detection API is available update the README and CLI
help text to explicitly state the code only supports V2 files so callers know
the limitation.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
LGTM |
Qwen3_W4A32_KAI.hppbenchmarkAll.hppSummary by CodeRabbit
New Features
Documentation
Chores