Skip to content

[Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder#64

Open
ngxson wants to merge 22 commits intongxson:masterfrom
simrnsingh:feat-gemma3n-vision
Open

[Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder#64
ngxson wants to merge 22 commits intongxson:masterfrom
simrnsingh:feat-gemma3n-vision

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 22, 2025

Mirror from upstream PR: ggml-org#18256

Summary by CodeRabbit

  • New Features

    • Added multimodal vision+audio support with a MobileNetV5-based vision encoder and Conformer-based audio pathway.
    • Expanded public model/tensor surfaces and writer options to handle vision/audio tensors and larger vocab/embedding sizes.
  • Chores

    • Integrated MobileNetV5 graph, model loading, and tooling updates across build and conversion utilities.

✏️ Tip: You can customize this high-level summary in your review settings.

…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp

2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size
@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

📝 Walkthrough

Walkthrough

Adds Gemma3n (MobileNetV5) vision/audio multimodal support across the converter, GGUF constants/mappings, and mtmd C++ toolchain: new tensor mappings, converter model classes/overrides, a MobileNetV5 graph implementation, loader wiring, and runtime input/embedding handling.

Changes

Cohort / File(s) Summary
Python converter & multimodal models
convert_hf_to_gguf.py
Add ConformerAudioModel, Gemma3nVisionModel, Gemma3nVisionAudioModel; new overrides (init, set_vocab, set_gguf_parameters, tensor_force_quant, modify_tensors, custom_map), block tensor mapping, vocab/embedding padding and audio tensor routing.
GGUF constants & mapping tables
gguf-py/gguf/constants.py
gguf-py/gguf/tensor_mapping.py
Add VISION_PROJECTOR_TYPE.GEMMA3N, many new V_*/A_*/MM_* MODEL_TENSOR entries, public name mappings, and extend MODEL_ARCH / TensorNameMap mappings for Gemma3n vision/audio tensors.
mtmd graph & MobileNetV5 model
tools/mtmd/models/mobilenetv5.cpp
tools/mtmd/models/models.h
New clip_graph_mobilenetv5 implementation with helpers (rms_norm_2d, pad_same_2d) and builders (edge residual, inverted residual, attention), MSFA pipeline and full build() producing Gemma3n embedding projection.
mtmd integration, headers & model structs
tools/mtmd/clip-impl.h
tools/mtmd/clip-model.h
tools/mtmd/clip.cpp
Add TN_MNV5_* macros, PROJECTOR_TYPE_GEMMA3NV/GEMMA3NA, mobilenetv5_block struct and mobilenet/MSFA fields in clip_model, and GEMMA3NV branches for graph selection, loading, preprocessing, token sizing, and batch encoding.
Build config & runtime tweaks
tools/mtmd/CMakeLists.txt
tools/mtmd/mtmd.cpp
Add models/mobilenetv5.cpp to build; treat GEMMA3NV like GEMMA3 in token setup and decode logic.
Runtime per-layer input changes
src/models/gemma3n-iswa.cpp
Append vision-embedding extraction/dequantize/pad path in get_per_layer_inputs; reorder project_per_layer_inputs add sequence.
GGUF writer metadata
gguf-py/gguf/gguf_writer.py
Add setters to record clip vision and audio projector types in GGUF metadata keys.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant HF as HuggingFace Model
    participant Converter as Python Converter
    participant GGUF as GGUF Writer / Mappings
    participant Loader as mtmd Loader (C++)
    participant Graph as clip_graph_mobilenetv5
    participant Runtime as Inference Runtime

    HF->>Converter: export tensors & hparams
    Converter->>GGUF: map tensor names (custom_map / block_tensor_mapping), set GGUF params
    Converter-->>Loader: write GGUF with Gemma3n vision/audio tensors
    Loader->>Graph: select clip_graph_mobilenetv5 (PROJECTOR_TYPE_GEMMA3N/GEMMA3NV)
    Loader->>Graph: load mobilenetv5 tensors (stem, blocks, MSFA)
    Graph->>Runtime: build graph (stem → stages → MSFA → embed proj)
    Runtime->>Runtime: preprocess image/audio → encode tokens → run graph → produce multimodal tokens
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

"I hopped through tensors, stitched each seam,
Stem to block, conv to attention beam.
Pixels and pulses now dance in tow,
MobileNet hums and the embeddings grow.
The rabbit winks — multimodal glow!"

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and only references the upstream PR without providing substantive details about the changes, objectives, testing, or integration plan expected in a complete description. Expand the description to include a summary of key changes (Gemma3n multimodal support, MobileNetV5 integration), testing performed, and any known issues or follow-ups from the upstream PR review.
Docstring Coverage ⚠️ Warning Docstring coverage is 16.07% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary change: adding Gemma3n multimodal support with MobileNetV5 vision encoder. It is specific, concise, and clearly indicates the main objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6dbbe and 60c23c9.

📒 Files selected for processing (1)
  • gguf-py/gguf/constants.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: ggml-ci-arm64-cpu-low-perf
  • GitHub Check: ggml-ci-arm64-cpu-high-perf
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
  • GitHub Check: openEuler-latest-cmake-cann (x86, 310p, Release)
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
  • GitHub Check: server-windows
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server (Release)

Comment @coderabbitai help to get the list of available commands and usage tips.

@ngxson ngxson changed the title Add Gemma3n multimodal support with MobileNetV5 vision encoder [Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder Dec 22, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
tools/mtmd/models/models.h (1)

69-101: LGTM - MobileNetV5 graph builder declaration follows established patterns.

The clip_graph_mobilenetv5 struct correctly follows the existing pattern of other graph implementations in this file. The helper method declarations align with their implementations in mobilenetv5.cpp (per the code snippets).

The mobilenetv5_block type reference should resolve correctly via the include chain (clip-graph.hclip-model.h).

Optional: Consider making helper methods private

The helper methods (rms_norm_2d, pad_same_2d, build_edge_residual, build_inverted_residual, build_mobilenet_attn) are implementation details that could be declared as private. However, since other graph implementations in this file also use public methods, this is optional and maintaining consistency with the existing codebase pattern is reasonable.

tools/mtmd/clip-model.h (1)

330-345: Appropriate additions for Gemma3n MobileNetV5 encoder.

The additions to clip_model are well-structured:

  • MobileNetV5 components use std::vector for flexible block management
  • MSFA (Multi-Scale Fusion Adapter) components are properly prefixed and organized
  • Naming conventions are consistent with existing fields

Note: There's an extra blank line at line 346, which may be intentional for readability but could be removed for consistency.

tools/mtmd/clip.cpp (2)

1619-1622: Use tensor name macros instead of hard-coded strings.

For consistency with the rest of the codebase and maintainability, consider defining macros for these tensor names in clip-impl.h:

-                    model.mm_0_w = get_tensor("mm.embedding.weight", false);  // Input embedding
-                    model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false);  // Hard embedding norm
+                    model.mm_0_w = get_tensor(TN_MM_EMBEDDING, false);  // Input embedding
+                    model.mm_1_w = get_tensor(TN_MM_HARD_EMB_NORM, false);  // Hard embedding norm

This would require adding corresponding macros to clip-impl.h.


1528-1623: Consider adding bounds checking for stage index access.

The dynamic block discovery loop iterates over 4 stages (0-3), but the mobilenet_stage_ends vector is populated based on actual blocks found. When accessing model.mobilenet_stage_ends[2] and [3] later in mobilenetv5.cpp::is_fusion_point(), ensure bounds checks are in place to avoid undefined behavior if fewer than 4 stages contain blocks.

The current check at line 284-288 (if (model.mobilenet_stage_ends.size() >= 4)) in mobilenetv5.cpp handles this, but consider adding a validation after loading:

if (model.mobilenet_stage_ends.size() < 4) {
    LOG_WRN("%s: GEMMA3N expected 4 stages but found %zu\n", 
            __func__, model.mobilenet_stage_ends.size());
}
tools/mtmd/models/mobilenetv5.cpp (2)

329-336: Remove or use the commented-out variable scale_h.

The variable scale_h is declared but commented out. Either use it for validation or remove it to avoid confusion:

                 int scale_w = high_res_w / feat_w;
-                // int scale_h = high_res_h / feat_h;

If height scaling should be validated separately:

int scale_h = high_res_h / feat_h;
if (scale_w != scale_h) {
    LOG_WRN("%s: non-uniform scaling in MSFA (scale_w=%d, scale_h=%d)\n", 
            __func__, scale_w, scale_h);
}

381-381: Consider making target output resolution configurable.

The target output resolution is hard-coded as 16:

const int target_out_res = 16;

Consider making this a model hyperparameter or deriving it from the model configuration to improve flexibility for future MobileNetV5 variants.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f48807 and 86618c7.

📒 Files selected for processing (11)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/CMakeLists.txt
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • tools/mtmd/models/models.h
  • tools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/models/models.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/models/models.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

  • gguf-py/gguf/tensor_mapping.py
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

  • src/models/gemma3n-iswa.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
  • gguf-py/gguf/constants.py
🧬 Code graph analysis (6)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
  • build (252-463)
  • build (252-252)
  • rms_norm_2d (5-20)
  • rms_norm_2d (5-5)
  • pad_same_2d (23-53)
  • pad_same_2d (23-23)
  • build_edge_residual (57-88)
  • build_edge_residual (57-57)
  • build_inverted_residual (90-151)
  • build_inverted_residual (90-90)
  • build_mobilenet_attn (154-250)
  • build_mobilenet_attn (154-154)
tools/mtmd/clip.cpp (3)
common/common.cpp (4)
  • model (1159-1161)
  • model (1159-1159)
  • string_format (399-412)
  • string_format (399-399)
src/llama-model.cpp (2)
  • get_tensor (7044-7054)
  • get_tensor (7044-7044)
tools/server/server-context.cpp (2)
  • params (607-853)
  • params (607-607)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (13)
  • ggml_permute (3700-3752)
  • ggml_rms_norm (3066-3071)
  • ggml_pad_ext (4983-5016)
  • ggml_conv_2d_direct (4702-4736)
  • ggml_gelu (2677-2681)
  • ggml_conv_2d_dw (4637-4658)
  • ggml_reshape_4d (3583-3601)
  • ggml_reshape_3d (3564-3581)
  • ggml_scale (3290-3295)
  • ggml_soft_max (3966-3970)
  • ggml_upscale (4928-4935)
  • ggml_concat (2517-2544)
  • ggml_pool_2d (4852-4878)
tools/mtmd/clip.cpp (9)
  • build_inp_raw (469-474)
  • build_inp_raw (469-469)
  • model (217-219)
  • model (935-1261)
  • model (935-935)
  • model (2038-2051)
  • model (2038-2038)
  • s (2446-2448)
  • s (2446-2446)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (465-736)
tools/mtmd/mtmd.cpp (1)
tools/mtmd/clip.cpp (4)
  • ctx (2490-2593)
  • ctx (2490-2490)
  • clip_get_projector_type (3737-3739)
  • clip_get_projector_type (3737-3737)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
  • add_clip_projector_type (1076-1077)
  • add_vision_attention_layernorm_eps (1097-1098)
gguf-py/gguf/constants.py (2)
  • VisionProjectorType (3408-3429)
  • GGMLQuantizationType (3283-3315)
🪛 GitHub Actions: flake8 Lint
convert_hf_to_gguf.py

[error] 5974-5974: flake8: E302 expected 2 blank lines, found 1. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6045-6045: flake8: W504 line break after binary operator. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8

🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

5969-5969: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
  • GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
  • GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
  • GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
  • GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
  • GitHub Check: android-build
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: pyright type-check
🔇 Additional comments (20)
src/models/gemma3n-iswa.cpp (1)

262-306: Vision input path implementation looks correct, but verify commented-out scaling.

The implementation correctly broadcasts the padding token embedding (token ID 0) across all vision tokens, matching the PyTorch behavior described in comments. The approach of creating zeros via ggml_scale(..., 0.0f) is valid.

A few observations:

  1. Line 305: The sqrtf((float) n_embd_altup) scaling is commented out for vision inputs, while it's applied for text inputs (line 259). Please confirm this difference is intentional per the model specification.

  2. The inp->embd tensor is allocated and marked as input (lines 270-271) but isn't used in the subsequent computation - the zeros are created from per_layer_model_proj projection instead. This appears intentional as the embeddings will be fed separately, but worth confirming the graph input wiring is correct.

tools/mtmd/mtmd.cpp (2)

269-272: LGTM - GEMMA3N correctly inherits GEMMA3's image token handling.

The addition of PROJECTOR_TYPE_GEMMA3N alongside PROJECTOR_TYPE_GEMMA3 correctly sets up the same <start_of_image> and <end_of_image> tokens for the Gemma3n vision path.


861-866: LGTM - Non-causal decode handling extended to GEMMA3N.

The logic correctly includes PROJECTOR_TYPE_GEMMA3N in the non-causal decoding path, maintaining parity with GEMMA3.

Minor observation: Consider extracting the repeated clip_get_projector_type(ctx->ctx_v) call to a local variable for readability, though this is optional given the function is lightweight.

tools/mtmd/CMakeLists.txt (1)

30-30: LGTM - MobileNetV5 source file added to build.

The new models/mobilenetv5.cpp is correctly included in the mtmd library sources. This enables the MobileNetV5-based graph construction for Gemma3n vision support.

gguf-py/gguf/tensor_mapping.py (1)

122-142: LGTM - New Gemma3n vision tensor mappings added.

The new tensor mappings for MobileNetV5-based Gemma3n vision support are correctly structured and follow the existing pattern. The mappings align with the MODEL_TENSOR enums defined in constants.py.

Note: The comments label these as "gemma3n", which is accurate for V_MM_EMBEDDING, V_MM_HARD_EMB_NORM, and V_MM_POST_PROJ_NORM. For V_MM_INP_PROJ and V_MM_SOFT_EMB_NORM, the constants.py comments indicate "gemma3" but this appears to be reusing existing tensor types with new HuggingFace tensor name mappings for gemma3n, which is a valid pattern.

tools/mtmd/clip-model.h (1)

175-212: Well-structured MobileNetV5 block definition.

The mobilenetv5_block struct is comprehensive and well-organized, covering all necessary components for the Gemma3n vision encoder:

  • Stage 0 (Edge Residual) and Stage 1+ (Universal Inverted Residual) convolutions with batch normalization
  • Multi-Query Attention (MQA) components with optional downsampling
  • Layer scale and block normalization

The struct follows the existing naming conventions and patterns in the file.

gguf-py/gguf/constants.py (3)

670-672: LGTM: New gemma3n tensor types properly defined.

The three new tensor types (V_MM_EMBEDDING, V_MM_HARD_EMB_NORM, V_MM_POST_PROJ_NORM) are:

  • Properly prefixed with V_MM_ following the existing naming convention
  • Clearly documented as gemma3n-specific
  • Correctly placed within the MODEL_TENSOR enum

1065-1067: Correct tensor mappings and MMPROJ integration.

The tensor name mappings and MMPROJ architecture additions are properly implemented:

  • String names follow the mm.* convention used for multimodal tensors
  • Tensors are correctly added to MODEL_TENSORS[MODEL_ARCH.MMPROJ] list
  • Consistent with existing patterns in the file

Also applies to: 1166-1168


1947-1981: Complete GEMMA3N architecture tensor list.

The MODEL_ARCH.GEMMA3N tensor list is comprehensive and well-organized:

  • Includes all standard Gemma3 tensors (token embedding, attention, FFN)
  • Properly extends with gemma3n-specific components:
    • Per-layer tensors (PER_LAYER_TOKEN_EMBD, PER_LAYER_MODEL_PROJ, etc.)
    • Altup tensors for alternative upsampling/routing
    • Laurel tensors for layer-wise processing
  • Comments clearly indicate the purpose of specialized tensor groups

This ensures proper serialization and deserialization of Gemma3n models.

tools/mtmd/clip.cpp (4)

3128-3133: Potential issue with n_patches calculation for GEMMA3N.

The calculation n_patches = image_size / patch_size computes a single dimension (e.g., 16 for 256/16), but n_patches typically represents the total number of patches (i.e., (image_size / patch_size)^2). The comment says "MSFA adapter always outputs fixed 16x16 resolution", which suggests the result should be 256, not 16.

If the output is indeed 16x16, the calculation should be:

-            n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+            int n_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+            n_patches = n_per_side * n_per_side;

If this is intentional (e.g., the MSFA outputs a single row of 16 tokens), please add a clarifying comment.


791-794: LGTM: GEMMA3N graph builder integration.

The new case for PROJECTOR_TYPE_GEMMA3N correctly uses the dedicated clip_graph_mobilenetv5 builder, consistent with how other projector types are handled.


1148-1155: LGTM: GEMMA3N hparams configuration.

The default n_merge = 1 with an optional override via KEY_PROJ_SCALE_FACTOR follows the established pattern for other projector types.


2859-2867: LGTM: GEMMA3N preprocessing.

The preprocessing correctly resizes the image to a square using bilinear interpolation without padding (false parameter), then normalizes using the configured mean/std values, matching the expected MobileNetV5 input format.

tools/mtmd/models/mobilenetv5.cpp (4)

5-20: LGTM: RMS Norm 2D helper implementation.

The rms_norm_2d helper correctly permutes the tensor to normalize over channels for each spatial position, applies the standard RMSNorm operation, and optionally applies the learned weight before permuting back. The use of ggml_cont after permute ensures the tensor is contiguous for subsequent operations.


22-53: LGTM: SAME padding implementation.

The pad_same_2d helper correctly implements TensorFlow/PyTorch-style asymmetric SAME padding. The ceiling division for output size and the asymmetric split of padding (bottom/right gets the extra pixel) matches the expected behavior.


153-250: LGTM: Attention block implementation.

The build_mobilenet_attn function correctly implements multi-query attention with:

  • Optional input normalization
  • Downsampled K/V paths using depthwise convolutions
  • Proper Q/K/V reshaping and permutation for attention
  • Scaled dot-product attention with softmax
  • Output projection with optional layer scaling and residual connection

403-463: LGTM: Gemma3n multimodal projection.

The embedding/projection logic correctly:

  1. Permutes and flattens spatial dimensions to sequence format
  2. Applies feature scaling by sqrt(hidden_size)
  3. Applies soft embedding normalization with optional learned weight
  4. Projects to text hidden size via linear layer
  5. Applies post-projection RMSNorm

This matches the expected Gemma3n vision embedder architecture.

tools/mtmd/clip-impl.h (2)

156-195: LGTM: MobileNetV5 tensor name macros.

The tensor name macros are well-organized by component (stem, edge residual, inverted residual, attention, MSFA) and follow the established naming conventions. The %d.%d format for stage/block indexing aligns with the dynamic loading logic in clip.cpp.


214-214: LGTM: GEMMA3N projector type registration.

The new PROJECTOR_TYPE_GEMMA3N enum value and its string mapping "gemma3n" are correctly placed and follow the existing pattern.

Also applies to: 245-245

convert_hf_to_gguf.py (1)

522-527: Robust handling of empty tensor_map.mapping for block_count=0 looks good

Using a guarded branch for max_name_len avoids ValueError when gguf.get_tensor_name_map(..., block_count=0) produces an empty mapping (e.g., MobileNetV5-based encoders) and only changes log formatting width. No further changes needed here.

RESAMPLER = auto()
GLM_EDGE = auto()
MERGER = auto()
GEMMA3N = auto()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add GEMMA3N mapping to VISION_PROJECTOR_TYPE_NAMES.

The GEMMA3N entry was added to VISION_PROJECTOR_TYPE enum but is missing from the VISION_PROJECTOR_TYPE_NAMES dictionary at lines 850-858. This mapping is used to convert the enum value to its string representation.

🔎 Proposed fix

Add the mapping to the VISION_PROJECTOR_TYPE_NAMES dictionary:

 VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
     VISION_PROJECTOR_TYPE.MLP:       "mlp",
     VISION_PROJECTOR_TYPE.LDP:       "ldp",
     VISION_PROJECTOR_TYPE.LDPV2:     "ldpv2",
     VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
     VISION_PROJECTOR_TYPE.GLM_EDGE:  "adapter",
     VISION_PROJECTOR_TYPE.MERGER:    "qwen2vl_merger",
     VISION_PROJECTOR_TYPE.GEMMA3:    "gemma3",
+    VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
+    VISION_PROJECTOR_TYPE.QWEN3VL:   "qwen3vl_merger",
+    VISION_PROJECTOR_TYPE.COGVLM:    "cogvlm",
 }

Note: QWEN3VL and COGVLM are also missing from this dictionary.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
GEMMA3N = auto()
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
VISION_PROJECTOR_TYPE.MLP: "mlp",
VISION_PROJECTOR_TYPE.LDP: "ldp",
VISION_PROJECTOR_TYPE.LDPV2: "ldpv2",
VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter",
VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger",
VISION_PROJECTOR_TYPE.GEMMA3: "gemma3",
VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",
VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger",
VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",
}
🤖 Prompt for AI Agents
In gguf-py/gguf/constants.py around line 459 (GEMMA3N added to
VISION_PROJECTOR_TYPE) and the VISION_PROJECTOR_TYPE_NAMES dictionary at lines
~850-858, the GEMMA3N enum value is not mapped to its string representation; add
an entry mapping VISION_PROJECTOR_TYPE.GEMMA3N to "GEMMA3N" in that dictionary.
Also add missing mappings for QWEN3VL and COGVLM (e.g.,
VISION_PROJECTOR_TYPE.QWEN3VL -> "QWEN3VL" and VISION_PROJECTOR_TYPE.COGVLM ->
"COGVLM") so all enum members have corresponding string names. Ensure
formatting/commas match the surrounding dictionary entries.

@ngxson

This comment was marked as resolved.

@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@ngxson, I'll save these preferences for mirrored PRs and apply them to this review. Let me fetch the upstream PR details and conduct a thorough review.


🏁 Script executed:

#!/bin/bash
# Fetch the upstream PR details
gh pr view 18256 --repo ggml-org/llama.cpp --json title,body,author

Length of output: 3458


🏁 Script executed:

#!/bin/bash
# Search for suspicious AI-generated code comment patterns
echo "=== Searching for suspicious AI-generated comments ==="
rg -i "CRITICAL:" --type py --type cpp --type c --type h || echo "No CRITICAL comments found"
rg -i "you are correct" --type py --type cpp --type c --type h || echo "No 'you are correct' comments found"
rg -i "TODO.*AI" --type py --type cpp --type c --type h || echo "No AI TODOs found"
rg -i "FIXME.*AI" --type py --type cpp --type c --type h || echo "No AI FIXMEs found"
rg -i "placeholder|stub|temporary" --type py --type cpp --type c --type h -C1 | head -50 || echo "No placeholder/stub comments found"

echo ""
echo "=== Checking for incomplete/commented code ==="
rg "//.*TODO|//.*FIXME|//.*XXX|//.*HACK" --type cpp --type c --type h -C1 | head -30 || echo "No flagged comments found"

Length of output: 10158


🏁 Script executed:

#!/bin/bash
# Review key implementation files for obvious issues

echo "=== Checking convert_hf_to_gguf.py Gemma3nVisionModel ==="
# Look at the Gemma3nVisionModel implementation
ast-grep --pattern 'class Gemma3nVisionModel' | head -100

echo ""
echo "=== Checking for potential issues in mobilenetv5.cpp ==="
# Check for common mistakes: division by zero, uninitialized variables, missing bounds checks
rg -n "/ (0|patch_size|stride)" tools/mtmd/models/mobilenetv5.cpp || echo "No obvious division issues"
rg -n "GGML_ASSERT|assert" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check tensor dimension handling ==="
# Look for potential dimension mismatches
rg -n "ne\[0\]|ne\[1\]|ne\[2\]|ne\[3\]" tools/mtmd/models/mobilenetv5.cpp | head -30

echo ""
echo "=== Check for magic numbers ==="
rg -n "256|512|1024|2048" tools/mtmd/models/mobilenetv5.cpp | head -20

Length of output: 8976


🏁 Script executed:

#!/bin/bash
# Check for more specific issues

echo "=== Check patch_size calculation in convert_hf_to_gguf.py ==="
rg -n "patch_size = image_size //" convert_hf_to_gguf.py -A2 -B2

echo ""
echo "=== Check sqrt usage in patch_size calculation ==="
rg -n "sqrt|math.sqrt" convert_hf_to_gguf.py | grep -i patch || echo "No sqrt in patch calculation"

echo ""
echo "=== Check division operations in mobilenetv5.cpp ==="
rg -n "/ \w+;" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check for potential null pointer dereferences ==="
rg -n "block\.\w+->ne\[" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check gemma3n-iswa.cpp changes ==="
cat src/models/gemma3n-iswa.cpp | grep -A30 "inp->embd" | head -40

Length of output: 3571


🏁 Script executed:

#!/bin/bash
# Check the specific patch_size calculation logic

echo "=== Check image_seq_length and patch_size calculation logic ==="
rg -n "image_seq_length|patch_size" convert_hf_to_gguf.py -A1 -B1 | grep -A3 -B3 "6008\|6009\|6010\|6011"

echo ""
echo "=== Check if sqrt is needed for patch calculation ==="
echo "256 tokens = 16x16 grid, so sqrt(256) = 16"
echo "If image_size = 768 and we want 16x16 patches, patch_size should be 768/16 = 48"
echo "But code does: patch_size = 768 // 256 = 3"
echo ""

echo "=== Verify the calculation in context ==="
rg -n "image_seq_length.*256" convert_hf_to_gguf.py -B5 -A10

echo ""
echo "=== Check clip.cpp for n_patches calculation ==="
rg -n "n_patches.*GEMMA3N" tools/mtmd/clip.cpp -B5 -A5

Length of output: 2673


🏁 Script executed:

#!/bin/bash
# Check n_patches calculation in clip.cpp

echo "=== Find n_patches calculation for GEMMA3N ==="
rg -n "GEMMA3N" tools/mtmd/clip.cpp -B2 -A2 | grep -A5 -B5 "n_patches"

echo ""
echo "=== Check clip_n_output_tokens function ==="
rg -n "clip_n_output_tokens|n_output_tokens" tools/mtmd/clip.cpp -A10 | head -40

echo ""
echo "=== Compare with Gemma3 (non-N) implementation ==="
rg -n "case PROJECTOR_TYPE_GEMMA3:" tools/mtmd/clip.cpp -A5 | head -20

Length of output: 3033


🏁 Script executed:

#!/bin/bash
# Check the full n_patches calculation for GEMMA3N

echo "=== Check complete GEMMA3N n_patches logic ==="
rg -n "case PROJECTOR_TYPE_GEMMA3N:" tools/mtmd/clip.cpp -A10

echo ""
echo "=== Check if patch_size is used correctly ==="
rg -n "patch_size" tools/mtmd/clip.cpp | grep -i gemma3n -A2 -B2

echo ""
echo "=== Look for the formula: image_size / patch_size ==="
rg -n "image_size.*patch_size|patch_size.*image_size" tools/mtmd/clip.cpp

Length of output: 5510


🏁 Script executed:

#!/bin/bash
# Deep dive into potential issues

echo "=== Check if mobilenet block pointers can be null ==="
rg -n "block\.(dw_start_w|dw_mid_w|layer_scale_w)" tools/mtmd/models/mobilenetv5.cpp -B2 -A2

echo ""
echo "=== Check build_inverted_residual usage ==="
rg -n "build_inverted_residual|build_edge_residual|build_mobilenet_attn" tools/mtmd/models/mobilenetv5.cpp | head -30

echo ""
echo "=== Verify model loading sets these pointers ==="
rg -n "dw_start_w|dw_mid_w|layer_scale" tools/mtmd/clip.cpp -A2 | head -50

Length of output: 4258


🏁 Script executed:

#!/bin/bash
# Final checks for issues

echo "=== Check for hardcoded values that should be configurable ==="
rg -n "2048|8.*head|4\s*\*.*hidden" convert_hf_to_gguf.py | grep -A2 -B2 "5990\|5994"

echo ""
echo "=== Check the Gemma3NModel vocab handling for potential issues ==="
rg -n "class Gemma3NModel" convert_hf_to_gguf.py -A100 | grep -A10 -B5 "vocab_size"

echo ""
echo "=== Check if there are any unused imports or variables ==="
rg -n "import.*mobilenet|from.*mobilenet" convert_hf_to_gguf.py

echo ""
echo "=== Check for potential integer overflow in calculations ==="
rg -n "int64_t|int32_t" tools/mtmd/models/mobilenetv5.cpp | head -20

Length of output: 4263


@ngxson, I've completed a thorough review of this mirrored PR from simrnsingh. Here are my findings:

✅ Good News

No suspicious AI-generated code comments were found (no "CRITICAL:", "You are correct here...", etc.). The contributor properly disclosed Claude Code usage for exploration and boilerplates.

🐛 Issues Found

1. Critical: Incorrect patch_size calculation (convert_hf_to_gguf.py, line 6011)

The calculation patch_size = image_size // image_seq_length is semantically wrong:

# Current code:
image_seq_length = 256  # 16x16 grid
image_size = 768
patch_size = image_size // image_seq_length  # = 768 // 256 = 3

Problem: A patch_size of 3 pixels would mean dividing a 768px image into 256×256 tiny patches, not 16×16 patches as the comment states.

Should be:

n_per_side = int(image_seq_length ** 0.5)  # sqrt(256) = 16
patch_size = image_size // n_per_side  # = 768 // 16 = 48

This matches how Gemma3 (non-N) calculates it at line 5927. The C++ code at clip.cpp:3132 expects n_patches = image_size / patch_size, so the current buggy calculation accidentally produces the correct output (256 tokens) but with wrong semantics.

2. Hardcoded magic values (convert_hf_to_gguf.py, lines 5990-5995)

def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
    if "intermediate_size" in keys:
        hidden_size = self.hparams_vision.get("hidden_size", 2048)  # Hardcoded default
        return hidden_size * 4  # Hardcoded expansion factor
    
    if "num_attention_heads" in keys or "num_heads" in keys:
        return 8  # Hardcoded MQA heads

These should ideally be read from the vision config if available, rather than hardcoded. While they may be correct for the current Gemma3n model, they're not future-proof.

3. Potential unsafe tensor dimension access (mobilenetv5.cpp, lines 138, 243)

The code accesses block.layer_scale_w->ne[0] inside blocks where layer_scale_w is checked for null, but doesn't verify it's non-null before dereferencing:

if (block.layer_scale_w) {
    ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w,
        1, 1, block.layer_scale_w->ne[0], 1);  // Safe: inside if-block

This is actually safe due to the if-check, but similar patterns exist that should be verified.

4. Hardcoded assumption (mobilenetv5.cpp, line 407)

int C = cur->ne[2]; // Should be 2048

This comment reveals a hardcoded assumption about channel count that might not hold for all model variants.

5. Minor: Commented-out code (mobilenetv5.cpp, line 331)

int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h;  // Unused

The contributor left commented-out code that should probably be removed.

📝 Recommendations

The patch_size calculation bug should be fixed before merging. The other issues are less critical but should be addressed for code quality.


✏️ Learnings added
Learnt from: ngxson
URL: https://github.com/ngxson/llama.cpp/pull/64

Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
URL: https://github.com/ngxson/llama.cpp/pull/64

Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

…cit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest

- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)

857-865: [Duplicate] Missing VISION_PROJECTOR_TYPE_NAMES mappings for GEMMA3N, QWEN3VL, and COGVLM.

This issue was already flagged in previous reviews. The VISION_PROJECTOR_TYPE_NAMES dictionary is missing mappings for GEMMA3N (line 459), QWEN3VL (line 461), and COGVLM (line 462) enum entries. This will cause a KeyError if code attempts to convert these enum values to their string representations using this dictionary.

Based on previous review, the fix should add:

VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
VISION_PROJECTOR_TYPE.QWEN3VL:   "qwen3vl_merger",
VISION_PROJECTOR_TYPE.COGVLM:    "cogvlm",
convert_hf_to_gguf.py (1)

5966-6091: Fix Gemma3n patch_size computation and note minor cleanups

There are a few points here:

  1. Patch size computation for Gemma3n MobileNetV5 is still wrong
    In Gemma3nVisionModel.set_gguf_parameters:

    image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
    image_size = self.hparams_vision["image_size"]
    self.hparams_vision["patch_size"] = image_size // image_seq_length

    With the default Gemma3n setup (768×768, image_seq_length = 256), this yields patch_size = 3, which implies a 256×256 grid and 65,536 patches, while the comment explicitly states “256 tokens = 16×16”. Patch size should be derived from tokens per side, not total tokens.

    Recommended fix (same issue as previously flagged in earlier review; applying it here for the new MobileNetV5 path as well):

    Proposed fix for patch_size in Gemma3nVisionModel
    -        # Image sequence length (256 tokens = 16x16 for Gemma3n)
    -        image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
    -        image_size = self.hparams_vision["image_size"]
    -        self.hparams_vision["patch_size"] = image_size // image_seq_length
  •    # Image sequence length (e.g. 256 tokens = 16x16 grid for Gemma3n)
    
  •    image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
    
  •    image_size = self.hparams_vision["image_size"]
    
  •    # Derive patch size from patches-per-side, not total token count
    
  •    n_per_side = int(image_seq_length ** 0.5)
    
  •    if n_per_side * n_per_side != image_seq_length:
    
  •        raise ValueError(
    
  •            f"image_seq_length={image_seq_length} is not a perfect square; "
    
  •            "cannot infer square patch grid for Gemma3n vision encoder"
    
  •        )
    
  •    self.hparams_vision["patch_size"] = image_size // n_per_side
    

This matches the intended 16×16 grid for 256 tokens and keeps `patch_size` consistent with how other vision encoders in this file derive it.  


2. **Vocab / embedding handling for Gemma3n text model is a solid improvement**  
- `Gemma3NModel.set_vocab` temporarily removes `vocab_size_per_layer_input` so `_create_vocab_sentencepiece()` uses the full `vocab_size` (including the vision/audio special tokens) and then restores it.  
- The new `modify_tensors` branch pads `embed_tokens.weight` and per-layer embeddings up to `vocab_size`, instead of truncating to `vocab_size_per_layer_input`, which is required for multimodal Gemma3n.

This avoids dropping the 262144–262399 special IDs and aligns the text embeddings with the tokenizer. Behavior looks correct and non‑regressive for pure‑text use.

3. **Optional typing/ruff cleanup for class attributes (RUF012)**  
In `Gemma3nVisionModel`:

```python
n_block_keys = []
block_tensor_mapping = { ... }

These are effectively class‑level constants. To satisfy RUF012 and make the intent explicit to type checkers, consider:

Optional ClassVar annotation tweak
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar
...
-    n_block_keys = []
+    n_block_keys: ClassVar[list[str]] = []
...
-    block_tensor_mapping = {
+    block_tensor_mapping: ClassVar[dict[str, str]] = {
      ...
    }

This is purely a typing / tooling nicety; behavior is unchanged.

Overall, once the patch_size formula is corrected, the Gemma3n MobileNetV5 path and the Gemma3n text vocab/embedding logic look structurally sound for the mirrored upstream changes.

Also applies to: 6115-6191

🧹 Nitpick comments (4)
gguf-py/gguf/tensor_mapping.py (1)

123-158: Gemma3n vision tensor mappings look consistent

The new V_MM_* and V_ENC_* entries align with the Gemma3n/MobileNetV5 tensor paths used in convert_hf_to_gguf.py and constants, so TensorNameMap will correctly resolve model.embed_vision.* and model.vision_tower.timm_model.* for Gemma3n.

Note that V_MM_INP_PROJ / V_MM_SOFT_EMB_NORM now have both generic multi_modal_projector.* and Gemma3n-specific model.embed_vision.* synonyms; that’s fine, but if more variants start using these tensors it may be worth documenting this dual use to avoid confusion later.

tools/mtmd/clip.cpp (1)

1528-1620: Verify mobilenetv5_block default initialization and stage boundary assumptions

The GEMMA3N tensor-loading branch dynamically discovers MobileNetV5 blocks per stage and accumulates them in model.mobilenet_blocks, with stage ends recorded in model.mobilenet_stage_ends. This is a good direction, but a couple of edge conditions are worth double‑checking:

  • mobilenetv5_block block; relies on the struct’s members being safely default‑initialized (e.g., ggml_tensor * foo = nullptr; or an explicit ctor). If any members lack default member initializers, they will contain indeterminate values in paths where that sub‑block type is absent (e.g., pure attention blocks vs. pure UIR blocks). Please confirm mobilenetv5_block is defined with default member initializers or add = {} here to value‑initialize it.

  • The for (int blk_idx = 0; ; ++blk_idx) loop for each stage stops at the first blk_idx that yields no tensors. This assumes that all blocks in a stage are densely indexed from 0..N-1 with no gaps. If future variants ever introduce gaps, discovery would silently truncate later blocks. That’s probably fine for current Gemma3n, but worth keeping in mind if more MobileNetV5 variants are added.

  • mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1) stores inclusive global indices. Ensure mobilenetv5.cpp interprets these indices the same way (inclusive vs exclusive) when iterating.

If you confirm mobilenetv5_block is defined with safe defaults and that stage indices are inclusive by design, this loader logic looks solid.

tools/mtmd/models/mobilenetv5.cpp (2)

372-372: Consider making target output resolution configurable.

The hardcoded target_out_res = 16 assumes a fixed output resolution for the MSFA downsampling stage. If the model architecture varies or if different Gemma3n variants use different resolutions, this value should be read from the model config rather than hardcoded.

💡 Suggested approach

Add a field to the model config for MSFA output resolution and read it during model loading, falling back to 16 if not present:

// Example usage (adjust based on actual model structure):
const int target_out_res = model.msfa_output_res ? model.msfa_output_res : 16;

Alternatively, if 16 is the fixed resolution for all Gemma3n models, consider adding a comment explaining this architectural constraint.

Based on learnings, this is a mirrored PR—please verify with the upstream contributor whether this value should remain fixed or be made configurable.


420-420: Consider extracting RMS norm epsilon as a named constant.

The hardcoded epsilon value 1e-6f appears twice (lines 420, 442) for Gemma3n RMS normalization. Extracting it as a named constant improves maintainability and makes the architectural choice explicit.

💡 Suggested refactor

Near the top of the file or in the class definition:

static constexpr float GEMMA3N_RMS_NORM_EPS = 1e-6f;

Then use it consistently:

-    const float eps = 1e-6f; // Gemma3n uses 1e-6
-    cur = ggml_rms_norm(ctx0, cur, eps);
+    cur = ggml_rms_norm(ctx0, cur, GEMMA3N_RMS_NORM_EPS);

Also applies to: 442-442

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86618c7 and e2835e9.

📒 Files selected for processing (5)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • tools/mtmd/models/mobilenetv5.cpp
  • tools/mtmd/clip.cpp
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Always use the Python environment in .venv and run Python tools from that environment
Apply Python linting rules configured in .flake8 (max-line-length=125, excludes examples/tools) and type checking with pyright

Files:

  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • convert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/models/mobilenetv5.cpp
  • gguf-py/gguf/constants.py
  • tools/mtmd/clip.cpp
  • convert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/models/mobilenetv5.cpp
  • gguf-py/gguf/constants.py
  • tools/mtmd/clip.cpp
  • convert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
  • tools/mtmd/clip.cpp
🧬 Code graph analysis (4)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (17)
  • ggml_permute (3700-3752)
  • ggml_cont (3461-3465)
  • ggml_rms_norm (3066-3071)
  • ggml_mul (2170-2175)
  • ggml_pad_ext (4983-5016)
  • ggml_conv_2d_direct (4702-4736)
  • ggml_gelu (2677-2681)
  • ggml_add (1969-1974)
  • ggml_reshape_3d (3564-3581)
  • ggml_reshape_4d (3583-3601)
  • ggml_mul_mat (3174-3189)
  • ggml_scale (3290-3295)
  • ggml_soft_max (3966-3970)
  • ggml_upscale (4928-4935)
  • ggml_concat (2517-2544)
  • ggml_pool_2d (4852-4878)
  • ggml_build_forward_expand (6793-6795)
tools/mtmd/clip.cpp (9)
  • build_inp_raw (469-474)
  • build_inp_raw (469-469)
  • model (217-219)
  • model (935-1261)
  • model (935-935)
  • model (2035-2048)
  • model (2035-2035)
  • s (2443-2445)
  • s (2443-2443)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (465-743)
tools/mtmd/clip.cpp (2)
common/common.cpp (4)
  • model (1159-1161)
  • model (1159-1159)
  • string_format (399-412)
  • string_format (399-399)
src/llama-model.cpp (2)
  • get_tensor (7044-7054)
  • get_tensor (7044-7044)
convert_hf_to_gguf.py (1)
gguf-py/gguf/constants.py (1)
  • VisionProjectorType (3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

5969-5969: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


5972-5995: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6067-6067: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (6)
tools/mtmd/clip.cpp (1)

783-795: GEMMA3N graph dispatch is consistent with new MobilenetV5 graph

Routing PROJECTOR_TYPE_GEMMA3N to clip_graph_mobilenetv5 mirrors how other projector types choose their specialized graphs and keeps mobilenet-specific logic isolated from the generic ViT path. No issues here.

convert_hf_to_gguf.py (1)

522-527: prepare_tensors: safe handling of empty tensor_map looks good

Guarding self.tensor_map.mapping before calling max() fixes the crash when block_count == 0 (e.g., MobileNetV5) while preserving the logging behavior with a reasonable fallback width.

No issues from a correctness or performance perspective.

tools/mtmd/models/mobilenetv5.cpp (4)

5-20: LGTM: RMS normalization helper is correctly implemented.

The permutation logic correctly transforms [W,H,C,B] → [C,H,W,B] for channel-wise normalization, applies RMS norm, optionally multiplies by weight, and restores the original layout. Null check for weight is properly guarded.


23-53: LGTM: Asymmetric SAME padding correctly implemented.

The calculation matches PyTorch/TensorFlow SAME padding semantics with proper ceil division for output size and asymmetric padding split. The ggml_pad_ext call uses correct dimension ordering for [W,H,C,B] layout.


57-88: LGTM: Block builders handle null pointers safely.

All three block builder functions (build_edge_residual, build_inverted_residual, build_mobilenet_attn) consistently check for null pointers before dereferencing optional weights and biases. Residual connection logic correctly validates dimension matching before addition.

Also applies to: 91-149, 152-246


248-453: LGTM: Main build function integrates all components correctly.

The build() function properly sequences stem convolution, stage/block iteration, fusion point collection, MSFA logic, and Gemma3n embedding projection. Logging statements use proper macros (LOG_WRN, LOG_ERR) as addressed in previous reviews. All optional weight/bias tensors are guarded by null checks before use.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)

857-865: GEMMA3N mapping still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.

This is the same issue flagged in the previous review. The GEMMA3N enum value added at line 459 still has no corresponding entry in this dictionary. This mapping is required for the enum-to-string conversion to work correctly.

convert_hf_to_gguf.py (1)

6014-6039: Fix Gemma3n MobileNetV5 patch_size semantics and avoid brittle hard‑coded vision hparams

Two related issues here:

  1. patch_size is still computed from total token count, not per‑side patches (critical)

    • Current code: patch_size = image_size // image_seq_length (e.g., 768 // 256 = 3), which implies n_per_side = 256 and a 65k‑token grid.
    • Semantically, image_seq_length is total patches (e.g., 256 = 16×16). Patch size must be derived from sqrt(image_seq_length) so that both the converter and C++ vision path agree on a 16×16 grid and the correct patch_size (48 for 768×768).
  2. Hard‑coded MobileNetV5 defaults in find_vparam are brittle

    • hidden_size defaulting to 2048 and num_heads forced to 8 will silently be wrong if future Gemma3n variants change these values in their config. It’s safer to read from self.hparams_vision when available and only fall back to defaults if the config is missing them.
Patch: derive `patch_size` from √image_seq_length
-        # Image sequence length (256 tokens = 16x16 for Gemma3n)
-        image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
-        image_size = self.hparams_vision["image_size"]
-        self.hparams_vision["patch_size"] = image_size // image_seq_length
+        # Image sequence length is total tokens (e.g. 256 = 16×16 grid)
+        image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
+        image_size = self.hparams_vision["image_size"]
+
+        n_per_side = int(image_seq_length ** 0.5)
+        if n_per_side * n_per_side != image_seq_length:
+            raise ValueError(f"image_seq_length={image_seq_length} is not a perfect square")
+
+        # e.g. 768 // 16 = 48 for a 16×16 patch grid
+        self.hparams_vision["patch_size"] = image_size // n_per_side
Patch: prefer config‑driven head / FFN sizes in `find_vparam`
     def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
         """Override to provide hardcoded MobileNetV5 parameters that aren't in config"""
         # Handle empty keys list (n_block_keys) - return 0 for CNN architecture
         if not keys:
             return 0
 
-        if "intermediate_size" in keys:
-            # Typical expansion is 4x the embedding dimension
-            hidden_size = self.hparams_vision.get("hidden_size", 2048)
-            return hidden_size * 4
-
-        if "num_attention_heads" in keys or "num_heads" in keys:
-            # Multi-Query Attention with 8 heads
-            return 8
+        if "intermediate_size" in keys:
+            assert self.hparams_vision is not None
+            if "intermediate_size" in self.hparams_vision:
+                return self.hparams_vision["intermediate_size"]
+            # Fallback: typical MobileNetV5 expansion is 4× hidden_size
+            hidden_size = self.hparams_vision.get("hidden_size")
+            if hidden_size is not None:
+                return hidden_size * 4
+
+        if any(k in ("num_attention_heads", "num_heads") for k in keys):
+            assert self.hparams_vision is not None
+            for k in ("num_attention_heads", "num_heads"):
+                if k in self.hparams_vision:
+                    return self.hparams_vision[k]
+            # Final fallback if config is missing heads
+            return 8
 
         # For other parameters, use parent implementation
         return super().find_vparam(keys, optional)

Given this is a mirrored PR, you’ll probably want to carry this fix locally and/or ping upstream about the patch_size formula and config‑driven defaults.

🧹 Nitpick comments (3)
gguf-py/gguf/constants.py (1)

1071-1071: Consider clarifying comment for V_MM_SOFT_EMB_NORM.

The comment here shows # gemma3n, but the enum definition at line 669 shows # gemma3. If this tensor is used by both gemma3 and gemma3n architectures, consider using a comment like # gemma3, gemma3n to clarify the shared usage and avoid confusion.

convert_hf_to_gguf.py (2)

5969-5995: Annotate mutable class attributes with ClassVar to satisfy Ruff RUF012

n_block_keys = [] and block_tensor_mapping = {…} are mutable class attributes; Ruff expects them to be annotated as typing.ClassVar[...].

Proposed type annotations for class attributes

Add ClassVar to the typing imports:

from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar

Then update the class attributes:

-    n_block_keys = []
+    n_block_keys: ClassVar[list[str]] = []
@@
-    block_tensor_mapping = {
+    block_tensor_mapping: ClassVar[dict[str, str]] = {
         "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight":             "v.blk.{bid}.{sid}.conv_exp.weight",
         ...
     }

6171-6226: Embedding padding and ALTUP stacking logic look correct; consider a small guard

This block:

  • Only affects language_model.* tensors; others are skipped, which keeps mmproj / vision clean.
  • Pads embed_tokens.weight and embed_tokens_per_layer up to hparams["vocab_size"], filling new rows with zeros for vision/audio tokens (which get real features from the vision/audio towers anyway).
  • Leaves altup_unembed_projections and altup_projections unpadded and stacks three shard tensors into single [3, …, …] matrices, matching how GGUF expects them.

You might consider adding a simple sanity check on the padding path to catch config mismatches earlier (optional):

Optional: assert current vs target vocab sizes when padding
-            vocab_size = self.hparams.get("vocab_size", 262400)
-            current_size = data_torch.shape[0]  # First dimension is vocab_size
+            vocab_size = self.hparams.get("vocab_size", 262400)
+            current_size = data_torch.shape[0]  # first dim is vocab size
+
+            if current_size > vocab_size:
+                raise ValueError(
+                    f"embed tensor rows ({current_size}) exceed vocab_size ({vocab_size})"
+                )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2835e9 and 632e29f.

📒 Files selected for processing (3)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • tools/mtmd/clip-impl.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/mtmd/clip-impl.h
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Always use the Python environment in .venv and run Python tools from that environment
Apply Python linting rules configured in .flake8 (max-line-length=125, excludes examples/tools) and type checking with pyright

Files:

  • gguf-py/gguf/constants.py
  • convert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • gguf-py/gguf/constants.py
  • convert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • gguf-py/gguf/constants.py
  • convert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
🧬 Code graph analysis (1)
convert_hf_to_gguf.py (3)
convert_lora_to_gguf.py (2)
  • modify_tensors (448-473)
  • to (185-186)
gguf-py/gguf/gguf_writer.py (2)
  • add_clip_projector_type (1076-1077)
  • add_vision_attention_layernorm_eps (1097-1098)
gguf-py/gguf/constants.py (1)
  • VisionProjectorType (3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

5969-5969: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


5972-5995: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6067-6067: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (3)
convert_hf_to_gguf.py (3)

522-527: Defensive handling of empty tensor_map in prepare_tensors looks good

The fallback max_name_len path for models with block_count == 0 (e.g., MobileNetV5) avoids max() on an empty mapping and only affects logging alignment. No functional risks spotted.


6070-6090: Verify MobileNetV5 tensor remapping and reshaping assumptions

The overall routing logic looks consistent:

  • Non‑Gemma3n tensors are skipped early.
  • Block tensors under model.vision_tower.timm_model.blocks.* are mapped via custom_map into the v.blk.{bid}.{sid}.* namespace.
  • Other MobileNetV5 pieces (stem / msfa / embeddings) fall back to the standard tensor map.
  • Biases and layer_scale.gamma are reshaped from [C] into [1, C, 1, 1], which matches a conv‑style broadcast.

If you haven’t already, it would be worth running a quick shape sanity check against a real Gemma3n vision checkpoint (ensure conv_stem.conv.bias and layer_scale.gamma are 1‑D [C] and that the resulting GGUF tensors have the shapes expected by mobilenetv5.cpp / clip.cpp for those fields).


6115-6134: Gemma3NModel.set_vocab correctly forces use of full vocab_size

Temporarily removing vocab_size_per_layer_input so the vocab builders see the full vocab_size (including vision/audio specials) and then restoring it is a clean way to reconcile:

  • vocab generation: needs 262400 entries
  • per‑layer embeddings: still driven by vocab_size_per_layer_input later

No functional issues seen.

…roadcasting of vision inp_per_layer

- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
gguf-py/gguf/constants.py (1)

459-459: GEMMA3N still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.

This issue was previously flagged but remains unresolved. The GEMMA3N entry was added to the VISION_PROJECTOR_TYPE enum at line 459, but the corresponding mapping in VISION_PROJECTOR_TYPE_NAMES dictionary (lines 856-864) is still missing. This will cause runtime errors when converting the enum value to its string representation.

Please add the mapping:

VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",

Note: QWEN3VL and COGVLM mappings are also still missing as noted in the previous review.

🧹 Nitpick comments (1)
src/models/gemma3n-iswa.cpp (1)

263-274: Consider adding dimension validation for the vision embedding path.

The vision embedding path assumes embd_size = model.tok_embd_per_layer->ne[0] equals n_embd_altup * n_layer. If this assumption is violated, the reshape at line 272 will produce incorrect results or fail silently.

💡 Optional: Add assertion to validate dimensions
 // Vision embedding path: use padding token (ID=0) embedding
 const int64_t embd_size = model.tok_embd_per_layer->ne[0];  // n_embd_altup * n_layer
+GGML_ASSERT(embd_size == n_embd_altup * n_layer && "tok_embd_per_layer dimension mismatch");

 // Extract and dequantize padding token embedding (column 0)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d37c22b and 47b7dd1.

📒 Files selected for processing (3)
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • src/models/gemma3n-iswa.cpp
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Always use the Python environment in .venv and run Python tools from that environment
Apply Python linting rules configured in .flake8 (max-line-length=125, excludes examples/tools) and type checking with pyright

Files:

  • gguf-py/gguf/tensor_mapping.py
  • gguf-py/gguf/constants.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • src/models/gemma3n-iswa.cpp
  • gguf-py/gguf/constants.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • gguf-py/gguf/constants.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
🧬 Code graph analysis (2)
src/models/gemma3n-iswa.cpp (1)
ggml/src/ggml.c (7)
  • ggml_new_tensor_1d (1747-1752)
  • ggml_set_input (7435-7437)
  • ggml_get_rows (3776-3797)
  • ggml_reshape_3d (3564-3581)
  • ggml_scale (3290-3295)
  • ggml_cpy (3426-3431)
  • ggml_add (1969-1974)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (465-742)
🔇 Additional comments (7)
gguf-py/gguf/tensor_mapping.py (1)

123-155: Verify V_MM_INP_PROJ comment and tensor reuse.

The mapping for V_MM_INP_PROJ at lines 129-131 is commented as "# gemma3n", but in constants.py (line 668), V_MM_INP_PROJ is defined with comment "# gemma3". Please verify:

  1. Is V_MM_INP_PROJ being reused between gemma3 and gemma3n, or should gemma3n have a distinct tensor?
  2. If it's reused, consider clarifying the comment or noting the shared usage.

Otherwise, the new gemma3n vision tensor mappings look correct and follow the established pattern.

gguf-py/gguf/constants.py (4)

670-678: LGTM: MODEL_TENSOR enum additions.

The new gemma3n vision tensor enum members are correctly defined with auto() and follow the established naming conventions and patterns.


1070-1079: LGTM: TENSOR_NAMES mappings for gemma3n.

The new TENSOR_NAMES mappings correctly associate the MODEL_TENSOR enum members with their GGUF tensor name strings, following established naming conventions.


1178-1186: LGTM: MODEL_TENSORS additions for MMPROJ arch.

The new gemma3n vision tensors are correctly added to the MODEL_TENSORS[MODEL_ARCH.MMPROJ] list, ensuring they will be recognized during tensor loading and validation.


3428-3428: LGTM: VisionProjectorType.GEMMA3N constant.

The GEMMA3N constant is correctly defined with the appropriate lowercase string value "gemma3n", consistent with other projector type definitions.

src/models/gemma3n-iswa.cpp (2)

253-261: LGTM: Token input path refactoring.

The refactoring correctly moves the input object creation and lifecycle management into the token branch scope. The res->add_input(std::move(inp)) call after processing is appropriately placed.


292-292: Broadcasting order change is intentional and properly implemented.

The addition order change—placing per_layer_proj (shape [n_embd_altup, n_layer, n_tokens]) first and inp_per_layer second—is correct per the PR objectives and supports broadcasting for vision inputs (shape [n_embd_altup, n_layer, 1]). The implementation follows standard conventions (larger tensor first) and the code is properly formatted per project guidelines.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gguf-py/gguf/constants.py (1)

461-472: Bug: VISION_PROJECTOR_TYPE.GEMMA3N is missing from VISION_PROJECTOR_TYPE_NAMES.

You added the enum value (Line 468) but didn’t add the corresponding name in VISION_PROJECTOR_TYPE_NAMES (Line 870+). If code relies on this mapping, GEMMA3N will fail to serialize/deserialize (or raise a KeyError).

Proposed fix
 VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
     VISION_PROJECTOR_TYPE.MLP:       "mlp",
     VISION_PROJECTOR_TYPE.LDP:       "ldp",
     VISION_PROJECTOR_TYPE.LDPV2:     "ldpv2",
     VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
     VISION_PROJECTOR_TYPE.GLM_EDGE:  "adapter",
     VISION_PROJECTOR_TYPE.MERGER:    "qwen2vl_merger",
+    VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
     VISION_PROJECTOR_TYPE.GEMMA3:    "gemma3",
 }

Also applies to: 870-878

🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 530-535: The access to self.tensor_map.mapping in prepare_tensors
is fragile if tensor_map lacks a mapping attribute; change the guard to use
getattr(self.tensor_map, "mapping", None) and treat a falsy result the same as
an empty mapping so max_name_len computation and the fallback to
"vision_encoder.weight," are used safely; update references in prepare_tensors
and any subsequent usage that assumes mapping exists to first assign mapping =
getattr(self.tensor_map, "mapping", None) and use that local variable for checks
and iteration.
- Around line 6193-6212: The current set_vocab method temporarily deletes
self.hparams["vocab_size_per_layer_input"] then calls super().set_vocab(), but
if super().set_vocab() raises an exception the original value is never restored;
wrap the call to super().set_vocab() in a try/finally so that whatever happens
the original vocab_size_per_layer_input (captured from self.hparams) is
re-assigned to self.hparams["vocab_size_per_layer_input"] in the finally block;
keep the existing logic of only deleting/restoring when
vocab_size_per_layer_input is not None and reference the set_vocab method,
self.hparams, vocab_size_per_layer_input, and super().set_vocab() to locate the
change.
- Around line 6044-6125: The patch_size math in
Gemma3nVisionModel.set_gguf_parameters is wrong: replace the linear division
self.hparams_vision["patch_size"] = image_size // image_seq_length with a
square-root based computation (n_per_side = int(sqrt(image_seq_length)) and
patch_size = image_size // n_per_side) so 256 tokens → 16×16 grid and
patch_size=48 for image_size=768; update references in set_gguf_parameters
accordingly. Also update find_vparam to prefer reading num_heads from
self.hparams_vision (e.g., self.hparams_vision.get("num_heads")) and fall back
to 8 only if absent, keeping the existing hidden_size fallback logic.

In @tools/mtmd/clip-model.h:
- Around line 331-347: Remove the unused msfa_concat_conv_w declaration from the
header and fix the unloaded mm_post_proj_norm_w by adding its loading logic
during GEMMA3N model init in clip.cpp (follow the same pattern used for
mobilenet_stem_conv_w / mobilenet_stem_norm_w: call the model tensor-load helper
to assign mm_post_proj_norm_w, check for nullptr and handle gracefully).
Alternatively, if the model truly does not provide that tensor, remove the
conditional check/usage of mm_post_proj_norm_w in mobilenetv5.cpp instead of
loading it. Refer to the symbols mobilenet_blocks, mobilenet_stem_conv_w,
mobilenet_stem_norm_w, mm_post_proj_norm_w, msfa_concat_conv_w, and the
mobilenetv5.cpp/clip.cpp initialization areas when making the change.
🧹 Nitpick comments (1)
tools/mtmd/models/models.h (1)

80-111: Consider narrowing the public surface of clip_graph_mobilenetv5.

These look like internal graph-building helpers; making them private (and/or switching structclass) would reduce accidental use outside the implementation.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47b7dd1 and 465e888.

📒 Files selected for processing (10)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/CMakeLists.txt
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/models.h
  • tools/mtmd/mtmd.cpp
🚧 Files skipped from review as they are similar to previous changes (4)
  • tools/mtmd/CMakeLists.txt
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
  • gguf-py/gguf/constants.py
🧬 Code graph analysis (2)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
  • build (248-453)
  • build (248-248)
  • rms_norm_2d (5-20)
  • rms_norm_2d (5-5)
  • pad_same_2d (23-53)
  • pad_same_2d (23-23)
  • build_edge_residual (57-88)
  • build_edge_residual (57-57)
  • build_inverted_residual (91-149)
  • build_inverted_residual (91-91)
  • build_mobilenet_attn (152-246)
  • build_mobilenet_attn (152-152)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (474-751)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

6047-6047: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6050-6073: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6145-6145: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (10)
gguf-py/gguf/constants.py (1)

3525-3549: VisionProjectorType.GEMMA3N = "gemma3n" looks consistent with the intended surface.

This matches the new projector type string and aligns with the new tensor set.

gguf-py/gguf/tensor_mapping.py (1)

126-159: Please verify these HF key strings against a real Gemma3n checkpoint.

These new mappings are only as good as the exact parameter names (e.g., model.embed_vision.embedding, model.vision_tower.timm_model.conv_stem.conv). If upstream HF naming differs even slightly, conversion will fail to find tensors.

If you’ve seen multiple naming variants across releases, consider adding extra aliases per tensor (as done elsewhere in this file).

tools/mtmd/models/models.h (1)

79-111: > Likely an incorrect or invalid review comment.

tools/mtmd/clip.cpp (6)

791-794: LGTM!

The GEMMA3N case correctly delegates to the clip_graph_mobilenetv5 builder, following the established pattern for other projector types.


1153-1160: LGTM!

The GEMMA3N hparams configuration correctly sets n_merge = 1 since MobileNetV5 handles spatial reduction internally through its architecture rather than through patch merging. The optional override via KEY_PROJ_SCALE_FACTOR provides flexibility.


2961-2969: LGTM!

The GEMMA3N preprocessing correctly resizes to a square image without padding (add_padding = false), which aligns with MobileNetV5 expectations. This is an intentional difference from GEMMA3's behavior.


3631-3631: LGTM!

GEMMA3N correctly falls through with GEMMA3 and other projector types that don't require special input tensor setup beyond the raw image input.


3759-3760: LGTM!

GEMMA3N correctly shares the embedding dimension source (mm_input_proj_w->ne[0]) with GEMMA3, consistent with both using similar projection mechanisms.


1577-1652: The dynamic block discovery correctly handles variable-length mobilenet_stage_ends. The graph builder in clip_graph_mobilenetv5 already protects against fewer-than-4 entries with an explicit size check (if (model.mobilenet_stage_ends.size() >= 4) at line 275 in mobilenetv5.cpp) before accessing indices [2] and [3], and provides fallback logic using total_blocks - 1 when fewer entries exist. No action needed.

Likely an incorrect or invalid review comment.

tools/mtmd/clip-model.h (1)

176-213: LGTM!

The mobilenetv5_block structure is well-organized with clear comments distinguishing Stage 0 (Edge Residual), Stage 1+ (Universal Inverted Residual), and Attention components. All pointers are properly initialized to nullptr.

Comment on lines 530 to +535
def prepare_tensors(self):
max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,")
# Handle empty tensor_map for models with block_count=0 (like MobileNetV5)
if self.tensor_map.mapping:
max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,")
else:
max_name_len = len("vision_encoder.weight,") # Default reasonable length
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard for empty tensor_map.mapping is good, but consider hardening the attribute access.

If gguf.TensorNameMap ever changes shape (e.g., no .mapping attr), this will raise at runtime. A tiny defensive getattr(self.tensor_map, "mapping", None) would make this robust.

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 530 - 535, The access to
self.tensor_map.mapping in prepare_tensors is fragile if tensor_map lacks a
mapping attribute; change the guard to use getattr(self.tensor_map, "mapping",
None) and treat a falsy result the same as an empty mapping so max_name_len
computation and the fallback to "vision_encoder.weight," are used safely; update
references in prepare_tensors and any subsequent usage that assumes mapping
exists to first assign mapping = getattr(self.tensor_map, "mapping", None) and
use that local variable for checks and iteration.

Comment on lines +6044 to +6125
@ModelBase.register("Gemma3nForConditionalGeneration", "Gemma3nVisionModel")
class Gemma3nVisionModel(MmprojModel):
"""Vision encoder converter for Gemma3n using MobileNetV5 architecture"""
n_block_keys = []

# Double indexed mapping for MobileNetV5 blocks
block_tensor_mapping = {
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn1.weight": "v.blk.{bid}.{sid}.bn1.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_pwl.weight": "v.blk.{bid}.{sid}.conv_pwl.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn2.weight": "v.blk.{bid}.{sid}.bn2.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.conv.weight": "v.blk.{bid}.{sid}.dw_start.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.conv.weight": "v.blk.{bid}.{sid}.dw_mid.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.bn.weight": "v.blk.{bid}.{sid}.dw_mid.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.conv.weight": "v.blk.{bid}.{sid}.pw_exp.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.bn.weight": "v.blk.{bid}.{sid}.pw_exp.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.conv.weight": "v.blk.{bid}.{sid}.pw_proj.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.bn.weight": "v.blk.{bid}.{sid}.pw_proj.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.layer_scale.gamma": "v.blk.{bid}.{sid}.layer_scale.gamma",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.query.proj.weight": "v.blk.{bid}.{sid}.attn.query.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.proj.weight": "v.blk.{bid}.{sid}.attn.key.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.proj.weight": "v.blk.{bid}.{sid}.attn.value.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.output.proj.weight": "v.blk.{bid}.{sid}.attn.output.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.down_conv.weight": "v.blk.{bid}.{sid}.attn.key.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.norm.weight": "v.blk.{bid}.{sid}.attn.key.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.down_conv.weight": "v.blk.{bid}.{sid}.attn.value.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.norm.weight": "v.blk.{bid}.{sid}.attn.value.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.norm.weight": "v.blk.{bid}.{sid}.norm.weight",
}

def find_hparam(self, keys: list[str], optional: bool = False) -> Any:
"""Override to return 0 for block count since MobileNetV5 is CNN-based"""
if not keys: # If n_block_keys is empty (our case)
return 0
# Otherwise use parent implementation
return super().find_hparam(keys, optional)

def __init__(self, *args, **kwargs):
# Parent init will call find_hparam which now returns 0 for empty keys
super().__init__(*args, **kwargs)

def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
"""Override to provide hardcoded MobileNetV5 parameters that aren't in config"""
# Handle empty keys list (n_block_keys) - return 0 for CNN architecture
if not keys:
return 0

if "intermediate_size" in keys:
# Typical expansion is 4x the embedding dimension
hidden_size = self.hparams_vision.get("hidden_size", 2048)
return hidden_size * 4

if "num_attention_heads" in keys or "num_heads" in keys:
# Multi-Query Attention with 8 heads
return 8

# For other parameters, use parent implementation
return super().find_vparam(keys, optional)

def set_gguf_parameters(self):
# MobileNetV5 does not use normalisation at all
self.preprocessor_config["image_mean"] = [0.0 , 0.0 , 0.0 ]
self.preprocessor_config["image_std"] = [1.0 , 1.0 , 1.0 ]
self.hparams_vision["image_size"] = self.preprocessor_config.get(
"size", {"height": 768, "width": 768}
)["height"]

# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
self.hparams_vision["patch_size"] = image_size // image_seq_length

# Now call parent which will use the corrected values
super().set_gguf_parameters()

# Set projector type to GEMMA3N
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.GEMMA3N)

# MobileNetV5 specific parameters
self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-6))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check the actual file content around the specified lines
head -6130 convert_hf_to_gguf.py | tail -90 | cat -n

Repository: ngxson/llama.cpp

Length of output: 6407


🏁 Script executed:

# Check the parent class to understand the context
rg -A 10 "class MmprojModel" convert_hf_to_gguf.py | head -50

Repository: ngxson/llama.cpp

Length of output: 430


🏁 Script executed:

# Search for similar vision model implementations and how patch_size is typically computed
rg "patch_size.*=" convert_hf_to_gguf.py -B 3 -A 3 | head -100

Repository: ngxson/llama.cpp

Length of output: 1714


🏁 Script executed:

# Check if image_seq_length is used elsewhere to understand the expected pattern
rg "image_seq_length" convert_hf_to_gguf.py -B 2 -A 2

Repository: ngxson/llama.cpp

Length of output: 750


🏁 Script executed:

# Look for how n_per_side or sqrt is used in similar contexts
rg "sqrt\|n_per_side\|\*\*\s*0\.5" convert_hf_to_gguf.py | head -20

Repository: ngxson/llama.cpp

Length of output: 42


Fix patch_size computation for Gemma3n—currently mathematically incorrect.

self.hparams_vision["patch_size"] = image_size // image_seq_length (line 75) treats token count as a linear divisor. For a 16×16 patch grid (256 tokens), the correct formula is n_per_side = sqrt(image_seq_length) and patch_size = image_size // n_per_side. With image_size=768 and image_seq_length=256, the current code produces patch_size=3 instead of 48—a 16× error that propagates downstream. Other vision models in this codebase (Qwen3VL, TinyGemma3) use the correct square-root approach.

Proposed fix
         # Image sequence length (256 tokens = 16x16 for Gemma3n)
         image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
+        n_per_side = int(image_seq_length ** 0.5)
+        if n_per_side * n_per_side != image_seq_length:
+            raise ValueError(f"image_seq_length must be a perfect square, got {image_seq_length}")
         image_size = self.hparams_vision["image_size"]
-        self.hparams_vision["patch_size"] = image_size // image_seq_length
+        self.hparams_vision["patch_size"] = image_size // n_per_side

Additionally, find_vparam() hardcodes num_heads=8 (line 59) with no config fallback, while hidden_size (line 54) reads from config with a default. For consistency, attempt to read num_heads from self.hparams_vision before hardcoding.

🧰 Tools
🪛 Ruff (0.14.10)

6047-6047: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6050-6073: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6044 - 6125, The patch_size math in
Gemma3nVisionModel.set_gguf_parameters is wrong: replace the linear division
self.hparams_vision["patch_size"] = image_size // image_seq_length with a
square-root based computation (n_per_side = int(sqrt(image_seq_length)) and
patch_size = image_size // n_per_side) so 256 tokens → 16×16 grid and
patch_size=48 for image_size=768; update references in set_gguf_parameters
accordingly. Also update find_vparam to prefer reading num_heads from
self.hparams_vision (e.g., self.hparams_vision.get("num_heads")) and fall back
to 8 only if absent, keeping the existing hidden_size fallback logic.

Comment on lines +6134 to +6169
def custom_map(self, name: str) -> str:
"""Parses names like model.vision_tower.timm_model.blocks.1.2.suffix and applies template mapping."""
parts = name.split(".")
# MobileNet blocks have at least 7 parts: model, vision_tower, timm_model, blocks, bid, sid, and suffix
if len(parts) >= 7:
bid, sid = parts[4], parts[5]
suffix = ".".join(parts[6:])
template = f"model.vision_tower.timm_model.blocks.{{bid}}.{{sid}}.{suffix}"
if template in self.block_tensor_mapping:
return self.block_tensor_mapping[template].format(bid=bid, sid=sid)

raise ValueError(f"Unknown name: {name}")

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused

# Gemma3n uses
# - model.embed_vision.* for projection layers
# - model.vision_tower.* for vision encoder
# Skip non-vision tensors
if not (name.startswith("model.embed_vision.") or
name.startswith("model.vision_tower.")):
return []

if name.startswith("model.vision_tower.timm_model.blocks."):
# Double-indexed block tensors through custom logic
new_name = self.custom_map(name)
else:
# Route non-repeating (conv_stem, msfa, embedding, etc.) and un-catched through tensor_mapping.py
new_name = self.map_tensor_name(name)

if new_name.endswith("conv_stem.conv.bias") or new_name.endswith("layer_scale.gamma"):
data_torch = data_torch.unsqueeze(0).unsqueeze(-1).unsqueeze(-1) # [1, C, 1, 1]

yield (new_name, data_torch)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make custom_map() less brittle + simplify reshape semantics.

  • custom_map() raises on any unknown blocks.* tensor (Line 6145). That’s fine for a single known checkpoint, but it makes the converter fragile across MobileNetV5 variants (extra tensors, renamed submodules, etc.). Consider falling back to self.map_tensor_name(name) (or skipping with a warning) when the template isn’t found.
  • unsqueeze chain (Line 6165-6166) is harder to read and easier to get wrong than an explicit reshape.
Possible refactor
@@
-        if new_name.endswith("conv_stem.conv.bias") or new_name.endswith("layer_scale.gamma"):
-            data_torch = data_torch.unsqueeze(0).unsqueeze(-1).unsqueeze(-1) # [1, C, 1, 1]
+        if new_name.endswith("conv_stem.conv.bias") or new_name.endswith("layer_scale.gamma"):
+            data_torch = data_torch.reshape(1, -1, 1, 1)  # [1, C, 1, 1]

Also: n_block_keys = [] and block_tensor_mapping = {...} are mutable class attrs; annotate as ClassVar or use tuples / Mapping to satisfy Ruff RUF012 and prevent accidental mutation.

🧰 Tools
🪛 Ruff (0.14.10)

6145-6145: Avoid specifying long messages outside the exception class

(TRY003)

Comment on lines +331 to +347
// mobilenetv5 for gemma3n
std::vector<mobilenetv5_block> mobilenet_blocks;
std::vector<int> mobilenet_stage_ends;
ggml_tensor * mobilenet_stem_conv_w = nullptr;
ggml_tensor * mobilenet_stem_conv_b = nullptr;
ggml_tensor * mobilenet_stem_norm_w = nullptr;
ggml_tensor * mm_post_proj_norm_w = nullptr;

// Multi-Scale Fusion Adapter (MSFA) components
ggml_tensor * msfa_concat_conv_w = nullptr;
ggml_tensor * msfa_concat_norm_w = nullptr;
ggml_tensor * msfa_ffn_expand_w = nullptr;
ggml_tensor * msfa_ffn_project_w = nullptr;
ggml_tensor * msfa_ffn_expand_bn = nullptr;
ggml_tensor * msfa_ffn_project_bn = nullptr;


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n 'msfa_concat_conv_w|mm_post_proj_norm_w' --type-list | head -20

Repository: ngxson/llama.cpp

Length of output: 176


🏁 Script executed:

rg -n 'msfa_concat_conv_w|mm_post_proj_norm_w'

Repository: ngxson/llama.cpp

Length of output: 395


🏁 Script executed:

rg -n -A 5 -B 5 'msfa_concat_norm_w|msfa_concat_conv_w' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 833


🏁 Script executed:

rg -n -A 3 -B 3 'mm_post_proj_norm_w' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

# Check what gets loaded in clip.cpp for the GEMMA3N case more broadly
rg -n 'get_tensor.*MNV5' tools/mtmd/clip.cpp | head -30

Repository: ngxson/llama.cpp

Length of output: 3743


🏁 Script executed:

# Check if mm_post_proj_norm_w has a corresponding tensor name constant
rg -n 'MNV5.*PROJ|POST.*PROJ' tools/mtmd/

Repository: ngxson/llama.cpp

Length of output: 1049


🏁 Script executed:

# Search for tensor name constants related to post projection
rg -n 'TN_MNV5.*POST|TN_MNV5.*PROJ_NORM' tools/mtmd/clip-impl.h

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

# Check the exact line in mobilenetv5.cpp to understand context
sed -n '435,450p' tools/mtmd/models/mobilenetv5.cpp

Repository: ngxson/llama.cpp

Length of output: 603


Remove unused tensor declaration and fix unloaded tensor bug.

msfa_concat_conv_w is declared but never used anywhere in the codebase and should be removed. More critically, mm_post_proj_norm_w is used in mobilenetv5.cpp:445-447 but is never loaded from the model in clip.cpp, so it will always be nullptr. Either add the loading logic for mm_post_proj_norm_w in the GEMMA3N model initialization (around line 1574 in clip.cpp), or remove the conditional check in mobilenetv5.cpp if the tensor doesn't exist in the model file.

🤖 Prompt for AI Agents
In @tools/mtmd/clip-model.h around lines 331 - 347, Remove the unused
msfa_concat_conv_w declaration from the header and fix the unloaded
mm_post_proj_norm_w by adding its loading logic during GEMMA3N model init in
clip.cpp (follow the same pattern used for mobilenet_stem_conv_w /
mobilenet_stem_norm_w: call the model tensor-load helper to assign
mm_post_proj_norm_w, check for nullptr and handle gracefully). Alternatively, if
the model truly does not provide that tensor, remove the conditional check/usage
of mm_post_proj_norm_w in mobilenetv5.cpp instead of loading it. Refer to the
symbols mobilenet_blocks, mobilenet_stem_conv_w, mobilenet_stem_norm_w,
mm_post_proj_norm_w, msfa_concat_conv_w, and the mobilenetv5.cpp/clip.cpp
initialization areas when making the change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 6051-6090: The current __init__ and find_hparam logic force
hparams_vision["n_layers"]=0 and unconditionally set
hparams_vision["intermediate_size"]=hidden_size*4 and
hparams_vision["num_attention_heads"]=8, which is brittle; change this to derive
values from a provided vision_config (or require vision_config keys) by: in
find_hparam/__init__ validate presence of required keys in self.hparams_vision
or a passed vision_config, use dict.setdefault for intermediate_size and
num_attention_heads only if the corresponding hidden_size/num_attention_heads
exist, and otherwise raise a clear error or log a fatal message so missing
vision metadata fails loudly; update references to find_hparam, __init__,
hparams_vision, intermediate_size, and num_attention_heads accordingly.
- Around line 6098-6102: The computation of patch_size is incorrect: instead of
dividing image_size by image_seq_length, compute patches_per_side =
int(math.sqrt(image_seq_length)), validate that patches_per_side**2 ==
image_seq_length and image_size % patches_per_side == 0, then set
self.hparams_vision["patch_size"] = image_size // patches_per_side; if
validations fail, raise a clear error (or log and exit) mentioning
image_seq_length and image_size so callers can fix the config (touch variables:
image_seq_length from self.preprocessor_config, image_size and patch_size in
self.hparams_vision).
- Around line 6229-6250: The padding code treats both token embeddings and
per-layer embeddings the same, but embed_tokens_per_layer tensors have shape
[embedding_dim, n_vocab], so padding must be applied on axis 1 for per-layer
tensors instead of axis 0; update the block that checks "embed_tokens.weight" or
"embed_tokens_per_layer" to branch when "per_layer" in name: for regular token
embeddings keep current_size = data_torch.shape[0] and pad with zeros of shape
(padding_size, data_torch.shape[1]) concatenated dim=0; for per-layer embeddings
set current_size = data_torch.shape[1], compute padding_size = vocab_size -
current_size, create padding zeros of shape (data_torch.shape[0], padding_size)
and concatenate dim=1; adjust the logger message accordingly and keep moving
data_torch to CPU before padding and returning (self.map_tensor_name(name),
data_torch).

In @tools/mtmd/clip.cpp:
- Around line 3233-3238: For PROJECTOR_TYPE_GEMMA3N in clip_n_output_tokens(),
n_patches is being set to the number of patches per side
(ctx->model.hparams.image_size / ctx->model.hparams.patch_size) but must be the
total token count (per_side squared); change the assignment so n_patches =
per_side * per_side (e.g., compute per_side = ctx->model.hparams.image_size /
ctx->model.hparams.patch_size and then n_patches = per_side * per_side) to
return 16×16=256 tokens for GEMMA3N and satisfy the downstream sanity check.
- Around line 1154-1160: The comment for PROJECTOR_TYPE_GEMMA3N is misleading:
MobileNetV5 does not fully bypass resizing because preprocessing still
force-resizes the input; update the inline comment near the hparams.n_merge
assignment (and the get_u32 call) to state that Gemma3n/MobileNetV5 expects 256
tokens (16x16), we set n_merge = 1, and note that preprocessing still performs a
forced resize (see the preprocessing logic) so the model's internal resizing
does not eliminate external preprocessing. Keep the behavior unchanged, just
correct and clarify the comment text.

In @tools/mtmd/models/mobilenetv5.cpp:
- Around line 5-20: In clip_graph_mobilenetv5::rms_norm_2d add a defensive null
check for the inp parameter before any dereference (e.g., before calling
ggml_permute); if inp is null, return nullptr (or an appropriate
error/early-exit tensor) to avoid a null-pointer dereference, keeping existing
behavior for weight unchanged and ensuring the function returns a valid
ggml_tensor* in the error case.
- Around line 91-149: The function build_inverted_residual uses the inp pointer
without validation; add an immediate null check at the top of
build_inverted_residual for the inp parameter and handle it safely (e.g., return
nullptr or propagate an error) instead of dereferencing a null pointer so the
rest of the function (uses of inp->ne[...] and residual addition) are not
executed when inp is null.
- Around line 248-260: The build() function uses model.mobilenet_stem_conv_w
without validation; add a null-check at the start of the stem block (before
calling ggml_conv_2d_direct) to detect missing stem weights
(model.mobilenet_stem_conv_w == nullptr) and handle it by logging an
error/throwing or returning nullptr from build() to avoid dereferencing; ensure
downstream code does not assume cur was created if the check fails and keep
existing handling for mobilenet_stem_conv_b and mobilenet_stem_norm_w unchanged.
- Around line 23-53: In pad_same_2d, add a null check for the input pointer inp
at the start of the function and return or handle the error if inp is null; also
validate stride_h and stride_w are > 0 before using them (e.g., return early or
assert/log error) to avoid division by zero when computing oh and ow; update
references to inp, stride_h, and stride_w in pad_same_2d accordingly so the
function fails fast on invalid inputs instead of dereferencing a null pointer or
performing division by zero.
- Around line 57-88: The function build_edge_residual assumes inp and block
weight tensors exist; add explicit null checks at the top of
build_edge_residual: if inp is null return nullptr (or inp as appropriate) to
avoid dereferencing, and verify block.s0_conv_exp_w and block.s0_conv_pwl_w
before calling ggml_conv_2d_direct (and before passing them to rms_norm_2d); if
either weight is null, skip the corresponding conv/pwl steps or return nullptr
consistently so callers can handle the error. Ensure all early exits use the
same convention as the surrounding codebase (nullptr or original inp) and
reference the symbols build_edge_residual, block.s0_conv_exp_w,
block.s0_conv_pwl_w, ggml_conv_2d_direct, and rms_norm_2d when making the
checks.
- Around line 152-246: The function build_mobilenet_attn may dereference null
pointers (inp and several block weight tensors); add defensive null checks at
the start of build_mobilenet_attn to validate inp and before using each required
weight (block.attn_q_w, block.attn_k_w, block.attn_v_w, block.attn_o_w) and
return a safe fallback (e.g., inp or nullptr) or propagate an error if any are
null; also guard uses of optional downsample/norm tensors (block.attn_k_dw_w,
block.attn_v_dw_w, block.attn_k_norm_w, block.attn_v_norm_w,
block.layer_scale_w) so they are only accessed when non-null to avoid
null-pointer deref.
🧹 Nitpick comments (5)
tools/mtmd/clip-model.h (2)

176-214: mobilenetv5_block layout is clear; consider adding tiny helpers to prevent invalid combos.
As-is, blocks can be “partially populated” (e.g., both Edge+Attention), which may be valid, but it’s easy to mis-handle later; small predicates like is_edge() / is_uir() / is_attn() would make the execution path safer/cleaner in mobilenetv5.cpp.


331-346: Use an index-safe type for mobilenet_stage_ends and keep loader/header consistent.
std::vector<int> mobilenet_stage_ends will truncate on very large models and doesn’t match the size_t indices used in logs/compute. Prefer std::vector<size_t> (or std::vector<int32_t> if you truly want a bounded range) and update the push sites in clip.cpp accordingly.

convert_hf_to_gguf.py (2)

6051-6074: Minor: annotate block_tensor_mapping as ClassVar + keep exception style consistent.
This matches Ruff RUF012 / TRY003 and avoids signaling “instance state”.

Proposed tweak
@@
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ClassVar, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
@@
-    block_tensor_mapping = {
+    block_tensor_mapping: ClassVar[dict[str, str]] = {
@@
-        raise ValueError(f"Unknown name: {name}")
+        raise ValueError("Unknown MobileNetV5 tensor name") from None

Also applies to: 6116-6128


6174-6193: Use try/finally when temporarily deleting vocab_size_per_layer_input.
As written, an exception in super().set_vocab() can leave self.hparams mutated.

Proposed fix
-        vocab_size_per_layer_input = self.hparams.get("vocab_size_per_layer_input")
-
-        # Temporarily remove vocab_size_per_layer_input to force using vocab_size
-        if vocab_size_per_layer_input is not None:
-            del self.hparams["vocab_size_per_layer_input"]
-
-        # Call parent set_vocab which will now use vocab_size (262400)
-        super().set_vocab()
-
-        # Restore vocab_size_per_layer_input for later use
-        if vocab_size_per_layer_input is not None:
-            self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input
+        vocab_size_per_layer_input = self.hparams.pop("vocab_size_per_layer_input", None)
+        try:
+            # Call parent set_vocab which will now use vocab_size (262400)
+            super().set_vocab()
+        finally:
+            if vocab_size_per_layer_input is not None:
+                self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input
tools/mtmd/models/mobilenetv5.cpp (1)

420-420: Consider extracting epsilon constant.

The epsilon value 1e-6f is used in multiple RMS norm operations (lines 420 and 442). Extracting this to a named constant would improve maintainability and ensure consistency.

♻️ Suggested refactor

At the top of the file or in a constants section:

static constexpr float GEMMA3N_RMS_NORM_EPS = 1e-6f;

Then use throughout:

-        const float eps = 1e-6f; // Gemma3n uses 1e-6
-        cur = ggml_rms_norm(ctx0, cur, eps);
+        cur = ggml_rms_norm(ctx0, cur, GEMMA3N_RMS_NORM_EPS);

Also applies to: 442-442

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 465e888 and bfbb315.

📒 Files selected for processing (6)
  • convert_hf_to_gguf.py
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • tools/mtmd/models/models.h
  • tools/mtmd/mtmd.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/models/models.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
🧬 Code graph analysis (2)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
  • add_clip_projector_type (1086-1087)
  • add_vision_attention_layernorm_eps (1107-1108)
gguf-py/gguf/constants.py (2)
  • VisionProjectorType (3525-3548)
  • GGMLQuantizationType (3400-3432)
tools/mtmd/models/mobilenetv5.cpp (1)
ggml/src/ggml.c (15)
  • ggml_permute (3702-3754)
  • ggml_cont (3463-3467)
  • ggml_rms_norm (3068-3073)
  • ggml_mul (2172-2177)
  • ggml_pad_ext (4985-5018)
  • ggml_conv_2d_direct (4704-4738)
  • ggml_gelu (2679-2683)
  • ggml_add (1971-1976)
  • ggml_mul_mat (3176-3191)
  • ggml_scale (3292-3297)
  • ggml_soft_max (3968-3972)
  • ggml_upscale (4930-4937)
  • ggml_concat (2519-2546)
  • ggml_pool_2d (4854-4880)
  • ggml_build_forward_expand (6795-6797)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

6051-6074: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6127-6127: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: windows-2022-cmake-cuda (12.4)
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: server-windows
🔇 Additional comments (7)
tools/mtmd/clip.cpp (3)

783-795: GEMMA3N graph builder wiring looks correct.
Selecting clip_graph_mobilenetv5 for PROJECTOR_TYPE_GEMMA3N is the right integration point.


3631-3644: GEMMA3N “no extra inputs” path: please confirm the MobileNetV5 graph doesn’t require positions/masks.
You added PROJECTOR_TYPE_GEMMA3N to the do-nothing branch (Line 3631). If MobileNetV5/MSFA needs any runtime inputs beyond inp_raw, they’ll be missing here.


3759-3761: GEMMA3N mmproj embedding sizing: LGTM.
Returning mm_input_proj_w->ne[0] is consistent with Gemma3 and keeps caller logic uniform.

convert_hf_to_gguf.py (2)

530-535: Good defensive handling for empty tensor_map (prevents max() crash).
This keeps tensor logging robust for block_count=0 models like MobileNetV5.


6146-6148: The [1, C, 1, 1] reshape is necessary and correct for GGML broadcasting.

The tensors are reshaped from shape [C] to [1, C, 1, 1] to properly broadcast with the convolution output shape [C, H, W, N] in GGML operations (lines 256 and 138/240 in mobilenetv5.cpp). The reshape matches C++ expectations—no issues.

tools/mtmd/models/mobilenetv5.cpp (2)

298-392: MSFA implementation looks solid.

The Multi-Scale Fusion Adapter logic correctly:

  • Guards against empty intermediate features (line 299)
  • Resizes features to match target resolution
  • Warns about non-integer scaling (lines 325-327)
  • Conditionally applies all optional transformations (expand, project, norms)

401-407: The permutation sequence at lines 403-404 is intentionally designed to transform spatial dimensions before flattening to tokens. The code includes an explicit comment explaining that it reshapes from PyTorch's (Batch, Seq, Hidden) convention to GGML's (Hidden, Seq, Batch) format, and the final shape [C, W*H, B] aligns with this mapping.

However, the codebase does not include the PyTorch model implementation or explicit validation that confirms the width-major token ordering (from the [C, W, H, B][C, W*H, B] transformation) matches Gemma3N's expected token traversal. The conversion script handles tensor weight mapping but does not validate forward-pass token ordering. To fully verify this matches the upstream PyTorch model, you would need to compare model outputs between this GGML implementation and the original PyTorch implementation.

Comment on lines +6051 to +6090
block_tensor_mapping = {
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn1.weight": "v.blk.{bid}.{sid}.bn1.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_pwl.weight": "v.blk.{bid}.{sid}.conv_pwl.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn2.weight": "v.blk.{bid}.{sid}.bn2.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.conv.weight": "v.blk.{bid}.{sid}.dw_start.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.conv.weight": "v.blk.{bid}.{sid}.dw_mid.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.bn.weight": "v.blk.{bid}.{sid}.dw_mid.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.conv.weight": "v.blk.{bid}.{sid}.pw_exp.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.bn.weight": "v.blk.{bid}.{sid}.pw_exp.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.conv.weight": "v.blk.{bid}.{sid}.pw_proj.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.bn.weight": "v.blk.{bid}.{sid}.pw_proj.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.layer_scale.gamma": "v.blk.{bid}.{sid}.layer_scale.gamma",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.query.proj.weight": "v.blk.{bid}.{sid}.attn.query.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.proj.weight": "v.blk.{bid}.{sid}.attn.key.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.proj.weight": "v.blk.{bid}.{sid}.attn.value.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.output.proj.weight": "v.blk.{bid}.{sid}.attn.output.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.down_conv.weight": "v.blk.{bid}.{sid}.attn.key.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.norm.weight": "v.blk.{bid}.{sid}.attn.key.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.down_conv.weight": "v.blk.{bid}.{sid}.attn.value.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.norm.weight": "v.blk.{bid}.{sid}.attn.value.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.norm.weight": "v.blk.{bid}.{sid}.norm.weight",
}

def find_hparam(self, keys: Iterable[str], optional: bool = False) -> Any:
# force n_layers to 0 in __init__()
# we have to do this because self.hparams_vision is not yet accessible for modification inside __init__()
if "n_layers" in list(keys):
return 0
return super().find_hparam(keys, optional)

def __init__(self, *args, **kwargs):
# Parent init will call find_hparam which now returns 0 for empty keys
super().__init__(*args, **kwargs)
assert self.hparams_vision is not None
self.hparams_vision["n_layers"] = 0
self.hparams_vision["intermediate_size"] = self.hparams_vision.get("hidden_size", 2048) * 4
self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_attention_heads", 8)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid brittle hardcoded vision defaults; derive from vision_config (or fail loudly).
intermediate_size = hidden_size * 4 and num_attention_heads = 8 as unconditional fallbacks can silently produce mismatched GGUF metadata if Gemma3n variants change. Consider setdefault() with strict validation (e.g., require hidden_size present) and/or log when falling back. Please relay upstream.

🧰 Tools
🪛 Ruff (0.14.10)

6051-6074: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6051 - 6090, The current __init__ and
find_hparam logic force hparams_vision["n_layers"]=0 and unconditionally set
hparams_vision["intermediate_size"]=hidden_size*4 and
hparams_vision["num_attention_heads"]=8, which is brittle; change this to derive
values from a provided vision_config (or require vision_config keys) by: in
find_hparam/__init__ validate presence of required keys in self.hparams_vision
or a passed vision_config, use dict.setdefault for intermediate_size and
num_attention_heads only if the corresponding hidden_size/num_attention_heads
exist, and otherwise raise a clear error or log a fatal message so missing
vision metadata fails loudly; update references to find_hparam, __init__,
hparams_vision, intermediate_size, and num_attention_heads accordingly.

Comment on lines +6098 to +6102
# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
self.hparams_vision["patch_size"] = image_size // image_seq_length

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix patch_size computation (currently semantically incorrect / fragile).
patch_size = image_size // image_seq_length treats a token count as a linear dimension. For a 16×16 grid (image_seq_length=256), derive patch_size via sqrt(image_seq_length) (patches per side), and validate squareness/divisibility. Please relay this upstream.

Proposed fix
         # Image sequence length (256 tokens = 16x16 for Gemma3n)
         image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
         image_size = self.hparams_vision["image_size"]
-        self.hparams_vision["patch_size"] = image_size // image_seq_length
+        n_per_side = int(math.isqrt(image_seq_length))
+        if n_per_side * n_per_side != image_seq_length:
+            raise ValueError(f"image_seq_length must be a perfect square, got {image_seq_length}")
+        if image_size % n_per_side != 0:
+            raise ValueError(f"image_size ({image_size}) must be divisible by sqrt(image_seq_length) ({n_per_side})")
+        self.hparams_vision["patch_size"] = image_size // n_per_side

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6098 - 6102, The computation of
patch_size is incorrect: instead of dividing image_size by image_seq_length,
compute patches_per_side = int(math.sqrt(image_seq_length)), validate that
patches_per_side**2 == image_seq_length and image_size % patches_per_side == 0,
then set self.hparams_vision["patch_size"] = image_size // patches_per_side; if
validations fail, raise a clear error (or log and exit) mentioning
image_seq_length and image_size so callers can fix the config (touch variables:
image_seq_length from self.preprocessor_config, image_size and patch_size in
self.hparams_vision).

Comment on lines +6229 to +6250
# Pad token embeddings for vision/audio special tokens (262144-262399)
if "embed_tokens.weight" in name or "embed_tokens_per_layer" in name:
# Move to CPU to avoid meta device issues during padding
data_torch = data_torch.to(device="cpu")

vocab_size = self.hparams.get("vocab_size", 262400)
current_size = data_torch.shape[0] # First dimension is vocab_size

if current_size < vocab_size:
# Pad with zeros for vision/audio tokens (they get embeddings from vision tower)
padding_size = vocab_size - current_size
tensor_type = "per-layer embeddings" if "per_layer" in name else "token embeddings"
logger.info(f"Padding {tensor_type} shape {list(data_torch.shape)} from {current_size} to {vocab_size} (adding {padding_size} vision/audio token slots)")

# Create padding with zeros (vision tokens won't use these embeddings)
padding = torch.zeros((padding_size, data_torch.shape[1]), dtype=data_torch.dtype, device=data_torch.device)
data_torch = torch.cat([data_torch, padding], dim=0)

# Continue with normal processing
name = name.replace("language_model.", "")
return [(self.map_tensor_name(name), data_torch)]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for embed_tokens_per_layer and related tensor definitions
rg -n "embed_tokens_per_layer" -C 3 | head -100

Repository: ngxson/llama.cpp

Length of output: 1002


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for vocab_size_per_layer or related per-layer tensor handling
rg -n "per_layer|per-layer" -C 2 --type py | grep -E "(shape|dim|rank|torch\.zeros|\.cat\()" | head -50

Repository: ngxson/llama.cpp

Length of output: 256


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's any shape validation or assertions around tensor padding
rg -n "assert.*shape|assert.*rank|data_torch\.shape" convert_hf_to_gguf.py -C 2 | head -100

Repository: ngxson/llama.cpp

Length of output: 5119


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the class/function containing this code and understand tensor flow
rg -n "class.*Gemma3\|def.*set_tensor_data\|embed_tokens_per_layer" convert_hf_to_gguf.py -B2 -A5 | head -150

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check how tensors are loaded from the model before reaching this function
rg -n "get_tensor\|load_tensor\|\.weight" convert_hf_to_gguf.py | grep -E "(embed_tokens|6[0-9]{3})" | head -50

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for any shape validation or ndim checks in the convert_hf_to_gguf file
rg -n "ndim|\.ndim|rank\|assert.*shape" convert_hf_to_gguf.py | grep -A2 -B2 "6[0-9]{3}" | head -100

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look at the actual function signature and class context
sed -n '6200,6260p' convert_hf_to_gguf.py | cat -n

Repository: ngxson/llama.cpp

Length of output: 3515


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the class definition for context
rg -n "class Gemma3" convert_hf_to_gguf.py -A 20 | head -50

Repository: ngxson/llama.cpp

Length of output: 2800


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check HuggingFace model references or documentation about gemma3 per-layer embeddings
rg -n "gemma3\|Gemma3" convert_hf_to_gguf.py -i -C 3 | head -100

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for any shape transformations or reshaping of embed_tokens before the padding code
rg -n "reshape\|view\|unsqueeze\|squeeze" convert_hf_to_gguf.py | grep -E "(6[0-2][0-9]{2})"

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's any ndim check or assertion anywhere in modify_tensors method
sed -n '6200,6500p' convert_hf_to_gguf.py | grep -n "ndim\|assert\|len.*shape" | head -30

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for any documentation or comments about embed_tokens_per_layer shape
rg -n "embed_tokens_per_layer\|PER_LAYER_TOKEN_EMBD" --type py -C 5

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the entire modify_tensors method for Gemma3Model
sed -n '6200,6350p' convert_hf_to_gguf.py | cat -n

Repository: ngxson/llama.cpp

Length of output: 7970


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for how tensors are accessed before padding - look for data_torch operations
rg -n "data_torch\.ndim\|data_torch\.shape\[" convert_hf_to_gguf.py | head -50

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for any references to Gemma3 model in tensor_mapping or documentation
rg -n "gemma3\|Gemma3\|per.layer.*embed" gguf-py/gguf/ --type py -C 2

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for any test files or example configs that might show tensor shapes
find . -name "*.py" -o -name "*.md" | xargs grep -l "embed_tokens_per_layer\|Gemma3" 2>/dev/null | head -20

Repository: ngxson/llama.cpp

Length of output: 116


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check HuggingFace references or model configs
find . -name "config.json" -o -name "*.md" | xargs grep -l "gemma3\|embed_tokens" 2>/dev/null | head -10

Repository: ngxson/llama.cpp

Length of output: 191


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for any assertions or shape checks related to embedding tensors in the file
sed -n '1,6300p' convert_hf_to_gguf.py | grep -n "\.ndim\|\.shape\|assert.*rank\|assert.*dim" | tail -50

Repository: ngxson/llama.cpp

Length of output: 3471


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's a parent class implementation or comments about tensor ranks
rg -n "class TextModel\|def modify_tensors" convert_hf_to_gguf.py -A 5 | head -100

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for any model loading or tensor preparation code
rg -n "def load_tensor\|def get_tensor" convert_hf_to_gguf.py -A 10 | head -80

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check what happens before modify_tensors is called
rg -n "modify_tensors\|data_torch = " convert_hf_to_gguf.py | head -50

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/antml:bin/bash
set -euo pipefail

# Look for any comments or documentation about Gemma3 tensor shapes
sed -n '5860,5900p' convert_hf_to_gguf.py | cat -n

Repository: ngxson/llama.cpp

Length of output: 2427


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's any reshape/view operations on embed_tokens anywhere
rg -n "embed_tokens" convert_hf_to_gguf.py -C 5

Repository: ngxson/llama.cpp

Length of output: 6665


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for the full Gemma3Model.modify_tensors method to understand all tensor handling
sed -n '6215,6300p' convert_hf_to_gguf.py | cat -n

Repository: ngxson/llama.cpp

Length of output: 4266


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for how embed_tokens_per_layer is referenced in tensor_mapping
rg -n "PER_LAYER_TOKEN_EMBD\|embed_tokens_per_layer" gguf-py/gguf/tensor_mapping.py -B 3 -A 3

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's any shape validation added elsewhere for embedding tensors
rg -n "ndim.*==.*2\|assert.*rank\|len.*shape.*==.*2" convert_hf_to_gguf.py | head -30

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for any error handling or shape validation around the padding code
sed -n '6200,6300p' convert_hf_to_gguf.py | grep -n "if\|assert\|try\|except\|raise\|ndim\|shape\[" | cat -n

Repository: ngxson/llama.cpp

Length of output: 1433


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for any GitHub issues or comments about tensor shape problems
rg -n "shape\[1\]|ndim|rank" convert_hf_to_gguf.py | grep -E "6[0-9]{3}"

Repository: ngxson/llama.cpp

Length of output: 2998


🏁 Script executed:

#!/antml:bin/bash
set -euo pipefail

# Look for test files that might show expected shapes
find . -name "*test*" -o -name "*gemma*" | grep -i test | head -20

Repository: ngxson/llama.cpp

Length of output: 756


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there are any issues or commits related to embed_tokens_per_layer
git log --all --oneline --grep="embed_tokens" 2>/dev/null | head -20

Repository: ngxson/llama.cpp

Length of output: 201


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for actual tensor loading to see if per-layer embeddings can be 3D
rg -n "transformers\|from_pretrained" convert_hf_to_gguf.py | head -30

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's any online documentation or comments about Gemma3n per-layer embeddings
rg -n "Gemma3n\|per.layer\|per_layer" convert_hf_to_gguf.py -C 3 | head -100

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for any error messages or crash reports related to shape
rg -n "IndexError\|shape\|dimension" convert_hf_to_gguf.py | grep -A2 -B2 "62[0-9]{2}" | head -50

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

#!/antml:bin/bash
set -euo pipefail

# Check if there are any recent changes or PRs mentioning embed_tokens_per_layer
git log --all --oneline -S "embed_tokens_per_layer" 2>/dev/null | head -20

Repository: ngxson/llama.cpp

Length of output: 146


🏁 Script executed:

#!/antml/bin/bash
set -euo pipefail

# Verify: check if accessing shape[1] without ndim check would fail on 3D tensor
python3 << 'EOF'
import torch

# Test what happens with different tensor ranks
tensor_2d = torch.randn(10, 256)
tensor_3d = torch.randn(4, 10, 256)

print(f"2D tensor shape: {tensor_2d.shape}")
print(f"2D tensor shape[0]: {tensor_2d.shape[0]}")
print(f"2D tensor shape[1]: {tensor_2d.shape[1]}")
print()
print(f"3D tensor shape: {tensor_3d.shape}")
print(f"3D tensor shape[0]: {tensor_3d.shape[0]}")
print(f"3D tensor shape[1]: {tensor_3d.shape[1]}")
print()
print("If code assumes 2D but gets 3D [4, 10, 256]:")
print(f"  current_size = shape[0] = {tensor_3d.shape[0]} (would be wrong! should be 10)")
print(f"  padding = torch.zeros((padding_size, shape[1]), ...) -> ({5}, {tensor_3d.shape[1]}) = (5, 10)")
print(f"  This would pad wrong axis!")
EOF

Repository: ngxson/llama.cpp

Length of output: 146


🏁 Script executed:

#!/antml:bin/bash
set -euo pipefail

# Check what happens when accessing shape[1] on a 2D vs 3D tensor
python3 << 'EOF'
import torch

# 2D tensor [vocab, dim]
tensor_2d = torch.randn(262400, 256)
print(f"2D tensor [vocab, dim]: shape={tuple(tensor_2d.shape)}")
print(f"  shape[0]={tensor_2d.shape[0]}, shape[1]={tensor_2d.shape[1]}")

# Hypothetical 3D tensor [layers, vocab, dim]
tensor_3d = torch.randn(8, 262144, 256)
print(f"\n3D tensor [layers, vocab, dim]: shape={tuple(tensor_3d.shape)}")
print(f"  shape[0]={tensor_3d.shape[0]}, shape[1]={tensor_3d.shape[1]}")
print(f"\nIf code assumes 2D but gets 3D:")
print(f"  current_size = shape[0] = {tensor_3d.shape[0]} ❌ (would think vocab=8, actually layers=8)")
print(f"  padding zeros shape = (padding_size, shape[1]) = (1000, {tensor_3d.shape[1]})")
print(f"  This creates padding with wrong shape for the wrong axis!")
EOF

Repository: ngxson/llama.cpp

Length of output: 177


🏁 Script executed:

#!/antml/bin/bash
set -euo pipefail

# Check if there's any indication that embed_tokens_per_layer should be 2D
rg -n "embed_tokens_per_layer|PER_LAYER_TOKEN_EMBD" -A 2 -B 2

Repository: ngxson/llama.cpp

Length of output: 4700


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Extract the exact tensor shape definition from the C++ code
grep -n "tok_embd_per_layer\|n_embd_altup\|n_vocab" src/llama-model.cpp | head -20

Repository: ngxson/llama.cpp

Length of output: 2330


🏁 Script executed:

#!/antml/bin/bash
set -euo pipefail

# Check the actual shape of embed_tokens.weight vs embed_tokens_per_layer in HF models
cat > /tmp/check_shapes.py << 'EOF'
# According to C++ code:
# tok_embd_per_layer: {n_embd_altup * n_layer, n_vocab}
# This means: [embedding_dim, vocab_size]

# But the padding code assumes:
# shape[0] = vocab_size (comment says "First dimension is vocab_size")
# shape[1] = embedding_dim

print("C++ tensor definition:")
print("  tok_embd_per_layer = create_tensor(..., {n_embd_altup * n_layer, n_vocab}, ...)")
print("  Shape: [n_embd_altup * n_layer, n_vocab]")
print("         [EMBEDDING_DIM,          VOCAB_SIZE]")
print()
print("Python padding code assumes:")
print("  current_size = data_torch.shape[0]  # 'First dimension is vocab_size'")
print("  padding = torch.zeros((padding_size, data_torch.shape[1]), ...)")
print("  ^ Treats shape[0] as vocab, shape[1] as embedding_dim")
print()
print("MISMATCH: The dimensions are swapped!")
print("  C++ defines: [embedding, vocab]")
print("  Code assumes: [vocab, embedding]")
EOF
python3 /tmp/check_shapes.py

Repository: ngxson/llama.cpp

Length of output: 585


Fix dimension mismatch: embed_tokens_per_layer has shape [embedding_dim, n_vocab], not [vocab, dim].

The padding logic assumes shape[0] is vocab_size and pads with (padding_size, shape[1]), but per-layer embeddings have reversed dimensions compared to regular token embeddings. According to the model definition (src/llama-model.cpp:4166), tok_embd_per_layer is shaped as {n_embd_altup * n_layer, n_vocab}, meaning embedding dimension comes first. This causes the padding to be applied to the wrong axis, corrupting the tensor. Add a separate code path for "per_layer" in name to handle the transposed case, or validate tensor rank/shape with an assertion before accessing shape[1].

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6229 - 6250, The padding code treats both
token embeddings and per-layer embeddings the same, but embed_tokens_per_layer
tensors have shape [embedding_dim, n_vocab], so padding must be applied on axis
1 for per-layer tensors instead of axis 0; update the block that checks
"embed_tokens.weight" or "embed_tokens_per_layer" to branch when "per_layer" in
name: for regular token embeddings keep current_size = data_torch.shape[0] and
pad with zeros of shape (padding_size, data_torch.shape[1]) concatenated dim=0;
for per-layer embeddings set current_size = data_torch.shape[1], compute
padding_size = vocab_size - current_size, create padding zeros of shape
(data_torch.shape[0], padding_size) and concatenate dim=1; adjust the logger
message accordingly and keep moving data_torch to CPU before padding and
returning (self.map_tensor_name(name), data_torch).

Comment on lines +1563 to +1655
case PROJECTOR_TYPE_GEMMA3N:
{
model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, false);
model.mobilenet_stem_conv_b = get_tensor(TN_MNV5_STEM_BIAS, false);
model.mobilenet_stem_norm_w = get_tensor(TN_MNV5_STEM_BN, false);

model.msfa_ffn_expand_w = get_tensor(TN_MNV5_MSFA_FFN_EXP_W, false);
model.msfa_ffn_expand_bn = get_tensor(TN_MNV5_MSFA_FFN_EXP_BN, false); // Consume BN if present but likely folded
model.msfa_ffn_project_w = get_tensor(TN_MNV5_MSFA_FFN_PROJ_W, false);
model.msfa_ffn_project_bn = get_tensor(TN_MNV5_MSFA_FFN_PROJ_BN, false);

model.msfa_concat_norm_w = get_tensor(TN_MNV5_MSFA_NORM, false);

// Dynamically load blocks stage by stage
for (int stage = 0; stage < 4; ++stage) {
int blocks_found_in_stage = 0;

for (int blk_idx = 0; ; ++blk_idx) {
bool found_block = false;
mobilenetv5_block block;

// 1. Check for Edge Residual (S0)
block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false);
if (block.s0_conv_exp_w) {
found_block = true;
block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false);
block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false);
block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false);
}
// 2. Check for UIR (Universal Inverted Residual)
else {
// Check for dw_start OR pw_exp (some UIR blocks skip dw_start)
block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false);
block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false);

if (block.dw_start_w || block.pw_exp_w) {
found_block = true;
if (block.dw_start_w) {
block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false);
}
if (block.pw_exp_w) {
block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false);
}
block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false);
if (block.dw_mid_w) {
block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false);
}
block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false);
if (block.pw_proj_w) {
block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false);
}
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}

// 3. Check for Attention (MQA)
// Even if UIR/Edge check failed, this might be a pure attention block
ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false);
if (attn_q_check) {
found_block = true;
block.attn_q_w = attn_q_check;
block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false);
block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false);
block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false);
block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false);
block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false);
block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false);
block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false);
block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false);
// Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check
if (!block.layer_scale_w) {
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}

if (found_block) {
model.mobilenet_blocks.push_back(block);
blocks_found_in_stage++;
} else {
// End of blocks for this stage
break;
}
}

// Track where this stage ends in the flat vector
if (blocks_found_in_stage > 0) {
model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1);
LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1);
}
}
model.mm_input_proj_w = get_tensor(TN_MM_INP_PROJ);
model.mm_soft_emb_norm_w = get_tensor(TN_MM_SOFT_EMB_N);
} break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "clip.h" tools/mtmd/

Repository: ngxson/llama.cpp

Length of output: 77


🏁 Script executed:

fd -t f "clip.cpp" tools/mtmd/

Repository: ngxson/llama.cpp

Length of output: 79


🏁 Script executed:

wc -l tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 84


🏁 Script executed:

head -100 tools/mtmd/clip.h

Repository: ngxson/llama.cpp

Length of output: 3864


🏁 Script executed:

grep -n "struct clip_model" tools/mtmd/clip.cpp | head -5

Repository: ngxson/llama.cpp

Length of output: 90


🏁 Script executed:

sed -n '1,300p' tools/mtmd/clip.cpp | grep -n "struct clip_model"

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg "struct clip_model" tools/mtmd/clip.cpp -A 150 | head -200

Repository: ngxson/llama.cpp

Length of output: 6489


🏁 Script executed:

rg "struct clip_model[^_]" tools/mtmd/clip.cpp -A 200 | head -300

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

grep -n "mobilenet_stage_ends" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 165


🏁 Script executed:

rg "mobilenet_stage_ends" tools/mtmd/clip.cpp -B 3 -A 3

Repository: ngxson/llama.cpp

Length of output: 479


🏁 Script executed:

grep -n "std::vector<int> mobilenet_stage_ends" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg "class clip_model|struct clip_model" tools/mtmd/clip.cpp -A 300 | head -400

Repository: ngxson/llama.cpp

Length of output: 14208


🏁 Script executed:

sed -n '1,1200p' tools/mtmd/clip.cpp | grep -n "clip_model"

Repository: ngxson/llama.cpp

Length of output: 360


🏁 Script executed:

sed -n '1,900p' tools/mtmd/clip.cpp | tail -200

Repository: ngxson/llama.cpp

Length of output: 6659


🏁 Script executed:

rg "^struct clip_model|^class clip_model" tools/mtmd/clip.cpp -A 100

Repository: ngxson/llama.cpp

Length of output: 3902


🏁 Script executed:

rg "^struct clip_model \{" tools/mtmd/clip.cpp -A 200

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

grep -n "mobilenet_stage_ends\|msfa_concat_conv_w\|mm_post_proj_norm_w" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 165


🏁 Script executed:

sed -n '100,900p' tools/mtmd/clip.cpp | grep -n "struct\|class" | head -20

Repository: ngxson/llama.cpp

Length of output: 251


🏁 Script executed:

grep -n "struct clip_model" tools/mtmd/clip.cpp | head -20

Repository: ngxson/llama.cpp

Length of output: 90


🏁 Script executed:

fd -t f "*.h" tools/mtmd/ -exec grep -l "clip_model" {} \;

Repository: ngxson/llama.cpp

Length of output: 547


🏁 Script executed:

rg "struct clip_model\s*\{" tools/mtmd/ -A 150

Repository: ngxson/llama.cpp

Length of output: 9709


🏁 Script executed:

sed -n '1563,1655p' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 6667


🏁 Script executed:

grep -n "TN_MNV5\|TN_MM_" tools/mtmd/clip.cpp | head -40

Repository: ngxson/llama.cpp

Length of output: 4623


🏁 Script executed:

rg "msfa_concat_conv_w|mm_post_proj_norm_w" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg "msfa_concat_conv_w|mm_post_proj_norm_w" tools/mtmd/

Repository: ngxson/llama.cpp

Length of output: 379


🏁 Script executed:

sed -n '1563,1680p' tools/mtmd/clip.cpp | tail -40

Repository: ngxson/llama.cpp

Length of output: 2333


🏁 Script executed:

grep -n "get_tensor.*TN_MNV5_STEM_CONV\|get_tensor.*TN_MNV5_STEM_BIAS\|get_tensor.*TN_MNV5_STEM_BN" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 336


🏁 Script executed:

cat tools/mtmd/models/mobilenetv5.cpp | head -100

Repository: ngxson/llama.cpp

Length of output: 3971


🏁 Script executed:

grep -n "msfa_concat_conv_w\|mm_post_proj_norm_w" tools/mtmd/models/mobilenetv5.cpp

Repository: ngxson/llama.cpp

Length of output: 174


🏁 Script executed:

rg "mm_post_proj_norm_w" tools/mtmd/ -B 3 -A 3

Repository: ngxson/llama.cpp

Length of output: 1111


🏁 Script executed:

rg "msfa_concat_conv_w" tools/mtmd/models/mobilenetv5.cpp

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg "mobilenet_stem_conv_w" tools/mtmd/models/mobilenetv5.cpp -B 2 -A 2

Repository: ngxson/llama.cpp

Length of output: 343


🏁 Script executed:

grep -n "model.mobilenet_stem_conv_w\|model.mobilenet_stem_norm_w" tools/mtmd/models/mobilenetv5.cpp

Repository: ngxson/llama.cpp

Length of output: 260


🏁 Script executed:

sed -n '1649,1650p' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 299


🏁 Script executed:

rg "mobilenet_stage_ends\[" tools/mtmd/

Repository: ngxson/llama.cpp

Length of output: 301


🏁 Script executed:

grep -n "int blk_idx = 0; ; ++blk_idx" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 125


🏁 Script executed:

python3 << 'EOF'
# Check type mismatch details
# std::vector<int> expects int, but size_t is being pushed
# size_t on 64-bit is typically uint64_t (8 bytes)
# int is typically 32 bits (4 bytes)
print("Type mismatch analysis:")
print("- mobilenet_stage_ends declared as: std::vector<int>")
print("- Value pushed: model.mobilenet_blocks.size() - 1")
print("- .size() returns std::size_t (typically uint64_t on 64-bit systems)")
print("- Implicit conversion from size_t to int can lose upper bits")
print("- Risk: truncation if vector size > 2^31-1")
EOF

Repository: ngxson/llama.cpp

Length of output: 146


GEMMA3N tensor loading: make required tensors explicitly required, load missing fields, unify stage indexing type, and bound block iteration.

  • model.mobilenet_stem_conv_w (line 1565) is marked optional but used unconditionally at line 254 in mobilenetv5.cpp without null check, risking a null dereference.
  • Declared fields msfa_concat_conv_w and mm_post_proj_norm_w are not loaded in this section; only msfa_concat_norm_w is loaded (line 1574). If these are part of the model, they should be populated.
  • mobilenet_stage_ends is declared as std::vector<int> (header) but receives size_t values at line 1649, causing implicit truncation on 64-bit systems.
  • The block iteration loop (line 1580) has no upper bound; it relies only on tensor lookup failure to terminate, which could loop pathologically on malformed GGUF files.
Suggested fixes
 case PROJECTOR_TYPE_GEMMA3N:
     {
-        model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, false);
+        model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, true);
         model.mobilenet_stem_conv_b = get_tensor(TN_MNV5_STEM_BIAS, false);
         model.mobilenet_stem_norm_w = get_tensor(TN_MNV5_STEM_BN, false);

         model.msfa_ffn_expand_w  = get_tensor(TN_MNV5_MSFA_FFN_EXP_W, false);
         model.msfa_ffn_expand_bn = get_tensor(TN_MNV5_MSFA_FFN_EXP_BN, false);
         model.msfa_ffn_project_w = get_tensor(TN_MNV5_MSFA_FFN_PROJ_W, false);
         model.msfa_ffn_project_bn = get_tensor(TN_MNV5_MSFA_FFN_PROJ_BN, false);

+        model.msfa_concat_conv_w = get_tensor(TN_MNV5_MSFA_CONCAT_CONV_W, false);
         model.msfa_concat_norm_w = get_tensor(TN_MNV5_MSFA_NORM, false);
+        model.mm_post_proj_norm_w = get_tensor(TN_MM_POST_PROJ_NORM, false);

         // Dynamically load blocks stage by stage
         for (int stage = 0; stage < 4; ++stage) {
             int blocks_found_in_stage = 0;

-            for (int blk_idx = 0; ; ++blk_idx) {
+            for (int blk_idx = 0; blk_idx < 256; ++blk_idx) {
                 bool found_block = false;
                 mobilenetv5_block block;

Comment on lines +23 to +53
ggml_tensor* clip_graph_mobilenetv5::pad_same_2d(ggml_tensor* inp, int kernel_h, int kernel_w, int stride_h, int stride_w, int dilation_h, int dilation_w) {
const int64_t ih = inp->ne[1]; // height
const int64_t iw = inp->ne[0]; // width

// Calculate output size (ceil division)
const int64_t oh = (ih + stride_h - 1) / stride_h;
const int64_t ow = (iw + stride_w - 1) / stride_w;

// Calculate padding needed
const int64_t pad_h = std::max((int64_t)0, (oh - 1) * stride_h + (kernel_h - 1) * dilation_h + 1 - ih);
const int64_t pad_w = std::max((int64_t)0, (ow - 1) * stride_w + (kernel_w - 1) * dilation_w + 1 - iw);

// Split padding asymmetrically
const int pad_h_top = pad_h / 2;
const int pad_h_bottom = pad_h - pad_h_top;
const int pad_w_left = pad_w / 2;
const int pad_w_right = pad_w - pad_w_left;

// Apply padding if needed
// ggml_pad_ext: (ctx, tensor, lp0, rp0, lp1, rp1, lp2, rp2, lp3, rp3)
// For [W, H, C, B]: p0=width, p1=height, p2=channels, p3=batch
if (pad_h > 0 || pad_w > 0) {
inp = ggml_pad_ext(ctx0, inp,
pad_w_left, pad_w_right, // width padding (dim 0)
pad_h_top, pad_h_bottom, // height padding (dim 1)
0, 0, // no channel padding (dim 2)
0, 0); // no batch padding (dim 3)
}

return inp;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add null check and validate stride parameters.

Two issues:

  1. inp is dereferenced at lines 24-25 without a null check
  2. Division by stride_h and stride_w at lines 28-29 could cause division by zero if these parameters are 0
🛡️ Suggested defensive checks
 ggml_tensor* clip_graph_mobilenetv5::pad_same_2d(ggml_tensor* inp, int kernel_h, int kernel_w, int stride_h, int stride_w, int dilation_h, int dilation_w) {
+    if (!inp || stride_h <= 0 || stride_w <= 0) {
+        return nullptr;
+    }
+
     const int64_t ih = inp->ne[1];  // height
     const int64_t iw = inp->ne[0];  // width
🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 23 - 53, In pad_same_2d, add
a null check for the input pointer inp at the start of the function and return
or handle the error if inp is null; also validate stride_h and stride_w are > 0
before using them (e.g., return early or assert/log error) to avoid division by
zero when computing oh and ow; update references to inp, stride_h, and stride_w
in pad_same_2d accordingly so the function fails fast on invalid inputs instead
of dereferencing a null pointer or performing division by zero.

Comment on lines +57 to +88
ggml_tensor * clip_graph_mobilenetv5::build_edge_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) {
ggml_tensor * cur = inp;

// 1. Expansion Conv (3x3)
if (stride == 2) {
// Case: Downsampling (Block 0)
// Replicates Conv2dSame(kernel=3, stride=2)
cur = pad_same_2d(cur, 3, 3, stride, stride);
cur = ggml_conv_2d_direct(ctx0, block.s0_conv_exp_w, cur, stride, stride, 0, 0, 1, 1);
} else {
// Case: Normal 3x3 Block (Block 1, 2)
// Replicates Conv2d(kernel=3, stride=1, padding=1)
cur = ggml_conv_2d_direct(ctx0, block.s0_conv_exp_w, cur, stride, stride, 1, 1, 1, 1);
}

// BN + Activation
if (block.s0_bn1_w) cur = rms_norm_2d(cur, block.s0_bn1_w);
cur = ggml_gelu(ctx0, cur);

// 2. Pointwise Linear Conv (1x1)
// 1x1 Convs usually have padding=0 and stride=1
cur = ggml_conv_2d_direct(ctx0, block.s0_conv_pwl_w, cur, 1, 1, 0, 0, 1, 1);
if (block.s0_bn2_w) cur = rms_norm_2d(cur, block.s0_bn2_w);

// 3. Residual Connection
// Only apply residual if spatial dimensions and channels match (stride 1)
if (stride == 1 && inp->ne[2] == cur->ne[2] && inp->ne[0] == cur->ne[0]) {
cur = ggml_add(ctx0, cur, inp);
}

return cur;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add null checks for input tensor and required block tensors.

The function assumes inp and required block tensors (s0_conv_exp_w, s0_conv_pwl_w) are non-null. Accessing these without validation could cause null-pointer dereferences:

  • inp used at line 58
  • block.s0_conv_exp_w at lines 65, 69
  • block.s0_conv_pwl_w at line 78
🛡️ Suggested validation
 ggml_tensor * clip_graph_mobilenetv5::build_edge_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) {
+    if (!inp || !block.s0_conv_exp_w || !block.s0_conv_pwl_w) {
+        return nullptr;
+    }
+
     ggml_tensor * cur = inp;
🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 57 - 88, The function
build_edge_residual assumes inp and block weight tensors exist; add explicit
null checks at the top of build_edge_residual: if inp is null return nullptr (or
inp as appropriate) to avoid dereferencing, and verify block.s0_conv_exp_w and
block.s0_conv_pwl_w before calling ggml_conv_2d_direct (and before passing them
to rms_norm_2d); if either weight is null, skip the corresponding conv/pwl steps
or return nullptr consistently so callers can handle the error. Ensure all early
exits use the same convention as the surrounding codebase (nullptr or original
inp) and reference the symbols build_edge_residual, block.s0_conv_exp_w,
block.s0_conv_pwl_w, ggml_conv_2d_direct, and rms_norm_2d when making the
checks.

Comment on lines +91 to +149
ggml_tensor * clip_graph_mobilenetv5::build_inverted_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) {
ggml_tensor * cur = inp;

// 1. Depthwise Start (Optional)
// NOTE: dw_start always has stride=1 (no downsampling here)
if (block.dw_start_w) {
int k = block.dw_start_w->ne[0]; // 3 or 5
int p = k / 2;
cur = ggml_conv_2d_dw(ctx0, block.dw_start_w, cur, 1, 1, p, p, 1, 1);
if (block.dw_start_bn_w) cur = rms_norm_2d(cur, block.dw_start_bn_w);
}

// 2. Pointwise Expansion (1x1)
if (block.pw_exp_w) {
// Standard 1x1 conv, pad=0, stride=1
cur = ggml_conv_2d_direct(ctx0, block.pw_exp_w, cur, 1, 1, 0, 0, 1, 1);
if (block.pw_exp_bn_w) cur = rms_norm_2d(cur, block.pw_exp_bn_w);
cur = ggml_gelu(ctx0, cur);
}

// 3. Depthwise Mid (Optional)
// NOTE: dw_mid is where downsampling happens (stride=2 for first block of stage)
if (block.dw_mid_w) {
int k = block.dw_mid_w->ne[0]; // 3 or 5

if (stride > 1) {
// Case: Stride 2 (Downsample) -> Use Asymmetric "Same" Padding
cur = pad_same_2d(cur, k, k, stride, stride);
cur = ggml_conv_2d_dw(ctx0, block.dw_mid_w, cur, stride, stride, 0, 0, 1, 1); // pad=0
} else {
// Case: Stride 1 -> Use Standard Symmetric Padding
int p = k / 2;
cur = ggml_conv_2d_dw(ctx0, block.dw_mid_w, cur, stride, stride, p, p, 1, 1);
}

if (block.dw_mid_bn_w) cur = rms_norm_2d(cur, block.dw_mid_bn_w);
cur = ggml_gelu(ctx0, cur);
}

// 4. Pointwise Projection (1x1)
if (block.pw_proj_w) {
cur = ggml_conv_2d_direct(ctx0, block.pw_proj_w, cur, 1, 1, 0, 0, 1, 1);
if (block.pw_proj_bn_w) cur = rms_norm_2d(cur, block.pw_proj_bn_w);
}

// Apply Layer Scaling if present
if (block.layer_scale_w) {
cur = ggml_mul(ctx0, cur, block.layer_scale_w);
}

// 5. Residual Connection
bool same_spatial = (inp->ne[0] == cur->ne[0]) && (inp->ne[1] == cur->ne[1]);
bool same_channel = (inp->ne[2] == cur->ne[2]);
if (same_spatial && same_channel) {
cur = ggml_add(ctx0, cur, inp);
}

return cur;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add null check for inp parameter.

The inp tensor is used at line 92 without null validation, which could lead to a null-pointer dereference.

🛡️ Suggested check
 ggml_tensor * clip_graph_mobilenetv5::build_inverted_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) {
+    if (!inp) {
+        return nullptr;
+    }
+
     ggml_tensor * cur = inp;
🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 91 - 149, The function
build_inverted_residual uses the inp pointer without validation; add an
immediate null check at the top of build_inverted_residual for the inp parameter
and handle it safely (e.g., return nullptr or propagate an error) instead of
dereferencing a null pointer so the rest of the function (uses of inp->ne[...]
and residual addition) are not executed when inp is null.

Comment on lines +248 to +260
ggml_cgraph * clip_graph_mobilenetv5::build() {
ggml_tensor * inp = build_inp_raw();

// 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2))
ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2); // Apply SAME padding

cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0
if (model.mobilenet_stem_conv_b) {
cur = ggml_add(ctx0, cur, model.mobilenet_stem_conv_b);
}
if (model.mobilenet_stem_norm_w) cur = rms_norm_2d(cur, model.mobilenet_stem_norm_w);
cur = ggml_gelu(ctx0, cur);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate required stem convolution weight.

The model.mobilenet_stem_conv_w tensor is used at line 254 without a null check. If model loading is incomplete, this will cause a null-pointer dereference.

🛡️ Suggested validation
 ggml_cgraph * clip_graph_mobilenetv5::build() {
     ggml_tensor * inp = build_inp_raw();
 
     // 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2))
     ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2);  // Apply SAME padding
 
+    if (!model.mobilenet_stem_conv_w) {
+        return nullptr;
+    }
+
     cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1);  // padding=0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ggml_cgraph * clip_graph_mobilenetv5::build() {
ggml_tensor * inp = build_inp_raw();
// 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2))
ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2); // Apply SAME padding
cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0
if (model.mobilenet_stem_conv_b) {
cur = ggml_add(ctx0, cur, model.mobilenet_stem_conv_b);
}
if (model.mobilenet_stem_norm_w) cur = rms_norm_2d(cur, model.mobilenet_stem_norm_w);
cur = ggml_gelu(ctx0, cur);
ggml_cgraph * clip_graph_mobilenetv5::build() {
ggml_tensor * inp = build_inp_raw();
// 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2))
ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2); // Apply SAME padding
if (!model.mobilenet_stem_conv_w) {
return nullptr;
}
cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0
if (model.mobilenet_stem_conv_b) {
cur = ggml_add(ctx0, cur, model.mobilenet_stem_conv_b);
}
if (model.mobilenet_stem_norm_w) cur = rms_norm_2d(cur, model.mobilenet_stem_norm_w);
cur = ggml_gelu(ctx0, cur);
🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 248 - 260, The build()
function uses model.mobilenet_stem_conv_w without validation; add a null-check
at the start of the stem block (before calling ggml_conv_2d_direct) to detect
missing stem weights (model.mobilenet_stem_conv_w == nullptr) and handle it by
logging an error/throwing or returning nullptr from build() to avoid
dereferencing; ensure downstream code does not assume cur was created if the
check fails and keep existing handling for mobilenet_stem_conv_b and
mobilenet_stem_norm_w unchanged.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gguf-py/gguf/constants.py (1)

881-889: Critical: Missing GEMMA3N entry in VISION_PROJECTOR_TYPE_NAMES.

Line 468 adds VISION_PROJECTOR_TYPE.GEMMA3N to the enum, but the VISION_PROJECTOR_TYPE_NAMES dict does not include a corresponding mapping. This will cause a KeyError when code attempts to look up the string name for VISION_PROJECTOR_TYPE.GEMMA3N.

🐛 Proposed fix
 VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
     VISION_PROJECTOR_TYPE.MLP:       "mlp",
     VISION_PROJECTOR_TYPE.LDP:       "ldp",
     VISION_PROJECTOR_TYPE.LDPV2:     "ldpv2",
     VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
     VISION_PROJECTOR_TYPE.GLM_EDGE:  "adapter",
     VISION_PROJECTOR_TYPE.MERGER:    "qwen2vl_merger",
+    VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
     VISION_PROJECTOR_TYPE.GEMMA3:    "gemma3",
 }
🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 6139-6165: The patch_size calculation in the __init__ method is
wrong: don't divide image_size by image_seq_length; instead compute n_per_side =
int(sqrt(image_seq_length)) (or math.isqrt(image_seq_length) for exact integer
math) and set self.hparams_vision["patch_size"] = image_size // n_per_side so
256 tokens -> n_per_side=16 -> patch_size=image_size//16; ensure math is
available/imported if you use math.isqrt/math.sqrt and handle non-perfect-square
image_seq_length by using integer floor.
- Around line 6045-6105: The condition in ConformerAudioModel.tensor_force_quant
incorrectly applies F32 to any name containing ".conv" because of operator
precedence; change the test so that the quantization is forced only when the
tensor is a conv weight — i.e., require that (".conv" in name or "_conv" in
name) AND ".weight" in name. Update the conditional in
ConformerAudioModel.tensor_force_quant accordingly (use parentheses or reorder
the logic) so only conv weight tensors return gguf.GGMLQuantizationType.F32;
leave ConformerAudioModel.is_audio_tensor and the fallback to
super().tensor_force_quant unchanged.
- Around line 6108-6137: Mark the mutable class attribute block_tensor_mapping
on Gemma3nVisionAudioModel as a ClassVar to avoid mutable-class-attr pitfalls:
import ClassVar and Dict from typing and change the declaration to something
like block_tensor_mapping: ClassVar[Dict[str, str]] = { ... } so static
analyzers and linters know it’s not an instance attribute.
- Around line 6199-6203: The current modify_tensors replacement can produce
double "layers" (e.g., "conformer.layers.layers..."); change the logic in
modify_tensors (and keep using ConformerAudioModel.is_audio_tensor) to detect
whether the incoming name contains "model.audio_tower.conformer.layers." and, if
so, replace that exact substring with "conformer.layers.", otherwise replace
"model.audio_tower.conformer." with "conformer.layers." so the result always
matches the expected "conformer.layers.{bid}..." keys used by batchnorm folding.

In @tools/mtmd/clip.cpp:
- Around line 3242-3247: The GEMMA3N branch incorrectly sets n_patches to
ctx->model.hparams.image_size / ctx->model.hparams.patch_size (patches per side)
instead of total tokens; change the calculation in the PROJECTOR_TYPE_GEMMA3N
case to compute total patches/tokens as (image_size / patch_size) squared (e.g.,
n_patches = pow(ctx->model.hparams.image_size / ctx->model.hparams.patch_size,
2) or multiply the quotient by itself) so the value matches the 16×16 = 256
claim and is robust to a corrected patch_size.
- Around line 1584-1655: The local mobilenetv5_block variable is
default-uninitialized causing UB when reading members like layer_scale_w before
assignment; fix by zero-initializing the struct instance at creation (e.g.,
value-initialize mobilenetv5_block so all pointers/flags are null/zero), or
explicitly initialize all members you later read (layer_scale_w and any
pointer/flag fields) before any get_tensor checks, so that pushing to
model.mobilenet_blocks uses a fully-initialized block.
🧹 Nitpick comments (3)
gguf-py/gguf/tensor_mapping.py (1)

1609-1795: Non-{bid} keys in block_mappings_cfg are easy to accidentally add; consider keeping gemma3n non-block tensors in mappings_cfg only.
Not a blocker, but it avoids repeated per-layer inserts and makes it clearer which tensors are truly block-indexed.

convert_hf_to_gguf.py (2)

6247-6265: Gemma3NModel.set_vocab temporary override looks correct, but consider a non-mutating approach.

The delete/restore pattern works, but mutating self.hparams mid-conversion is fragile if anything throws in super().set_vocab(). A small refactor to use a shallow copy (or a try/finally) would make this safer.


6301-6322: Verify padding logic for embed_tokens_per_layer*: assumes vocab is axis 0.

This code pads data_torch.shape[0] up to vocab_size. That’s only correct if the tensor is [vocab, dim] (or a per-layer tensor instance still shaped [vocab, dim]). If the tensor is stacked (e.g. [n_layers, vocab, dim]), this will pad the wrong dimension silently.

Minimal defensive check idea
         if "embed_tokens.weight" in name or "embed_tokens_per_layer" in name:
             # Move to CPU to avoid meta device issues during padding
             data_torch = data_torch.to(device="cpu")
 
             vocab_size = self.hparams.get("vocab_size", 262400)
-            current_size = data_torch.shape[0]  # First dimension is vocab_size
+            if data_torch.ndim != 2:
+                raise ValueError(f"Unexpected embedding tensor rank for {name}: shape={tuple(data_torch.shape)}")
+            current_size = data_torch.shape[0]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfbb315 and 6a68b35.

📒 Files selected for processing (4)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • tools/mtmd/clip.cpp
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
  • gguf-py/gguf/constants.py
🧬 Code graph analysis (2)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (474-762)
convert_hf_to_gguf.py (2)
gguf-py/gguf/constants.py (2)
  • GGMLQuantizationType (3433-3465)
  • VisionProjectorType (3558-3581)
gguf-py/gguf/gguf_writer.py (4)
  • add_clip_projector_type (1086-1087)
  • add_vision_attention_layernorm_eps (1107-1108)
  • add_audio_num_mel_bins (1189-1190)
  • add_audio_attention_layernorm_eps (1186-1187)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)


6114-6137: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6197-6197: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: ggml-ci-x64-cpu-high-perf
  • GitHub Check: ggml-ci-x64-cpu-low-perf
  • GitHub Check: ggml-ci-arm64-cpu-low-perf
  • GitHub Check: macOS-latest-cmake-visionos
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: check-requirements
  • GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server-windows
🔇 Additional comments (8)
tools/mtmd/clip.cpp (3)

783-795: GEMMA3N graph builder wiring looks straightforward.
Routing PROJECTOR_TYPE_GEMMA3N to clip_graph_mobilenetv5 matches the intended architecture split.


2970-2978: Verify GEMMA3N preprocess should warp-to-square (no padding) vs pad-to-square (Gemma3 behavior).
Right now GEMMA3 uses padding by default, while GEMMA3N explicitly disables it (add_padding=false), which changes aspect-ratio handling and can impact accuracy.


1349-1351: hparams.n_layer = 0 + “skip audio” for GEMMA3N: OK, but please sanity-check downstream assumptions.
This is fine if all GEMMA3N code paths avoid model.layers[...] and audio init, but it’s worth validating that no shared helpers still assume n_layer > 0.

Also applies to: 2115-2139

gguf-py/gguf/tensor_mapping.py (1)

126-159: Paths verified against converter implementation—no issues found.

The GEMMA3N tensor mappings at lines 127–157 match the converter's expected HF module structure. The converter code (Gemma3nVisionAudioModel.modify_tensors) explicitly validates both model.embed_vision.* and model.vision_tower.* prefixes, confirming these paths exist in the loaded model. No duplicate keys or typos detected.

Consider adding periodic validation (e.g., during model conversion testing) to catch any upstream naming drift in future GEMMA3N model updates.

gguf-py/gguf/constants.py (1)

392-392: LGTM: Gemma3N constant additions follow existing patterns.

The additions for MODEL_ARCH.GEMMA3N, MODEL_TENSOR entries, tensor name mappings, and VisionProjectorType.GEMMA3N all follow the established patterns and conventions used by other model architectures in this file.

Also applies to: 468-468, 679-687, 715-746, 811-811, 1095-1104, 1133-1163, 1214-1222, 1250-1289, 2040-2074, 3560-3560

convert_hf_to_gguf.py (3)

530-536: Good guard for empty tensor_map.mapping, but consider avoiding a “magic” default name-length.

This is fine for preventing max() on empty mapping; the fallback length is only used for log alignment. If you want it future-proof, consider deriving max_name_len from actual new_name values as you iterate (one-pass) rather than hardcoding "vision_encoder.weight,".


10163-10178: Skip condition extension in LFM2Model.modify_tensors seems fine.

Including ConformerAudioModel.is_audio_tensor(name) in the skip path helps avoid accidentally pulling audio weights into the text model conversion.


10305-10323: LFM2AudioModel wiring looks consistent with ConformerAudioModel.

No specific concerns in this snippet beyond the shared ConformerAudioModel issues noted above (quantization predicate + batchnorm folding expectations).

Comment on lines +6045 to +6105
class ConformerAudioModel(MmprojModel):
_batch_norm_tensors: list[dict[str, Tensor]] | None = None

@staticmethod
def is_audio_tensor(name: str):
return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"])

def tensor_force_quant(self, name, new_name, bid, n_dims):
if ConformerAudioModel.is_audio_tensor(name):
if ".conv" in name or "_conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip language model tensors
if name.startswith("lfm."):
return []

# for training only
if any(p in name for p in ["audio_loss_weight"]):
return []

# for audio output
if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]):
return []

# fold running_mean, running_var and eps into weight and bias for batch_norm
if "batch_norm" in name:
if self._batch_norm_tensors is None:
self._batch_norm_tensors = [{} for _ in range(self.block_count)]
assert bid is not None
self._batch_norm_tensors[bid][name] = data_torch

if len(self._batch_norm_tensors[bid]) < 5:
return []

weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"]
bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"]
running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"]
running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"]
eps = 1e-5 # default value

a = weight / torch.sqrt(running_var + eps)
b = bias - running_mean * a
return [
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a),
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b),
]

# reshape conv weights
if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"):
data_torch = data_torch[:, None, None]
if "conv.depthwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[1] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2])
if "conv.pointwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[2] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1])

return [(self.map_tensor_name(name), data_torch)]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix boolean precedence in ConformerAudioModel.tensor_force_quant (currently quantizes too broadly).

".conv" in name or "_conv" in name and ".weight" in name is parsed as (".conv" in name) or ("_conv" in name and ".weight" in name). That likely forces F32 for any tensor containing ".conv" (including biases), which is not intended.

Proposed fix
 class ConformerAudioModel(MmprojModel):
@@
     def tensor_force_quant(self, name, new_name, bid, n_dims):
         if ConformerAudioModel.is_audio_tensor(name):
-            if ".conv" in name or "_conv" in name and ".weight" in name:
+            if ((".conv" in name or "_conv" in name) and name.endswith(".weight")):
                 return gguf.GGMLQuantizationType.F32
         return super().tensor_force_quant(name, new_name, bid, n_dims)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class ConformerAudioModel(MmprojModel):
_batch_norm_tensors: list[dict[str, Tensor]] | None = None
@staticmethod
def is_audio_tensor(name: str):
return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"])
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ConformerAudioModel.is_audio_tensor(name):
if ".conv" in name or "_conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip language model tensors
if name.startswith("lfm."):
return []
# for training only
if any(p in name for p in ["audio_loss_weight"]):
return []
# for audio output
if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]):
return []
# fold running_mean, running_var and eps into weight and bias for batch_norm
if "batch_norm" in name:
if self._batch_norm_tensors is None:
self._batch_norm_tensors = [{} for _ in range(self.block_count)]
assert bid is not None
self._batch_norm_tensors[bid][name] = data_torch
if len(self._batch_norm_tensors[bid]) < 5:
return []
weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"]
bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"]
running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"]
running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"]
eps = 1e-5 # default value
a = weight / torch.sqrt(running_var + eps)
b = bias - running_mean * a
return [
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a),
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b),
]
# reshape conv weights
if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"):
data_torch = data_torch[:, None, None]
if "conv.depthwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[1] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2])
if "conv.pointwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[2] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1])
return [(self.map_tensor_name(name), data_torch)]
class ConformerAudioModel(MmprojModel):
_batch_norm_tensors: list[dict[str, Tensor]] | None = None
@staticmethod
def is_audio_tensor(name: str):
return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"])
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ConformerAudioModel.is_audio_tensor(name):
if ((".conv" in name or "_conv" in name) and name.endswith(".weight")):
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip language model tensors
if name.startswith("lfm."):
return []
# for training only
if any(p in name for p in ["audio_loss_weight"]):
return []
# for audio output
if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]):
return []
# fold running_mean, running_var and eps into weight and bias for batch_norm
if "batch_norm" in name:
if self._batch_norm_tensors is None:
self._batch_norm_tensors = [{} for _ in range(self.block_count)]
assert bid is not None
self._batch_norm_tensors[bid][name] = data_torch
if len(self._batch_norm_tensors[bid]) < 5:
return []
weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"]
bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"]
running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"]
running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"]
eps = 1e-5 # default value
a = weight / torch.sqrt(running_var + eps)
b = bias - running_mean * a
return [
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a),
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b),
]
# reshape conv weights
if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"):
data_torch = data_torch[:, None, None]
if "conv.depthwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[1] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2])
if "conv.pointwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[2] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1])
return [(self.map_tensor_name(name), data_torch)]
🧰 Tools
🪛 Ruff (0.14.10)

6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6045 - 6105, The condition in
ConformerAudioModel.tensor_force_quant incorrectly applies F32 to any name
containing ".conv" because of operator precedence; change the test so that the
quantization is forced only when the tensor is a conv weight — i.e., require
that (".conv" in name or "_conv" in name) AND ".weight" in name. Update the
conditional in ConformerAudioModel.tensor_force_quant accordingly (use
parentheses or reorder the logic) so only conv weight tensors return
gguf.GGMLQuantizationType.F32; leave ConformerAudioModel.is_audio_tensor and the
fallback to super().tensor_force_quant unchanged.

Comment on lines +6108 to +6137
class Gemma3nVisionAudioModel(ConformerAudioModel):
has_audio_encoder = True
has_vision_encoder = True

# Double indexed mapping for MobileNetV5 blocks (not supported by tensor_mapping.py)
# This is the only known model having this, so we prefer implementing it outside of tensor_mapping.py
block_tensor_mapping = {
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn1.weight": "v.blk.{bid}.{sid}.bn1.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_pwl.weight": "v.blk.{bid}.{sid}.conv_pwl.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.bn2.weight": "v.blk.{bid}.{sid}.bn2.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.conv.weight": "v.blk.{bid}.{sid}.dw_start.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.conv.weight": "v.blk.{bid}.{sid}.dw_mid.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.bn.weight": "v.blk.{bid}.{sid}.dw_mid.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.conv.weight": "v.blk.{bid}.{sid}.pw_exp.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.bn.weight": "v.blk.{bid}.{sid}.pw_exp.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.conv.weight": "v.blk.{bid}.{sid}.pw_proj.conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.bn.weight": "v.blk.{bid}.{sid}.pw_proj.bn.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.layer_scale.gamma": "v.blk.{bid}.{sid}.layer_scale.gamma",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.query.proj.weight": "v.blk.{bid}.{sid}.attn.query.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.proj.weight": "v.blk.{bid}.{sid}.attn.key.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.proj.weight": "v.blk.{bid}.{sid}.attn.value.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.output.proj.weight": "v.blk.{bid}.{sid}.attn.output.proj.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.down_conv.weight": "v.blk.{bid}.{sid}.attn.key.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.norm.weight": "v.blk.{bid}.{sid}.attn.key.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.down_conv.weight": "v.blk.{bid}.{sid}.attn.value.down_conv.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.norm.weight": "v.blk.{bid}.{sid}.attn.value.norm.weight",
"model.vision_tower.timm_model.blocks.{bid}.{sid}.norm.weight": "v.blk.{bid}.{sid}.norm.weight",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Annotate block_tensor_mapping as ClassVar to avoid mutable-class-attr pitfalls.

Static analysis is right here: this is a mutable class attribute and should be explicitly marked as such.

Proposed fix
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ClassVar, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
@@
 class Gemma3nVisionAudioModel(ConformerAudioModel):
@@
-    block_tensor_mapping = {
+    block_tensor_mapping: ClassVar[dict[str, str]] = {
         "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight":             "v.blk.{bid}.{sid}.conv_exp.weight",
         ...
     }

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.10)

6114-6137: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6108 - 6137, Mark the mutable class
attribute block_tensor_mapping on Gemma3nVisionAudioModel as a ClassVar to avoid
mutable-class-attr pitfalls: import ClassVar and Dict from typing and change the
declaration to something like block_tensor_mapping: ClassVar[Dict[str, str]] = {
... } so static analyzers and linters know it’s not an instance attribute.

Comment on lines +6199 to +6203
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if (ConformerAudioModel.is_audio_tensor(name)):
name = name.replace("model.audio_tower.conformer.", "conformer.layers.")
return super().modify_tensors(data_torch, name, bid)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify audio tensor prefix rewrite; it may produce conformer.layers.layers.<bid>....

If the incoming name is the common model.audio_tower.conformer.layers.<i>..., the current replacement:

  • model.audio_tower.conformer.conformer.layers.
    will yield conformer.layers.layers.<i>..., which won’t match your batchnorm folding keys (conformer.layers.{bid}.conv.batch_norm.*) and may break tensor mapping.
Suggested safer rewrite (adjust once you confirm actual HF tensor prefixes)
-        if (ConformerAudioModel.is_audio_tensor(name)):
-            name = name.replace("model.audio_tower.conformer.", "conformer.layers.")
+        if (ConformerAudioModel.is_audio_tensor(name)):
+            if name.startswith("model.audio_tower.conformer.layers."):
+                name = name.replace("model.audio_tower.conformer.layers.", "conformer.layers.", 1)
+            elif name.startswith("model.audio_tower.conformer."):
+                name = name.replace("model.audio_tower.conformer.", "conformer.", 1)
             return super().modify_tensors(data_torch, name, bid)
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6199 - 6203, The current modify_tensors
replacement can produce double "layers" (e.g., "conformer.layers.layers...");
change the logic in modify_tensors (and keep using
ConformerAudioModel.is_audio_tensor) to detect whether the incoming name
contains "model.audio_tower.conformer.layers." and, if so, replace that exact
substring with "conformer.layers.", otherwise replace
"model.audio_tower.conformer." with "conformer.layers." so the result always
matches the expected "conformer.layers.{bid}..." keys used by batchnorm folding.

Comment on lines +1584 to +1655
for (int blk_idx = 0; ; ++blk_idx) {
bool found_block = false;
mobilenetv5_block block;

// 1. Check for Edge Residual (S0)
block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false);
if (block.s0_conv_exp_w) {
found_block = true;
block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false);
block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false);
block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false);
}
// 2. Check for UIR (Universal Inverted Residual)
else {
// Check for dw_start OR pw_exp (some UIR blocks skip dw_start)
block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false);
block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false);

if (block.dw_start_w || block.pw_exp_w) {
found_block = true;
if (block.dw_start_w) {
block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false);
}
if (block.pw_exp_w) {
block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false);
}
block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false);
if (block.dw_mid_w) {
block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false);
}
block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false);
if (block.pw_proj_w) {
block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false);
}
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}

// 3. Check for Attention (MQA)
// Even if UIR/Edge check failed, this might be a pure attention block
ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false);
if (attn_q_check) {
found_block = true;
block.attn_q_w = attn_q_check;
block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false);
block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false);
block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false);
block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false);
block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false);
block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false);
block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false);
block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false);
// Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check
if (!block.layer_scale_w) {
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}

if (found_block) {
model.mobilenet_blocks.push_back(block);
blocks_found_in_stage++;
} else {
// End of blocks for this stage
break;
}
}

// Track where this stage ends in the flat vector
if (blocks_found_in_stage > 0) {
model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1);
LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: mobilenetv5_block block; is uninitialized (UB) before checking fields / pushing to vector.
You later read fields like block.layer_scale_w (and may push partially-filled structs), which is undefined behavior unless the struct zero-initializes itself.

Proposed fix
-                            mobilenetv5_block block;
+                            mobilenetv5_block block{};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (int blk_idx = 0; ; ++blk_idx) {
bool found_block = false;
mobilenetv5_block block;
// 1. Check for Edge Residual (S0)
block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false);
if (block.s0_conv_exp_w) {
found_block = true;
block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false);
block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false);
block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false);
}
// 2. Check for UIR (Universal Inverted Residual)
else {
// Check for dw_start OR pw_exp (some UIR blocks skip dw_start)
block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false);
block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false);
if (block.dw_start_w || block.pw_exp_w) {
found_block = true;
if (block.dw_start_w) {
block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false);
}
if (block.pw_exp_w) {
block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false);
}
block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false);
if (block.dw_mid_w) {
block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false);
}
block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false);
if (block.pw_proj_w) {
block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false);
}
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}
// 3. Check for Attention (MQA)
// Even if UIR/Edge check failed, this might be a pure attention block
ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false);
if (attn_q_check) {
found_block = true;
block.attn_q_w = attn_q_check;
block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false);
block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false);
block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false);
block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false);
block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false);
block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false);
block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false);
block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false);
// Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check
if (!block.layer_scale_w) {
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}
if (found_block) {
model.mobilenet_blocks.push_back(block);
blocks_found_in_stage++;
} else {
// End of blocks for this stage
break;
}
}
// Track where this stage ends in the flat vector
if (blocks_found_in_stage > 0) {
model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1);
LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1);
}
for (int blk_idx = 0; ; ++blk_idx) {
bool found_block = false;
mobilenetv5_block block{};
// 1. Check for Edge Residual (S0)
block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false);
if (block.s0_conv_exp_w) {
found_block = true;
block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false);
block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false);
block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false);
}
// 2. Check for UIR (Universal Inverted Residual)
else {
// Check for dw_start OR pw_exp (some UIR blocks skip dw_start)
block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false);
block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false);
if (block.dw_start_w || block.pw_exp_w) {
found_block = true;
if (block.dw_start_w) {
block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false);
}
if (block.pw_exp_w) {
block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false);
}
block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false);
if (block.dw_mid_w) {
block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false);
}
block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false);
if (block.pw_proj_w) {
block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false);
}
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}
// 3. Check for Attention (MQA)
// Even if UIR/Edge check failed, this might be a pure attention block
ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false);
if (attn_q_check) {
found_block = true;
block.attn_q_w = attn_q_check;
block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false);
block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false);
block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false);
block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false);
block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false);
block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false);
block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false);
block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false);
// Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check
if (!block.layer_scale_w) {
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}
if (found_block) {
model.mobilenet_blocks.push_back(block);
blocks_found_in_stage++;
} else {
// End of blocks for this stage
break;
}
}
// Track where this stage ends in the flat vector
if (blocks_found_in_stage > 0) {
model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1);
LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1);
}
🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 1584 - 1655, The local mobilenetv5_block
variable is default-uninitialized causing UB when reading members like
layer_scale_w before assignment; fix by zero-initializing the struct instance at
creation (e.g., value-initialize mobilenetv5_block so all pointers/flags are
null/zero), or explicitly initialize all members you later read (layer_scale_w
and any pointer/flag fields) before any get_tensor checks, so that pushing to
model.mobilenet_blocks uses a fully-initialized block.

Comment on lines +3242 to +3247
case PROJECTOR_TYPE_GEMMA3N:
{
// MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
// regardless of input size (see architecture description)
n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
} break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: GEMMA3N token count calculation is wrong (returns “patches per side”, not tokens).
Comment says 256 (16×16), but image_size / patch_size is 16 for a correct (768, 48) setup. This currently looks coupled to the known converter patch_size bug; it will break once patch_size is fixed semantically.

Proposed fix (matches 16x16 claim)
         case PROJECTOR_TYPE_GEMMA3N:
             {
                 // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
                 // regardless of input size (see architecture description)
-                n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+                const int n_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+                GGML_ASSERT(n_side > 0);
+                n_patches = n_side * n_side;
             } break;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case PROJECTOR_TYPE_GEMMA3N:
{
// MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
// regardless of input size (see architecture description)
n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
} break;
case PROJECTOR_TYPE_GEMMA3N:
{
// MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
// regardless of input size (see architecture description)
const int n_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
GGML_ASSERT(n_side > 0);
n_patches = n_side * n_side;
} break;
🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 3242 - 3247, The GEMMA3N branch incorrectly
sets n_patches to ctx->model.hparams.image_size / ctx->model.hparams.patch_size
(patches per side) instead of total tokens; change the calculation in the
PROJECTOR_TYPE_GEMMA3N case to compute total patches/tokens as (image_size /
patch_size) squared (e.g., n_patches = pow(ctx->model.hparams.image_size /
ctx->model.hparams.patch_size, 2) or multiply the quotient by itself) so the
value matches the 16×16 = 256 claim and is robust to a corrected patch_size.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
convert_hf_to_gguf.py (1)

6281-6311: Padding embeddings: applying padding to wrong dimension in per-layer embeddings.

The tensor shape for embed_tokens_per_layer is [n_embd_altup * n_layer, n_vocab], with vocab size in shape[1], not shape[0]. The current padding logic at line 6307 concatenates along dim=0 and uses data_torch.shape[1] for the second dimension, which pads the embedding dimension instead of the vocabulary dimension. The padding should be applied along dim=1 to correctly extend the vocabulary axis. Additionally, add an explicit assertion for the tensor rank (should be 2D) to prevent unhandled shape mismatches.

gguf-py/gguf/constants.py (1)

463-474: Remove orphaned GEMMA3N enum value that mismatches C++ implementation.

VISION_PROJECTOR_TYPE.GEMMA3N exists in the Python enum but is never used anywhere in the codebase. The actual implementation splits Gemma3n into two separate projector types in C++: PROJECTOR_TYPE_GEMMA3NV (vision, "gemma3nv") and PROJECTOR_TYPE_GEMMA3NA (audio, "gemma3na"), which are correctly exposed via the VisionProjectorType class as GEMMA3NV and GEMMA3NA.

Remove GEMMA3N from the VISION_PROJECTOR_TYPE enum and MODEL_ARCH enum since it has no corresponding implementation. The Python codebase should only define enum values that map to actual projector types used by convert_hf_to_gguf.py or loaded by downstream consumers.

🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 6058-6091: In modify_tensors, avoid hardcoding eps=1e-5 when
folding batch_norm; instead attempt to obtain eps from the model config (or a
provided attribute on the converter) before falling back to the default,
validate it is a small positive float, and emit a warning via the converter
logger if the config value is missing so the user is aware of the silent numeric
change; update references around self._batch_norm_tensors handling and the
computation of a = weight / torch.sqrt(running_var + eps) to use the chosen eps
and ensure map_tensor_name and block_count logic is unchanged.
- Around line 530-536: The fallback for max_name_len in prepare_tensors uses a
model-specific literal; change it to a shorter generic constant or derive it
from available keys to avoid embedding model names: when self.tensor_map.mapping
is empty, set max_name_len to a small fixed value (e.g., len("encoder.weight,"))
or compute max(len(k) for k in self.model_tensors.keys()) + len(".weight,") if
self.model_tensors exists, ensuring you reference the prepare_tensors method,
self.tensor_map.mapping and self.model_tensors when making the replacement.
- Around line 6235-6254: The current set_vocab method temporarily deletes
self.hparams["vocab_size_per_layer_input"] but does not guarantee restoration if
super().set_vocab() raises; wrap the call to super().set_vocab() in a
try/finally block so that vocab_size_per_layer_input (the saved variable) is
always restored to self.hparams after the call, ensuring no permanent mutation
of self.hparams even on exceptions; reference the set_vocab method, the local
variable vocab_size_per_layer_input, self.hparams, and the call to
super().set_vocab() when applying the change.

In @tools/mtmd/clip.cpp:
- Around line 3242-3247: n_patches is computed incorrectly for
PROJECTOR_TYPE_GEMMA3NV: instead of producing 16x16 (=256) tokens the code
divides image_size by patch_size only once; update the logic in the
PROJECTOR_TYPE_GEMMA3NV branch (the block that assigns n_patches) to yield the
total number of patches, either by setting n_patches to the fixed constant 256
if the adapter truly always outputs 16x16, or by computing (image_size /
patch_size) squared (i.e. multiply the per-dimension count by itself) using
ctx->model.hparams.image_size and ctx->model.hparams.patch_size so the result
reflects total tokens correctly.
- Around line 1567-1659: Summary: Add post-load validation to ensure MobileNetV5
blocks were actually discovered and tensor name patterns align with the
converter. After the per-stage loading loop, record per-stage counts (e.g., add
a local vector<int> stage_block_counts and increment with blocks_found_in_stage
inside the existing loop), then validate: assert
model.mobilenet_stage_ends.size() == 4 (or log error if not), verify each
stage_block_counts[stage] > 0 (log which stage is empty and bail), and check
total model.mobilenet_blocks.size() is within expected bounds (log actual vs
expected and abort on gross mismatch). Also emit a warning listing any missing
key tensor patterns (use TN_MNV5_BLK_S0_EXP_W, TN_MNV5_BLK_DW_START_W,
TN_MNV5_ATTN_Q_W, etc.) so mismatches with clip-impl.h / Python converter can be
diagnosed.

In @tools/mtmd/models/mobilenetv5.cpp:
- Around line 152-246: In build_mobilenet_attn add a divisibility assert before
computing n_head: insert GGML_ASSERT(q->ne[2] % D == 0) to ensure q->ne[2] is
divisible by D, and extend the spatial residual check to include height (require
inp->ne[1] == cur->ne[1] alongside inp->ne[0] and inp->ne[2]) so the residual
only applies when W, H and channels match; also verify the orientation/shape
passed to ggml_mul_mat(ctx0, k, q) and subsequent ggml_soft_max(ctx0, scores) so
they operate on tensors shaped as [D, M, 1, B] (k) and [D, N, n_head, B] (q) (or
transpose them appropriately) to produce scores of shape [D, M, N, B] for the
intended attention before softmax and matmul with v.
🧹 Nitpick comments (10)
gguf-py/gguf/gguf_writer.py (1)

1086-1091: Clarify how clip.projector_type interacts with the new per-modality projector type keys.

With add_clip_projector_type() plus add_clip_vision_projector_type() / add_clip_audio_projector_type(), GGUFs can now encode projector type in multiple places. To avoid interop issues, it’d help to standardize one of:

  • precedence rules (e.g., prefer per-modality keys when present), and/or
  • producer behavior (e.g., write both legacy Keys.Clip.PROJECTOR_TYPE and the new per-modality key for backward compatibility).

Also applies to: 1172-1176

tools/mtmd/models/mobilenetv5.cpp (5)

5-20: Make rms_norm_2d() call sites independent of a default eps and validate weight broadcasting.

Many call sites pass only (inp, weight); if eps isn’t a defaulted parameter in the class declaration (in tools/mtmd/models/models.h), this won’t compile. Also, ggml_mul() broadcast behavior depends on the exact weight tensor shape (1D vs [C,1,1,1]).

Proposed change (explicit eps at call sites)
-    if (block.s0_bn1_w) cur = rms_norm_2d(cur, block.s0_bn1_w);
+    if (block.s0_bn1_w) cur = rms_norm_2d(cur, block.s0_bn1_w, 1e-6f);

(Repeat similarly for other rms_norm_2d(cur, ...) call sites.)


23-53: pad_same_2d(): avoid narrowing int64_t padding values to int without bounds.

pad_h/pad_w are int64_t but are narrowed to int for left/right/top/bottom. Probably fine for normal image sizes, but this is an easy footgun. If ggml_pad_ext takes int, consider asserting the pads fit, or keep them as int64_t until the call boundary.


57-88: Residual shape check should include height; also don’t rely on implicit eps.

The residual check currently compares channels and width, but not height (ne[1]). If anything ever produces non-square or otherwise mismatched spatial dims, this can add incompatible tensors.

Proposed change (height-aware residual condition)
-    if (stride == 1 && inp->ne[2] == cur->ne[2] && inp->ne[0] == cur->ne[0]) {
+    if (stride == 1 &&
+        inp->ne[0] == cur->ne[0] &&
+        inp->ne[1] == cur->ne[1] &&
+        inp->ne[2] == cur->ne[2]) {
         cur = ggml_add(ctx0, cur, inp);
     }

91-149: Stage stride inference is brittle; prefer per-block stride metadata if available.

stride = is_stage_start(i) ? 2 : 1; assumes every stage start downsamples. If the upstream model ever has a stage that starts with stride=1, this silently builds the wrong graph. If stride exists in the converted config / tensor metadata, use it; otherwise, add asserts keyed off expected shapes.


248-451: MSFA path has hardcoded target_out_res=16 and width-only upscaling; both are brittle.

  • const int target_out_res = 16; should ideally be derived (e.g., sqrt(image_seq_len) or another hparam), otherwise variants won’t work.
  • Upscale uses scale_w only and asserts only high_res_w % feat_w == 0; if height differs too, you can build inconsistent shapes.
Proposed change (height checks + scale_h parity)
-                int scale_w = high_res_w / feat_w;
-                // int scale_h = high_res_h / feat_h;
+                int scale_w = high_res_w / feat_w;
+                int scale_h = high_res_h / feat_h;

-                GGML_ASSERT(high_res_w % feat_w == 0);
+                GGML_ASSERT(high_res_w % feat_w == 0);
+                GGML_ASSERT(high_res_h % feat_h == 0);
+                GGML_ASSERT(scale_w == scale_h); // if ggml_upscale only supports uniform scaling

-                feat = ggml_upscale(ctx0, feat, scale_w, ggml_scale_mode::GGML_SCALE_MODE_NEAREST);
+                feat = ggml_upscale(ctx0, feat, scale_w, ggml_scale_mode::GGML_SCALE_MODE_NEAREST);
convert_hf_to_gguf.py (2)

6100-6125: Annotate block_tensor_mapping as ClassVar (and keep it immutable-by-convention).

This is a constant mapping; make that explicit to satisfy linters and avoid accidental per-instance mutation. (Ruff RUF012)

Proposed fix
+from typing import ClassVar
+
 class Gemma3nVisionAudioModel(ConformerAudioModel):
@@
-    block_tensor_mapping = {
+    block_tensor_mapping: ClassVar[dict[str, str]] = {
         "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight":             "v.blk.{bid}.{sid}.conv_exp.weight",
         ...
     }

6175-6187: custom_map() should validate it’s actually mapping a MobileNet block path.

Right now it assumes parts[4]/parts[5] are {bid}.{sid} whenever len(parts) >= 7, which could mis-map other similarly-long names. Add a quick guard like parts[:4] == ["model","vision_tower","timm_model","blocks"].

tools/mtmd/clip.cpp (2)

2115-2115: TODO: Audio support for Gemma3n

The code skips audio loading for GEMMA3NV with a TODO comment indicating that audio tensors exist in the GGUF but are not yet supported. This is a reasonable temporary workaround.

Consider opening a tracking issue for implementing Gemma3n audio support to ensure this TODO is addressed in a future update.

Also applies to: 2125-2127, 2132-2132


3640-3640: LGTM: Helper function updates with minor note

The additions to switch statements and helper functions are consistent:

  • Line 3640: Correctly groups GEMMA3NV with similar projector types
  • Line 3768: Returns mm_input_proj_w->ne[0] matching GEMMA3 behavior
  • Lines 3812-3820: Correctly excludes GEMMA3NV from mRoPE projectors
  • Lines 3836-3845: Correctly excludes GEMMA3NV from Whisper encoders

Note: Functions clip_is_minicpmv and clip_is_glm are marked with // TODO: remove this function (lines 3799, 3807), indicating they're deprecated. Consider filing a cleanup issue to remove these in a future refactor.

Also applies to: 3768-3768, 3799-3820, 3836-3845

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a68b35 and 8f6dbbe.

📒 Files selected for processing (8)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/gguf_writer.py
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/clip.h
  • tools/mtmd/models/mobilenetv5.cpp
  • tools/mtmd/mtmd.cpp
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip.cpp
  • gguf-py/gguf/constants.py
  • tools/mtmd/mtmd.cpp
🧬 Code graph analysis (2)
gguf-py/gguf/gguf_writer.py (1)
gguf-py/gguf/constants.py (3)
  • Keys (20-332)
  • ClipVision (284-308)
  • ClipAudio (310-323)
convert_hf_to_gguf.py (3)
gguf-py/gguf/constants.py (1)
  • VisionProjectorType (3560-3584)
ggml/src/ggml-vulkan/ggml-vulkan.cpp (6)
  • b (391-394)
  • b (391-391)
  • b (403-406)
  • b (403-403)
  • b (415-418)
  • b (415-415)
gguf-py/gguf/gguf_writer.py (5)
  • add_clip_vision_projector_type (1089-1090)
  • add_vision_attention_layernorm_eps (1110-1111)
  • add_clip_audio_projector_type (1174-1175)
  • add_audio_num_mel_bins (1195-1196)
  • add_audio_attention_layernorm_eps (1192-1193)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)


6102-6125: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6186-6186: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
  • GitHub Check: build-cmake-pkg / linux
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
  • GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
  • GitHub Check: openEuler-latest-cmake-cann (x86, 310p, Release)
  • GitHub Check: openEuler-latest-cmake-cann (x86, 910b, Release)
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
  • GitHub Check: server (Release)
  • GitHub Check: server-windows
  • GitHub Check: pyright type-check
  • GitHub Check: check-requirements
🔇 Additional comments (14)
tools/mtmd/mtmd.cpp (2)

864-874: Non-causal decode selection is vision-only; confirm audio-only behavior is intended.

mtmd_decode_use_non_causal() keys off ctx->proj_type_v() only. For audio-only mmproj files (ctx_v == nullptr), this always returns false. If any audio projector types require non-causal decoding, this will be wrong.


268-316: Verify whether GEMMA3NV uses the same <start_of_image> / <end_of_image> BOI/EOI tokens as GEMMA3.

The code treats both identically (line 269 of mtmd.cpp), but GEMMA3NV uses a fundamentally different vision architecture (MobileNetV5 encoder) compared to GEMMA3's standard projector. Without explicit tokenizer confirmation, this shared token assignment could cause prompt-formatting issues if Gemma3n's tokenizer handles these strings differently.

tools/mtmd/clip.h (1)

105-111: No action needed. The removal of clip_is_gemma3() is safe—no remaining call sites or references exist in the codebase.

convert_hf_to_gguf.py (3)

6188-6211: Verify unsqueeze semantics for conv_stem.conv.bias / layer_scale.gamma.

Converting 1D tensors into [1, C, 1, 1] may be required by your ggml/mtmd loader, but it’s non-obvious and easy to get wrong (esp. for layer_scale which is often applied as a vector). Please double-check the corresponding C++ tensor shapes expected in the MobileNetV5 graph/loader and add a short comment explaining the expected runtime broadcast.


10151-10167: LFM2 multimodal skipping looks fine.

Using ConformerAudioModel.is_audio_tensor() here is a pragmatic way to avoid dragging audio weights into the text GGUF.


10295-10327: LFM2AudioModel wiring is reasonable; confirm block_count discovery for this encoder.

Given MmprojModel.__init__ derives block_count from n_block_keys, please confirm the LFM2 audio encoder config (returned by get_audio_config()) actually contains one of those keys, otherwise initialization may break (or produce a bad tensor map).

gguf-py/gguf/constants.py (2)

278-286: Nice improvement: explicit per-modality projector type keys (vision/audio).
This aligns with mixed-modality models and matches the mtmd side’s clip.vision.projector_type / clip.audio.projector_type split.

Also applies to: 310-312


681-689: New gemma3n tensor IDs/names look coherent; please sanity-check name suffix conventions end-to-end.
Given mtmd expects explicit *.weight / *.bias tensor names, verify that the python-side “base names” (e.g. v.conv_stem.conv, v.conv_stem.bn, v.msfa.norm) are expanded consistently by the writer/loader for all required parameters (esp. norms that may need both weight+bias).

Also applies to: 717-747, 1097-1106, 1135-1165, 1216-1292

tools/mtmd/clip-impl.h (2)

205-237: Projector type wiring for gemma3nv/gemma3na looks correct and consistent within mtmd.
Enum entries and PROJECTOR_TYPE_NAMES additions are straightforward. (Per your prior pattern, keeping QWEN25O as a replaceable placeholder remains fine.)

Also applies to: 239-269


157-196: BN macros use RMS normalization, not BatchNorm—no bias/stats needed.

The concern about missing bias and running statistics is based on a misunderstanding of the normalization type. The code uses rms_norm_2d() for all these "BN" tensors, which implements RMS (Root Mean Square) normalization. RMS norm is a stateless operation that only requires the scale parameter (weight); it does not use bias or running statistics like BatchNorm does. The weight-only macro definitions are correct and complete for this use case.

tools/mtmd/clip.cpp (4)

791-794: LGTM: Graph builder routing

The GEMMA3NV routing to clip_graph_mobilenetv5 follows the established pattern for other projector types.


1349-1351: LGTM: Correct architecture-specific handling

Setting n_layer = 0 for GEMMA3NV is appropriate since MobileNetV5 uses a custom block structure instead of standard ViT layers. This prevents the loading loop at lines 1354-1425 from attempting to load non-existent standard layer tensors.


2970-2978: LGTM: Preprocessing path

The GEMMA3NV preprocessing correctly resizes to square without padding (add_padding = false), which differs from GEMMA3's behavior. This architectural difference is appropriate for MobileNetV5.


1153-1160: Code is correct; ensure Python converter bug is addressed separately

The hparams initialization for GEMMA3NV is correct. MobileNetV5 produces 16×16 tokens as stated, and n_merge=1 is appropriate since the encoder handles all spatial downsampling internally (contrasting with GEMMA3's n_merge=4). However, the Python converter has a confirmed bug: it computes patch_size = image_size // image_seq_length (e.g., 768 // 256 = 3) instead of correctly deriving it from the 16×16 grid dimensions (768 // 16 = 48). While this converter bug doesn't directly break the C++ code's n_merge setting, ensure the Python converter is fixed to avoid downstream issues with patch_size-dependent operations.

Comment on lines 530 to 536
def prepare_tensors(self):
max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,")
# Handle empty tensor_map for models with block_count=0 (like MobileNetV5)
if self.tensor_map.mapping:
max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,")
else:
max_name_len = len("vision_encoder.weight,") # Default reasonable length

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Good guard for block_count=0 models; tighten the fallback log-width constant.

This is only for log formatting, so correctness impact is low. Consider using a shorter constant (or deriving from self.model_tensors keys) to avoid embedding model-specific names into the generic base class.

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 530 - 536, The fallback for max_name_len
in prepare_tensors uses a model-specific literal; change it to a shorter generic
constant or derive it from available keys to avoid embedding model names: when
self.tensor_map.mapping is empty, set max_name_len to a small fixed value (e.g.,
len("encoder.weight,")) or compute max(len(k) for k in
self.model_tensors.keys()) + len(".weight,") if self.model_tensors exists,
ensuring you reference the prepare_tensors method, self.tensor_map.mapping and
self.model_tensors when making the replacement.

Comment on lines +6052 to +6056
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ConformerAudioModel.is_audio_tensor(name):
if ".conv" in name or "_conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix boolean-precedence bug in conv quantization predicate.

or/and precedence makes the condition read differently than it looks; likely you intended to require ".weight" for both ".conv" and "_conv" cases. (Ruff RUF021)

Proposed fix
-        if ConformerAudioModel.is_audio_tensor(name):
-            if ".conv" in name or "_conv" in name and ".weight" in name:
-                return gguf.GGMLQuantizationType.F32
+        if ConformerAudioModel.is_audio_tensor(name):
+            if ((".conv" in name) or ("_conv" in name)) and (".weight" in name):
+                return gguf.GGMLQuantizationType.F32
🧰 Tools
🪛 Ruff (0.14.10)

6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)

Comment on lines +6058 to +6091
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# fold running_mean, running_var and eps into weight and bias for batch_norm
if "batch_norm" in name:
if self._batch_norm_tensors is None:
self._batch_norm_tensors = [{} for _ in range(self.block_count)]
assert bid is not None
self._batch_norm_tensors[bid][name] = data_torch

if len(self._batch_norm_tensors[bid]) < 5:
return []

weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"]
bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"]
running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"]
running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"]
eps = 1e-5 # default value

a = weight / torch.sqrt(running_var + eps)
b = bias - running_mean * a
return [
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a),
(self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b),
]

# reshape conv weights
if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"):
data_torch = data_torch[:, None, None]
if "conv.depthwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[1] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2])
if "conv.pointwise_conv" in name and name.endswith(".weight"):
assert data_torch.shape[2] == 1
data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1])

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

BatchNorm folding: avoid hardcoding eps (or at least document/validate it).

BN eps isn’t in the state_dict; hardcoding 1e-5 is a reasonable default, but if the source model uses a different value this silently changes numerics. Suggest: (1) try to read it from config if available, else (2) keep the default but add a warning when folding.

🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6058 - 6091, In modify_tensors, avoid
hardcoding eps=1e-5 when folding batch_norm; instead attempt to obtain eps from
the model config (or a provided attribute on the converter) before falling back
to the default, validate it is a small positive float, and emit a warning via
the converter logger if the config value is missing so the user is aware of the
silent numeric change; update references around self._batch_norm_tensors
handling and the computation of a = weight / torch.sqrt(running_var + eps) to
use the chosen eps and ensure map_tensor_name and block_count logic is
unchanged.

Comment on lines +6127 to +6153
def __init__(self, *args, **kwargs):
# Parent init will call find_hparam which now returns 0 for empty keys
super().__init__(*args, **kwargs)
assert self.hparams_vision is not None
self.hparams_vision["n_layers"] = 128 # fake value for audio encoder, vision encoder doesn't use it
self.hparams_vision["intermediate_size"] = self.hparams_vision.get("intermediate_size", 2048) * 4
self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_attention_heads", 8)

# MobileNetV5 does not use image_mean/std
self.preprocessor_config["image_mean"] = [0.0 ,0.0 , 0.0]
self.preprocessor_config["image_std"] = [1.0 ,1.0 ,1.0]
self.hparams_vision["image_size"] = self.preprocessor_config.get(
"size", {"height": 768, "width": 768}
)["height"]

# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
self.hparams_vision["patch_size"] = image_size // image_seq_length

# remap audio hparams
assert self.hparams_audio is not None
self.hparams_audio["n_layers"] = self.hparams_audio["conf_num_hidden_layers"]
self.hparams_audio["num_attention_heads"] = self.hparams_audio["conf_num_attention_heads"]
self.hparams_audio["feat_in"] = self.hparams_audio["input_feat_size"]
self.hparams_audio["intermediate_size"] = self.hparams_audio.get("intermediate_size", 6144)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the context of this code in the repository
fd "convert_hf_to_gguf.py" --type f

Repository: ngxson/llama.cpp

Length of output: 81


🌐 Web query:

Gemma3n MobileNetV5 vision encoder patch size configuration llama.cpp

💡 Result:

Short answer: llama.cpp expects the Gemma vision encoder patch_size to be 14 (stored in the GGUF as clip.vision.patch_size). See the Gemma3n vision docs for the encoder/image defaults and the llama.cpp logs showing clip.vision.patch_size = 14.

Sources: [Hugging Face Gemma3n docs][1], [llama.cpp issue logs showing clip.vision.patch_size = 14][2].


🌐 Web query:

image_seq_length patch_size 2D grid calculation vision transformer

💡 Result:

Formula:

  • Grid size per side: grid_h = H / P, grid_w = W / P (integer division if using non-overlapping patches).
  • Number of patches (sequence length, excluding class token): N = grid_h * grid_w = (H / P) * (W / P).
  • If a class token is prepended, image_seq_length = N + 1.

Example (H=W=224, P=16):

  • grid = (224/16, 224/16) = (14, 14)
  • N = 14 * 14 = 196
  • image_seq_length (with class token) = 196 + 1 = 197.

Sources:
[1] Torchvision ViT implementation (positional embedding / seq length calculation).
[2] ViT patching explanation / tutorial (patch → (H/P × W/P) patches).


🏁 Script executed:

# Search for Gemma3n related code in the repository
rg "Gemma3n" --type py -A 5 -B 5

Repository: ngxson/llama.cpp

Length of output: 3851


🏁 Script executed:

# Look for vision encoder or patch_size references in the codebase
rg "patch_size|image_seq_length" --type py -B 3 -A 3 | head -100

Repository: ngxson/llama.cpp

Length of output: 8469


🏁 Script executed:

# Search for the specific class or method that contains this __init__
rg "hparams_vision.*=.*None" --type py -B 10 | head -50

Repository: ngxson/llama.cpp

Length of output: 734


patch_size computation must use patches-per-side, not total sequence length.

self.hparams_vision["patch_size"] = image_size // image_seq_length (e.g., 768 // 256 = 3) is incorrect for a 2D patch grid. Since the comment states 256 tokens = 16x16, the patch size should be computed per-side: patch_size = image_size // sqrt(image_seq_length) (e.g., 768 // 16 = 48). This aligns with standard vision transformer patching and matches the correct implementation already present in the same codebase (Tinygemma3 model). Without this fix, downstream token counts and attention operations will be semantically incorrect.

Proposed fix
         # Image sequence length (256 tokens = 16x16 for Gemma3n)
         image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
         image_size = self.hparams_vision["image_size"]
-        self.hparams_vision["patch_size"] = image_size // image_seq_length
+        n_per_side = int(image_seq_length ** 0.5)
+        if n_per_side * n_per_side != image_seq_length:
+            raise ValueError(f"image_seq_length must be a perfect square, got {image_seq_length}")
+        if image_size % n_per_side != 0:
+            raise ValueError(f"image_size {image_size} not divisible by patches-per-side {n_per_side}")
+        self.hparams_vision["patch_size"] = image_size // n_per_side

Also review the hardcoded fallbacks (intermediate_size * 4, num_attention_heads = 8); prefer reading from the vision config when present, with defaults only when missing.

Comment on lines 6235 to +6254
def set_vocab(self):
# For Gemma3n multimodal models, we need the FULL vocab_size (262400)
# which includes special tokens from 262144-262399 for vision/audio.
# The vocab_size_per_layer_input (262144) is only the embedding size per layer.
# Temporarily override the hparams lookup order to prioritize vocab_size.

# Store original vocab_size_per_layer_input if it exists
vocab_size_per_layer_input = self.hparams.get("vocab_size_per_layer_input")

# Temporarily remove vocab_size_per_layer_input to force using vocab_size
if vocab_size_per_layer_input is not None:
del self.hparams["vocab_size_per_layer_input"]

# Call parent set_vocab which will now use vocab_size (262400)
super().set_vocab()

# Restore vocab_size_per_layer_input for later use
if vocab_size_per_layer_input is not None:
self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure vocab override is restored with try/finally.

If super().set_vocab() throws, vocab_size_per_layer_input won’t be restored, leaving self.hparams in a mutated state (even if the process continues for any reason, e.g., partial conversions/tests).

Proposed fix
         vocab_size_per_layer_input = self.hparams.get("vocab_size_per_layer_input")
 
-        if vocab_size_per_layer_input is not None:
-            del self.hparams["vocab_size_per_layer_input"]
-
-        # Call parent set_vocab which will now use vocab_size (262400)
-        super().set_vocab()
-
-        # Restore vocab_size_per_layer_input for later use
-        if vocab_size_per_layer_input is not None:
-            self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input
+        if vocab_size_per_layer_input is not None:
+            del self.hparams["vocab_size_per_layer_input"]
+        try:
+            # Call parent set_vocab which will now use vocab_size (262400)
+            super().set_vocab()
+        finally:
+            if vocab_size_per_layer_input is not None:
+                self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6235 - 6254, The current set_vocab method
temporarily deletes self.hparams["vocab_size_per_layer_input"] but does not
guarantee restoration if super().set_vocab() raises; wrap the call to
super().set_vocab() in a try/finally block so that vocab_size_per_layer_input
(the saved variable) is always restored to self.hparams after the call, ensuring
no permanent mutation of self.hparams even on exceptions; reference the
set_vocab method, the local variable vocab_size_per_layer_input, self.hparams,
and the call to super().set_vocab() when applying the change.

Comment on lines +1567 to +1659
case PROJECTOR_TYPE_GEMMA3NV:
{
model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, false);
model.mobilenet_stem_conv_b = get_tensor(TN_MNV5_STEM_BIAS, false);
model.mobilenet_stem_norm_w = get_tensor(TN_MNV5_STEM_BN, false);

model.msfa_ffn_expand_w = get_tensor(TN_MNV5_MSFA_FFN_EXP_W, false);
model.msfa_ffn_expand_bn = get_tensor(TN_MNV5_MSFA_FFN_EXP_BN, false); // Consume BN if present but likely folded
model.msfa_ffn_project_w = get_tensor(TN_MNV5_MSFA_FFN_PROJ_W, false);
model.msfa_ffn_project_bn = get_tensor(TN_MNV5_MSFA_FFN_PROJ_BN, false);

model.msfa_concat_norm_w = get_tensor(TN_MNV5_MSFA_NORM, false);

// Dynamically load blocks stage by stage
for (int stage = 0; stage < 4; ++stage) {
int blocks_found_in_stage = 0;

for (int blk_idx = 0; ; ++blk_idx) {
bool found_block = false;
mobilenetv5_block block;

// 1. Check for Edge Residual (S0)
block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false);
if (block.s0_conv_exp_w) {
found_block = true;
block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false);
block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false);
block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false);
}
// 2. Check for UIR (Universal Inverted Residual)
else {
// Check for dw_start OR pw_exp (some UIR blocks skip dw_start)
block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false);
block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false);

if (block.dw_start_w || block.pw_exp_w) {
found_block = true;
if (block.dw_start_w) {
block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false);
}
if (block.pw_exp_w) {
block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false);
}
block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false);
if (block.dw_mid_w) {
block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false);
}
block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false);
if (block.pw_proj_w) {
block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false);
}
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}

// 3. Check for Attention (MQA)
// Even if UIR/Edge check failed, this might be a pure attention block
ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false);
if (attn_q_check) {
found_block = true;
block.attn_q_w = attn_q_check;
block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false);
block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false);
block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false);
block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false);
block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false);
block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false);
block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false);
block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false);
// Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check
if (!block.layer_scale_w) {
block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false);
}
}

if (found_block) {
model.mobilenet_blocks.push_back(block);
blocks_found_in_stage++;
} else {
// End of blocks for this stage
break;
}
}

// Track where this stage ends in the flat vector
if (blocks_found_in_stage > 0) {
model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1);
LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1);
}
}
model.mm_input_proj_w = get_tensor(TN_MM_INP_PROJ);
model.mm_soft_emb_norm_w = get_tensor(TN_MM_SOFT_EMB_N);
} break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg "TN_MNV5" --type cpp -B 2 -A 2 | head -100

Repository: ngxson/llama.cpp

Length of output: 8387


🏁 Script executed:

fd . -name "*.h" -type f | xargs rg -l "TN_MNV5" | head -5

Repository: ngxson/llama.cpp

Length of output: 350


🏁 Script executed:

rg -n "mobilenet_stage_ends" --type cpp

Repository: ngxson/llama.cpp

Length of output: 693


🏁 Script executed:

rg -n "TN_MNV5_BLK_S0_EXP_W\|TN_MNV5_ATTN_Q_W" --type cpp -B 5

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

fd "convert" --type f | grep -i "mobilenet\|clip" | head -10

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg -l "conv_stem\|blk.*conv_exp" --type py | head -5

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

fd "*.py" -type f | xargs rg -l "mobilenet" | head -5

Repository: ngxson/llama.cpp

Length of output: 378


🏁 Script executed:

cat -n tools/mtmd/clip.cpp | sed -n '1650,1660p'

Repository: ngxson/llama.cpp

Length of output: 772


🏁 Script executed:

rg -A 10 "mobilenet_stage_ends.push_back" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 674


🏁 Script executed:

find . -name "*.py" -type f | xargs grep -l "gemma3n\|mobilenetv5" 2>/dev/null | head -5

Repository: ngxson/llama.cpp

Length of output: 144


🏁 Script executed:

rg -A 5 "models.mobilenet_blocks" tools/mtmd/models/mobilenetv5.cpp | head -30

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

cat -n tools/mtmd/clip.cpp | sed -n '1567,1580p'

Repository: ngxson/llama.cpp

Length of output: 1037


🏁 Script executed:

rg -B 5 -A 15 "blocks_found_in_stage" tools/mtmd/clip.cpp | head -40

Repository: ngxson/llama.cpp

Length of output: 2169


🏁 Script executed:

cat -n gguf-py/gguf/tensor_mapping.py | head -50

Repository: ngxson/llama.cpp

Length of output: 3017


🏁 Script executed:

rg -B 5 -A 5 "gemma.*3n\|GEMMA3NV" gguf-py/gguf/tensor_mapping.py | head -40

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg "conv_stem\|v\.blk\." gguf-py/gguf/tensor_mapping.py | head -20

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

cat -n convert_hf_to_gguf.py | sed -n '1,100p'

Repository: ngxson/llama.cpp

Length of output: 3468


🏁 Script executed:

rg -l "mobilenetv5\|gemma3n" . --type file | head -10

Repository: ngxson/llama.cpp

Length of output: 88


🏁 Script executed:

find . -name "*.py" -type f | xargs grep -l "gemma3n\|gemma-3n" 2>/dev/null

Repository: ngxson/llama.cpp

Length of output: 144


🏁 Script executed:

rg "v\.conv_stem\|v\.blk\." --type py | head -30

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

find . -path ./node_modules -prune -o -name "convert*.py" -type f -print | xargs grep -l "PROJECTOR\|gemma" 2>/dev/null | head -5

Repository: ngxson/llama.cpp

Length of output: 114


🏁 Script executed:

cat -n tools/mtmd/models/mobilenetv5.cpp | sed -n '1,100p'

Repository: ngxson/llama.cpp

Length of output: 4671


🏁 Script executed:

rg -A 20 "case PROJECTOR_TYPE_GEMMA3NV" tools/mtmd/clip.cpp | head -60

Repository: ngxson/llama.cpp

Length of output: 3489


🏁 Script executed:

grep -rn "gemma3n\|mobilenetv5" convert_hf_to_gguf.py | head -30

Repository: ngxson/llama.cpp

Length of output: 169


🏁 Script executed:

rg "class Gemma\|mobilenet" convert_hf_to_gguf.py -A 10 | head -50

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

cat -n tools/mtmd/clip-model.h | sed -n '320,340p'

Repository: ngxson/llama.cpp

Length of output: 1023


🏁 Script executed:

rg "mobilenetv5_block" tools/mtmd/clip-model.h -B 5 -A 30 | head -60

Repository: ngxson/llama.cpp

Length of output: 2023


🏁 Script executed:

python3 << 'EOF'
import re

# Define tensor name patterns from C++
tensor_patterns_cpp = {
    "TN_MNV5_STEM_CONV": "v.conv_stem.conv.weight",
    "TN_MNV5_STEM_BIAS": "v.conv_stem.conv.bias",
    "TN_MNV5_STEM_BN": "v.conv_stem.bn.weight",
    "TN_MNV5_BLK_S0_EXP_W": "v.blk.%d.%d.conv_exp.weight",
    "TN_MNV5_BLK_S0_BN1_W": "v.blk.%d.%d.bn1.weight",
    "TN_MNV5_ATTN_Q_W": "v.blk.%d.%d.attn.query.proj.weight",
    "TN_MNV5_ATTN_K_W": "v.blk.%d.%d.attn.key.proj.weight",
}

print("Verified tensor name patterns:")
for name, pattern in tensor_patterns_cpp.items():
    print(f"  {name:25} -> {pattern}")
EOF

Repository: ngxson/llama.cpp

Length of output: 146


🏁 Script executed:

rg -A 5 "blocks_found_in_stage\|stage_ends" tools/mtmd/models/mobilenetv5.cpp | head -40

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

cat -n tools/mtmd/clip.cpp | sed -n '1640,1660p'

Repository: ngxson/llama.cpp

Length of output: 1246


🏁 Script executed:

rg "expected.*blocks\|assert.*stage\|blocks_found" tools/mtmd/models/mobilenetv5.cpp

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg -B 2 -A 10 "for.*mobilenet_stage_ends\|mobilenet_blocks" tools/mtmd/models/mobilenetv5.cpp | head -50

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

rg -B 5 -A 15 "build_mobilenetv5\|process_mobilenet" tools/mtmd/models/mobilenetv5.cpp | head -60

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

cat -n tools/mtmd/models/mobilenetv5.cpp | sed -n '260,290p'

Repository: ngxson/llama.cpp

Length of output: 1404


🏁 Script executed:

rg "mobilenet_blocks.size\|stage_ends.size" tools/mtmd/models/mobilenetv5.cpp

Repository: ngxson/llama.cpp

Length of output: 42


Add post-load validation to verify block discovery completeness

The dynamic MobileNetV5 block loading relies on exact tensor name matching with the Python converter. While the tensor name patterns are well-defined (e.g., v.blk.%d.%d.conv_exp.weight), there is no validation after the loading loop to detect if the expected blocks were actually found for each stage.

If tensor names from the converter don't match the C++ patterns, blocks will silently fail to load, potentially breaking inference. Add assertions or logging after line 1656 to verify:

  • Expected number of blocks were found for each stage
  • All 4 stages have at least one block (unless intentionally sparse)
  • Total blocks count is reasonable for the model architecture

Additionally, verify that the tensor name patterns in clip-impl.h (e.g., TN_MNV5_BLK_S0_EXP_W = "v.blk.%d.%d.conv_exp.weight") match the exact tensor names produced by your Python converter.

🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 1567 - 1659, Summary: Add post-load
validation to ensure MobileNetV5 blocks were actually discovered and tensor name
patterns align with the converter. After the per-stage loading loop, record
per-stage counts (e.g., add a local vector<int> stage_block_counts and increment
with blocks_found_in_stage inside the existing loop), then validate: assert
model.mobilenet_stage_ends.size() == 4 (or log error if not), verify each
stage_block_counts[stage] > 0 (log which stage is empty and bail), and check
total model.mobilenet_blocks.size() is within expected bounds (log actual vs
expected and abort on gross mismatch). Also emit a warning listing any missing
key tensor patterns (use TN_MNV5_BLK_S0_EXP_W, TN_MNV5_BLK_DW_START_W,
TN_MNV5_ATTN_Q_W, etc.) so mismatches with clip-impl.h / Python converter can be
diagnosed.

Comment on lines +3242 to +3247
case PROJECTOR_TYPE_GEMMA3NV:
{
// MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
// regardless of input size (see architecture description)
n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
} break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Incorrect n_patches calculation

The comment states "MobileNetV5 MSFA adapter always outputs fixed 16×16 resolution" (256 tokens), but the code only divides once instead of squaring:

n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;

Issue: This calculation is semantically incorrect and produces different results depending on the patch_size value:

  • If patch_size = 3 (current buggy Python converter): 768 / 3 = 256 ✓ (accidentally correct)
  • If patch_size = 48 (correct value after Python fix): 768 / 48 = 16 ✗ (should be 256)

This is directly related to the critical bug in the Python converter mentioned in the PR objectives, where patch_size is computed as image_size // image_seq_length instead of image_size // sqrt(image_seq_length).

🔧 Proposed fix

Option 1: Fixed constant (recommended if output is truly always 16×16)

 case PROJECTOR_TYPE_GEMMA3NV:
     {
         // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
-        // regardless of input size (see architecture description)
-        n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+        n_patches = 16 * 16; // Fixed 256 tokens output
     } break;

Option 2: Compute from hyperparameters (if variable resolution is possible)

 case PROJECTOR_TYPE_GEMMA3NV:
     {
-        // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
-        // regardless of input size (see architecture description)
-        n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+        // MobileNetV5 outputs a grid of patches per side
+        int patches_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+        n_patches = patches_per_side * patches_per_side;
     } break;
🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 3242 - 3247, n_patches is computed
incorrectly for PROJECTOR_TYPE_GEMMA3NV: instead of producing 16x16 (=256)
tokens the code divides image_size by patch_size only once; update the logic in
the PROJECTOR_TYPE_GEMMA3NV branch (the block that assigns n_patches) to yield
the total number of patches, either by setting n_patches to the fixed constant
256 if the adapter truly always outputs 16x16, or by computing (image_size /
patch_size) squared (i.e. multiply the per-dimension count by itself) using
ctx->model.hparams.image_size and ctx->model.hparams.patch_size so the result
reflects total tokens correctly.

Comment on lines +152 to +246
ggml_tensor * clip_graph_mobilenetv5::build_mobilenet_attn(ggml_tensor * inp, const mobilenetv5_block & block) {
ggml_tensor * cur = inp;

// Norm
if (block.attn_norm_w) {
cur = rms_norm_2d(cur, block.attn_norm_w, 1e-6f);
}

// 1. Q Calculation
ggml_tensor * q = ggml_conv_2d_direct(ctx0, block.attn_q_w, cur, 1, 1, 0, 0, 1, 1);

// 2. K Calculation (Downsampled)
// Uses Conv2dSame(640, 640, kernel_size=(3, 3), stride=(2, 2), groups=640)
ggml_tensor * k_inp = cur;
if (block.attn_k_dw_w) {
int k_size = block.attn_k_dw_w->ne[0]; // Usually 3
k_inp = pad_same_2d(cur, k_size, k_size, 2, 2); // Apply SAME padding
k_inp = ggml_conv_2d_dw(ctx0, block.attn_k_dw_w, k_inp, 2, 2, 0, 0, 1, 1); // padding=0
if (block.attn_k_norm_w) {
k_inp = rms_norm_2d(k_inp, block.attn_k_norm_w, 1e-6f);
}
}
ggml_tensor * k = ggml_conv_2d_direct(ctx0, block.attn_k_w, k_inp, 1, 1, 0, 0, 1, 1);

// 3. V Calculation (Downsampled)
// Uses Conv2dSame(640, 640, kernel_size=(3, 3), stride=(2, 2), groups=640)
ggml_tensor * v_inp = cur;
if (block.attn_v_dw_w) {
int v_size = block.attn_v_dw_w->ne[0]; // Usually 3
v_inp = pad_same_2d(cur, v_size, v_size, 2, 2); // Apply SAME padding
v_inp = ggml_conv_2d_dw(ctx0, block.attn_v_dw_w, v_inp, 2, 2, 0, 0, 1, 1); // padding=0
if (block.attn_v_norm_w) {
v_inp = rms_norm_2d(v_inp, block.attn_v_norm_w, 1e-6f);
}
}
ggml_tensor * v = ggml_conv_2d_direct(ctx0, block.attn_v_w, v_inp, 1, 1, 0, 0, 1, 1);

const int W = cur->ne[0]; const int H = cur->ne[1]; const int B = cur->ne[3];
const int D = k->ne[2]; // Head dimension
const int n_head = q->ne[2] / D;
const int N = W * H;

// Process Q: [W, H, D*n_head, B] -> [D, N, n_head, B]
q = ggml_reshape_3d(ctx0, q, N, D*n_head, B);
q = ggml_reshape_4d(ctx0, q, N, D, n_head, B);
q = ggml_permute(ctx0, q, 1, 0, 2, 3); // [D, N, n_head, B]
q = ggml_cont(ctx0, q);

const int Wk = k->ne[0]; const int Hk = k->ne[1];
const int M = Wk * Hk;

// Process K: [Wk, Hk, D, B] -> [D, M, 1, B]
k = ggml_reshape_3d(ctx0, k, M, D, B);
k = ggml_reshape_4d(ctx0, k, M, D, 1, B);
k = ggml_permute(ctx0, k, 1, 0, 2, 3); // [D, M, 1, B]
k = ggml_cont(ctx0, k);

// Process V: [Wk, Hk, D, B] -> [M, D, 1, B]
v = ggml_reshape_3d(ctx0, v, M, D, B);
v = ggml_reshape_4d(ctx0, v, M, D, 1, B);
v = ggml_cont(ctx0, v); // [M, D, 1, B]

// Multi-Query Attention
float scale = 1.0f / sqrtf((float)D);

// Step 1: Compute Q @ K.T
ggml_tensor * scores = ggml_mul_mat(ctx0, k, q);

scores = ggml_scale(ctx0, scores, scale);

scores = ggml_soft_max(ctx0, scores);

ggml_tensor * kqv = ggml_mul_mat(ctx0, v, scores);

kqv = ggml_permute(ctx0, kqv, 1, 0, 2, 3);
kqv = ggml_cont(ctx0, kqv);


kqv = ggml_reshape_3d(ctx0, kqv, N, D * n_head, B);
kqv = ggml_reshape_4d(ctx0, kqv, W, H, D * n_head, B);
kqv = ggml_cont(ctx0, kqv);

// Output projection
cur = ggml_conv_2d_direct(ctx0, block.attn_o_w, kqv, 1, 1, 0, 0, 1, 1);

// Residual & Layer Scale
if (inp->ne[0] == cur->ne[0] && inp->ne[2] == cur->ne[2]) {
if (block.layer_scale_w) {
cur = ggml_mul(ctx0, cur, block.layer_scale_w);
}
cur = ggml_add(ctx0, cur, inp);
}

return cur;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n tools/mtmd/models/mobilenetv5.cpp | sed -n '152,246p'

Repository: ngxson/llama.cpp

Length of output: 4169


🏁 Script executed:

grep -n "GGML_ASSERT" tools/mtmd/models/mobilenetv5.cpp | head -20

Repository: ngxson/llama.cpp

Length of output: 222


🏁 Script executed:

rg "n_head.*=.*ne\[.*\].*/" --type cpp -A 2 -B 2 | head -30

Repository: ngxson/llama.cpp

Length of output: 420


🏁 Script executed:

grep -n "ne\[0\].*==.*ne\[0\].*&&.*ne\[1\].*==.*ne\[1\]" tools/mtmd/models/mobilenetv5.cpp

Repository: ngxson/llama.cpp

Length of output: 145


Add divisibility assertion and height check for spatial residual.

The code is missing two critical guards:

  1. Line 191 divides by D without asserting divisibility: Add GGML_ASSERT(q->ne[2] % D == 0) (GGML_ASSERT is already used elsewhere in this file at lines 325, 376, 397).
  2. Line 238 residual check only validates width and channels but should also validate height to match the spatial dimension pattern used elsewhere in the file (see line 142: inp->ne[0] == cur->ne[0] && inp->ne[1] == cur->ne[1]).

Additionally, verify that ggml_mul_mat(ctx0, k, q) at line 218 and ggml_soft_max(ctx0, scores) at line 222 have the correct tensor orientation for the intended [M,N,n_head,B] scoring operation.

Proposed changes
     const int D = k->ne[2]; // Head dimension
     const int n_head = q->ne[2] / D;
+    GGML_ASSERT(D > 0);
+    GGML_ASSERT(q->ne[2] % D == 0);
-    if (inp->ne[0] == cur->ne[0] && inp->ne[2] == cur->ne[2]) {
+    if (inp->ne[0] == cur->ne[0] && inp->ne[1] == cur->ne[1] && inp->ne[2] == cur->ne[2]) {
         if (block.layer_scale_w) {
             cur = ggml_mul(ctx0, cur, block.layer_scale_w);
         }
         cur = ggml_add(ctx0, cur, inp);
     }
🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 152 - 246, In
build_mobilenet_attn add a divisibility assert before computing n_head: insert
GGML_ASSERT(q->ne[2] % D == 0) to ensure q->ne[2] is divisible by D, and extend
the spatial residual check to include height (require inp->ne[1] == cur->ne[1]
alongside inp->ne[0] and inp->ne[2]) so the residual only applies when W, H and
channels match; also verify the orientation/shape passed to ggml_mul_mat(ctx0,
k, q) and subsequent ggml_soft_max(ctx0, scores) so they operate on tensors
shaped as [D, M, 1, B] (k) and [D, N, n_head, B] (q) (or transpose them
appropriately) to produce scores of shape [D, M, N, B] for the intended
attention before softmax and matmul with v.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gguf-py/gguf/constants.py (1)

463-474: Critical: Missing VISION_PROJECTOR_TYPE_NAMES entry for GEMMA3N.

A new VISION_PROJECTOR_TYPE.GEMMA3N enum value is added at line 470, but the corresponding entry is missing from the VISION_PROJECTOR_TYPE_NAMES dictionary (lines 883-891). This will cause a KeyError when code attempts to look up the string name for this projector type.

🔧 Proposed fix

Add the missing mapping to VISION_PROJECTOR_TYPE_NAMES:

 VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
     VISION_PROJECTOR_TYPE.MLP:       "mlp",
     VISION_PROJECTOR_TYPE.LDP:       "ldp",
     VISION_PROJECTOR_TYPE.LDPV2:     "ldpv2",
     VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
     VISION_PROJECTOR_TYPE.GLM_EDGE:  "adapter",
     VISION_PROJECTOR_TYPE.MERGER:    "qwen2vl_merger",
+    VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
     VISION_PROJECTOR_TYPE.GEMMA3:    "gemma3",
 }

Note: The string value "gemma3n" should match the projector type identifier used in the converter and C++ code.

Also applies to: 883-891

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6dbbe and 60c23c9.

📒 Files selected for processing (1)
  • gguf-py/gguf/constants.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: ggml-ci-arm64-cpu-low-perf
  • GitHub Check: ggml-ci-arm64-cpu-high-perf
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
  • GitHub Check: openEuler-latest-cmake-cann (x86, 310p, Release)
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
  • GitHub Check: server-windows
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server (Release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants