[Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder#64
[Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder#64ngxson wants to merge 22 commits intongxson:masterfrom
Conversation
…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size
📝 WalkthroughWalkthroughAdds Gemma3n (MobileNetV5) vision/audio multimodal support across the converter, GGUF constants/mappings, and mtmd C++ toolchain: new tensor mappings, converter model classes/overrides, a MobileNetV5 graph implementation, loader wiring, and runtime input/embedding handling. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant HF as HuggingFace Model
participant Converter as Python Converter
participant GGUF as GGUF Writer / Mappings
participant Loader as mtmd Loader (C++)
participant Graph as clip_graph_mobilenetv5
participant Runtime as Inference Runtime
HF->>Converter: export tensors & hparams
Converter->>GGUF: map tensor names (custom_map / block_tensor_mapping), set GGUF params
Converter-->>Loader: write GGUF with Gemma3n vision/audio tensors
Loader->>Graph: select clip_graph_mobilenetv5 (PROJECTOR_TYPE_GEMMA3N/GEMMA3NV)
Loader->>Graph: load mobilenetv5 tensors (stem, blocks, MSFA)
Graph->>Runtime: build graph (stem → stages → MSFA → embed proj)
Runtime->>Runtime: preprocess image/audio → encode tokens → run graph → produce multimodal tokens
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (2)📓 Common learnings📚 Learning: 2025-05-26T09:45:20.653ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (6)
tools/mtmd/models/models.h (1)
69-101: LGTM - MobileNetV5 graph builder declaration follows established patterns.The
clip_graph_mobilenetv5struct correctly follows the existing pattern of other graph implementations in this file. The helper method declarations align with their implementations inmobilenetv5.cpp(per the code snippets).The
mobilenetv5_blocktype reference should resolve correctly via the include chain (clip-graph.h→clip-model.h).Optional: Consider making helper methods private
The helper methods (
rms_norm_2d,pad_same_2d,build_edge_residual,build_inverted_residual,build_mobilenet_attn) are implementation details that could be declared asprivate. However, since other graph implementations in this file also use public methods, this is optional and maintaining consistency with the existing codebase pattern is reasonable.tools/mtmd/clip-model.h (1)
330-345: Appropriate additions for Gemma3n MobileNetV5 encoder.The additions to
clip_modelare well-structured:
- MobileNetV5 components use
std::vectorfor flexible block management- MSFA (Multi-Scale Fusion Adapter) components are properly prefixed and organized
- Naming conventions are consistent with existing fields
Note: There's an extra blank line at line 346, which may be intentional for readability but could be removed for consistency.
tools/mtmd/clip.cpp (2)
1619-1622: Use tensor name macros instead of hard-coded strings.For consistency with the rest of the codebase and maintainability, consider defining macros for these tensor names in
clip-impl.h:- model.mm_0_w = get_tensor("mm.embedding.weight", false); // Input embedding - model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false); // Hard embedding norm + model.mm_0_w = get_tensor(TN_MM_EMBEDDING, false); // Input embedding + model.mm_1_w = get_tensor(TN_MM_HARD_EMB_NORM, false); // Hard embedding normThis would require adding corresponding macros to
clip-impl.h.
1528-1623: Consider adding bounds checking for stage index access.The dynamic block discovery loop iterates over 4 stages (0-3), but the
mobilenet_stage_endsvector is populated based on actual blocks found. When accessingmodel.mobilenet_stage_ends[2]and[3]later inmobilenetv5.cpp::is_fusion_point(), ensure bounds checks are in place to avoid undefined behavior if fewer than 4 stages contain blocks.The current check at line 284-288 (
if (model.mobilenet_stage_ends.size() >= 4)) inmobilenetv5.cpphandles this, but consider adding a validation after loading:if (model.mobilenet_stage_ends.size() < 4) { LOG_WRN("%s: GEMMA3N expected 4 stages but found %zu\n", __func__, model.mobilenet_stage_ends.size()); }tools/mtmd/models/mobilenetv5.cpp (2)
329-336: Remove or use the commented-out variablescale_h.The variable
scale_his declared but commented out. Either use it for validation or remove it to avoid confusion:int scale_w = high_res_w / feat_w; - // int scale_h = high_res_h / feat_h;If height scaling should be validated separately:
int scale_h = high_res_h / feat_h; if (scale_w != scale_h) { LOG_WRN("%s: non-uniform scaling in MSFA (scale_w=%d, scale_h=%d)\n", __func__, scale_w, scale_h); }
381-381: Consider making target output resolution configurable.The target output resolution is hard-coded as 16:
const int target_out_res = 16;Consider making this a model hyperparameter or deriving it from the model configuration to improve flexibility for future MobileNetV5 variants.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cpptools/mtmd/models/models.htools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/models/models.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cppsrc/models/gemma3n-iswa.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/models/models.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cppgguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.hconvert_hf_to_gguf.pygguf-py/gguf/constants.py
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/tensor_mapping.pyconvert_hf_to_gguf.pygguf-py/gguf/constants.py
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/models/gemma3n-iswa.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.hgguf-py/gguf/constants.py
🧬 Code graph analysis (6)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
build(252-463)build(252-252)rms_norm_2d(5-20)rms_norm_2d(5-5)pad_same_2d(23-53)pad_same_2d(23-23)build_edge_residual(57-88)build_edge_residual(57-57)build_inverted_residual(90-151)build_inverted_residual(90-90)build_mobilenet_attn(154-250)build_mobilenet_attn(154-154)
tools/mtmd/clip.cpp (3)
common/common.cpp (4)
model(1159-1161)model(1159-1159)string_format(399-412)string_format(399-399)src/llama-model.cpp (2)
get_tensor(7044-7054)get_tensor(7044-7044)tools/server/server-context.cpp (2)
params(607-853)params(607-607)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (13)
ggml_permute(3700-3752)ggml_rms_norm(3066-3071)ggml_pad_ext(4983-5016)ggml_conv_2d_direct(4702-4736)ggml_gelu(2677-2681)ggml_conv_2d_dw(4637-4658)ggml_reshape_4d(3583-3601)ggml_reshape_3d(3564-3581)ggml_scale(3290-3295)ggml_soft_max(3966-3970)ggml_upscale(4928-4935)ggml_concat(2517-2544)ggml_pool_2d(4852-4878)tools/mtmd/clip.cpp (9)
build_inp_raw(469-474)build_inp_raw(469-469)model(217-219)model(935-1261)model(935-935)model(2038-2051)model(2038-2038)s(2446-2448)s(2446-2446)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(465-736)
tools/mtmd/mtmd.cpp (1)
tools/mtmd/clip.cpp (4)
ctx(2490-2593)ctx(2490-2490)clip_get_projector_type(3737-3739)clip_get_projector_type(3737-3737)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
add_clip_projector_type(1076-1077)add_vision_attention_layernorm_eps(1097-1098)gguf-py/gguf/constants.py (2)
VisionProjectorType(3408-3429)GGMLQuantizationType(3283-3315)
🪛 GitHub Actions: flake8 Lint
convert_hf_to_gguf.py
[error] 5974-5974: flake8: E302 expected 2 blank lines, found 1. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6045-6045: flake8: W504 line break after binary operator. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
5969-5969: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: android-build
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: pyright type-check
🔇 Additional comments (20)
src/models/gemma3n-iswa.cpp (1)
262-306: Vision input path implementation looks correct, but verify commented-out scaling.The implementation correctly broadcasts the padding token embedding (token ID 0) across all vision tokens, matching the PyTorch behavior described in comments. The approach of creating zeros via
ggml_scale(..., 0.0f)is valid.A few observations:
Line 305: The
sqrtf((float) n_embd_altup)scaling is commented out for vision inputs, while it's applied for text inputs (line 259). Please confirm this difference is intentional per the model specification.The
inp->embdtensor is allocated and marked as input (lines 270-271) but isn't used in the subsequent computation - the zeros are created fromper_layer_model_projprojection instead. This appears intentional as the embeddings will be fed separately, but worth confirming the graph input wiring is correct.tools/mtmd/mtmd.cpp (2)
269-272: LGTM - GEMMA3N correctly inherits GEMMA3's image token handling.The addition of
PROJECTOR_TYPE_GEMMA3NalongsidePROJECTOR_TYPE_GEMMA3correctly sets up the same<start_of_image>and<end_of_image>tokens for the Gemma3n vision path.
861-866: LGTM - Non-causal decode handling extended to GEMMA3N.The logic correctly includes
PROJECTOR_TYPE_GEMMA3Nin the non-causal decoding path, maintaining parity with GEMMA3.Minor observation: Consider extracting the repeated
clip_get_projector_type(ctx->ctx_v)call to a local variable for readability, though this is optional given the function is lightweight.tools/mtmd/CMakeLists.txt (1)
30-30: LGTM - MobileNetV5 source file added to build.The new
models/mobilenetv5.cppis correctly included in the mtmd library sources. This enables the MobileNetV5-based graph construction for Gemma3n vision support.gguf-py/gguf/tensor_mapping.py (1)
122-142: LGTM - New Gemma3n vision tensor mappings added.The new tensor mappings for MobileNetV5-based Gemma3n vision support are correctly structured and follow the existing pattern. The mappings align with the
MODEL_TENSORenums defined inconstants.py.Note: The comments label these as "gemma3n", which is accurate for
V_MM_EMBEDDING,V_MM_HARD_EMB_NORM, andV_MM_POST_PROJ_NORM. ForV_MM_INP_PROJandV_MM_SOFT_EMB_NORM, the constants.py comments indicate "gemma3" but this appears to be reusing existing tensor types with new HuggingFace tensor name mappings for gemma3n, which is a valid pattern.tools/mtmd/clip-model.h (1)
175-212: Well-structured MobileNetV5 block definition.The
mobilenetv5_blockstruct is comprehensive and well-organized, covering all necessary components for the Gemma3n vision encoder:
- Stage 0 (Edge Residual) and Stage 1+ (Universal Inverted Residual) convolutions with batch normalization
- Multi-Query Attention (MQA) components with optional downsampling
- Layer scale and block normalization
The struct follows the existing naming conventions and patterns in the file.
gguf-py/gguf/constants.py (3)
670-672: LGTM: New gemma3n tensor types properly defined.The three new tensor types (
V_MM_EMBEDDING,V_MM_HARD_EMB_NORM,V_MM_POST_PROJ_NORM) are:
- Properly prefixed with
V_MM_following the existing naming convention- Clearly documented as gemma3n-specific
- Correctly placed within the MODEL_TENSOR enum
1065-1067: Correct tensor mappings and MMPROJ integration.The tensor name mappings and MMPROJ architecture additions are properly implemented:
- String names follow the
mm.*convention used for multimodal tensors- Tensors are correctly added to
MODEL_TENSORS[MODEL_ARCH.MMPROJ]list- Consistent with existing patterns in the file
Also applies to: 1166-1168
1947-1981: Complete GEMMA3N architecture tensor list.The
MODEL_ARCH.GEMMA3Ntensor list is comprehensive and well-organized:
- Includes all standard Gemma3 tensors (token embedding, attention, FFN)
- Properly extends with gemma3n-specific components:
- Per-layer tensors (
PER_LAYER_TOKEN_EMBD,PER_LAYER_MODEL_PROJ, etc.)- Altup tensors for alternative upsampling/routing
- Laurel tensors for layer-wise processing
- Comments clearly indicate the purpose of specialized tensor groups
This ensures proper serialization and deserialization of Gemma3n models.
tools/mtmd/clip.cpp (4)
3128-3133: Potential issue with n_patches calculation for GEMMA3N.The calculation
n_patches = image_size / patch_sizecomputes a single dimension (e.g., 16 for 256/16), butn_patchestypically represents the total number of patches (i.e.,(image_size / patch_size)^2). The comment says "MSFA adapter always outputs fixed 16x16 resolution", which suggests the result should be 256, not 16.If the output is indeed 16x16, the calculation should be:
- n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; + int n_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; + n_patches = n_per_side * n_per_side;If this is intentional (e.g., the MSFA outputs a single row of 16 tokens), please add a clarifying comment.
791-794: LGTM: GEMMA3N graph builder integration.The new case for
PROJECTOR_TYPE_GEMMA3Ncorrectly uses the dedicatedclip_graph_mobilenetv5builder, consistent with how other projector types are handled.
1148-1155: LGTM: GEMMA3N hparams configuration.The default
n_merge = 1with an optional override viaKEY_PROJ_SCALE_FACTORfollows the established pattern for other projector types.
2859-2867: LGTM: GEMMA3N preprocessing.The preprocessing correctly resizes the image to a square using bilinear interpolation without padding (
falseparameter), then normalizes using the configured mean/std values, matching the expected MobileNetV5 input format.tools/mtmd/models/mobilenetv5.cpp (4)
5-20: LGTM: RMS Norm 2D helper implementation.The
rms_norm_2dhelper correctly permutes the tensor to normalize over channels for each spatial position, applies the standard RMSNorm operation, and optionally applies the learned weight before permuting back. The use ofggml_contafter permute ensures the tensor is contiguous for subsequent operations.
22-53: LGTM: SAME padding implementation.The
pad_same_2dhelper correctly implements TensorFlow/PyTorch-style asymmetric SAME padding. The ceiling division for output size and the asymmetric split of padding (bottom/right gets the extra pixel) matches the expected behavior.
153-250: LGTM: Attention block implementation.The
build_mobilenet_attnfunction correctly implements multi-query attention with:
- Optional input normalization
- Downsampled K/V paths using depthwise convolutions
- Proper Q/K/V reshaping and permutation for attention
- Scaled dot-product attention with softmax
- Output projection with optional layer scaling and residual connection
403-463: LGTM: Gemma3n multimodal projection.The embedding/projection logic correctly:
- Permutes and flattens spatial dimensions to sequence format
- Applies feature scaling by sqrt(hidden_size)
- Applies soft embedding normalization with optional learned weight
- Projects to text hidden size via linear layer
- Applies post-projection RMSNorm
This matches the expected Gemma3n vision embedder architecture.
tools/mtmd/clip-impl.h (2)
156-195: LGTM: MobileNetV5 tensor name macros.The tensor name macros are well-organized by component (stem, edge residual, inverted residual, attention, MSFA) and follow the established naming conventions. The
%d.%dformat for stage/block indexing aligns with the dynamic loading logic inclip.cpp.
214-214: LGTM: GEMMA3N projector type registration.The new
PROJECTOR_TYPE_GEMMA3Nenum value and its string mapping"gemma3n"are correctly placed and follow the existing pattern.Also applies to: 245-245
convert_hf_to_gguf.py (1)
522-527: Robust handling of emptytensor_map.mappingfor block_count=0 looks goodUsing a guarded branch for
max_name_lenavoidsValueErrorwhengguf.get_tensor_name_map(..., block_count=0)produces an empty mapping (e.g., MobileNetV5-based encoders) and only changes log formatting width. No further changes needed here.
| RESAMPLER = auto() | ||
| GLM_EDGE = auto() | ||
| MERGER = auto() | ||
| GEMMA3N = auto() |
There was a problem hiding this comment.
Add GEMMA3N mapping to VISION_PROJECTOR_TYPE_NAMES.
The GEMMA3N entry was added to VISION_PROJECTOR_TYPE enum but is missing from the VISION_PROJECTOR_TYPE_NAMES dictionary at lines 850-858. This mapping is used to convert the enum value to its string representation.
🔎 Proposed fix
Add the mapping to the VISION_PROJECTOR_TYPE_NAMES dictionary:
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
VISION_PROJECTOR_TYPE.MLP: "mlp",
VISION_PROJECTOR_TYPE.LDP: "ldp",
VISION_PROJECTOR_TYPE.LDPV2: "ldpv2",
VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter",
VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger",
VISION_PROJECTOR_TYPE.GEMMA3: "gemma3",
+ VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",
+ VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger",
+ VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",
}Note: QWEN3VL and COGVLM are also missing from this dictionary.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| GEMMA3N = auto() | |
| VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = { | |
| VISION_PROJECTOR_TYPE.MLP: "mlp", | |
| VISION_PROJECTOR_TYPE.LDP: "ldp", | |
| VISION_PROJECTOR_TYPE.LDPV2: "ldpv2", | |
| VISION_PROJECTOR_TYPE.RESAMPLER: "resampler", | |
| VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter", | |
| VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger", | |
| VISION_PROJECTOR_TYPE.GEMMA3: "gemma3", | |
| VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", | |
| VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger", | |
| VISION_PROJECTOR_TYPE.COGVLM: "cogvlm", | |
| } |
🤖 Prompt for AI Agents
In gguf-py/gguf/constants.py around line 459 (GEMMA3N added to
VISION_PROJECTOR_TYPE) and the VISION_PROJECTOR_TYPE_NAMES dictionary at lines
~850-858, the GEMMA3N enum value is not mapped to its string representation; add
an entry mapping VISION_PROJECTOR_TYPE.GEMMA3N to "GEMMA3N" in that dictionary.
Also add missing mappings for QWEN3VL and COGVLM (e.g.,
VISION_PROJECTOR_TYPE.QWEN3VL -> "QWEN3VL" and VISION_PROJECTOR_TYPE.COGVLM ->
"COGVLM") so all enum members have corresponding string names. Ensure
formatting/commas match the surrounding dictionary entries.
This comment was marked as resolved.
This comment was marked as resolved.
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain@ngxson, I'll save these preferences for mirrored PRs and apply them to this review. Let me fetch the upstream PR details and conduct a thorough review. 🏁 Script executed: Length of output: 3458 🏁 Script executed: Length of output: 10158 🏁 Script executed: Length of output: 8976 🏁 Script executed: Length of output: 3571 🏁 Script executed: Length of output: 2673 🏁 Script executed: Length of output: 3033 🏁 Script executed: Length of output: 5510 🏁 Script executed: Length of output: 4258 🏁 Script executed: Length of output: 4263 @ngxson, I've completed a thorough review of this mirrored PR from simrnsingh. Here are my findings: ✅ Good NewsNo suspicious AI-generated code comments were found (no "CRITICAL:", "You are correct here...", etc.). The contributor properly disclosed Claude Code usage for exploration and boilerplates. 🐛 Issues Found1. Critical: Incorrect patch_size calculation (convert_hf_to_gguf.py, line 6011)The calculation # Current code:
image_seq_length = 256 # 16x16 grid
image_size = 768
patch_size = image_size // image_seq_length # = 768 // 256 = 3Problem: A Should be: n_per_side = int(image_seq_length ** 0.5) # sqrt(256) = 16
patch_size = image_size // n_per_side # = 768 // 16 = 48This matches how Gemma3 (non-N) calculates it at line 5927. The C++ code at clip.cpp:3132 expects 2. Hardcoded magic values (convert_hf_to_gguf.py, lines 5990-5995)def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
if "intermediate_size" in keys:
hidden_size = self.hparams_vision.get("hidden_size", 2048) # Hardcoded default
return hidden_size * 4 # Hardcoded expansion factor
if "num_attention_heads" in keys or "num_heads" in keys:
return 8 # Hardcoded MQA headsThese should ideally be read from the vision config if available, rather than hardcoded. While they may be correct for the current Gemma3n model, they're not future-proof. 3. Potential unsafe tensor dimension access (mobilenetv5.cpp, lines 138, 243)The code accesses if (block.layer_scale_w) {
ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w,
1, 1, block.layer_scale_w->ne[0], 1); // Safe: inside if-blockThis is actually safe due to the if-check, but similar patterns exist that should be verified. 4. Hardcoded assumption (mobilenetv5.cpp, line 407)int C = cur->ne[2]; // Should be 2048This comment reveals a hardcoded assumption about channel count that might not hold for all model variants. 5. Minor: Commented-out code (mobilenetv5.cpp, line 331)int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h; // UnusedThe contributor left commented-out code that should probably be removed. 📝 RecommendationsThe patch_size calculation bug should be fixed before merging. The other issues are less critical but should be addressed for code quality. ✏️ Learnings added
🧠 Learnings used |
…cit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest - convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf - mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_* - clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)
857-865: [Duplicate] Missing VISION_PROJECTOR_TYPE_NAMES mappings for GEMMA3N, QWEN3VL, and COGVLM.This issue was already flagged in previous reviews. The
VISION_PROJECTOR_TYPE_NAMESdictionary is missing mappings forGEMMA3N(line 459),QWEN3VL(line 461), andCOGVLM(line 462) enum entries. This will cause aKeyErrorif code attempts to convert these enum values to their string representations using this dictionary.Based on previous review, the fix should add:
VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger", VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",convert_hf_to_gguf.py (1)
5966-6091: Fix Gemma3n patch_size computation and note minor cleanupsThere are a few points here:
Patch size computation for Gemma3n MobileNetV5 is still wrong
InGemma3nVisionModel.set_gguf_parameters:image_seq_length = self.preprocessor_config.get("image_seq_length", 256) image_size = self.hparams_vision["image_size"] self.hparams_vision["patch_size"] = image_size // image_seq_lengthWith the default Gemma3n setup (768×768,
image_seq_length = 256), this yieldspatch_size = 3, which implies a 256×256 grid and 65,536 patches, while the comment explicitly states “256 tokens = 16×16”. Patch size should be derived from tokens per side, not total tokens.Recommended fix (same issue as previously flagged in earlier review; applying it here for the new MobileNetV5 path as well):
Proposed fix for patch_size in Gemma3nVisionModel
- # Image sequence length (256 tokens = 16x16 for Gemma3n) - image_seq_length = self.preprocessor_config.get("image_seq_length", 256) - image_size = self.hparams_vision["image_size"] - self.hparams_vision["patch_size"] = image_size // image_seq_length
# Image sequence length (e.g. 256 tokens = 16x16 grid for Gemma3n)image_seq_length = self.preprocessor_config.get("image_seq_length", 256)image_size = self.hparams_vision["image_size"]# Derive patch size from patches-per-side, not total token countn_per_side = int(image_seq_length ** 0.5)if n_per_side * n_per_side != image_seq_length:raise ValueError(f"image_seq_length={image_seq_length} is not a perfect square; ""cannot infer square patch grid for Gemma3n vision encoder")self.hparams_vision["patch_size"] = image_size // n_per_sideThis matches the intended 16×16 grid for 256 tokens and keeps `patch_size` consistent with how other vision encoders in this file derive it. 2. **Vocab / embedding handling for Gemma3n text model is a solid improvement** - `Gemma3NModel.set_vocab` temporarily removes `vocab_size_per_layer_input` so `_create_vocab_sentencepiece()` uses the full `vocab_size` (including the vision/audio special tokens) and then restores it. - The new `modify_tensors` branch pads `embed_tokens.weight` and per-layer embeddings up to `vocab_size`, instead of truncating to `vocab_size_per_layer_input`, which is required for multimodal Gemma3n. This avoids dropping the 262144–262399 special IDs and aligns the text embeddings with the tokenizer. Behavior looks correct and non‑regressive for pure‑text use. 3. **Optional typing/ruff cleanup for class attributes (RUF012)** In `Gemma3nVisionModel`: ```python n_block_keys = [] block_tensor_mapping = { ... }These are effectively class‑level constants. To satisfy
RUF012and make the intent explicit to type checkers, consider:Optional ClassVar annotation tweak
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast +from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar ... - n_block_keys = [] + n_block_keys: ClassVar[list[str]] = [] ... - block_tensor_mapping = { + block_tensor_mapping: ClassVar[dict[str, str]] = { ... }This is purely a typing / tooling nicety; behavior is unchanged.
Overall, once the
patch_sizeformula is corrected, the Gemma3n MobileNetV5 path and the Gemma3n text vocab/embedding logic look structurally sound for the mirrored upstream changes.Also applies to: 6115-6191
🧹 Nitpick comments (4)
gguf-py/gguf/tensor_mapping.py (1)
123-158: Gemma3n vision tensor mappings look consistentThe new V_MM_* and V_ENC_* entries align with the Gemma3n/MobileNetV5 tensor paths used in
convert_hf_to_gguf.pyand constants, so TensorNameMap will correctly resolvemodel.embed_vision.*andmodel.vision_tower.timm_model.*for Gemma3n.Note that
V_MM_INP_PROJ/V_MM_SOFT_EMB_NORMnow have both genericmulti_modal_projector.*and Gemma3n-specificmodel.embed_vision.*synonyms; that’s fine, but if more variants start using these tensors it may be worth documenting this dual use to avoid confusion later.tools/mtmd/clip.cpp (1)
1528-1620: Verifymobilenetv5_blockdefault initialization and stage boundary assumptionsThe GEMMA3N tensor-loading branch dynamically discovers MobileNetV5 blocks per stage and accumulates them in
model.mobilenet_blocks, with stage ends recorded inmodel.mobilenet_stage_ends. This is a good direction, but a couple of edge conditions are worth double‑checking:
mobilenetv5_block block;relies on the struct’s members being safely default‑initialized (e.g.,ggml_tensor * foo = nullptr;or an explicit ctor). If any members lack default member initializers, they will contain indeterminate values in paths where that sub‑block type is absent (e.g., pure attention blocks vs. pure UIR blocks). Please confirmmobilenetv5_blockis defined with default member initializers or add= {}here to value‑initialize it.The
for (int blk_idx = 0; ; ++blk_idx)loop for each stage stops at the firstblk_idxthat yields no tensors. This assumes that all blocks in a stage are densely indexed from 0..N-1 with no gaps. If future variants ever introduce gaps, discovery would silently truncate later blocks. That’s probably fine for current Gemma3n, but worth keeping in mind if more MobileNetV5 variants are added.
mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1)stores inclusive global indices. Ensure mobilenetv5.cpp interprets these indices the same way (inclusive vs exclusive) when iterating.If you confirm
mobilenetv5_blockis defined with safe defaults and that stage indices are inclusive by design, this loader logic looks solid.tools/mtmd/models/mobilenetv5.cpp (2)
372-372: Consider making target output resolution configurable.The hardcoded
target_out_res = 16assumes a fixed output resolution for the MSFA downsampling stage. If the model architecture varies or if different Gemma3n variants use different resolutions, this value should be read from the model config rather than hardcoded.💡 Suggested approach
Add a field to the model config for MSFA output resolution and read it during model loading, falling back to 16 if not present:
// Example usage (adjust based on actual model structure): const int target_out_res = model.msfa_output_res ? model.msfa_output_res : 16;Alternatively, if 16 is the fixed resolution for all Gemma3n models, consider adding a comment explaining this architectural constraint.
Based on learnings, this is a mirrored PR—please verify with the upstream contributor whether this value should remain fixed or be made configurable.
420-420: Consider extracting RMS norm epsilon as a named constant.The hardcoded epsilon value
1e-6fappears twice (lines 420, 442) for Gemma3n RMS normalization. Extracting it as a named constant improves maintainability and makes the architectural choice explicit.💡 Suggested refactor
Near the top of the file or in the class definition:
static constexpr float GEMMA3N_RMS_NORM_EPS = 1e-6f;Then use it consistently:
- const float eps = 1e-6f; // Gemma3n uses 1e-6 - cur = ggml_rms_norm(ctx0, cur, eps); + cur = ggml_rms_norm(ctx0, cur, GEMMA3N_RMS_NORM_EPS);Also applies to: 442-442
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pytools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}
📄 CodeRabbit inference engine (AGENTS.md)
Always format C++ code before committing using
git clang-formatwith the project's.clang-formatconfiguration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignmentvoid * ptr, reference alignmentint & ref)
Files:
tools/mtmd/models/mobilenetv5.cpptools/mtmd/clip.cpp
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Always use the Python environment in.venvand run Python tools from that environment
Apply Python linting rules configured in.flake8(max-line-length=125, excludes examples/tools) and type checking with pyright
Files:
gguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pyconvert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated
Applied to files:
tools/mtmd/models/mobilenetv5.cppgguf-py/gguf/constants.pytools/mtmd/clip.cppconvert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated
Applied to files:
tools/mtmd/models/mobilenetv5.cppgguf-py/gguf/constants.pytools/mtmd/clip.cppconvert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/constants.pytools/mtmd/clip.cpp
🧬 Code graph analysis (4)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (17)
ggml_permute(3700-3752)ggml_cont(3461-3465)ggml_rms_norm(3066-3071)ggml_mul(2170-2175)ggml_pad_ext(4983-5016)ggml_conv_2d_direct(4702-4736)ggml_gelu(2677-2681)ggml_add(1969-1974)ggml_reshape_3d(3564-3581)ggml_reshape_4d(3583-3601)ggml_mul_mat(3174-3189)ggml_scale(3290-3295)ggml_soft_max(3966-3970)ggml_upscale(4928-4935)ggml_concat(2517-2544)ggml_pool_2d(4852-4878)ggml_build_forward_expand(6793-6795)tools/mtmd/clip.cpp (9)
build_inp_raw(469-474)build_inp_raw(469-469)model(217-219)model(935-1261)model(935-935)model(2035-2048)model(2035-2035)s(2443-2445)s(2443-2443)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(465-743)
tools/mtmd/clip.cpp (2)
common/common.cpp (4)
model(1159-1161)model(1159-1159)string_format(399-412)string_format(399-399)src/llama-model.cpp (2)
get_tensor(7044-7054)get_tensor(7044-7044)
convert_hf_to_gguf.py (1)
gguf-py/gguf/constants.py (1)
VisionProjectorType(3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
5969-5969: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
5972-5995: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6067-6067: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (6)
tools/mtmd/clip.cpp (1)
783-795: GEMMA3N graph dispatch is consistent with new MobilenetV5 graphRouting
PROJECTOR_TYPE_GEMMA3Ntoclip_graph_mobilenetv5mirrors how other projector types choose their specialized graphs and keeps mobilenet-specific logic isolated from the generic ViT path. No issues here.convert_hf_to_gguf.py (1)
522-527: prepare_tensors: safe handling of empty tensor_map looks goodGuarding
self.tensor_map.mappingbefore callingmax()fixes the crash whenblock_count == 0(e.g., MobileNetV5) while preserving the logging behavior with a reasonable fallback width.No issues from a correctness or performance perspective.
tools/mtmd/models/mobilenetv5.cpp (4)
5-20: LGTM: RMS normalization helper is correctly implemented.The permutation logic correctly transforms [W,H,C,B] → [C,H,W,B] for channel-wise normalization, applies RMS norm, optionally multiplies by weight, and restores the original layout. Null check for weight is properly guarded.
23-53: LGTM: Asymmetric SAME padding correctly implemented.The calculation matches PyTorch/TensorFlow SAME padding semantics with proper ceil division for output size and asymmetric padding split. The ggml_pad_ext call uses correct dimension ordering for [W,H,C,B] layout.
57-88: LGTM: Block builders handle null pointers safely.All three block builder functions (
build_edge_residual,build_inverted_residual,build_mobilenet_attn) consistently check for null pointers before dereferencing optional weights and biases. Residual connection logic correctly validates dimension matching before addition.Also applies to: 91-149, 152-246
248-453: LGTM: Main build function integrates all components correctly.The build() function properly sequences stem convolution, stage/block iteration, fusion point collection, MSFA logic, and Gemma3n embedding projection. Logging statements use proper macros (LOG_WRN, LOG_ERR) as addressed in previous reviews. All optional weight/bias tensors are guarded by null checks before use.
…th already existing terminology
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)
857-865: GEMMA3N mapping still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.This is the same issue flagged in the previous review. The GEMMA3N enum value added at line 459 still has no corresponding entry in this dictionary. This mapping is required for the enum-to-string conversion to work correctly.
convert_hf_to_gguf.py (1)
6014-6039: Fix Gemma3n MobileNetV5patch_sizesemantics and avoid brittle hard‑coded vision hparamsTwo related issues here:
patch_sizeis still computed from total token count, not per‑side patches (critical)
- Current code:
patch_size = image_size // image_seq_length(e.g.,768 // 256 = 3), which impliesn_per_side = 256and a 65k‑token grid.- Semantically,
image_seq_lengthis total patches (e.g., 256 = 16×16). Patch size must be derived fromsqrt(image_seq_length)so that both the converter and C++ vision path agree on a 16×16 grid and the correctpatch_size(48 for 768×768).Hard‑coded MobileNetV5 defaults in
find_vparamare brittle
hidden_sizedefaulting to 2048 andnum_headsforced to 8 will silently be wrong if future Gemma3n variants change these values in their config. It’s safer to read fromself.hparams_visionwhen available and only fall back to defaults if the config is missing them.Patch: derive `patch_size` from √image_seq_length
- # Image sequence length (256 tokens = 16x16 for Gemma3n) - image_seq_length = self.preprocessor_config.get("image_seq_length", 256) - image_size = self.hparams_vision["image_size"] - self.hparams_vision["patch_size"] = image_size // image_seq_length + # Image sequence length is total tokens (e.g. 256 = 16×16 grid) + image_seq_length = self.preprocessor_config.get("image_seq_length", 256) + image_size = self.hparams_vision["image_size"] + + n_per_side = int(image_seq_length ** 0.5) + if n_per_side * n_per_side != image_seq_length: + raise ValueError(f"image_seq_length={image_seq_length} is not a perfect square") + + # e.g. 768 // 16 = 48 for a 16×16 patch grid + self.hparams_vision["patch_size"] = image_size // n_per_sidePatch: prefer config‑driven head / FFN sizes in `find_vparam`
def find_vparam(self, keys: list[str], optional: bool = False) -> Any: """Override to provide hardcoded MobileNetV5 parameters that aren't in config""" # Handle empty keys list (n_block_keys) - return 0 for CNN architecture if not keys: return 0 - if "intermediate_size" in keys: - # Typical expansion is 4x the embedding dimension - hidden_size = self.hparams_vision.get("hidden_size", 2048) - return hidden_size * 4 - - if "num_attention_heads" in keys or "num_heads" in keys: - # Multi-Query Attention with 8 heads - return 8 + if "intermediate_size" in keys: + assert self.hparams_vision is not None + if "intermediate_size" in self.hparams_vision: + return self.hparams_vision["intermediate_size"] + # Fallback: typical MobileNetV5 expansion is 4× hidden_size + hidden_size = self.hparams_vision.get("hidden_size") + if hidden_size is not None: + return hidden_size * 4 + + if any(k in ("num_attention_heads", "num_heads") for k in keys): + assert self.hparams_vision is not None + for k in ("num_attention_heads", "num_heads"): + if k in self.hparams_vision: + return self.hparams_vision[k] + # Final fallback if config is missing heads + return 8 # For other parameters, use parent implementation return super().find_vparam(keys, optional)Given this is a mirrored PR, you’ll probably want to carry this fix locally and/or ping upstream about the patch_size formula and config‑driven defaults.
🧹 Nitpick comments (3)
gguf-py/gguf/constants.py (1)
1071-1071: Consider clarifying comment for V_MM_SOFT_EMB_NORM.The comment here shows
# gemma3n, but the enum definition at line 669 shows# gemma3. If this tensor is used by both gemma3 and gemma3n architectures, consider using a comment like# gemma3, gemma3nto clarify the shared usage and avoid confusion.convert_hf_to_gguf.py (2)
5969-5995: Annotate mutable class attributes withClassVarto satisfy Ruff RUF012
n_block_keys = []andblock_tensor_mapping = {…}are mutable class attributes; Ruff expects them to be annotated astyping.ClassVar[...].Proposed type annotations for class attributes
Add
ClassVarto the typing imports:from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVarThen update the class attributes:
- n_block_keys = [] + n_block_keys: ClassVar[list[str]] = [] @@ - block_tensor_mapping = { + block_tensor_mapping: ClassVar[dict[str, str]] = { "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight", ... }
6171-6226: Embedding padding and ALTUP stacking logic look correct; consider a small guardThis block:
- Only affects
language_model.*tensors; others are skipped, which keeps mmproj / vision clean.- Pads
embed_tokens.weightandembed_tokens_per_layerup tohparams["vocab_size"], filling new rows with zeros for vision/audio tokens (which get real features from the vision/audio towers anyway).- Leaves
altup_unembed_projectionsandaltup_projectionsunpadded and stacks three shard tensors into single[3, …, …]matrices, matching how GGUF expects them.You might consider adding a simple sanity check on the padding path to catch config mismatches earlier (optional):
Optional: assert current vs target vocab sizes when padding
- vocab_size = self.hparams.get("vocab_size", 262400) - current_size = data_torch.shape[0] # First dimension is vocab_size + vocab_size = self.hparams.get("vocab_size", 262400) + current_size = data_torch.shape[0] # first dim is vocab size + + if current_size > vocab_size: + raise ValueError( + f"embed tensor rows ({current_size}) exceed vocab_size ({vocab_size})" + )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
convert_hf_to_gguf.pygguf-py/gguf/constants.pytools/mtmd/clip-impl.h
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/mtmd/clip-impl.h
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Always use the Python environment in.venvand run Python tools from that environment
Apply Python linting rules configured in.flake8(max-line-length=125, excludes examples/tools) and type checking with pyright
Files:
gguf-py/gguf/constants.pyconvert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated
Applied to files:
gguf-py/gguf/constants.pyconvert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated
Applied to files:
gguf-py/gguf/constants.pyconvert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/constants.py
🧬 Code graph analysis (1)
convert_hf_to_gguf.py (3)
convert_lora_to_gguf.py (2)
modify_tensors(448-473)to(185-186)gguf-py/gguf/gguf_writer.py (2)
add_clip_projector_type(1076-1077)add_vision_attention_layernorm_eps(1097-1098)gguf-py/gguf/constants.py (1)
VisionProjectorType(3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
5969-5969: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
5972-5995: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6067-6067: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (3)
convert_hf_to_gguf.py (3)
522-527: Defensive handling of emptytensor_mapinprepare_tensorslooks goodThe fallback
max_name_lenpath for models withblock_count == 0(e.g., MobileNetV5) avoidsmax()on an empty mapping and only affects logging alignment. No functional risks spotted.
6070-6090: Verify MobileNetV5 tensor remapping and reshaping assumptionsThe overall routing logic looks consistent:
- Non‑Gemma3n tensors are skipped early.
- Block tensors under
model.vision_tower.timm_model.blocks.*are mapped viacustom_mapinto thev.blk.{bid}.{sid}.*namespace.- Other MobileNetV5 pieces (stem / msfa / embeddings) fall back to the standard tensor map.
- Biases and
layer_scale.gammaare reshaped from[C]into[1, C, 1, 1], which matches a conv‑style broadcast.If you haven’t already, it would be worth running a quick shape sanity check against a real Gemma3n vision checkpoint (ensure
conv_stem.conv.biasandlayer_scale.gammaare 1‑D[C]and that the resulting GGUF tensors have the shapes expected bymobilenetv5.cpp/clip.cppfor those fields).
6115-6134:Gemma3NModel.set_vocabcorrectly forces use of fullvocab_sizeTemporarily removing
vocab_size_per_layer_inputso the vocab builders see the fullvocab_size(including vision/audio specials) and then restoring it is a clean way to reconcile:
- vocab generation: needs 262400 entries
- per‑layer embeddings: still driven by
vocab_size_per_layer_inputlaterNo functional issues seen.
…roadcasting of vision inp_per_layer - Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
gguf-py/gguf/constants.py (1)
459-459: GEMMA3N still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.This issue was previously flagged but remains unresolved. The
GEMMA3Nentry was added to theVISION_PROJECTOR_TYPEenum at line 459, but the corresponding mapping inVISION_PROJECTOR_TYPE_NAMESdictionary (lines 856-864) is still missing. This will cause runtime errors when converting the enum value to its string representation.Please add the mapping:
VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",Note:
QWEN3VLandCOGVLMmappings are also still missing as noted in the previous review.
🧹 Nitpick comments (1)
src/models/gemma3n-iswa.cpp (1)
263-274: Consider adding dimension validation for the vision embedding path.The vision embedding path assumes
embd_size = model.tok_embd_per_layer->ne[0]equalsn_embd_altup * n_layer. If this assumption is violated, the reshape at line 272 will produce incorrect results or fail silently.💡 Optional: Add assertion to validate dimensions
// Vision embedding path: use padding token (ID=0) embedding const int64_t embd_size = model.tok_embd_per_layer->ne[0]; // n_embd_altup * n_layer +GGML_ASSERT(embd_size == n_embd_altup * n_layer && "tok_embd_per_layer dimension mismatch"); // Extract and dequantize padding token embedding (column 0)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
gguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}
📄 CodeRabbit inference engine (AGENTS.md)
Always format C++ code before committing using
git clang-formatwith the project's.clang-formatconfiguration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignmentvoid * ptr, reference alignmentint & ref)
Files:
src/models/gemma3n-iswa.cpp
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Always use the Python environment in.venvand run Python tools from that environment
Apply Python linting rules configured in.flake8(max-line-length=125, excludes examples/tools) and type checking with pyright
Files:
gguf-py/gguf/tensor_mapping.pygguf-py/gguf/constants.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated
Applied to files:
src/models/gemma3n-iswa.cppgguf-py/gguf/constants.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated
Applied to files:
gguf-py/gguf/constants.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/constants.py
🧬 Code graph analysis (2)
src/models/gemma3n-iswa.cpp (1)
ggml/src/ggml.c (7)
ggml_new_tensor_1d(1747-1752)ggml_set_input(7435-7437)ggml_get_rows(3776-3797)ggml_reshape_3d(3564-3581)ggml_scale(3290-3295)ggml_cpy(3426-3431)ggml_add(1969-1974)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(465-742)
🔇 Additional comments (7)
gguf-py/gguf/tensor_mapping.py (1)
123-155: Verify V_MM_INP_PROJ comment and tensor reuse.The mapping for
V_MM_INP_PROJat lines 129-131 is commented as "# gemma3n", but inconstants.py(line 668),V_MM_INP_PROJis defined with comment "# gemma3". Please verify:
- Is
V_MM_INP_PROJbeing reused between gemma3 and gemma3n, or should gemma3n have a distinct tensor?- If it's reused, consider clarifying the comment or noting the shared usage.
Otherwise, the new gemma3n vision tensor mappings look correct and follow the established pattern.
gguf-py/gguf/constants.py (4)
670-678: LGTM: MODEL_TENSOR enum additions.The new gemma3n vision tensor enum members are correctly defined with
auto()and follow the established naming conventions and patterns.
1070-1079: LGTM: TENSOR_NAMES mappings for gemma3n.The new TENSOR_NAMES mappings correctly associate the MODEL_TENSOR enum members with their GGUF tensor name strings, following established naming conventions.
1178-1186: LGTM: MODEL_TENSORS additions for MMPROJ arch.The new gemma3n vision tensors are correctly added to the
MODEL_TENSORS[MODEL_ARCH.MMPROJ]list, ensuring they will be recognized during tensor loading and validation.
3428-3428: LGTM: VisionProjectorType.GEMMA3N constant.The
GEMMA3Nconstant is correctly defined with the appropriate lowercase string value "gemma3n", consistent with other projector type definitions.src/models/gemma3n-iswa.cpp (2)
253-261: LGTM: Token input path refactoring.The refactoring correctly moves the input object creation and lifecycle management into the token branch scope. The
res->add_input(std::move(inp))call after processing is appropriately placed.
292-292: Broadcasting order change is intentional and properly implemented.The addition order change—placing
per_layer_proj(shape[n_embd_altup, n_layer, n_tokens]) first andinp_per_layersecond—is correct per the PR objectives and supports broadcasting for vision inputs (shape[n_embd_altup, n_layer, 1]). The implementation follows standard conventions (larger tensor first) and the code is properly formatted per project guidelines.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
gguf-py/gguf/constants.py (1)
461-472: Bug:VISION_PROJECTOR_TYPE.GEMMA3Nis missing fromVISION_PROJECTOR_TYPE_NAMES.You added the enum value (Line 468) but didn’t add the corresponding name in
VISION_PROJECTOR_TYPE_NAMES(Line 870+). If code relies on this mapping, GEMMA3N will fail to serialize/deserialize (or raise a KeyError).Proposed fix
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = { VISION_PROJECTOR_TYPE.MLP: "mlp", VISION_PROJECTOR_TYPE.LDP: "ldp", VISION_PROJECTOR_TYPE.LDPV2: "ldpv2", VISION_PROJECTOR_TYPE.RESAMPLER: "resampler", VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter", VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger", + VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", VISION_PROJECTOR_TYPE.GEMMA3: "gemma3", }Also applies to: 870-878
🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 530-535: The access to self.tensor_map.mapping in prepare_tensors
is fragile if tensor_map lacks a mapping attribute; change the guard to use
getattr(self.tensor_map, "mapping", None) and treat a falsy result the same as
an empty mapping so max_name_len computation and the fallback to
"vision_encoder.weight," are used safely; update references in prepare_tensors
and any subsequent usage that assumes mapping exists to first assign mapping =
getattr(self.tensor_map, "mapping", None) and use that local variable for checks
and iteration.
- Around line 6193-6212: The current set_vocab method temporarily deletes
self.hparams["vocab_size_per_layer_input"] then calls super().set_vocab(), but
if super().set_vocab() raises an exception the original value is never restored;
wrap the call to super().set_vocab() in a try/finally so that whatever happens
the original vocab_size_per_layer_input (captured from self.hparams) is
re-assigned to self.hparams["vocab_size_per_layer_input"] in the finally block;
keep the existing logic of only deleting/restoring when
vocab_size_per_layer_input is not None and reference the set_vocab method,
self.hparams, vocab_size_per_layer_input, and super().set_vocab() to locate the
change.
- Around line 6044-6125: The patch_size math in
Gemma3nVisionModel.set_gguf_parameters is wrong: replace the linear division
self.hparams_vision["patch_size"] = image_size // image_seq_length with a
square-root based computation (n_per_side = int(sqrt(image_seq_length)) and
patch_size = image_size // n_per_side) so 256 tokens → 16×16 grid and
patch_size=48 for image_size=768; update references in set_gguf_parameters
accordingly. Also update find_vparam to prefer reading num_heads from
self.hparams_vision (e.g., self.hparams_vision.get("num_heads")) and fall back
to 8 only if absent, keeping the existing hidden_size fallback logic.
In @tools/mtmd/clip-model.h:
- Around line 331-347: Remove the unused msfa_concat_conv_w declaration from the
header and fix the unloaded mm_post_proj_norm_w by adding its loading logic
during GEMMA3N model init in clip.cpp (follow the same pattern used for
mobilenet_stem_conv_w / mobilenet_stem_norm_w: call the model tensor-load helper
to assign mm_post_proj_norm_w, check for nullptr and handle gracefully).
Alternatively, if the model truly does not provide that tensor, remove the
conditional check/usage of mm_post_proj_norm_w in mobilenetv5.cpp instead of
loading it. Refer to the symbols mobilenet_blocks, mobilenet_stem_conv_w,
mobilenet_stem_norm_w, mm_post_proj_norm_w, msfa_concat_conv_w, and the
mobilenetv5.cpp/clip.cpp initialization areas when making the change.
🧹 Nitpick comments (1)
tools/mtmd/models/models.h (1)
80-111: Consider narrowing the public surface ofclip_graph_mobilenetv5.These look like internal graph-building helpers; making them
private(and/or switchingstruct→class) would reduce accidental use outside the implementation.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/models.htools/mtmd/mtmd.cpp
🚧 Files skipped from review as they are similar to previous changes (4)
- tools/mtmd/CMakeLists.txt
- src/models/gemma3n-iswa.cpp
- tools/mtmd/mtmd.cpp
- tools/mtmd/clip-impl.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cppgguf-py/gguf/constants.py
🧬 Code graph analysis (2)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
build(248-453)build(248-248)rms_norm_2d(5-20)rms_norm_2d(5-5)pad_same_2d(23-53)pad_same_2d(23-23)build_edge_residual(57-88)build_edge_residual(57-57)build_inverted_residual(91-149)build_inverted_residual(91-91)build_mobilenet_attn(152-246)build_mobilenet_attn(152-152)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(474-751)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
6047-6047: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6050-6073: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6145-6145: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (10)
gguf-py/gguf/constants.py (1)
3525-3549:VisionProjectorType.GEMMA3N = "gemma3n"looks consistent with the intended surface.This matches the new projector type string and aligns with the new tensor set.
gguf-py/gguf/tensor_mapping.py (1)
126-159: Please verify these HF key strings against a real Gemma3n checkpoint.These new mappings are only as good as the exact parameter names (e.g.,
model.embed_vision.embedding,model.vision_tower.timm_model.conv_stem.conv). If upstream HF naming differs even slightly, conversion will fail to find tensors.If you’ve seen multiple naming variants across releases, consider adding extra aliases per tensor (as done elsewhere in this file).
tools/mtmd/models/models.h (1)
79-111: > Likely an incorrect or invalid review comment.tools/mtmd/clip.cpp (6)
791-794: LGTM!The GEMMA3N case correctly delegates to the
clip_graph_mobilenetv5builder, following the established pattern for other projector types.
1153-1160: LGTM!The GEMMA3N hparams configuration correctly sets
n_merge = 1since MobileNetV5 handles spatial reduction internally through its architecture rather than through patch merging. The optional override viaKEY_PROJ_SCALE_FACTORprovides flexibility.
2961-2969: LGTM!The GEMMA3N preprocessing correctly resizes to a square image without padding (
add_padding = false), which aligns with MobileNetV5 expectations. This is an intentional difference from GEMMA3's behavior.
3631-3631: LGTM!GEMMA3N correctly falls through with GEMMA3 and other projector types that don't require special input tensor setup beyond the raw image input.
3759-3760: LGTM!GEMMA3N correctly shares the embedding dimension source (
mm_input_proj_w->ne[0]) with GEMMA3, consistent with both using similar projection mechanisms.
1577-1652: The dynamic block discovery correctly handles variable-lengthmobilenet_stage_ends. The graph builder inclip_graph_mobilenetv5already protects against fewer-than-4 entries with an explicit size check (if (model.mobilenet_stage_ends.size() >= 4)at line 275 in mobilenetv5.cpp) before accessing indices [2] and [3], and provides fallback logic usingtotal_blocks - 1when fewer entries exist. No action needed.Likely an incorrect or invalid review comment.
tools/mtmd/clip-model.h (1)
176-213: LGTM!The
mobilenetv5_blockstructure is well-organized with clear comments distinguishing Stage 0 (Edge Residual), Stage 1+ (Universal Inverted Residual), and Attention components. All pointers are properly initialized tonullptr.
| def prepare_tensors(self): | ||
| max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,") | ||
| # Handle empty tensor_map for models with block_count=0 (like MobileNetV5) | ||
| if self.tensor_map.mapping: | ||
| max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,") | ||
| else: | ||
| max_name_len = len("vision_encoder.weight,") # Default reasonable length |
There was a problem hiding this comment.
Guard for empty tensor_map.mapping is good, but consider hardening the attribute access.
If gguf.TensorNameMap ever changes shape (e.g., no .mapping attr), this will raise at runtime. A tiny defensive getattr(self.tensor_map, "mapping", None) would make this robust.
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 530 - 535, The access to
self.tensor_map.mapping in prepare_tensors is fragile if tensor_map lacks a
mapping attribute; change the guard to use getattr(self.tensor_map, "mapping",
None) and treat a falsy result the same as an empty mapping so max_name_len
computation and the fallback to "vision_encoder.weight," are used safely; update
references in prepare_tensors and any subsequent usage that assumes mapping
exists to first assign mapping = getattr(self.tensor_map, "mapping", None) and
use that local variable for checks and iteration.
convert_hf_to_gguf.py
Outdated
| @ModelBase.register("Gemma3nForConditionalGeneration", "Gemma3nVisionModel") | ||
| class Gemma3nVisionModel(MmprojModel): | ||
| """Vision encoder converter for Gemma3n using MobileNetV5 architecture""" | ||
| n_block_keys = [] | ||
|
|
||
| # Double indexed mapping for MobileNetV5 blocks | ||
| block_tensor_mapping = { | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.bn1.weight": "v.blk.{bid}.{sid}.bn1.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_pwl.weight": "v.blk.{bid}.{sid}.conv_pwl.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.bn2.weight": "v.blk.{bid}.{sid}.bn2.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.conv.weight": "v.blk.{bid}.{sid}.dw_start.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.conv.weight": "v.blk.{bid}.{sid}.dw_mid.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.bn.weight": "v.blk.{bid}.{sid}.dw_mid.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.conv.weight": "v.blk.{bid}.{sid}.pw_exp.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.bn.weight": "v.blk.{bid}.{sid}.pw_exp.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.conv.weight": "v.blk.{bid}.{sid}.pw_proj.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.bn.weight": "v.blk.{bid}.{sid}.pw_proj.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.layer_scale.gamma": "v.blk.{bid}.{sid}.layer_scale.gamma", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.query.proj.weight": "v.blk.{bid}.{sid}.attn.query.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.proj.weight": "v.blk.{bid}.{sid}.attn.key.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.proj.weight": "v.blk.{bid}.{sid}.attn.value.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.output.proj.weight": "v.blk.{bid}.{sid}.attn.output.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.down_conv.weight": "v.blk.{bid}.{sid}.attn.key.down_conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.norm.weight": "v.blk.{bid}.{sid}.attn.key.norm.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.down_conv.weight": "v.blk.{bid}.{sid}.attn.value.down_conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.norm.weight": "v.blk.{bid}.{sid}.attn.value.norm.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.norm.weight": "v.blk.{bid}.{sid}.norm.weight", | ||
| } | ||
|
|
||
| def find_hparam(self, keys: list[str], optional: bool = False) -> Any: | ||
| """Override to return 0 for block count since MobileNetV5 is CNN-based""" | ||
| if not keys: # If n_block_keys is empty (our case) | ||
| return 0 | ||
| # Otherwise use parent implementation | ||
| return super().find_hparam(keys, optional) | ||
|
|
||
| def __init__(self, *args, **kwargs): | ||
| # Parent init will call find_hparam which now returns 0 for empty keys | ||
| super().__init__(*args, **kwargs) | ||
|
|
||
| def find_vparam(self, keys: list[str], optional: bool = False) -> Any: | ||
| """Override to provide hardcoded MobileNetV5 parameters that aren't in config""" | ||
| # Handle empty keys list (n_block_keys) - return 0 for CNN architecture | ||
| if not keys: | ||
| return 0 | ||
|
|
||
| if "intermediate_size" in keys: | ||
| # Typical expansion is 4x the embedding dimension | ||
| hidden_size = self.hparams_vision.get("hidden_size", 2048) | ||
| return hidden_size * 4 | ||
|
|
||
| if "num_attention_heads" in keys or "num_heads" in keys: | ||
| # Multi-Query Attention with 8 heads | ||
| return 8 | ||
|
|
||
| # For other parameters, use parent implementation | ||
| return super().find_vparam(keys, optional) | ||
|
|
||
| def set_gguf_parameters(self): | ||
| # MobileNetV5 does not use normalisation at all | ||
| self.preprocessor_config["image_mean"] = [0.0 , 0.0 , 0.0 ] | ||
| self.preprocessor_config["image_std"] = [1.0 , 1.0 , 1.0 ] | ||
| self.hparams_vision["image_size"] = self.preprocessor_config.get( | ||
| "size", {"height": 768, "width": 768} | ||
| )["height"] | ||
|
|
||
| # Image sequence length (256 tokens = 16x16 for Gemma3n) | ||
| image_seq_length = self.preprocessor_config.get("image_seq_length", 256) | ||
| image_size = self.hparams_vision["image_size"] | ||
| self.hparams_vision["patch_size"] = image_size // image_seq_length | ||
|
|
||
| # Now call parent which will use the corrected values | ||
| super().set_gguf_parameters() | ||
|
|
||
| # Set projector type to GEMMA3N | ||
| self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.GEMMA3N) | ||
|
|
||
| # MobileNetV5 specific parameters | ||
| self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-6)) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the actual file content around the specified lines
head -6130 convert_hf_to_gguf.py | tail -90 | cat -nRepository: ngxson/llama.cpp
Length of output: 6407
🏁 Script executed:
# Check the parent class to understand the context
rg -A 10 "class MmprojModel" convert_hf_to_gguf.py | head -50Repository: ngxson/llama.cpp
Length of output: 430
🏁 Script executed:
# Search for similar vision model implementations and how patch_size is typically computed
rg "patch_size.*=" convert_hf_to_gguf.py -B 3 -A 3 | head -100Repository: ngxson/llama.cpp
Length of output: 1714
🏁 Script executed:
# Check if image_seq_length is used elsewhere to understand the expected pattern
rg "image_seq_length" convert_hf_to_gguf.py -B 2 -A 2Repository: ngxson/llama.cpp
Length of output: 750
🏁 Script executed:
# Look for how n_per_side or sqrt is used in similar contexts
rg "sqrt\|n_per_side\|\*\*\s*0\.5" convert_hf_to_gguf.py | head -20Repository: ngxson/llama.cpp
Length of output: 42
Fix patch_size computation for Gemma3n—currently mathematically incorrect.
self.hparams_vision["patch_size"] = image_size // image_seq_length (line 75) treats token count as a linear divisor. For a 16×16 patch grid (256 tokens), the correct formula is n_per_side = sqrt(image_seq_length) and patch_size = image_size // n_per_side. With image_size=768 and image_seq_length=256, the current code produces patch_size=3 instead of 48—a 16× error that propagates downstream. Other vision models in this codebase (Qwen3VL, TinyGemma3) use the correct square-root approach.
Proposed fix
# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
+ n_per_side = int(image_seq_length ** 0.5)
+ if n_per_side * n_per_side != image_seq_length:
+ raise ValueError(f"image_seq_length must be a perfect square, got {image_seq_length}")
image_size = self.hparams_vision["image_size"]
- self.hparams_vision["patch_size"] = image_size // image_seq_length
+ self.hparams_vision["patch_size"] = image_size // n_per_sideAdditionally, find_vparam() hardcodes num_heads=8 (line 59) with no config fallback, while hidden_size (line 54) reads from config with a default. For consistency, attempt to read num_heads from self.hparams_vision before hardcoding.
🧰 Tools
🪛 Ruff (0.14.10)
6047-6047: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6050-6073: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6044 - 6125, The patch_size math in
Gemma3nVisionModel.set_gguf_parameters is wrong: replace the linear division
self.hparams_vision["patch_size"] = image_size // image_seq_length with a
square-root based computation (n_per_side = int(sqrt(image_seq_length)) and
patch_size = image_size // n_per_side) so 256 tokens → 16×16 grid and
patch_size=48 for image_size=768; update references in set_gguf_parameters
accordingly. Also update find_vparam to prefer reading num_heads from
self.hparams_vision (e.g., self.hparams_vision.get("num_heads")) and fall back
to 8 only if absent, keeping the existing hidden_size fallback logic.
| def custom_map(self, name: str) -> str: | ||
| """Parses names like model.vision_tower.timm_model.blocks.1.2.suffix and applies template mapping.""" | ||
| parts = name.split(".") | ||
| # MobileNet blocks have at least 7 parts: model, vision_tower, timm_model, blocks, bid, sid, and suffix | ||
| if len(parts) >= 7: | ||
| bid, sid = parts[4], parts[5] | ||
| suffix = ".".join(parts[6:]) | ||
| template = f"model.vision_tower.timm_model.blocks.{{bid}}.{{sid}}.{suffix}" | ||
| if template in self.block_tensor_mapping: | ||
| return self.block_tensor_mapping[template].format(bid=bid, sid=sid) | ||
|
|
||
| raise ValueError(f"Unknown name: {name}") | ||
|
|
||
| def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: | ||
| del bid # unused | ||
|
|
||
| # Gemma3n uses | ||
| # - model.embed_vision.* for projection layers | ||
| # - model.vision_tower.* for vision encoder | ||
| # Skip non-vision tensors | ||
| if not (name.startswith("model.embed_vision.") or | ||
| name.startswith("model.vision_tower.")): | ||
| return [] | ||
|
|
||
| if name.startswith("model.vision_tower.timm_model.blocks."): | ||
| # Double-indexed block tensors through custom logic | ||
| new_name = self.custom_map(name) | ||
| else: | ||
| # Route non-repeating (conv_stem, msfa, embedding, etc.) and un-catched through tensor_mapping.py | ||
| new_name = self.map_tensor_name(name) | ||
|
|
||
| if new_name.endswith("conv_stem.conv.bias") or new_name.endswith("layer_scale.gamma"): | ||
| data_torch = data_torch.unsqueeze(0).unsqueeze(-1).unsqueeze(-1) # [1, C, 1, 1] | ||
|
|
||
| yield (new_name, data_torch) | ||
|
|
There was a problem hiding this comment.
Make custom_map() less brittle + simplify reshape semantics.
custom_map()raises on any unknownblocks.*tensor (Line 6145). That’s fine for a single known checkpoint, but it makes the converter fragile across MobileNetV5 variants (extra tensors, renamed submodules, etc.). Consider falling back toself.map_tensor_name(name)(or skipping with a warning) when the template isn’t found.unsqueezechain (Line 6165-6166) is harder to read and easier to get wrong than an explicit reshape.
Possible refactor
@@
- if new_name.endswith("conv_stem.conv.bias") or new_name.endswith("layer_scale.gamma"):
- data_torch = data_torch.unsqueeze(0).unsqueeze(-1).unsqueeze(-1) # [1, C, 1, 1]
+ if new_name.endswith("conv_stem.conv.bias") or new_name.endswith("layer_scale.gamma"):
+ data_torch = data_torch.reshape(1, -1, 1, 1) # [1, C, 1, 1]Also: n_block_keys = [] and block_tensor_mapping = {...} are mutable class attrs; annotate as ClassVar or use tuples / Mapping to satisfy Ruff RUF012 and prevent accidental mutation.
🧰 Tools
🪛 Ruff (0.14.10)
6145-6145: Avoid specifying long messages outside the exception class
(TRY003)
| // mobilenetv5 for gemma3n | ||
| std::vector<mobilenetv5_block> mobilenet_blocks; | ||
| std::vector<int> mobilenet_stage_ends; | ||
| ggml_tensor * mobilenet_stem_conv_w = nullptr; | ||
| ggml_tensor * mobilenet_stem_conv_b = nullptr; | ||
| ggml_tensor * mobilenet_stem_norm_w = nullptr; | ||
| ggml_tensor * mm_post_proj_norm_w = nullptr; | ||
|
|
||
| // Multi-Scale Fusion Adapter (MSFA) components | ||
| ggml_tensor * msfa_concat_conv_w = nullptr; | ||
| ggml_tensor * msfa_concat_norm_w = nullptr; | ||
| ggml_tensor * msfa_ffn_expand_w = nullptr; | ||
| ggml_tensor * msfa_ffn_project_w = nullptr; | ||
| ggml_tensor * msfa_ffn_expand_bn = nullptr; | ||
| ggml_tensor * msfa_ffn_project_bn = nullptr; | ||
|
|
||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n 'msfa_concat_conv_w|mm_post_proj_norm_w' --type-list | head -20Repository: ngxson/llama.cpp
Length of output: 176
🏁 Script executed:
rg -n 'msfa_concat_conv_w|mm_post_proj_norm_w'Repository: ngxson/llama.cpp
Length of output: 395
🏁 Script executed:
rg -n -A 5 -B 5 'msfa_concat_norm_w|msfa_concat_conv_w' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 833
🏁 Script executed:
rg -n -A 3 -B 3 'mm_post_proj_norm_w' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
# Check what gets loaded in clip.cpp for the GEMMA3N case more broadly
rg -n 'get_tensor.*MNV5' tools/mtmd/clip.cpp | head -30Repository: ngxson/llama.cpp
Length of output: 3743
🏁 Script executed:
# Check if mm_post_proj_norm_w has a corresponding tensor name constant
rg -n 'MNV5.*PROJ|POST.*PROJ' tools/mtmd/Repository: ngxson/llama.cpp
Length of output: 1049
🏁 Script executed:
# Search for tensor name constants related to post projection
rg -n 'TN_MNV5.*POST|TN_MNV5.*PROJ_NORM' tools/mtmd/clip-impl.hRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
# Check the exact line in mobilenetv5.cpp to understand context
sed -n '435,450p' tools/mtmd/models/mobilenetv5.cppRepository: ngxson/llama.cpp
Length of output: 603
Remove unused tensor declaration and fix unloaded tensor bug.
msfa_concat_conv_w is declared but never used anywhere in the codebase and should be removed. More critically, mm_post_proj_norm_w is used in mobilenetv5.cpp:445-447 but is never loaded from the model in clip.cpp, so it will always be nullptr. Either add the loading logic for mm_post_proj_norm_w in the GEMMA3N model initialization (around line 1574 in clip.cpp), or remove the conditional check in mobilenetv5.cpp if the tensor doesn't exist in the model file.
🤖 Prompt for AI Agents
In @tools/mtmd/clip-model.h around lines 331 - 347, Remove the unused
msfa_concat_conv_w declaration from the header and fix the unloaded
mm_post_proj_norm_w by adding its loading logic during GEMMA3N model init in
clip.cpp (follow the same pattern used for mobilenet_stem_conv_w /
mobilenet_stem_norm_w: call the model tensor-load helper to assign
mm_post_proj_norm_w, check for nullptr and handle gracefully). Alternatively, if
the model truly does not provide that tensor, remove the conditional check/usage
of mm_post_proj_norm_w in mobilenetv5.cpp instead of loading it. Refer to the
symbols mobilenet_blocks, mobilenet_stem_conv_w, mobilenet_stem_norm_w,
mm_post_proj_norm_w, msfa_concat_conv_w, and the mobilenetv5.cpp/clip.cpp
initialization areas when making the change.
There was a problem hiding this comment.
Actionable comments posted: 14
🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 6051-6090: The current __init__ and find_hparam logic force
hparams_vision["n_layers"]=0 and unconditionally set
hparams_vision["intermediate_size"]=hidden_size*4 and
hparams_vision["num_attention_heads"]=8, which is brittle; change this to derive
values from a provided vision_config (or require vision_config keys) by: in
find_hparam/__init__ validate presence of required keys in self.hparams_vision
or a passed vision_config, use dict.setdefault for intermediate_size and
num_attention_heads only if the corresponding hidden_size/num_attention_heads
exist, and otherwise raise a clear error or log a fatal message so missing
vision metadata fails loudly; update references to find_hparam, __init__,
hparams_vision, intermediate_size, and num_attention_heads accordingly.
- Around line 6098-6102: The computation of patch_size is incorrect: instead of
dividing image_size by image_seq_length, compute patches_per_side =
int(math.sqrt(image_seq_length)), validate that patches_per_side**2 ==
image_seq_length and image_size % patches_per_side == 0, then set
self.hparams_vision["patch_size"] = image_size // patches_per_side; if
validations fail, raise a clear error (or log and exit) mentioning
image_seq_length and image_size so callers can fix the config (touch variables:
image_seq_length from self.preprocessor_config, image_size and patch_size in
self.hparams_vision).
- Around line 6229-6250: The padding code treats both token embeddings and
per-layer embeddings the same, but embed_tokens_per_layer tensors have shape
[embedding_dim, n_vocab], so padding must be applied on axis 1 for per-layer
tensors instead of axis 0; update the block that checks "embed_tokens.weight" or
"embed_tokens_per_layer" to branch when "per_layer" in name: for regular token
embeddings keep current_size = data_torch.shape[0] and pad with zeros of shape
(padding_size, data_torch.shape[1]) concatenated dim=0; for per-layer embeddings
set current_size = data_torch.shape[1], compute padding_size = vocab_size -
current_size, create padding zeros of shape (data_torch.shape[0], padding_size)
and concatenate dim=1; adjust the logger message accordingly and keep moving
data_torch to CPU before padding and returning (self.map_tensor_name(name),
data_torch).
In @tools/mtmd/clip.cpp:
- Around line 3233-3238: For PROJECTOR_TYPE_GEMMA3N in clip_n_output_tokens(),
n_patches is being set to the number of patches per side
(ctx->model.hparams.image_size / ctx->model.hparams.patch_size) but must be the
total token count (per_side squared); change the assignment so n_patches =
per_side * per_side (e.g., compute per_side = ctx->model.hparams.image_size /
ctx->model.hparams.patch_size and then n_patches = per_side * per_side) to
return 16×16=256 tokens for GEMMA3N and satisfy the downstream sanity check.
- Around line 1154-1160: The comment for PROJECTOR_TYPE_GEMMA3N is misleading:
MobileNetV5 does not fully bypass resizing because preprocessing still
force-resizes the input; update the inline comment near the hparams.n_merge
assignment (and the get_u32 call) to state that Gemma3n/MobileNetV5 expects 256
tokens (16x16), we set n_merge = 1, and note that preprocessing still performs a
forced resize (see the preprocessing logic) so the model's internal resizing
does not eliminate external preprocessing. Keep the behavior unchanged, just
correct and clarify the comment text.
In @tools/mtmd/models/mobilenetv5.cpp:
- Around line 5-20: In clip_graph_mobilenetv5::rms_norm_2d add a defensive null
check for the inp parameter before any dereference (e.g., before calling
ggml_permute); if inp is null, return nullptr (or an appropriate
error/early-exit tensor) to avoid a null-pointer dereference, keeping existing
behavior for weight unchanged and ensuring the function returns a valid
ggml_tensor* in the error case.
- Around line 91-149: The function build_inverted_residual uses the inp pointer
without validation; add an immediate null check at the top of
build_inverted_residual for the inp parameter and handle it safely (e.g., return
nullptr or propagate an error) instead of dereferencing a null pointer so the
rest of the function (uses of inp->ne[...] and residual addition) are not
executed when inp is null.
- Around line 248-260: The build() function uses model.mobilenet_stem_conv_w
without validation; add a null-check at the start of the stem block (before
calling ggml_conv_2d_direct) to detect missing stem weights
(model.mobilenet_stem_conv_w == nullptr) and handle it by logging an
error/throwing or returning nullptr from build() to avoid dereferencing; ensure
downstream code does not assume cur was created if the check fails and keep
existing handling for mobilenet_stem_conv_b and mobilenet_stem_norm_w unchanged.
- Around line 23-53: In pad_same_2d, add a null check for the input pointer inp
at the start of the function and return or handle the error if inp is null; also
validate stride_h and stride_w are > 0 before using them (e.g., return early or
assert/log error) to avoid division by zero when computing oh and ow; update
references to inp, stride_h, and stride_w in pad_same_2d accordingly so the
function fails fast on invalid inputs instead of dereferencing a null pointer or
performing division by zero.
- Around line 57-88: The function build_edge_residual assumes inp and block
weight tensors exist; add explicit null checks at the top of
build_edge_residual: if inp is null return nullptr (or inp as appropriate) to
avoid dereferencing, and verify block.s0_conv_exp_w and block.s0_conv_pwl_w
before calling ggml_conv_2d_direct (and before passing them to rms_norm_2d); if
either weight is null, skip the corresponding conv/pwl steps or return nullptr
consistently so callers can handle the error. Ensure all early exits use the
same convention as the surrounding codebase (nullptr or original inp) and
reference the symbols build_edge_residual, block.s0_conv_exp_w,
block.s0_conv_pwl_w, ggml_conv_2d_direct, and rms_norm_2d when making the
checks.
- Around line 152-246: The function build_mobilenet_attn may dereference null
pointers (inp and several block weight tensors); add defensive null checks at
the start of build_mobilenet_attn to validate inp and before using each required
weight (block.attn_q_w, block.attn_k_w, block.attn_v_w, block.attn_o_w) and
return a safe fallback (e.g., inp or nullptr) or propagate an error if any are
null; also guard uses of optional downsample/norm tensors (block.attn_k_dw_w,
block.attn_v_dw_w, block.attn_k_norm_w, block.attn_v_norm_w,
block.layer_scale_w) so they are only accessed when non-null to avoid
null-pointer deref.
🧹 Nitpick comments (5)
tools/mtmd/clip-model.h (2)
176-214:mobilenetv5_blocklayout is clear; consider adding tiny helpers to prevent invalid combos.
As-is, blocks can be “partially populated” (e.g., both Edge+Attention), which may be valid, but it’s easy to mis-handle later; small predicates likeis_edge() / is_uir() / is_attn()would make the execution path safer/cleaner inmobilenetv5.cpp.
331-346: Use an index-safe type formobilenet_stage_endsand keep loader/header consistent.
std::vector<int> mobilenet_stage_endswill truncate on very large models and doesn’t match thesize_tindices used in logs/compute. Preferstd::vector<size_t>(orstd::vector<int32_t>if you truly want a bounded range) and update the push sites inclip.cppaccordingly.convert_hf_to_gguf.py (2)
6051-6074: Minor: annotateblock_tensor_mappingasClassVar+ keep exception style consistent.
This matches Ruff RUF012 / TRY003 and avoids signaling “instance state”.Proposed tweak
@@ -from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast +from typing import TYPE_CHECKING, Any, Callable, ClassVar, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast @@ - block_tensor_mapping = { + block_tensor_mapping: ClassVar[dict[str, str]] = { @@ - raise ValueError(f"Unknown name: {name}") + raise ValueError("Unknown MobileNetV5 tensor name") from NoneAlso applies to: 6116-6128
6174-6193: Usetry/finallywhen temporarily deletingvocab_size_per_layer_input.
As written, an exception insuper().set_vocab()can leaveself.hparamsmutated.Proposed fix
- vocab_size_per_layer_input = self.hparams.get("vocab_size_per_layer_input") - - # Temporarily remove vocab_size_per_layer_input to force using vocab_size - if vocab_size_per_layer_input is not None: - del self.hparams["vocab_size_per_layer_input"] - - # Call parent set_vocab which will now use vocab_size (262400) - super().set_vocab() - - # Restore vocab_size_per_layer_input for later use - if vocab_size_per_layer_input is not None: - self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input + vocab_size_per_layer_input = self.hparams.pop("vocab_size_per_layer_input", None) + try: + # Call parent set_vocab which will now use vocab_size (262400) + super().set_vocab() + finally: + if vocab_size_per_layer_input is not None: + self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_inputtools/mtmd/models/mobilenetv5.cpp (1)
420-420: Consider extracting epsilon constant.The epsilon value
1e-6fis used in multiple RMS norm operations (lines 420 and 442). Extracting this to a named constant would improve maintainability and ensure consistency.♻️ Suggested refactor
At the top of the file or in a constants section:
static constexpr float GEMMA3N_RMS_NORM_EPS = 1e-6f;Then use throughout:
- const float eps = 1e-6f; // Gemma3n uses 1e-6 - cur = ggml_rms_norm(ctx0, cur, eps); + cur = ggml_rms_norm(ctx0, cur, GEMMA3N_RMS_NORM_EPS);Also applies to: 442-442
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
convert_hf_to_gguf.pytools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cpptools/mtmd/models/models.htools/mtmd/mtmd.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- tools/mtmd/mtmd.cpp
- tools/mtmd/models/models.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpp
🧬 Code graph analysis (2)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
add_clip_projector_type(1086-1087)add_vision_attention_layernorm_eps(1107-1108)gguf-py/gguf/constants.py (2)
VisionProjectorType(3525-3548)GGMLQuantizationType(3400-3432)
tools/mtmd/models/mobilenetv5.cpp (1)
ggml/src/ggml.c (15)
ggml_permute(3702-3754)ggml_cont(3463-3467)ggml_rms_norm(3068-3073)ggml_mul(2172-2177)ggml_pad_ext(4985-5018)ggml_conv_2d_direct(4704-4738)ggml_gelu(2679-2683)ggml_add(1971-1976)ggml_mul_mat(3176-3191)ggml_scale(3292-3297)ggml_soft_max(3968-3972)ggml_upscale(4930-4937)ggml_concat(2519-2546)ggml_pool_2d(4854-4880)ggml_build_forward_expand(6795-6797)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
6051-6074: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6127-6127: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: server-windows
🔇 Additional comments (7)
tools/mtmd/clip.cpp (3)
783-795: GEMMA3N graph builder wiring looks correct.
Selectingclip_graph_mobilenetv5forPROJECTOR_TYPE_GEMMA3Nis the right integration point.
3631-3644: GEMMA3N “no extra inputs” path: please confirm the MobileNetV5 graph doesn’t require positions/masks.
You addedPROJECTOR_TYPE_GEMMA3Nto the do-nothing branch (Line 3631). If MobileNetV5/MSFA needs any runtime inputs beyondinp_raw, they’ll be missing here.
3759-3761: GEMMA3N mmproj embedding sizing: LGTM.
Returningmm_input_proj_w->ne[0]is consistent with Gemma3 and keeps caller logic uniform.convert_hf_to_gguf.py (2)
530-535: Good defensive handling for emptytensor_map(preventsmax()crash).
This keeps tensor logging robust forblock_count=0models like MobileNetV5.
6146-6148: The[1, C, 1, 1]reshape is necessary and correct for GGML broadcasting.The tensors are reshaped from shape
[C]to[1, C, 1, 1]to properly broadcast with the convolution output shape[C, H, W, N]in GGML operations (lines 256 and 138/240 in mobilenetv5.cpp). The reshape matches C++ expectations—no issues.tools/mtmd/models/mobilenetv5.cpp (2)
298-392: MSFA implementation looks solid.The Multi-Scale Fusion Adapter logic correctly:
- Guards against empty intermediate features (line 299)
- Resizes features to match target resolution
- Warns about non-integer scaling (lines 325-327)
- Conditionally applies all optional transformations (expand, project, norms)
401-407: The permutation sequence at lines 403-404 is intentionally designed to transform spatial dimensions before flattening to tokens. The code includes an explicit comment explaining that it reshapes from PyTorch's(Batch, Seq, Hidden)convention to GGML's(Hidden, Seq, Batch)format, and the final shape[C, W*H, B]aligns with this mapping.However, the codebase does not include the PyTorch model implementation or explicit validation that confirms the width-major token ordering (from the
[C, W, H, B]→[C, W*H, B]transformation) matches Gemma3N's expected token traversal. The conversion script handles tensor weight mapping but does not validate forward-pass token ordering. To fully verify this matches the upstream PyTorch model, you would need to compare model outputs between this GGML implementation and the original PyTorch implementation.
| block_tensor_mapping = { | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.bn1.weight": "v.blk.{bid}.{sid}.bn1.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_pwl.weight": "v.blk.{bid}.{sid}.conv_pwl.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.bn2.weight": "v.blk.{bid}.{sid}.bn2.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.conv.weight": "v.blk.{bid}.{sid}.dw_start.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.conv.weight": "v.blk.{bid}.{sid}.dw_mid.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.bn.weight": "v.blk.{bid}.{sid}.dw_mid.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.conv.weight": "v.blk.{bid}.{sid}.pw_exp.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.bn.weight": "v.blk.{bid}.{sid}.pw_exp.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.conv.weight": "v.blk.{bid}.{sid}.pw_proj.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.bn.weight": "v.blk.{bid}.{sid}.pw_proj.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.layer_scale.gamma": "v.blk.{bid}.{sid}.layer_scale.gamma", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.query.proj.weight": "v.blk.{bid}.{sid}.attn.query.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.proj.weight": "v.blk.{bid}.{sid}.attn.key.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.proj.weight": "v.blk.{bid}.{sid}.attn.value.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.output.proj.weight": "v.blk.{bid}.{sid}.attn.output.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.down_conv.weight": "v.blk.{bid}.{sid}.attn.key.down_conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.norm.weight": "v.blk.{bid}.{sid}.attn.key.norm.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.down_conv.weight": "v.blk.{bid}.{sid}.attn.value.down_conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.norm.weight": "v.blk.{bid}.{sid}.attn.value.norm.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.norm.weight": "v.blk.{bid}.{sid}.norm.weight", | ||
| } | ||
|
|
||
| def find_hparam(self, keys: Iterable[str], optional: bool = False) -> Any: | ||
| # force n_layers to 0 in __init__() | ||
| # we have to do this because self.hparams_vision is not yet accessible for modification inside __init__() | ||
| if "n_layers" in list(keys): | ||
| return 0 | ||
| return super().find_hparam(keys, optional) | ||
|
|
||
| def __init__(self, *args, **kwargs): | ||
| # Parent init will call find_hparam which now returns 0 for empty keys | ||
| super().__init__(*args, **kwargs) | ||
| assert self.hparams_vision is not None | ||
| self.hparams_vision["n_layers"] = 0 | ||
| self.hparams_vision["intermediate_size"] = self.hparams_vision.get("hidden_size", 2048) * 4 | ||
| self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_attention_heads", 8) | ||
|
|
There was a problem hiding this comment.
Avoid brittle hardcoded vision defaults; derive from vision_config (or fail loudly).
intermediate_size = hidden_size * 4 and num_attention_heads = 8 as unconditional fallbacks can silently produce mismatched GGUF metadata if Gemma3n variants change. Consider setdefault() with strict validation (e.g., require hidden_size present) and/or log when falling back. Please relay upstream.
🧰 Tools
🪛 Ruff (0.14.10)
6051-6074: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6051 - 6090, The current __init__ and
find_hparam logic force hparams_vision["n_layers"]=0 and unconditionally set
hparams_vision["intermediate_size"]=hidden_size*4 and
hparams_vision["num_attention_heads"]=8, which is brittle; change this to derive
values from a provided vision_config (or require vision_config keys) by: in
find_hparam/__init__ validate presence of required keys in self.hparams_vision
or a passed vision_config, use dict.setdefault for intermediate_size and
num_attention_heads only if the corresponding hidden_size/num_attention_heads
exist, and otherwise raise a clear error or log a fatal message so missing
vision metadata fails loudly; update references to find_hparam, __init__,
hparams_vision, intermediate_size, and num_attention_heads accordingly.
| # Image sequence length (256 tokens = 16x16 for Gemma3n) | ||
| image_seq_length = self.preprocessor_config.get("image_seq_length", 256) | ||
| image_size = self.hparams_vision["image_size"] | ||
| self.hparams_vision["patch_size"] = image_size // image_seq_length | ||
|
|
There was a problem hiding this comment.
Fix patch_size computation (currently semantically incorrect / fragile).
patch_size = image_size // image_seq_length treats a token count as a linear dimension. For a 16×16 grid (image_seq_length=256), derive patch_size via sqrt(image_seq_length) (patches per side), and validate squareness/divisibility. Please relay this upstream.
Proposed fix
# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
- self.hparams_vision["patch_size"] = image_size // image_seq_length
+ n_per_side = int(math.isqrt(image_seq_length))
+ if n_per_side * n_per_side != image_seq_length:
+ raise ValueError(f"image_seq_length must be a perfect square, got {image_seq_length}")
+ if image_size % n_per_side != 0:
+ raise ValueError(f"image_size ({image_size}) must be divisible by sqrt(image_seq_length) ({n_per_side})")
+ self.hparams_vision["patch_size"] = image_size // n_per_sideCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6098 - 6102, The computation of
patch_size is incorrect: instead of dividing image_size by image_seq_length,
compute patches_per_side = int(math.sqrt(image_seq_length)), validate that
patches_per_side**2 == image_seq_length and image_size % patches_per_side == 0,
then set self.hparams_vision["patch_size"] = image_size // patches_per_side; if
validations fail, raise a clear error (or log and exit) mentioning
image_seq_length and image_size so callers can fix the config (touch variables:
image_seq_length from self.preprocessor_config, image_size and patch_size in
self.hparams_vision).
| # Pad token embeddings for vision/audio special tokens (262144-262399) | ||
| if "embed_tokens.weight" in name or "embed_tokens_per_layer" in name: | ||
| # Move to CPU to avoid meta device issues during padding | ||
| data_torch = data_torch.to(device="cpu") | ||
|
|
||
| vocab_size = self.hparams.get("vocab_size", 262400) | ||
| current_size = data_torch.shape[0] # First dimension is vocab_size | ||
|
|
||
| if current_size < vocab_size: | ||
| # Pad with zeros for vision/audio tokens (they get embeddings from vision tower) | ||
| padding_size = vocab_size - current_size | ||
| tensor_type = "per-layer embeddings" if "per_layer" in name else "token embeddings" | ||
| logger.info(f"Padding {tensor_type} shape {list(data_torch.shape)} from {current_size} to {vocab_size} (adding {padding_size} vision/audio token slots)") | ||
|
|
||
| # Create padding with zeros (vision tokens won't use these embeddings) | ||
| padding = torch.zeros((padding_size, data_torch.shape[1]), dtype=data_torch.dtype, device=data_torch.device) | ||
| data_torch = torch.cat([data_torch, padding], dim=0) | ||
|
|
||
| # Continue with normal processing | ||
| name = name.replace("language_model.", "") | ||
| return [(self.map_tensor_name(name), data_torch)] | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for embed_tokens_per_layer and related tensor definitions
rg -n "embed_tokens_per_layer" -C 3 | head -100Repository: ngxson/llama.cpp
Length of output: 1002
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for vocab_size_per_layer or related per-layer tensor handling
rg -n "per_layer|per-layer" -C 2 --type py | grep -E "(shape|dim|rank|torch\.zeros|\.cat\()" | head -50Repository: ngxson/llama.cpp
Length of output: 256
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's any shape validation or assertions around tensor padding
rg -n "assert.*shape|assert.*rank|data_torch\.shape" convert_hf_to_gguf.py -C 2 | head -100Repository: ngxson/llama.cpp
Length of output: 5119
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the class/function containing this code and understand tensor flow
rg -n "class.*Gemma3\|def.*set_tensor_data\|embed_tokens_per_layer" convert_hf_to_gguf.py -B2 -A5 | head -150Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check how tensors are loaded from the model before reaching this function
rg -n "get_tensor\|load_tensor\|\.weight" convert_hf_to_gguf.py | grep -E "(embed_tokens|6[0-9]{3})" | head -50Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for any shape validation or ndim checks in the convert_hf_to_gguf file
rg -n "ndim|\.ndim|rank\|assert.*shape" convert_hf_to_gguf.py | grep -A2 -B2 "6[0-9]{3}" | head -100Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look at the actual function signature and class context
sed -n '6200,6260p' convert_hf_to_gguf.py | cat -nRepository: ngxson/llama.cpp
Length of output: 3515
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the class definition for context
rg -n "class Gemma3" convert_hf_to_gguf.py -A 20 | head -50Repository: ngxson/llama.cpp
Length of output: 2800
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check HuggingFace model references or documentation about gemma3 per-layer embeddings
rg -n "gemma3\|Gemma3" convert_hf_to_gguf.py -i -C 3 | head -100Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for any shape transformations or reshaping of embed_tokens before the padding code
rg -n "reshape\|view\|unsqueeze\|squeeze" convert_hf_to_gguf.py | grep -E "(6[0-2][0-9]{2})"Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's any ndim check or assertion anywhere in modify_tensors method
sed -n '6200,6500p' convert_hf_to_gguf.py | grep -n "ndim\|assert\|len.*shape" | head -30Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for any documentation or comments about embed_tokens_per_layer shape
rg -n "embed_tokens_per_layer\|PER_LAYER_TOKEN_EMBD" --type py -C 5Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the entire modify_tensors method for Gemma3Model
sed -n '6200,6350p' convert_hf_to_gguf.py | cat -nRepository: ngxson/llama.cpp
Length of output: 7970
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for how tensors are accessed before padding - look for data_torch operations
rg -n "data_torch\.ndim\|data_torch\.shape\[" convert_hf_to_gguf.py | head -50Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for any references to Gemma3 model in tensor_mapping or documentation
rg -n "gemma3\|Gemma3\|per.layer.*embed" gguf-py/gguf/ --type py -C 2Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for any test files or example configs that might show tensor shapes
find . -name "*.py" -o -name "*.md" | xargs grep -l "embed_tokens_per_layer\|Gemma3" 2>/dev/null | head -20Repository: ngxson/llama.cpp
Length of output: 116
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check HuggingFace references or model configs
find . -name "config.json" -o -name "*.md" | xargs grep -l "gemma3\|embed_tokens" 2>/dev/null | head -10Repository: ngxson/llama.cpp
Length of output: 191
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for any assertions or shape checks related to embedding tensors in the file
sed -n '1,6300p' convert_hf_to_gguf.py | grep -n "\.ndim\|\.shape\|assert.*rank\|assert.*dim" | tail -50Repository: ngxson/llama.cpp
Length of output: 3471
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's a parent class implementation or comments about tensor ranks
rg -n "class TextModel\|def modify_tensors" convert_hf_to_gguf.py -A 5 | head -100Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for any model loading or tensor preparation code
rg -n "def load_tensor\|def get_tensor" convert_hf_to_gguf.py -A 10 | head -80Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check what happens before modify_tensors is called
rg -n "modify_tensors\|data_torch = " convert_hf_to_gguf.py | head -50Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/antml:bin/bash
set -euo pipefail
# Look for any comments or documentation about Gemma3 tensor shapes
sed -n '5860,5900p' convert_hf_to_gguf.py | cat -nRepository: ngxson/llama.cpp
Length of output: 2427
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's any reshape/view operations on embed_tokens anywhere
rg -n "embed_tokens" convert_hf_to_gguf.py -C 5Repository: ngxson/llama.cpp
Length of output: 6665
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for the full Gemma3Model.modify_tensors method to understand all tensor handling
sed -n '6215,6300p' convert_hf_to_gguf.py | cat -nRepository: ngxson/llama.cpp
Length of output: 4266
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for how embed_tokens_per_layer is referenced in tensor_mapping
rg -n "PER_LAYER_TOKEN_EMBD\|embed_tokens_per_layer" gguf-py/gguf/tensor_mapping.py -B 3 -A 3Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's any shape validation added elsewhere for embedding tensors
rg -n "ndim.*==.*2\|assert.*rank\|len.*shape.*==.*2" convert_hf_to_gguf.py | head -30Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for any error handling or shape validation around the padding code
sed -n '6200,6300p' convert_hf_to_gguf.py | grep -n "if\|assert\|try\|except\|raise\|ndim\|shape\[" | cat -nRepository: ngxson/llama.cpp
Length of output: 1433
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for any GitHub issues or comments about tensor shape problems
rg -n "shape\[1\]|ndim|rank" convert_hf_to_gguf.py | grep -E "6[0-9]{3}"Repository: ngxson/llama.cpp
Length of output: 2998
🏁 Script executed:
#!/antml:bin/bash
set -euo pipefail
# Look for test files that might show expected shapes
find . -name "*test*" -o -name "*gemma*" | grep -i test | head -20Repository: ngxson/llama.cpp
Length of output: 756
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there are any issues or commits related to embed_tokens_per_layer
git log --all --oneline --grep="embed_tokens" 2>/dev/null | head -20Repository: ngxson/llama.cpp
Length of output: 201
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for actual tensor loading to see if per-layer embeddings can be 3D
rg -n "transformers\|from_pretrained" convert_hf_to_gguf.py | head -30Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's any online documentation or comments about Gemma3n per-layer embeddings
rg -n "Gemma3n\|per.layer\|per_layer" convert_hf_to_gguf.py -C 3 | head -100Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for any error messages or crash reports related to shape
rg -n "IndexError\|shape\|dimension" convert_hf_to_gguf.py | grep -A2 -B2 "62[0-9]{2}" | head -50Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
#!/antml:bin/bash
set -euo pipefail
# Check if there are any recent changes or PRs mentioning embed_tokens_per_layer
git log --all --oneline -S "embed_tokens_per_layer" 2>/dev/null | head -20Repository: ngxson/llama.cpp
Length of output: 146
🏁 Script executed:
#!/antml/bin/bash
set -euo pipefail
# Verify: check if accessing shape[1] without ndim check would fail on 3D tensor
python3 << 'EOF'
import torch
# Test what happens with different tensor ranks
tensor_2d = torch.randn(10, 256)
tensor_3d = torch.randn(4, 10, 256)
print(f"2D tensor shape: {tensor_2d.shape}")
print(f"2D tensor shape[0]: {tensor_2d.shape[0]}")
print(f"2D tensor shape[1]: {tensor_2d.shape[1]}")
print()
print(f"3D tensor shape: {tensor_3d.shape}")
print(f"3D tensor shape[0]: {tensor_3d.shape[0]}")
print(f"3D tensor shape[1]: {tensor_3d.shape[1]}")
print()
print("If code assumes 2D but gets 3D [4, 10, 256]:")
print(f" current_size = shape[0] = {tensor_3d.shape[0]} (would be wrong! should be 10)")
print(f" padding = torch.zeros((padding_size, shape[1]), ...) -> ({5}, {tensor_3d.shape[1]}) = (5, 10)")
print(f" This would pad wrong axis!")
EOFRepository: ngxson/llama.cpp
Length of output: 146
🏁 Script executed:
#!/antml:bin/bash
set -euo pipefail
# Check what happens when accessing shape[1] on a 2D vs 3D tensor
python3 << 'EOF'
import torch
# 2D tensor [vocab, dim]
tensor_2d = torch.randn(262400, 256)
print(f"2D tensor [vocab, dim]: shape={tuple(tensor_2d.shape)}")
print(f" shape[0]={tensor_2d.shape[0]}, shape[1]={tensor_2d.shape[1]}")
# Hypothetical 3D tensor [layers, vocab, dim]
tensor_3d = torch.randn(8, 262144, 256)
print(f"\n3D tensor [layers, vocab, dim]: shape={tuple(tensor_3d.shape)}")
print(f" shape[0]={tensor_3d.shape[0]}, shape[1]={tensor_3d.shape[1]}")
print(f"\nIf code assumes 2D but gets 3D:")
print(f" current_size = shape[0] = {tensor_3d.shape[0]} ❌ (would think vocab=8, actually layers=8)")
print(f" padding zeros shape = (padding_size, shape[1]) = (1000, {tensor_3d.shape[1]})")
print(f" This creates padding with wrong shape for the wrong axis!")
EOFRepository: ngxson/llama.cpp
Length of output: 177
🏁 Script executed:
#!/antml/bin/bash
set -euo pipefail
# Check if there's any indication that embed_tokens_per_layer should be 2D
rg -n "embed_tokens_per_layer|PER_LAYER_TOKEN_EMBD" -A 2 -B 2Repository: ngxson/llama.cpp
Length of output: 4700
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Extract the exact tensor shape definition from the C++ code
grep -n "tok_embd_per_layer\|n_embd_altup\|n_vocab" src/llama-model.cpp | head -20Repository: ngxson/llama.cpp
Length of output: 2330
🏁 Script executed:
#!/antml/bin/bash
set -euo pipefail
# Check the actual shape of embed_tokens.weight vs embed_tokens_per_layer in HF models
cat > /tmp/check_shapes.py << 'EOF'
# According to C++ code:
# tok_embd_per_layer: {n_embd_altup * n_layer, n_vocab}
# This means: [embedding_dim, vocab_size]
# But the padding code assumes:
# shape[0] = vocab_size (comment says "First dimension is vocab_size")
# shape[1] = embedding_dim
print("C++ tensor definition:")
print(" tok_embd_per_layer = create_tensor(..., {n_embd_altup * n_layer, n_vocab}, ...)")
print(" Shape: [n_embd_altup * n_layer, n_vocab]")
print(" [EMBEDDING_DIM, VOCAB_SIZE]")
print()
print("Python padding code assumes:")
print(" current_size = data_torch.shape[0] # 'First dimension is vocab_size'")
print(" padding = torch.zeros((padding_size, data_torch.shape[1]), ...)")
print(" ^ Treats shape[0] as vocab, shape[1] as embedding_dim")
print()
print("MISMATCH: The dimensions are swapped!")
print(" C++ defines: [embedding, vocab]")
print(" Code assumes: [vocab, embedding]")
EOF
python3 /tmp/check_shapes.pyRepository: ngxson/llama.cpp
Length of output: 585
Fix dimension mismatch: embed_tokens_per_layer has shape [embedding_dim, n_vocab], not [vocab, dim].
The padding logic assumes shape[0] is vocab_size and pads with (padding_size, shape[1]), but per-layer embeddings have reversed dimensions compared to regular token embeddings. According to the model definition (src/llama-model.cpp:4166), tok_embd_per_layer is shaped as {n_embd_altup * n_layer, n_vocab}, meaning embedding dimension comes first. This causes the padding to be applied to the wrong axis, corrupting the tensor. Add a separate code path for "per_layer" in name to handle the transposed case, or validate tensor rank/shape with an assertion before accessing shape[1].
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6229 - 6250, The padding code treats both
token embeddings and per-layer embeddings the same, but embed_tokens_per_layer
tensors have shape [embedding_dim, n_vocab], so padding must be applied on axis
1 for per-layer tensors instead of axis 0; update the block that checks
"embed_tokens.weight" or "embed_tokens_per_layer" to branch when "per_layer" in
name: for regular token embeddings keep current_size = data_torch.shape[0] and
pad with zeros of shape (padding_size, data_torch.shape[1]) concatenated dim=0;
for per-layer embeddings set current_size = data_torch.shape[1], compute
padding_size = vocab_size - current_size, create padding zeros of shape
(data_torch.shape[0], padding_size) and concatenate dim=1; adjust the logger
message accordingly and keep moving data_torch to CPU before padding and
returning (self.map_tensor_name(name), data_torch).
tools/mtmd/clip.cpp
Outdated
| case PROJECTOR_TYPE_GEMMA3N: | ||
| { | ||
| model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, false); | ||
| model.mobilenet_stem_conv_b = get_tensor(TN_MNV5_STEM_BIAS, false); | ||
| model.mobilenet_stem_norm_w = get_tensor(TN_MNV5_STEM_BN, false); | ||
|
|
||
| model.msfa_ffn_expand_w = get_tensor(TN_MNV5_MSFA_FFN_EXP_W, false); | ||
| model.msfa_ffn_expand_bn = get_tensor(TN_MNV5_MSFA_FFN_EXP_BN, false); // Consume BN if present but likely folded | ||
| model.msfa_ffn_project_w = get_tensor(TN_MNV5_MSFA_FFN_PROJ_W, false); | ||
| model.msfa_ffn_project_bn = get_tensor(TN_MNV5_MSFA_FFN_PROJ_BN, false); | ||
|
|
||
| model.msfa_concat_norm_w = get_tensor(TN_MNV5_MSFA_NORM, false); | ||
|
|
||
| // Dynamically load blocks stage by stage | ||
| for (int stage = 0; stage < 4; ++stage) { | ||
| int blocks_found_in_stage = 0; | ||
|
|
||
| for (int blk_idx = 0; ; ++blk_idx) { | ||
| bool found_block = false; | ||
| mobilenetv5_block block; | ||
|
|
||
| // 1. Check for Edge Residual (S0) | ||
| block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false); | ||
| if (block.s0_conv_exp_w) { | ||
| found_block = true; | ||
| block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false); | ||
| block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false); | ||
| block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false); | ||
| } | ||
| // 2. Check for UIR (Universal Inverted Residual) | ||
| else { | ||
| // Check for dw_start OR pw_exp (some UIR blocks skip dw_start) | ||
| block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false); | ||
| block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false); | ||
|
|
||
| if (block.dw_start_w || block.pw_exp_w) { | ||
| found_block = true; | ||
| if (block.dw_start_w) { | ||
| block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false); | ||
| } | ||
| if (block.pw_exp_w) { | ||
| block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false); | ||
| } | ||
| block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false); | ||
| if (block.dw_mid_w) { | ||
| block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false); | ||
| } | ||
| block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false); | ||
| if (block.pw_proj_w) { | ||
| block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false); | ||
| } | ||
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | ||
| } | ||
| } | ||
|
|
||
| // 3. Check for Attention (MQA) | ||
| // Even if UIR/Edge check failed, this might be a pure attention block | ||
| ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false); | ||
| if (attn_q_check) { | ||
| found_block = true; | ||
| block.attn_q_w = attn_q_check; | ||
| block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false); | ||
| block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false); | ||
| block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false); | ||
| block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false); | ||
| block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false); | ||
| block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false); | ||
| block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false); | ||
| block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false); | ||
| // Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check | ||
| if (!block.layer_scale_w) { | ||
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | ||
| } | ||
| } | ||
|
|
||
| if (found_block) { | ||
| model.mobilenet_blocks.push_back(block); | ||
| blocks_found_in_stage++; | ||
| } else { | ||
| // End of blocks for this stage | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| // Track where this stage ends in the flat vector | ||
| if (blocks_found_in_stage > 0) { | ||
| model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1); | ||
| LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1); | ||
| } | ||
| } | ||
| model.mm_input_proj_w = get_tensor(TN_MM_INP_PROJ); | ||
| model.mm_soft_emb_norm_w = get_tensor(TN_MM_SOFT_EMB_N); | ||
| } break; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "clip.h" tools/mtmd/Repository: ngxson/llama.cpp
Length of output: 77
🏁 Script executed:
fd -t f "clip.cpp" tools/mtmd/Repository: ngxson/llama.cpp
Length of output: 79
🏁 Script executed:
wc -l tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 84
🏁 Script executed:
head -100 tools/mtmd/clip.hRepository: ngxson/llama.cpp
Length of output: 3864
🏁 Script executed:
grep -n "struct clip_model" tools/mtmd/clip.cpp | head -5Repository: ngxson/llama.cpp
Length of output: 90
🏁 Script executed:
sed -n '1,300p' tools/mtmd/clip.cpp | grep -n "struct clip_model"Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg "struct clip_model" tools/mtmd/clip.cpp -A 150 | head -200Repository: ngxson/llama.cpp
Length of output: 6489
🏁 Script executed:
rg "struct clip_model[^_]" tools/mtmd/clip.cpp -A 200 | head -300Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
grep -n "mobilenet_stage_ends" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 165
🏁 Script executed:
rg "mobilenet_stage_ends" tools/mtmd/clip.cpp -B 3 -A 3Repository: ngxson/llama.cpp
Length of output: 479
🏁 Script executed:
grep -n "std::vector<int> mobilenet_stage_ends" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg "class clip_model|struct clip_model" tools/mtmd/clip.cpp -A 300 | head -400Repository: ngxson/llama.cpp
Length of output: 14208
🏁 Script executed:
sed -n '1,1200p' tools/mtmd/clip.cpp | grep -n "clip_model"Repository: ngxson/llama.cpp
Length of output: 360
🏁 Script executed:
sed -n '1,900p' tools/mtmd/clip.cpp | tail -200Repository: ngxson/llama.cpp
Length of output: 6659
🏁 Script executed:
rg "^struct clip_model|^class clip_model" tools/mtmd/clip.cpp -A 100Repository: ngxson/llama.cpp
Length of output: 3902
🏁 Script executed:
rg "^struct clip_model \{" tools/mtmd/clip.cpp -A 200Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
grep -n "mobilenet_stage_ends\|msfa_concat_conv_w\|mm_post_proj_norm_w" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 165
🏁 Script executed:
sed -n '100,900p' tools/mtmd/clip.cpp | grep -n "struct\|class" | head -20Repository: ngxson/llama.cpp
Length of output: 251
🏁 Script executed:
grep -n "struct clip_model" tools/mtmd/clip.cpp | head -20Repository: ngxson/llama.cpp
Length of output: 90
🏁 Script executed:
fd -t f "*.h" tools/mtmd/ -exec grep -l "clip_model" {} \;Repository: ngxson/llama.cpp
Length of output: 547
🏁 Script executed:
rg "struct clip_model\s*\{" tools/mtmd/ -A 150Repository: ngxson/llama.cpp
Length of output: 9709
🏁 Script executed:
sed -n '1563,1655p' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 6667
🏁 Script executed:
grep -n "TN_MNV5\|TN_MM_" tools/mtmd/clip.cpp | head -40Repository: ngxson/llama.cpp
Length of output: 4623
🏁 Script executed:
rg "msfa_concat_conv_w|mm_post_proj_norm_w" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg "msfa_concat_conv_w|mm_post_proj_norm_w" tools/mtmd/Repository: ngxson/llama.cpp
Length of output: 379
🏁 Script executed:
sed -n '1563,1680p' tools/mtmd/clip.cpp | tail -40Repository: ngxson/llama.cpp
Length of output: 2333
🏁 Script executed:
grep -n "get_tensor.*TN_MNV5_STEM_CONV\|get_tensor.*TN_MNV5_STEM_BIAS\|get_tensor.*TN_MNV5_STEM_BN" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 336
🏁 Script executed:
cat tools/mtmd/models/mobilenetv5.cpp | head -100Repository: ngxson/llama.cpp
Length of output: 3971
🏁 Script executed:
grep -n "msfa_concat_conv_w\|mm_post_proj_norm_w" tools/mtmd/models/mobilenetv5.cppRepository: ngxson/llama.cpp
Length of output: 174
🏁 Script executed:
rg "mm_post_proj_norm_w" tools/mtmd/ -B 3 -A 3Repository: ngxson/llama.cpp
Length of output: 1111
🏁 Script executed:
rg "msfa_concat_conv_w" tools/mtmd/models/mobilenetv5.cppRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg "mobilenet_stem_conv_w" tools/mtmd/models/mobilenetv5.cpp -B 2 -A 2Repository: ngxson/llama.cpp
Length of output: 343
🏁 Script executed:
grep -n "model.mobilenet_stem_conv_w\|model.mobilenet_stem_norm_w" tools/mtmd/models/mobilenetv5.cppRepository: ngxson/llama.cpp
Length of output: 260
🏁 Script executed:
sed -n '1649,1650p' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 299
🏁 Script executed:
rg "mobilenet_stage_ends\[" tools/mtmd/Repository: ngxson/llama.cpp
Length of output: 301
🏁 Script executed:
grep -n "int blk_idx = 0; ; ++blk_idx" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 125
🏁 Script executed:
python3 << 'EOF'
# Check type mismatch details
# std::vector<int> expects int, but size_t is being pushed
# size_t on 64-bit is typically uint64_t (8 bytes)
# int is typically 32 bits (4 bytes)
print("Type mismatch analysis:")
print("- mobilenet_stage_ends declared as: std::vector<int>")
print("- Value pushed: model.mobilenet_blocks.size() - 1")
print("- .size() returns std::size_t (typically uint64_t on 64-bit systems)")
print("- Implicit conversion from size_t to int can lose upper bits")
print("- Risk: truncation if vector size > 2^31-1")
EOFRepository: ngxson/llama.cpp
Length of output: 146
GEMMA3N tensor loading: make required tensors explicitly required, load missing fields, unify stage indexing type, and bound block iteration.
model.mobilenet_stem_conv_w(line 1565) is marked optional but used unconditionally at line 254 in mobilenetv5.cpp without null check, risking a null dereference.- Declared fields
msfa_concat_conv_wandmm_post_proj_norm_ware not loaded in this section; onlymsfa_concat_norm_wis loaded (line 1574). If these are part of the model, they should be populated. mobilenet_stage_endsis declared asstd::vector<int>(header) but receivessize_tvalues at line 1649, causing implicit truncation on 64-bit systems.- The block iteration loop (line 1580) has no upper bound; it relies only on tensor lookup failure to terminate, which could loop pathologically on malformed GGUF files.
Suggested fixes
case PROJECTOR_TYPE_GEMMA3N:
{
- model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, false);
+ model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, true);
model.mobilenet_stem_conv_b = get_tensor(TN_MNV5_STEM_BIAS, false);
model.mobilenet_stem_norm_w = get_tensor(TN_MNV5_STEM_BN, false);
model.msfa_ffn_expand_w = get_tensor(TN_MNV5_MSFA_FFN_EXP_W, false);
model.msfa_ffn_expand_bn = get_tensor(TN_MNV5_MSFA_FFN_EXP_BN, false);
model.msfa_ffn_project_w = get_tensor(TN_MNV5_MSFA_FFN_PROJ_W, false);
model.msfa_ffn_project_bn = get_tensor(TN_MNV5_MSFA_FFN_PROJ_BN, false);
+ model.msfa_concat_conv_w = get_tensor(TN_MNV5_MSFA_CONCAT_CONV_W, false);
model.msfa_concat_norm_w = get_tensor(TN_MNV5_MSFA_NORM, false);
+ model.mm_post_proj_norm_w = get_tensor(TN_MM_POST_PROJ_NORM, false);
// Dynamically load blocks stage by stage
for (int stage = 0; stage < 4; ++stage) {
int blocks_found_in_stage = 0;
- for (int blk_idx = 0; ; ++blk_idx) {
+ for (int blk_idx = 0; blk_idx < 256; ++blk_idx) {
bool found_block = false;
mobilenetv5_block block;
| ggml_tensor* clip_graph_mobilenetv5::pad_same_2d(ggml_tensor* inp, int kernel_h, int kernel_w, int stride_h, int stride_w, int dilation_h, int dilation_w) { | ||
| const int64_t ih = inp->ne[1]; // height | ||
| const int64_t iw = inp->ne[0]; // width | ||
|
|
||
| // Calculate output size (ceil division) | ||
| const int64_t oh = (ih + stride_h - 1) / stride_h; | ||
| const int64_t ow = (iw + stride_w - 1) / stride_w; | ||
|
|
||
| // Calculate padding needed | ||
| const int64_t pad_h = std::max((int64_t)0, (oh - 1) * stride_h + (kernel_h - 1) * dilation_h + 1 - ih); | ||
| const int64_t pad_w = std::max((int64_t)0, (ow - 1) * stride_w + (kernel_w - 1) * dilation_w + 1 - iw); | ||
|
|
||
| // Split padding asymmetrically | ||
| const int pad_h_top = pad_h / 2; | ||
| const int pad_h_bottom = pad_h - pad_h_top; | ||
| const int pad_w_left = pad_w / 2; | ||
| const int pad_w_right = pad_w - pad_w_left; | ||
|
|
||
| // Apply padding if needed | ||
| // ggml_pad_ext: (ctx, tensor, lp0, rp0, lp1, rp1, lp2, rp2, lp3, rp3) | ||
| // For [W, H, C, B]: p0=width, p1=height, p2=channels, p3=batch | ||
| if (pad_h > 0 || pad_w > 0) { | ||
| inp = ggml_pad_ext(ctx0, inp, | ||
| pad_w_left, pad_w_right, // width padding (dim 0) | ||
| pad_h_top, pad_h_bottom, // height padding (dim 1) | ||
| 0, 0, // no channel padding (dim 2) | ||
| 0, 0); // no batch padding (dim 3) | ||
| } | ||
|
|
||
| return inp; | ||
| } |
There was a problem hiding this comment.
Add null check and validate stride parameters.
Two issues:
inpis dereferenced at lines 24-25 without a null check- Division by
stride_handstride_wat lines 28-29 could cause division by zero if these parameters are 0
🛡️ Suggested defensive checks
ggml_tensor* clip_graph_mobilenetv5::pad_same_2d(ggml_tensor* inp, int kernel_h, int kernel_w, int stride_h, int stride_w, int dilation_h, int dilation_w) {
+ if (!inp || stride_h <= 0 || stride_w <= 0) {
+ return nullptr;
+ }
+
const int64_t ih = inp->ne[1]; // height
const int64_t iw = inp->ne[0]; // width🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 23 - 53, In pad_same_2d, add
a null check for the input pointer inp at the start of the function and return
or handle the error if inp is null; also validate stride_h and stride_w are > 0
before using them (e.g., return early or assert/log error) to avoid division by
zero when computing oh and ow; update references to inp, stride_h, and stride_w
in pad_same_2d accordingly so the function fails fast on invalid inputs instead
of dereferencing a null pointer or performing division by zero.
| ggml_tensor * clip_graph_mobilenetv5::build_edge_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) { | ||
| ggml_tensor * cur = inp; | ||
|
|
||
| // 1. Expansion Conv (3x3) | ||
| if (stride == 2) { | ||
| // Case: Downsampling (Block 0) | ||
| // Replicates Conv2dSame(kernel=3, stride=2) | ||
| cur = pad_same_2d(cur, 3, 3, stride, stride); | ||
| cur = ggml_conv_2d_direct(ctx0, block.s0_conv_exp_w, cur, stride, stride, 0, 0, 1, 1); | ||
| } else { | ||
| // Case: Normal 3x3 Block (Block 1, 2) | ||
| // Replicates Conv2d(kernel=3, stride=1, padding=1) | ||
| cur = ggml_conv_2d_direct(ctx0, block.s0_conv_exp_w, cur, stride, stride, 1, 1, 1, 1); | ||
| } | ||
|
|
||
| // BN + Activation | ||
| if (block.s0_bn1_w) cur = rms_norm_2d(cur, block.s0_bn1_w); | ||
| cur = ggml_gelu(ctx0, cur); | ||
|
|
||
| // 2. Pointwise Linear Conv (1x1) | ||
| // 1x1 Convs usually have padding=0 and stride=1 | ||
| cur = ggml_conv_2d_direct(ctx0, block.s0_conv_pwl_w, cur, 1, 1, 0, 0, 1, 1); | ||
| if (block.s0_bn2_w) cur = rms_norm_2d(cur, block.s0_bn2_w); | ||
|
|
||
| // 3. Residual Connection | ||
| // Only apply residual if spatial dimensions and channels match (stride 1) | ||
| if (stride == 1 && inp->ne[2] == cur->ne[2] && inp->ne[0] == cur->ne[0]) { | ||
| cur = ggml_add(ctx0, cur, inp); | ||
| } | ||
|
|
||
| return cur; | ||
| } |
There was a problem hiding this comment.
Add null checks for input tensor and required block tensors.
The function assumes inp and required block tensors (s0_conv_exp_w, s0_conv_pwl_w) are non-null. Accessing these without validation could cause null-pointer dereferences:
inpused at line 58block.s0_conv_exp_wat lines 65, 69block.s0_conv_pwl_wat line 78
🛡️ Suggested validation
ggml_tensor * clip_graph_mobilenetv5::build_edge_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) {
+ if (!inp || !block.s0_conv_exp_w || !block.s0_conv_pwl_w) {
+ return nullptr;
+ }
+
ggml_tensor * cur = inp;🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 57 - 88, The function
build_edge_residual assumes inp and block weight tensors exist; add explicit
null checks at the top of build_edge_residual: if inp is null return nullptr (or
inp as appropriate) to avoid dereferencing, and verify block.s0_conv_exp_w and
block.s0_conv_pwl_w before calling ggml_conv_2d_direct (and before passing them
to rms_norm_2d); if either weight is null, skip the corresponding conv/pwl steps
or return nullptr consistently so callers can handle the error. Ensure all early
exits use the same convention as the surrounding codebase (nullptr or original
inp) and reference the symbols build_edge_residual, block.s0_conv_exp_w,
block.s0_conv_pwl_w, ggml_conv_2d_direct, and rms_norm_2d when making the
checks.
| ggml_tensor * clip_graph_mobilenetv5::build_inverted_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) { | ||
| ggml_tensor * cur = inp; | ||
|
|
||
| // 1. Depthwise Start (Optional) | ||
| // NOTE: dw_start always has stride=1 (no downsampling here) | ||
| if (block.dw_start_w) { | ||
| int k = block.dw_start_w->ne[0]; // 3 or 5 | ||
| int p = k / 2; | ||
| cur = ggml_conv_2d_dw(ctx0, block.dw_start_w, cur, 1, 1, p, p, 1, 1); | ||
| if (block.dw_start_bn_w) cur = rms_norm_2d(cur, block.dw_start_bn_w); | ||
| } | ||
|
|
||
| // 2. Pointwise Expansion (1x1) | ||
| if (block.pw_exp_w) { | ||
| // Standard 1x1 conv, pad=0, stride=1 | ||
| cur = ggml_conv_2d_direct(ctx0, block.pw_exp_w, cur, 1, 1, 0, 0, 1, 1); | ||
| if (block.pw_exp_bn_w) cur = rms_norm_2d(cur, block.pw_exp_bn_w); | ||
| cur = ggml_gelu(ctx0, cur); | ||
| } | ||
|
|
||
| // 3. Depthwise Mid (Optional) | ||
| // NOTE: dw_mid is where downsampling happens (stride=2 for first block of stage) | ||
| if (block.dw_mid_w) { | ||
| int k = block.dw_mid_w->ne[0]; // 3 or 5 | ||
|
|
||
| if (stride > 1) { | ||
| // Case: Stride 2 (Downsample) -> Use Asymmetric "Same" Padding | ||
| cur = pad_same_2d(cur, k, k, stride, stride); | ||
| cur = ggml_conv_2d_dw(ctx0, block.dw_mid_w, cur, stride, stride, 0, 0, 1, 1); // pad=0 | ||
| } else { | ||
| // Case: Stride 1 -> Use Standard Symmetric Padding | ||
| int p = k / 2; | ||
| cur = ggml_conv_2d_dw(ctx0, block.dw_mid_w, cur, stride, stride, p, p, 1, 1); | ||
| } | ||
|
|
||
| if (block.dw_mid_bn_w) cur = rms_norm_2d(cur, block.dw_mid_bn_w); | ||
| cur = ggml_gelu(ctx0, cur); | ||
| } | ||
|
|
||
| // 4. Pointwise Projection (1x1) | ||
| if (block.pw_proj_w) { | ||
| cur = ggml_conv_2d_direct(ctx0, block.pw_proj_w, cur, 1, 1, 0, 0, 1, 1); | ||
| if (block.pw_proj_bn_w) cur = rms_norm_2d(cur, block.pw_proj_bn_w); | ||
| } | ||
|
|
||
| // Apply Layer Scaling if present | ||
| if (block.layer_scale_w) { | ||
| cur = ggml_mul(ctx0, cur, block.layer_scale_w); | ||
| } | ||
|
|
||
| // 5. Residual Connection | ||
| bool same_spatial = (inp->ne[0] == cur->ne[0]) && (inp->ne[1] == cur->ne[1]); | ||
| bool same_channel = (inp->ne[2] == cur->ne[2]); | ||
| if (same_spatial && same_channel) { | ||
| cur = ggml_add(ctx0, cur, inp); | ||
| } | ||
|
|
||
| return cur; | ||
| } |
There was a problem hiding this comment.
Add null check for inp parameter.
The inp tensor is used at line 92 without null validation, which could lead to a null-pointer dereference.
🛡️ Suggested check
ggml_tensor * clip_graph_mobilenetv5::build_inverted_residual(ggml_tensor * inp, const mobilenetv5_block & block, int stride) {
+ if (!inp) {
+ return nullptr;
+ }
+
ggml_tensor * cur = inp;🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 91 - 149, The function
build_inverted_residual uses the inp pointer without validation; add an
immediate null check at the top of build_inverted_residual for the inp parameter
and handle it safely (e.g., return nullptr or propagate an error) instead of
dereferencing a null pointer so the rest of the function (uses of inp->ne[...]
and residual addition) are not executed when inp is null.
| ggml_cgraph * clip_graph_mobilenetv5::build() { | ||
| ggml_tensor * inp = build_inp_raw(); | ||
|
|
||
| // 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2)) | ||
| ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2); // Apply SAME padding | ||
|
|
||
| cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0 | ||
| if (model.mobilenet_stem_conv_b) { | ||
| cur = ggml_add(ctx0, cur, model.mobilenet_stem_conv_b); | ||
| } | ||
| if (model.mobilenet_stem_norm_w) cur = rms_norm_2d(cur, model.mobilenet_stem_norm_w); | ||
| cur = ggml_gelu(ctx0, cur); | ||
|
|
There was a problem hiding this comment.
Validate required stem convolution weight.
The model.mobilenet_stem_conv_w tensor is used at line 254 without a null check. If model loading is incomplete, this will cause a null-pointer dereference.
🛡️ Suggested validation
ggml_cgraph * clip_graph_mobilenetv5::build() {
ggml_tensor * inp = build_inp_raw();
// 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2))
ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2); // Apply SAME padding
+ if (!model.mobilenet_stem_conv_w) {
+ return nullptr;
+ }
+
cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ggml_cgraph * clip_graph_mobilenetv5::build() { | |
| ggml_tensor * inp = build_inp_raw(); | |
| // 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2)) | |
| ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2); // Apply SAME padding | |
| cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0 | |
| if (model.mobilenet_stem_conv_b) { | |
| cur = ggml_add(ctx0, cur, model.mobilenet_stem_conv_b); | |
| } | |
| if (model.mobilenet_stem_norm_w) cur = rms_norm_2d(cur, model.mobilenet_stem_norm_w); | |
| cur = ggml_gelu(ctx0, cur); | |
| ggml_cgraph * clip_graph_mobilenetv5::build() { | |
| ggml_tensor * inp = build_inp_raw(); | |
| // 1. Stem - Conv2dSame(3, 64, kernel_size=(3, 3), stride=(2, 2)) | |
| ggml_tensor * cur = pad_same_2d(inp, 3, 3, 2, 2); // Apply SAME padding | |
| if (!model.mobilenet_stem_conv_w) { | |
| return nullptr; | |
| } | |
| cur = ggml_conv_2d_direct(ctx0, model.mobilenet_stem_conv_w, cur, 2, 2, 0, 0, 1, 1); // padding=0 | |
| if (model.mobilenet_stem_conv_b) { | |
| cur = ggml_add(ctx0, cur, model.mobilenet_stem_conv_b); | |
| } | |
| if (model.mobilenet_stem_norm_w) cur = rms_norm_2d(cur, model.mobilenet_stem_norm_w); | |
| cur = ggml_gelu(ctx0, cur); |
🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 248 - 260, The build()
function uses model.mobilenet_stem_conv_w without validation; add a null-check
at the start of the stem block (before calling ggml_conv_2d_direct) to detect
missing stem weights (model.mobilenet_stem_conv_w == nullptr) and handle it by
logging an error/throwing or returning nullptr from build() to avoid
dereferencing; ensure downstream code does not assume cur was created if the
check fails and keep existing handling for mobilenet_stem_conv_b and
mobilenet_stem_norm_w unchanged.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
gguf-py/gguf/constants.py (1)
881-889: Critical: Missing GEMMA3N entry in VISION_PROJECTOR_TYPE_NAMES.Line 468 adds
VISION_PROJECTOR_TYPE.GEMMA3Nto the enum, but theVISION_PROJECTOR_TYPE_NAMESdict does not include a corresponding mapping. This will cause aKeyErrorwhen code attempts to look up the string name forVISION_PROJECTOR_TYPE.GEMMA3N.🐛 Proposed fix
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = { VISION_PROJECTOR_TYPE.MLP: "mlp", VISION_PROJECTOR_TYPE.LDP: "ldp", VISION_PROJECTOR_TYPE.LDPV2: "ldpv2", VISION_PROJECTOR_TYPE.RESAMPLER: "resampler", VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter", VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger", + VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", VISION_PROJECTOR_TYPE.GEMMA3: "gemma3", }
🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 6139-6165: The patch_size calculation in the __init__ method is
wrong: don't divide image_size by image_seq_length; instead compute n_per_side =
int(sqrt(image_seq_length)) (or math.isqrt(image_seq_length) for exact integer
math) and set self.hparams_vision["patch_size"] = image_size // n_per_side so
256 tokens -> n_per_side=16 -> patch_size=image_size//16; ensure math is
available/imported if you use math.isqrt/math.sqrt and handle non-perfect-square
image_seq_length by using integer floor.
- Around line 6045-6105: The condition in ConformerAudioModel.tensor_force_quant
incorrectly applies F32 to any name containing ".conv" because of operator
precedence; change the test so that the quantization is forced only when the
tensor is a conv weight — i.e., require that (".conv" in name or "_conv" in
name) AND ".weight" in name. Update the conditional in
ConformerAudioModel.tensor_force_quant accordingly (use parentheses or reorder
the logic) so only conv weight tensors return gguf.GGMLQuantizationType.F32;
leave ConformerAudioModel.is_audio_tensor and the fallback to
super().tensor_force_quant unchanged.
- Around line 6108-6137: Mark the mutable class attribute block_tensor_mapping
on Gemma3nVisionAudioModel as a ClassVar to avoid mutable-class-attr pitfalls:
import ClassVar and Dict from typing and change the declaration to something
like block_tensor_mapping: ClassVar[Dict[str, str]] = { ... } so static
analyzers and linters know it’s not an instance attribute.
- Around line 6199-6203: The current modify_tensors replacement can produce
double "layers" (e.g., "conformer.layers.layers..."); change the logic in
modify_tensors (and keep using ConformerAudioModel.is_audio_tensor) to detect
whether the incoming name contains "model.audio_tower.conformer.layers." and, if
so, replace that exact substring with "conformer.layers.", otherwise replace
"model.audio_tower.conformer." with "conformer.layers." so the result always
matches the expected "conformer.layers.{bid}..." keys used by batchnorm folding.
In @tools/mtmd/clip.cpp:
- Around line 3242-3247: The GEMMA3N branch incorrectly sets n_patches to
ctx->model.hparams.image_size / ctx->model.hparams.patch_size (patches per side)
instead of total tokens; change the calculation in the PROJECTOR_TYPE_GEMMA3N
case to compute total patches/tokens as (image_size / patch_size) squared (e.g.,
n_patches = pow(ctx->model.hparams.image_size / ctx->model.hparams.patch_size,
2) or multiply the quotient by itself) so the value matches the 16×16 = 256
claim and is robust to a corrected patch_size.
- Around line 1584-1655: The local mobilenetv5_block variable is
default-uninitialized causing UB when reading members like layer_scale_w before
assignment; fix by zero-initializing the struct instance at creation (e.g.,
value-initialize mobilenetv5_block so all pointers/flags are null/zero), or
explicitly initialize all members you later read (layer_scale_w and any
pointer/flag fields) before any get_tensor checks, so that pushing to
model.mobilenet_blocks uses a fully-initialized block.
🧹 Nitpick comments (3)
gguf-py/gguf/tensor_mapping.py (1)
1609-1795: Non-{bid}keys inblock_mappings_cfgare easy to accidentally add; consider keeping gemma3n non-block tensors inmappings_cfgonly.
Not a blocker, but it avoids repeated per-layer inserts and makes it clearer which tensors are truly block-indexed.convert_hf_to_gguf.py (2)
6247-6265:Gemma3NModel.set_vocabtemporary override looks correct, but consider a non-mutating approach.The delete/restore pattern works, but mutating
self.hparamsmid-conversion is fragile if anything throws insuper().set_vocab(). A small refactor to use a shallow copy (or a try/finally) would make this safer.
6301-6322: Verify padding logic forembed_tokens_per_layer*: assumes vocab is axis 0.This code pads
data_torch.shape[0]up tovocab_size. That’s only correct if the tensor is[vocab, dim](or a per-layer tensor instance still shaped[vocab, dim]). If the tensor is stacked (e.g.[n_layers, vocab, dim]), this will pad the wrong dimension silently.Minimal defensive check idea
if "embed_tokens.weight" in name or "embed_tokens_per_layer" in name: # Move to CPU to avoid meta device issues during padding data_torch = data_torch.to(device="cpu") vocab_size = self.hparams.get("vocab_size", 262400) - current_size = data_torch.shape[0] # First dimension is vocab_size + if data_torch.ndim != 2: + raise ValueError(f"Unexpected embedding tensor rank for {name}: shape={tuple(data_torch.shape)}") + current_size = data_torch.shape[0]
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pytools/mtmd/clip.cpp
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cppgguf-py/gguf/constants.py
🧬 Code graph analysis (2)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(474-762)
convert_hf_to_gguf.py (2)
gguf-py/gguf/constants.py (2)
GGMLQuantizationType(3433-3465)VisionProjectorType(3558-3581)gguf-py/gguf/gguf_writer.py (4)
add_clip_projector_type(1086-1087)add_vision_attention_layernorm_eps(1107-1108)add_audio_num_mel_bins(1189-1190)add_audio_attention_layernorm_eps(1186-1187)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
6114-6137: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6197-6197: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: ggml-ci-arm64-cpu-low-perf
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: check-requirements
- GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
- GitHub Check: server (Release)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-windows
🔇 Additional comments (8)
tools/mtmd/clip.cpp (3)
783-795: GEMMA3N graph builder wiring looks straightforward.
RoutingPROJECTOR_TYPE_GEMMA3Ntoclip_graph_mobilenetv5matches the intended architecture split.
2970-2978: Verify GEMMA3N preprocess should warp-to-square (no padding) vs pad-to-square (Gemma3 behavior).
Right now GEMMA3 uses padding by default, while GEMMA3N explicitly disables it (add_padding=false), which changes aspect-ratio handling and can impact accuracy.
1349-1351:hparams.n_layer = 0+ “skip audio” for GEMMA3N: OK, but please sanity-check downstream assumptions.
This is fine if all GEMMA3N code paths avoidmodel.layers[...]and audio init, but it’s worth validating that no shared helpers still assumen_layer > 0.Also applies to: 2115-2139
gguf-py/gguf/tensor_mapping.py (1)
126-159: Paths verified against converter implementation—no issues found.The GEMMA3N tensor mappings at lines 127–157 match the converter's expected HF module structure. The converter code (
Gemma3nVisionAudioModel.modify_tensors) explicitly validates bothmodel.embed_vision.*andmodel.vision_tower.*prefixes, confirming these paths exist in the loaded model. No duplicate keys or typos detected.Consider adding periodic validation (e.g., during model conversion testing) to catch any upstream naming drift in future GEMMA3N model updates.
gguf-py/gguf/constants.py (1)
392-392: LGTM: Gemma3N constant additions follow existing patterns.The additions for MODEL_ARCH.GEMMA3N, MODEL_TENSOR entries, tensor name mappings, and VisionProjectorType.GEMMA3N all follow the established patterns and conventions used by other model architectures in this file.
Also applies to: 468-468, 679-687, 715-746, 811-811, 1095-1104, 1133-1163, 1214-1222, 1250-1289, 2040-2074, 3560-3560
convert_hf_to_gguf.py (3)
530-536: Good guard for emptytensor_map.mapping, but consider avoiding a “magic” default name-length.This is fine for preventing
max()on empty mapping; the fallback length is only used for log alignment. If you want it future-proof, consider derivingmax_name_lenfrom actualnew_namevalues as you iterate (one-pass) rather than hardcoding"vision_encoder.weight,".
10163-10178: Skip condition extension inLFM2Model.modify_tensorsseems fine.Including
ConformerAudioModel.is_audio_tensor(name)in the skip path helps avoid accidentally pulling audio weights into the text model conversion.
10305-10323:LFM2AudioModelwiring looks consistent withConformerAudioModel.No specific concerns in this snippet beyond the shared
ConformerAudioModelissues noted above (quantization predicate + batchnorm folding expectations).
| class ConformerAudioModel(MmprojModel): | ||
| _batch_norm_tensors: list[dict[str, Tensor]] | None = None | ||
|
|
||
| @staticmethod | ||
| def is_audio_tensor(name: str): | ||
| return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"]) | ||
|
|
||
| def tensor_force_quant(self, name, new_name, bid, n_dims): | ||
| if ConformerAudioModel.is_audio_tensor(name): | ||
| if ".conv" in name or "_conv" in name and ".weight" in name: | ||
| return gguf.GGMLQuantizationType.F32 | ||
| return super().tensor_force_quant(name, new_name, bid, n_dims) | ||
|
|
||
| def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: | ||
| # skip language model tensors | ||
| if name.startswith("lfm."): | ||
| return [] | ||
|
|
||
| # for training only | ||
| if any(p in name for p in ["audio_loss_weight"]): | ||
| return [] | ||
|
|
||
| # for audio output | ||
| if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]): | ||
| return [] | ||
|
|
||
| # fold running_mean, running_var and eps into weight and bias for batch_norm | ||
| if "batch_norm" in name: | ||
| if self._batch_norm_tensors is None: | ||
| self._batch_norm_tensors = [{} for _ in range(self.block_count)] | ||
| assert bid is not None | ||
| self._batch_norm_tensors[bid][name] = data_torch | ||
|
|
||
| if len(self._batch_norm_tensors[bid]) < 5: | ||
| return [] | ||
|
|
||
| weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"] | ||
| bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"] | ||
| running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"] | ||
| running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"] | ||
| eps = 1e-5 # default value | ||
|
|
||
| a = weight / torch.sqrt(running_var + eps) | ||
| b = bias - running_mean * a | ||
| return [ | ||
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a), | ||
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b), | ||
| ] | ||
|
|
||
| # reshape conv weights | ||
| if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"): | ||
| data_torch = data_torch[:, None, None] | ||
| if "conv.depthwise_conv" in name and name.endswith(".weight"): | ||
| assert data_torch.shape[1] == 1 | ||
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2]) | ||
| if "conv.pointwise_conv" in name and name.endswith(".weight"): | ||
| assert data_torch.shape[2] == 1 | ||
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1]) | ||
|
|
||
| return [(self.map_tensor_name(name), data_torch)] | ||
|
|
There was a problem hiding this comment.
Fix boolean precedence in ConformerAudioModel.tensor_force_quant (currently quantizes too broadly).
".conv" in name or "_conv" in name and ".weight" in name is parsed as (".conv" in name) or ("_conv" in name and ".weight" in name). That likely forces F32 for any tensor containing ".conv" (including biases), which is not intended.
Proposed fix
class ConformerAudioModel(MmprojModel):
@@
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ConformerAudioModel.is_audio_tensor(name):
- if ".conv" in name or "_conv" in name and ".weight" in name:
+ if ((".conv" in name or "_conv" in name) and name.endswith(".weight")):
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class ConformerAudioModel(MmprojModel): | |
| _batch_norm_tensors: list[dict[str, Tensor]] | None = None | |
| @staticmethod | |
| def is_audio_tensor(name: str): | |
| return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"]) | |
| def tensor_force_quant(self, name, new_name, bid, n_dims): | |
| if ConformerAudioModel.is_audio_tensor(name): | |
| if ".conv" in name or "_conv" in name and ".weight" in name: | |
| return gguf.GGMLQuantizationType.F32 | |
| return super().tensor_force_quant(name, new_name, bid, n_dims) | |
| def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: | |
| # skip language model tensors | |
| if name.startswith("lfm."): | |
| return [] | |
| # for training only | |
| if any(p in name for p in ["audio_loss_weight"]): | |
| return [] | |
| # for audio output | |
| if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]): | |
| return [] | |
| # fold running_mean, running_var and eps into weight and bias for batch_norm | |
| if "batch_norm" in name: | |
| if self._batch_norm_tensors is None: | |
| self._batch_norm_tensors = [{} for _ in range(self.block_count)] | |
| assert bid is not None | |
| self._batch_norm_tensors[bid][name] = data_torch | |
| if len(self._batch_norm_tensors[bid]) < 5: | |
| return [] | |
| weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"] | |
| bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"] | |
| running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"] | |
| running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"] | |
| eps = 1e-5 # default value | |
| a = weight / torch.sqrt(running_var + eps) | |
| b = bias - running_mean * a | |
| return [ | |
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a), | |
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b), | |
| ] | |
| # reshape conv weights | |
| if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"): | |
| data_torch = data_torch[:, None, None] | |
| if "conv.depthwise_conv" in name and name.endswith(".weight"): | |
| assert data_torch.shape[1] == 1 | |
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2]) | |
| if "conv.pointwise_conv" in name and name.endswith(".weight"): | |
| assert data_torch.shape[2] == 1 | |
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1]) | |
| return [(self.map_tensor_name(name), data_torch)] | |
| class ConformerAudioModel(MmprojModel): | |
| _batch_norm_tensors: list[dict[str, Tensor]] | None = None | |
| @staticmethod | |
| def is_audio_tensor(name: str): | |
| return any(p in name for p in ["audio", "codebook", "conformer", "depth_embedding", "depthformer", "depth_linear"]) | |
| def tensor_force_quant(self, name, new_name, bid, n_dims): | |
| if ConformerAudioModel.is_audio_tensor(name): | |
| if ((".conv" in name or "_conv" in name) and name.endswith(".weight")): | |
| return gguf.GGMLQuantizationType.F32 | |
| return super().tensor_force_quant(name, new_name, bid, n_dims) | |
| def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: | |
| # skip language model tensors | |
| if name.startswith("lfm."): | |
| return [] | |
| # for training only | |
| if any(p in name for p in ["audio_loss_weight"]): | |
| return [] | |
| # for audio output | |
| if any(p in name for p in ["codebook_offsets", "depth_embeddings", "depth_linear", "depthformer"]): | |
| return [] | |
| # fold running_mean, running_var and eps into weight and bias for batch_norm | |
| if "batch_norm" in name: | |
| if self._batch_norm_tensors is None: | |
| self._batch_norm_tensors = [{} for _ in range(self.block_count)] | |
| assert bid is not None | |
| self._batch_norm_tensors[bid][name] = data_torch | |
| if len(self._batch_norm_tensors[bid]) < 5: | |
| return [] | |
| weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"] | |
| bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"] | |
| running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"] | |
| running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"] | |
| eps = 1e-5 # default value | |
| a = weight / torch.sqrt(running_var + eps) | |
| b = bias - running_mean * a | |
| return [ | |
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a), | |
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b), | |
| ] | |
| # reshape conv weights | |
| if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"): | |
| data_torch = data_torch[:, None, None] | |
| if "conv.depthwise_conv" in name and name.endswith(".weight"): | |
| assert data_torch.shape[1] == 1 | |
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2]) | |
| if "conv.pointwise_conv" in name and name.endswith(".weight"): | |
| assert data_torch.shape[2] == 1 | |
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1]) | |
| return [(self.map_tensor_name(name), data_torch)] |
🧰 Tools
🪛 Ruff (0.14.10)
6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6045 - 6105, The condition in
ConformerAudioModel.tensor_force_quant incorrectly applies F32 to any name
containing ".conv" because of operator precedence; change the test so that the
quantization is forced only when the tensor is a conv weight — i.e., require
that (".conv" in name or "_conv" in name) AND ".weight" in name. Update the
conditional in ConformerAudioModel.tensor_force_quant accordingly (use
parentheses or reorder the logic) so only conv weight tensors return
gguf.GGMLQuantizationType.F32; leave ConformerAudioModel.is_audio_tensor and the
fallback to super().tensor_force_quant unchanged.
| class Gemma3nVisionAudioModel(ConformerAudioModel): | ||
| has_audio_encoder = True | ||
| has_vision_encoder = True | ||
|
|
||
| # Double indexed mapping for MobileNetV5 blocks (not supported by tensor_mapping.py) | ||
| # This is the only known model having this, so we prefer implementing it outside of tensor_mapping.py | ||
| block_tensor_mapping = { | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.bn1.weight": "v.blk.{bid}.{sid}.bn1.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_pwl.weight": "v.blk.{bid}.{sid}.conv_pwl.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.bn2.weight": "v.blk.{bid}.{sid}.bn2.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.conv.weight": "v.blk.{bid}.{sid}.dw_start.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_start.bn.weight": "v.blk.{bid}.{sid}.dw_start.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.conv.weight": "v.blk.{bid}.{sid}.dw_mid.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.dw_mid.bn.weight": "v.blk.{bid}.{sid}.dw_mid.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.conv.weight": "v.blk.{bid}.{sid}.pw_exp.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_exp.bn.weight": "v.blk.{bid}.{sid}.pw_exp.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.conv.weight": "v.blk.{bid}.{sid}.pw_proj.conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.pw_proj.bn.weight": "v.blk.{bid}.{sid}.pw_proj.bn.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.layer_scale.gamma": "v.blk.{bid}.{sid}.layer_scale.gamma", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.query.proj.weight": "v.blk.{bid}.{sid}.attn.query.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.proj.weight": "v.blk.{bid}.{sid}.attn.key.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.proj.weight": "v.blk.{bid}.{sid}.attn.value.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.output.proj.weight": "v.blk.{bid}.{sid}.attn.output.proj.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.down_conv.weight": "v.blk.{bid}.{sid}.attn.key.down_conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.key.norm.weight": "v.blk.{bid}.{sid}.attn.key.norm.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.down_conv.weight": "v.blk.{bid}.{sid}.attn.value.down_conv.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.attn.value.norm.weight": "v.blk.{bid}.{sid}.attn.value.norm.weight", | ||
| "model.vision_tower.timm_model.blocks.{bid}.{sid}.norm.weight": "v.blk.{bid}.{sid}.norm.weight", | ||
| } |
There was a problem hiding this comment.
Annotate block_tensor_mapping as ClassVar to avoid mutable-class-attr pitfalls.
Static analysis is right here: this is a mutable class attribute and should be explicitly marked as such.
Proposed fix
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ClassVar, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
@@
class Gemma3nVisionAudioModel(ConformerAudioModel):
@@
- block_tensor_mapping = {
+ block_tensor_mapping: ClassVar[dict[str, str]] = {
"model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight",
...
}Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.10)
6114-6137: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6108 - 6137, Mark the mutable class
attribute block_tensor_mapping on Gemma3nVisionAudioModel as a ClassVar to avoid
mutable-class-attr pitfalls: import ClassVar and Dict from typing and change the
declaration to something like block_tensor_mapping: ClassVar[Dict[str, str]] = {
... } so static analyzers and linters know it’s not an instance attribute.
| def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: | ||
| if (ConformerAudioModel.is_audio_tensor(name)): | ||
| name = name.replace("model.audio_tower.conformer.", "conformer.layers.") | ||
| return super().modify_tensors(data_torch, name, bid) | ||
|
|
There was a problem hiding this comment.
Verify audio tensor prefix rewrite; it may produce conformer.layers.layers.<bid>....
If the incoming name is the common model.audio_tower.conformer.layers.<i>..., the current replacement:
model.audio_tower.conformer.→conformer.layers.
will yieldconformer.layers.layers.<i>..., which won’t match your batchnorm folding keys (conformer.layers.{bid}.conv.batch_norm.*) and may break tensor mapping.
Suggested safer rewrite (adjust once you confirm actual HF tensor prefixes)
- if (ConformerAudioModel.is_audio_tensor(name)):
- name = name.replace("model.audio_tower.conformer.", "conformer.layers.")
+ if (ConformerAudioModel.is_audio_tensor(name)):
+ if name.startswith("model.audio_tower.conformer.layers."):
+ name = name.replace("model.audio_tower.conformer.layers.", "conformer.layers.", 1)
+ elif name.startswith("model.audio_tower.conformer."):
+ name = name.replace("model.audio_tower.conformer.", "conformer.", 1)
return super().modify_tensors(data_torch, name, bid)🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6199 - 6203, The current modify_tensors
replacement can produce double "layers" (e.g., "conformer.layers.layers...");
change the logic in modify_tensors (and keep using
ConformerAudioModel.is_audio_tensor) to detect whether the incoming name
contains "model.audio_tower.conformer.layers." and, if so, replace that exact
substring with "conformer.layers.", otherwise replace
"model.audio_tower.conformer." with "conformer.layers." so the result always
matches the expected "conformer.layers.{bid}..." keys used by batchnorm folding.
| for (int blk_idx = 0; ; ++blk_idx) { | ||
| bool found_block = false; | ||
| mobilenetv5_block block; | ||
|
|
||
| // 1. Check for Edge Residual (S0) | ||
| block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false); | ||
| if (block.s0_conv_exp_w) { | ||
| found_block = true; | ||
| block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false); | ||
| block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false); | ||
| block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false); | ||
| } | ||
| // 2. Check for UIR (Universal Inverted Residual) | ||
| else { | ||
| // Check for dw_start OR pw_exp (some UIR blocks skip dw_start) | ||
| block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false); | ||
| block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false); | ||
|
|
||
| if (block.dw_start_w || block.pw_exp_w) { | ||
| found_block = true; | ||
| if (block.dw_start_w) { | ||
| block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false); | ||
| } | ||
| if (block.pw_exp_w) { | ||
| block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false); | ||
| } | ||
| block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false); | ||
| if (block.dw_mid_w) { | ||
| block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false); | ||
| } | ||
| block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false); | ||
| if (block.pw_proj_w) { | ||
| block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false); | ||
| } | ||
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | ||
| } | ||
| } | ||
|
|
||
| // 3. Check for Attention (MQA) | ||
| // Even if UIR/Edge check failed, this might be a pure attention block | ||
| ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false); | ||
| if (attn_q_check) { | ||
| found_block = true; | ||
| block.attn_q_w = attn_q_check; | ||
| block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false); | ||
| block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false); | ||
| block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false); | ||
| block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false); | ||
| block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false); | ||
| block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false); | ||
| block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false); | ||
| block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false); | ||
| // Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check | ||
| if (!block.layer_scale_w) { | ||
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | ||
| } | ||
| } | ||
|
|
||
| if (found_block) { | ||
| model.mobilenet_blocks.push_back(block); | ||
| blocks_found_in_stage++; | ||
| } else { | ||
| // End of blocks for this stage | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| // Track where this stage ends in the flat vector | ||
| if (blocks_found_in_stage > 0) { | ||
| model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1); | ||
| LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1); | ||
| } |
There was a problem hiding this comment.
Critical: mobilenetv5_block block; is uninitialized (UB) before checking fields / pushing to vector.
You later read fields like block.layer_scale_w (and may push partially-filled structs), which is undefined behavior unless the struct zero-initializes itself.
Proposed fix
- mobilenetv5_block block;
+ mobilenetv5_block block{};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (int blk_idx = 0; ; ++blk_idx) { | |
| bool found_block = false; | |
| mobilenetv5_block block; | |
| // 1. Check for Edge Residual (S0) | |
| block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false); | |
| if (block.s0_conv_exp_w) { | |
| found_block = true; | |
| block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false); | |
| block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false); | |
| block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false); | |
| } | |
| // 2. Check for UIR (Universal Inverted Residual) | |
| else { | |
| // Check for dw_start OR pw_exp (some UIR blocks skip dw_start) | |
| block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false); | |
| block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false); | |
| if (block.dw_start_w || block.pw_exp_w) { | |
| found_block = true; | |
| if (block.dw_start_w) { | |
| block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false); | |
| } | |
| if (block.pw_exp_w) { | |
| block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false); | |
| } | |
| block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false); | |
| if (block.dw_mid_w) { | |
| block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false); | |
| } | |
| block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false); | |
| if (block.pw_proj_w) { | |
| block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false); | |
| } | |
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | |
| } | |
| } | |
| // 3. Check for Attention (MQA) | |
| // Even if UIR/Edge check failed, this might be a pure attention block | |
| ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false); | |
| if (attn_q_check) { | |
| found_block = true; | |
| block.attn_q_w = attn_q_check; | |
| block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false); | |
| block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false); | |
| block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false); | |
| block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false); | |
| block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false); | |
| block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false); | |
| block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false); | |
| block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false); | |
| // Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check | |
| if (!block.layer_scale_w) { | |
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | |
| } | |
| } | |
| if (found_block) { | |
| model.mobilenet_blocks.push_back(block); | |
| blocks_found_in_stage++; | |
| } else { | |
| // End of blocks for this stage | |
| break; | |
| } | |
| } | |
| // Track where this stage ends in the flat vector | |
| if (blocks_found_in_stage > 0) { | |
| model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1); | |
| LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1); | |
| } | |
| for (int blk_idx = 0; ; ++blk_idx) { | |
| bool found_block = false; | |
| mobilenetv5_block block{}; | |
| // 1. Check for Edge Residual (S0) | |
| block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false); | |
| if (block.s0_conv_exp_w) { | |
| found_block = true; | |
| block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false); | |
| block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false); | |
| block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false); | |
| } | |
| // 2. Check for UIR (Universal Inverted Residual) | |
| else { | |
| // Check for dw_start OR pw_exp (some UIR blocks skip dw_start) | |
| block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false); | |
| block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false); | |
| if (block.dw_start_w || block.pw_exp_w) { | |
| found_block = true; | |
| if (block.dw_start_w) { | |
| block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false); | |
| } | |
| if (block.pw_exp_w) { | |
| block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false); | |
| } | |
| block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false); | |
| if (block.dw_mid_w) { | |
| block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false); | |
| } | |
| block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false); | |
| if (block.pw_proj_w) { | |
| block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false); | |
| } | |
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | |
| } | |
| } | |
| // 3. Check for Attention (MQA) | |
| // Even if UIR/Edge check failed, this might be a pure attention block | |
| ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false); | |
| if (attn_q_check) { | |
| found_block = true; | |
| block.attn_q_w = attn_q_check; | |
| block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false); | |
| block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false); | |
| block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false); | |
| block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false); | |
| block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false); | |
| block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false); | |
| block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false); | |
| block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false); | |
| // Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check | |
| if (!block.layer_scale_w) { | |
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | |
| } | |
| } | |
| if (found_block) { | |
| model.mobilenet_blocks.push_back(block); | |
| blocks_found_in_stage++; | |
| } else { | |
| // End of blocks for this stage | |
| break; | |
| } | |
| } | |
| // Track where this stage ends in the flat vector | |
| if (blocks_found_in_stage > 0) { | |
| model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1); | |
| LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1); | |
| } |
🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 1584 - 1655, The local mobilenetv5_block
variable is default-uninitialized causing UB when reading members like
layer_scale_w before assignment; fix by zero-initializing the struct instance at
creation (e.g., value-initialize mobilenetv5_block so all pointers/flags are
null/zero), or explicitly initialize all members you later read (layer_scale_w
and any pointer/flag fields) before any get_tensor checks, so that pushing to
model.mobilenet_blocks uses a fully-initialized block.
tools/mtmd/clip.cpp
Outdated
| case PROJECTOR_TYPE_GEMMA3N: | ||
| { | ||
| // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution | ||
| // regardless of input size (see architecture description) | ||
| n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; | ||
| } break; |
There was a problem hiding this comment.
Critical: GEMMA3N token count calculation is wrong (returns “patches per side”, not tokens).
Comment says 256 (16×16), but image_size / patch_size is 16 for a correct (768, 48) setup. This currently looks coupled to the known converter patch_size bug; it will break once patch_size is fixed semantically.
Proposed fix (matches 16x16 claim)
case PROJECTOR_TYPE_GEMMA3N:
{
// MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
// regardless of input size (see architecture description)
- n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+ const int n_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+ GGML_ASSERT(n_side > 0);
+ n_patches = n_side * n_side;
} break;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| case PROJECTOR_TYPE_GEMMA3N: | |
| { | |
| // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution | |
| // regardless of input size (see architecture description) | |
| n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; | |
| } break; | |
| case PROJECTOR_TYPE_GEMMA3N: | |
| { | |
| // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution | |
| // regardless of input size (see architecture description) | |
| const int n_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; | |
| GGML_ASSERT(n_side > 0); | |
| n_patches = n_side * n_side; | |
| } break; |
🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 3242 - 3247, The GEMMA3N branch incorrectly
sets n_patches to ctx->model.hparams.image_size / ctx->model.hparams.patch_size
(patches per side) instead of total tokens; change the calculation in the
PROJECTOR_TYPE_GEMMA3N case to compute total patches/tokens as (image_size /
patch_size) squared (e.g., n_patches = pow(ctx->model.hparams.image_size /
ctx->model.hparams.patch_size, 2) or multiply the quotient by itself) so the
value matches the 16×16 = 256 claim and is robust to a corrected patch_size.
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
convert_hf_to_gguf.py (1)
6281-6311: Padding embeddings: applying padding to wrong dimension in per-layer embeddings.The tensor shape for
embed_tokens_per_layeris[n_embd_altup * n_layer, n_vocab], with vocab size inshape[1], notshape[0]. The current padding logic at line 6307 concatenates alongdim=0and usesdata_torch.shape[1]for the second dimension, which pads the embedding dimension instead of the vocabulary dimension. The padding should be applied alongdim=1to correctly extend the vocabulary axis. Additionally, add an explicit assertion for the tensor rank (should be 2D) to prevent unhandled shape mismatches.gguf-py/gguf/constants.py (1)
463-474: Remove orphanedGEMMA3Nenum value that mismatches C++ implementation.
VISION_PROJECTOR_TYPE.GEMMA3Nexists in the Python enum but is never used anywhere in the codebase. The actual implementation splits Gemma3n into two separate projector types in C++:PROJECTOR_TYPE_GEMMA3NV(vision,"gemma3nv") andPROJECTOR_TYPE_GEMMA3NA(audio,"gemma3na"), which are correctly exposed via theVisionProjectorTypeclass asGEMMA3NVandGEMMA3NA.Remove
GEMMA3Nfrom theVISION_PROJECTOR_TYPEenum andMODEL_ARCHenum since it has no corresponding implementation. The Python codebase should only define enum values that map to actual projector types used byconvert_hf_to_gguf.pyor loaded by downstream consumers.
🤖 Fix all issues with AI agents
In @convert_hf_to_gguf.py:
- Around line 6058-6091: In modify_tensors, avoid hardcoding eps=1e-5 when
folding batch_norm; instead attempt to obtain eps from the model config (or a
provided attribute on the converter) before falling back to the default,
validate it is a small positive float, and emit a warning via the converter
logger if the config value is missing so the user is aware of the silent numeric
change; update references around self._batch_norm_tensors handling and the
computation of a = weight / torch.sqrt(running_var + eps) to use the chosen eps
and ensure map_tensor_name and block_count logic is unchanged.
- Around line 530-536: The fallback for max_name_len in prepare_tensors uses a
model-specific literal; change it to a shorter generic constant or derive it
from available keys to avoid embedding model names: when self.tensor_map.mapping
is empty, set max_name_len to a small fixed value (e.g., len("encoder.weight,"))
or compute max(len(k) for k in self.model_tensors.keys()) + len(".weight,") if
self.model_tensors exists, ensuring you reference the prepare_tensors method,
self.tensor_map.mapping and self.model_tensors when making the replacement.
- Around line 6235-6254: The current set_vocab method temporarily deletes
self.hparams["vocab_size_per_layer_input"] but does not guarantee restoration if
super().set_vocab() raises; wrap the call to super().set_vocab() in a
try/finally block so that vocab_size_per_layer_input (the saved variable) is
always restored to self.hparams after the call, ensuring no permanent mutation
of self.hparams even on exceptions; reference the set_vocab method, the local
variable vocab_size_per_layer_input, self.hparams, and the call to
super().set_vocab() when applying the change.
In @tools/mtmd/clip.cpp:
- Around line 3242-3247: n_patches is computed incorrectly for
PROJECTOR_TYPE_GEMMA3NV: instead of producing 16x16 (=256) tokens the code
divides image_size by patch_size only once; update the logic in the
PROJECTOR_TYPE_GEMMA3NV branch (the block that assigns n_patches) to yield the
total number of patches, either by setting n_patches to the fixed constant 256
if the adapter truly always outputs 16x16, or by computing (image_size /
patch_size) squared (i.e. multiply the per-dimension count by itself) using
ctx->model.hparams.image_size and ctx->model.hparams.patch_size so the result
reflects total tokens correctly.
- Around line 1567-1659: Summary: Add post-load validation to ensure MobileNetV5
blocks were actually discovered and tensor name patterns align with the
converter. After the per-stage loading loop, record per-stage counts (e.g., add
a local vector<int> stage_block_counts and increment with blocks_found_in_stage
inside the existing loop), then validate: assert
model.mobilenet_stage_ends.size() == 4 (or log error if not), verify each
stage_block_counts[stage] > 0 (log which stage is empty and bail), and check
total model.mobilenet_blocks.size() is within expected bounds (log actual vs
expected and abort on gross mismatch). Also emit a warning listing any missing
key tensor patterns (use TN_MNV5_BLK_S0_EXP_W, TN_MNV5_BLK_DW_START_W,
TN_MNV5_ATTN_Q_W, etc.) so mismatches with clip-impl.h / Python converter can be
diagnosed.
In @tools/mtmd/models/mobilenetv5.cpp:
- Around line 152-246: In build_mobilenet_attn add a divisibility assert before
computing n_head: insert GGML_ASSERT(q->ne[2] % D == 0) to ensure q->ne[2] is
divisible by D, and extend the spatial residual check to include height (require
inp->ne[1] == cur->ne[1] alongside inp->ne[0] and inp->ne[2]) so the residual
only applies when W, H and channels match; also verify the orientation/shape
passed to ggml_mul_mat(ctx0, k, q) and subsequent ggml_soft_max(ctx0, scores) so
they operate on tensors shaped as [D, M, 1, B] (k) and [D, N, n_head, B] (q) (or
transpose them appropriately) to produce scores of shape [D, M, N, B] for the
intended attention before softmax and matmul with v.
🧹 Nitpick comments (10)
gguf-py/gguf/gguf_writer.py (1)
1086-1091: Clarify howclip.projector_typeinteracts with the new per-modality projector type keys.With
add_clip_projector_type()plusadd_clip_vision_projector_type()/add_clip_audio_projector_type(), GGUFs can now encode projector type in multiple places. To avoid interop issues, it’d help to standardize one of:
- precedence rules (e.g., prefer per-modality keys when present), and/or
- producer behavior (e.g., write both legacy
Keys.Clip.PROJECTOR_TYPEand the new per-modality key for backward compatibility).Also applies to: 1172-1176
tools/mtmd/models/mobilenetv5.cpp (5)
5-20: Makerms_norm_2d()call sites independent of a defaultepsand validate weight broadcasting.Many call sites pass only
(inp, weight); ifepsisn’t a defaulted parameter in the class declaration (intools/mtmd/models/models.h), this won’t compile. Also,ggml_mul()broadcast behavior depends on the exact weight tensor shape (1D vs[C,1,1,1]).Proposed change (explicit eps at call sites)
- if (block.s0_bn1_w) cur = rms_norm_2d(cur, block.s0_bn1_w); + if (block.s0_bn1_w) cur = rms_norm_2d(cur, block.s0_bn1_w, 1e-6f);(Repeat similarly for other
rms_norm_2d(cur, ...)call sites.)
23-53:pad_same_2d(): avoid narrowingint64_tpadding values tointwithout bounds.
pad_h/pad_wareint64_tbut are narrowed tointfor left/right/top/bottom. Probably fine for normal image sizes, but this is an easy footgun. Ifggml_pad_exttakesint, consider asserting the pads fit, or keep them asint64_tuntil the call boundary.
57-88: Residual shape check should include height; also don’t rely on impliciteps.The residual check currently compares channels and width, but not height (
ne[1]). If anything ever produces non-square or otherwise mismatched spatial dims, this can add incompatible tensors.Proposed change (height-aware residual condition)
- if (stride == 1 && inp->ne[2] == cur->ne[2] && inp->ne[0] == cur->ne[0]) { + if (stride == 1 && + inp->ne[0] == cur->ne[0] && + inp->ne[1] == cur->ne[1] && + inp->ne[2] == cur->ne[2]) { cur = ggml_add(ctx0, cur, inp); }
91-149: Stage stride inference is brittle; prefer per-block stride metadata if available.
stride = is_stage_start(i) ? 2 : 1;assumes every stage start downsamples. If the upstream model ever has a stage that starts with stride=1, this silently builds the wrong graph. If stride exists in the converted config / tensor metadata, use it; otherwise, add asserts keyed off expected shapes.
248-451: MSFA path has hardcodedtarget_out_res=16and width-only upscaling; both are brittle.
const int target_out_res = 16;should ideally be derived (e.g.,sqrt(image_seq_len)or another hparam), otherwise variants won’t work.- Upscale uses
scale_wonly and asserts onlyhigh_res_w % feat_w == 0; if height differs too, you can build inconsistent shapes.Proposed change (height checks + scale_h parity)
- int scale_w = high_res_w / feat_w; - // int scale_h = high_res_h / feat_h; + int scale_w = high_res_w / feat_w; + int scale_h = high_res_h / feat_h; - GGML_ASSERT(high_res_w % feat_w == 0); + GGML_ASSERT(high_res_w % feat_w == 0); + GGML_ASSERT(high_res_h % feat_h == 0); + GGML_ASSERT(scale_w == scale_h); // if ggml_upscale only supports uniform scaling - feat = ggml_upscale(ctx0, feat, scale_w, ggml_scale_mode::GGML_SCALE_MODE_NEAREST); + feat = ggml_upscale(ctx0, feat, scale_w, ggml_scale_mode::GGML_SCALE_MODE_NEAREST);convert_hf_to_gguf.py (2)
6100-6125: Annotateblock_tensor_mappingasClassVar(and keep it immutable-by-convention).This is a constant mapping; make that explicit to satisfy linters and avoid accidental per-instance mutation. (Ruff RUF012)
Proposed fix
+from typing import ClassVar + class Gemma3nVisionAudioModel(ConformerAudioModel): @@ - block_tensor_mapping = { + block_tensor_mapping: ClassVar[dict[str, str]] = { "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight", ... }
6175-6187:custom_map()should validate it’s actually mapping a MobileNet block path.Right now it assumes
parts[4]/parts[5]are{bid}.{sid}wheneverlen(parts) >= 7, which could mis-map other similarly-long names. Add a quick guard likeparts[:4] == ["model","vision_tower","timm_model","blocks"].tools/mtmd/clip.cpp (2)
2115-2115: TODO: Audio support for Gemma3nThe code skips audio loading for GEMMA3NV with a TODO comment indicating that audio tensors exist in the GGUF but are not yet supported. This is a reasonable temporary workaround.
Consider opening a tracking issue for implementing Gemma3n audio support to ensure this TODO is addressed in a future update.
Also applies to: 2125-2127, 2132-2132
3640-3640: LGTM: Helper function updates with minor noteThe additions to switch statements and helper functions are consistent:
- Line 3640: Correctly groups GEMMA3NV with similar projector types
- Line 3768: Returns
mm_input_proj_w->ne[0]matching GEMMA3 behavior- Lines 3812-3820: Correctly excludes GEMMA3NV from mRoPE projectors
- Lines 3836-3845: Correctly excludes GEMMA3NV from Whisper encoders
Note: Functions
clip_is_minicpmvandclip_is_glmare marked with// TODO: remove this function(lines 3799, 3807), indicating they're deprecated. Consider filing a cleanup issue to remove these in a future refactor.Also applies to: 3768-3768, 3799-3820, 3836-3845
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/gguf_writer.pytools/mtmd/clip-impl.htools/mtmd/clip.cpptools/mtmd/clip.htools/mtmd/models/mobilenetv5.cpptools/mtmd/mtmd.cpp
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip-impl.htools/mtmd/clip.cppgguf-py/gguf/constants.pytools/mtmd/mtmd.cpp
🧬 Code graph analysis (2)
gguf-py/gguf/gguf_writer.py (1)
gguf-py/gguf/constants.py (3)
Keys(20-332)ClipVision(284-308)ClipAudio(310-323)
convert_hf_to_gguf.py (3)
gguf-py/gguf/constants.py (1)
VisionProjectorType(3560-3584)ggml/src/ggml-vulkan/ggml-vulkan.cpp (6)
b(391-394)b(391-391)b(403-406)b(403-403)b(415-418)b(415-415)gguf-py/gguf/gguf_writer.py (5)
add_clip_vision_projector_type(1089-1090)add_vision_attention_layernorm_eps(1110-1111)add_clip_audio_projector_type(1174-1175)add_audio_num_mel_bins(1195-1196)add_audio_attention_layernorm_eps(1192-1193)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
6102-6125: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6186-6186: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: openEuler-latest-cmake-cann (x86, 310p, Release)
- GitHub Check: openEuler-latest-cmake-cann (x86, 910b, Release)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
- GitHub Check: server (Release)
- GitHub Check: server-windows
- GitHub Check: pyright type-check
- GitHub Check: check-requirements
🔇 Additional comments (14)
tools/mtmd/mtmd.cpp (2)
864-874: Non-causal decode selection is vision-only; confirm audio-only behavior is intended.
mtmd_decode_use_non_causal()keys offctx->proj_type_v()only. For audio-only mmproj files (ctx_v == nullptr), this always returnsfalse. If any audio projector types require non-causal decoding, this will be wrong.
268-316: Verify whether GEMMA3NV uses the same<start_of_image>/<end_of_image>BOI/EOI tokens as GEMMA3.The code treats both identically (line 269 of mtmd.cpp), but GEMMA3NV uses a fundamentally different vision architecture (MobileNetV5 encoder) compared to GEMMA3's standard projector. Without explicit tokenizer confirmation, this shared token assignment could cause prompt-formatting issues if Gemma3n's tokenizer handles these strings differently.
tools/mtmd/clip.h (1)
105-111: No action needed. The removal ofclip_is_gemma3()is safe—no remaining call sites or references exist in the codebase.convert_hf_to_gguf.py (3)
6188-6211: Verify unsqueeze semantics forconv_stem.conv.bias/layer_scale.gamma.Converting 1D tensors into
[1, C, 1, 1]may be required by your ggml/mtmd loader, but it’s non-obvious and easy to get wrong (esp. for layer_scale which is often applied as a vector). Please double-check the corresponding C++ tensor shapes expected in the MobileNetV5 graph/loader and add a short comment explaining the expected runtime broadcast.
10151-10167: LFM2 multimodal skipping looks fine.Using
ConformerAudioModel.is_audio_tensor()here is a pragmatic way to avoid dragging audio weights into the text GGUF.
10295-10327: LFM2AudioModel wiring is reasonable; confirmblock_countdiscovery for this encoder.Given
MmprojModel.__init__derivesblock_countfromn_block_keys, please confirm the LFM2 audio encoder config (returned byget_audio_config()) actually contains one of those keys, otherwise initialization may break (or produce a bad tensor map).gguf-py/gguf/constants.py (2)
278-286: Nice improvement: explicit per-modality projector type keys (vision/audio).
This aligns with mixed-modality models and matches the mtmd side’sclip.vision.projector_type/clip.audio.projector_typesplit.Also applies to: 310-312
681-689: New gemma3n tensor IDs/names look coherent; please sanity-check name suffix conventions end-to-end.
Given mtmd expects explicit*.weight/*.biastensor names, verify that the python-side “base names” (e.g.v.conv_stem.conv,v.conv_stem.bn,v.msfa.norm) are expanded consistently by the writer/loader for all required parameters (esp. norms that may need both weight+bias).Also applies to: 717-747, 1097-1106, 1135-1165, 1216-1292
tools/mtmd/clip-impl.h (2)
205-237: Projector type wiring for gemma3nv/gemma3na looks correct and consistent within mtmd.
Enum entries andPROJECTOR_TYPE_NAMESadditions are straightforward. (Per your prior pattern, keeping QWEN25O as a replaceable placeholder remains fine.)Also applies to: 239-269
157-196: BN macros use RMS normalization, not BatchNorm—no bias/stats needed.The concern about missing bias and running statistics is based on a misunderstanding of the normalization type. The code uses
rms_norm_2d()for all these "BN" tensors, which implements RMS (Root Mean Square) normalization. RMS norm is a stateless operation that only requires the scale parameter (weight); it does not use bias or running statistics like BatchNorm does. The weight-only macro definitions are correct and complete for this use case.tools/mtmd/clip.cpp (4)
791-794: LGTM: Graph builder routingThe GEMMA3NV routing to
clip_graph_mobilenetv5follows the established pattern for other projector types.
1349-1351: LGTM: Correct architecture-specific handlingSetting
n_layer = 0for GEMMA3NV is appropriate since MobileNetV5 uses a custom block structure instead of standard ViT layers. This prevents the loading loop at lines 1354-1425 from attempting to load non-existent standard layer tensors.
2970-2978: LGTM: Preprocessing pathThe GEMMA3NV preprocessing correctly resizes to square without padding (
add_padding = false), which differs from GEMMA3's behavior. This architectural difference is appropriate for MobileNetV5.
1153-1160: Code is correct; ensure Python converter bug is addressed separatelyThe hparams initialization for GEMMA3NV is correct. MobileNetV5 produces 16×16 tokens as stated, and n_merge=1 is appropriate since the encoder handles all spatial downsampling internally (contrasting with GEMMA3's n_merge=4). However, the Python converter has a confirmed bug: it computes
patch_size = image_size // image_seq_length(e.g., 768 // 256 = 3) instead of correctly deriving it from the 16×16 grid dimensions (768 // 16 = 48). While this converter bug doesn't directly break the C++ code's n_merge setting, ensure the Python converter is fixed to avoid downstream issues with patch_size-dependent operations.
| def prepare_tensors(self): | ||
| max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,") | ||
| # Handle empty tensor_map for models with block_count=0 (like MobileNetV5) | ||
| if self.tensor_map.mapping: | ||
| max_name_len = max(len(s) for _, s in self.tensor_map.mapping.values()) + len(".weight,") | ||
| else: | ||
| max_name_len = len("vision_encoder.weight,") # Default reasonable length | ||
|
|
There was a problem hiding this comment.
Good guard for block_count=0 models; tighten the fallback log-width constant.
This is only for log formatting, so correctness impact is low. Consider using a shorter constant (or deriving from self.model_tensors keys) to avoid embedding model-specific names into the generic base class.
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 530 - 536, The fallback for max_name_len
in prepare_tensors uses a model-specific literal; change it to a shorter generic
constant or derive it from available keys to avoid embedding model names: when
self.tensor_map.mapping is empty, set max_name_len to a small fixed value (e.g.,
len("encoder.weight,")) or compute max(len(k) for k in
self.model_tensors.keys()) + len(".weight,") if self.model_tensors exists,
ensuring you reference the prepare_tensors method, self.tensor_map.mapping and
self.model_tensors when making the replacement.
| def tensor_force_quant(self, name, new_name, bid, n_dims): | ||
| if ConformerAudioModel.is_audio_tensor(name): | ||
| if ".conv" in name or "_conv" in name and ".weight" in name: | ||
| return gguf.GGMLQuantizationType.F32 | ||
| return super().tensor_force_quant(name, new_name, bid, n_dims) |
There was a problem hiding this comment.
Fix boolean-precedence bug in conv quantization predicate.
or/and precedence makes the condition read differently than it looks; likely you intended to require ".weight" for both ".conv" and "_conv" cases. (Ruff RUF021)
Proposed fix
- if ConformerAudioModel.is_audio_tensor(name):
- if ".conv" in name or "_conv" in name and ".weight" in name:
- return gguf.GGMLQuantizationType.F32
+ if ConformerAudioModel.is_audio_tensor(name):
+ if ((".conv" in name) or ("_conv" in name)) and (".weight" in name):
+ return gguf.GGMLQuantizationType.F32🧰 Tools
🪛 Ruff (0.14.10)
6054-6054: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
| def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: | ||
| # fold running_mean, running_var and eps into weight and bias for batch_norm | ||
| if "batch_norm" in name: | ||
| if self._batch_norm_tensors is None: | ||
| self._batch_norm_tensors = [{} for _ in range(self.block_count)] | ||
| assert bid is not None | ||
| self._batch_norm_tensors[bid][name] = data_torch | ||
|
|
||
| if len(self._batch_norm_tensors[bid]) < 5: | ||
| return [] | ||
|
|
||
| weight = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.weight"] | ||
| bias = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.bias"] | ||
| running_mean = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_mean"] | ||
| running_var = self._batch_norm_tensors[bid][f"conformer.layers.{bid}.conv.batch_norm.running_var"] | ||
| eps = 1e-5 # default value | ||
|
|
||
| a = weight / torch.sqrt(running_var + eps) | ||
| b = bias - running_mean * a | ||
| return [ | ||
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.weight"), a), | ||
| (self.map_tensor_name(f"conformer.layers.{bid}.conv.batch_norm.bias"), b), | ||
| ] | ||
|
|
||
| # reshape conv weights | ||
| if name.startswith("conformer.pre_encode.conv.") and name.endswith(".bias"): | ||
| data_torch = data_torch[:, None, None] | ||
| if "conv.depthwise_conv" in name and name.endswith(".weight"): | ||
| assert data_torch.shape[1] == 1 | ||
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[2]) | ||
| if "conv.pointwise_conv" in name and name.endswith(".weight"): | ||
| assert data_torch.shape[2] == 1 | ||
| data_torch = data_torch.reshape(data_torch.shape[0], data_torch.shape[1]) | ||
|
|
There was a problem hiding this comment.
BatchNorm folding: avoid hardcoding eps (or at least document/validate it).
BN eps isn’t in the state_dict; hardcoding 1e-5 is a reasonable default, but if the source model uses a different value this silently changes numerics. Suggest: (1) try to read it from config if available, else (2) keep the default but add a warning when folding.
🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6058 - 6091, In modify_tensors, avoid
hardcoding eps=1e-5 when folding batch_norm; instead attempt to obtain eps from
the model config (or a provided attribute on the converter) before falling back
to the default, validate it is a small positive float, and emit a warning via
the converter logger if the config value is missing so the user is aware of the
silent numeric change; update references around self._batch_norm_tensors
handling and the computation of a = weight / torch.sqrt(running_var + eps) to
use the chosen eps and ensure map_tensor_name and block_count logic is
unchanged.
| def __init__(self, *args, **kwargs): | ||
| # Parent init will call find_hparam which now returns 0 for empty keys | ||
| super().__init__(*args, **kwargs) | ||
| assert self.hparams_vision is not None | ||
| self.hparams_vision["n_layers"] = 128 # fake value for audio encoder, vision encoder doesn't use it | ||
| self.hparams_vision["intermediate_size"] = self.hparams_vision.get("intermediate_size", 2048) * 4 | ||
| self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_attention_heads", 8) | ||
|
|
||
| # MobileNetV5 does not use image_mean/std | ||
| self.preprocessor_config["image_mean"] = [0.0 ,0.0 , 0.0] | ||
| self.preprocessor_config["image_std"] = [1.0 ,1.0 ,1.0] | ||
| self.hparams_vision["image_size"] = self.preprocessor_config.get( | ||
| "size", {"height": 768, "width": 768} | ||
| )["height"] | ||
|
|
||
| # Image sequence length (256 tokens = 16x16 for Gemma3n) | ||
| image_seq_length = self.preprocessor_config.get("image_seq_length", 256) | ||
| image_size = self.hparams_vision["image_size"] | ||
| self.hparams_vision["patch_size"] = image_size // image_seq_length | ||
|
|
||
| # remap audio hparams | ||
| assert self.hparams_audio is not None | ||
| self.hparams_audio["n_layers"] = self.hparams_audio["conf_num_hidden_layers"] | ||
| self.hparams_audio["num_attention_heads"] = self.hparams_audio["conf_num_attention_heads"] | ||
| self.hparams_audio["feat_in"] = self.hparams_audio["input_feat_size"] | ||
| self.hparams_audio["intermediate_size"] = self.hparams_audio.get("intermediate_size", 6144) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the context of this code in the repository
fd "convert_hf_to_gguf.py" --type fRepository: ngxson/llama.cpp
Length of output: 81
🌐 Web query:
Gemma3n MobileNetV5 vision encoder patch size configuration llama.cpp
💡 Result:
Short answer: llama.cpp expects the Gemma vision encoder patch_size to be 14 (stored in the GGUF as clip.vision.patch_size). See the Gemma3n vision docs for the encoder/image defaults and the llama.cpp logs showing clip.vision.patch_size = 14.
Sources: [Hugging Face Gemma3n docs][1], [llama.cpp issue logs showing clip.vision.patch_size = 14][2].
🌐 Web query:
image_seq_length patch_size 2D grid calculation vision transformer
💡 Result:
Formula:
- Grid size per side: grid_h = H / P, grid_w = W / P (integer division if using non-overlapping patches).
- Number of patches (sequence length, excluding class token): N = grid_h * grid_w = (H / P) * (W / P).
- If a class token is prepended, image_seq_length = N + 1.
Example (H=W=224, P=16):
- grid = (224/16, 224/16) = (14, 14)
- N = 14 * 14 = 196
- image_seq_length (with class token) = 196 + 1 = 197.
Sources:
[1] Torchvision ViT implementation (positional embedding / seq length calculation).
[2] ViT patching explanation / tutorial (patch → (H/P × W/P) patches).
🏁 Script executed:
# Search for Gemma3n related code in the repository
rg "Gemma3n" --type py -A 5 -B 5Repository: ngxson/llama.cpp
Length of output: 3851
🏁 Script executed:
# Look for vision encoder or patch_size references in the codebase
rg "patch_size|image_seq_length" --type py -B 3 -A 3 | head -100Repository: ngxson/llama.cpp
Length of output: 8469
🏁 Script executed:
# Search for the specific class or method that contains this __init__
rg "hparams_vision.*=.*None" --type py -B 10 | head -50Repository: ngxson/llama.cpp
Length of output: 734
patch_size computation must use patches-per-side, not total sequence length.
self.hparams_vision["patch_size"] = image_size // image_seq_length (e.g., 768 // 256 = 3) is incorrect for a 2D patch grid. Since the comment states 256 tokens = 16x16, the patch size should be computed per-side: patch_size = image_size // sqrt(image_seq_length) (e.g., 768 // 16 = 48). This aligns with standard vision transformer patching and matches the correct implementation already present in the same codebase (Tinygemma3 model). Without this fix, downstream token counts and attention operations will be semantically incorrect.
Proposed fix
# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
- self.hparams_vision["patch_size"] = image_size // image_seq_length
+ n_per_side = int(image_seq_length ** 0.5)
+ if n_per_side * n_per_side != image_seq_length:
+ raise ValueError(f"image_seq_length must be a perfect square, got {image_seq_length}")
+ if image_size % n_per_side != 0:
+ raise ValueError(f"image_size {image_size} not divisible by patches-per-side {n_per_side}")
+ self.hparams_vision["patch_size"] = image_size // n_per_sideAlso review the hardcoded fallbacks (intermediate_size * 4, num_attention_heads = 8); prefer reading from the vision config when present, with defaults only when missing.
| def set_vocab(self): | ||
| # For Gemma3n multimodal models, we need the FULL vocab_size (262400) | ||
| # which includes special tokens from 262144-262399 for vision/audio. | ||
| # The vocab_size_per_layer_input (262144) is only the embedding size per layer. | ||
| # Temporarily override the hparams lookup order to prioritize vocab_size. | ||
|
|
||
| # Store original vocab_size_per_layer_input if it exists | ||
| vocab_size_per_layer_input = self.hparams.get("vocab_size_per_layer_input") | ||
|
|
||
| # Temporarily remove vocab_size_per_layer_input to force using vocab_size | ||
| if vocab_size_per_layer_input is not None: | ||
| del self.hparams["vocab_size_per_layer_input"] | ||
|
|
||
| # Call parent set_vocab which will now use vocab_size (262400) | ||
| super().set_vocab() | ||
|
|
||
| # Restore vocab_size_per_layer_input for later use | ||
| if vocab_size_per_layer_input is not None: | ||
| self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input | ||
|
|
There was a problem hiding this comment.
Ensure vocab override is restored with try/finally.
If super().set_vocab() throws, vocab_size_per_layer_input won’t be restored, leaving self.hparams in a mutated state (even if the process continues for any reason, e.g., partial conversions/tests).
Proposed fix
vocab_size_per_layer_input = self.hparams.get("vocab_size_per_layer_input")
- if vocab_size_per_layer_input is not None:
- del self.hparams["vocab_size_per_layer_input"]
-
- # Call parent set_vocab which will now use vocab_size (262400)
- super().set_vocab()
-
- # Restore vocab_size_per_layer_input for later use
- if vocab_size_per_layer_input is not None:
- self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input
+ if vocab_size_per_layer_input is not None:
+ del self.hparams["vocab_size_per_layer_input"]
+ try:
+ # Call parent set_vocab which will now use vocab_size (262400)
+ super().set_vocab()
+ finally:
+ if vocab_size_per_layer_input is not None:
+ self.hparams["vocab_size_per_layer_input"] = vocab_size_per_layer_input🤖 Prompt for AI Agents
In @convert_hf_to_gguf.py around lines 6235 - 6254, The current set_vocab method
temporarily deletes self.hparams["vocab_size_per_layer_input"] but does not
guarantee restoration if super().set_vocab() raises; wrap the call to
super().set_vocab() in a try/finally block so that vocab_size_per_layer_input
(the saved variable) is always restored to self.hparams after the call, ensuring
no permanent mutation of self.hparams even on exceptions; reference the
set_vocab method, the local variable vocab_size_per_layer_input, self.hparams,
and the call to super().set_vocab() when applying the change.
| case PROJECTOR_TYPE_GEMMA3NV: | ||
| { | ||
| model.mobilenet_stem_conv_w = get_tensor(TN_MNV5_STEM_CONV, false); | ||
| model.mobilenet_stem_conv_b = get_tensor(TN_MNV5_STEM_BIAS, false); | ||
| model.mobilenet_stem_norm_w = get_tensor(TN_MNV5_STEM_BN, false); | ||
|
|
||
| model.msfa_ffn_expand_w = get_tensor(TN_MNV5_MSFA_FFN_EXP_W, false); | ||
| model.msfa_ffn_expand_bn = get_tensor(TN_MNV5_MSFA_FFN_EXP_BN, false); // Consume BN if present but likely folded | ||
| model.msfa_ffn_project_w = get_tensor(TN_MNV5_MSFA_FFN_PROJ_W, false); | ||
| model.msfa_ffn_project_bn = get_tensor(TN_MNV5_MSFA_FFN_PROJ_BN, false); | ||
|
|
||
| model.msfa_concat_norm_w = get_tensor(TN_MNV5_MSFA_NORM, false); | ||
|
|
||
| // Dynamically load blocks stage by stage | ||
| for (int stage = 0; stage < 4; ++stage) { | ||
| int blocks_found_in_stage = 0; | ||
|
|
||
| for (int blk_idx = 0; ; ++blk_idx) { | ||
| bool found_block = false; | ||
| mobilenetv5_block block; | ||
|
|
||
| // 1. Check for Edge Residual (S0) | ||
| block.s0_conv_exp_w = get_tensor(string_format(TN_MNV5_BLK_S0_EXP_W, stage, blk_idx), false); | ||
| if (block.s0_conv_exp_w) { | ||
| found_block = true; | ||
| block.s0_bn1_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN1_W, stage, blk_idx), false); | ||
| block.s0_conv_pwl_w = get_tensor(string_format(TN_MNV5_BLK_S0_PWL_W, stage, blk_idx), false); | ||
| block.s0_bn2_w = get_tensor(string_format(TN_MNV5_BLK_S0_BN2_W, stage, blk_idx), false); | ||
| } | ||
| // 2. Check for UIR (Universal Inverted Residual) | ||
| else { | ||
| // Check for dw_start OR pw_exp (some UIR blocks skip dw_start) | ||
| block.dw_start_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_W, stage, blk_idx), false); | ||
| block.pw_exp_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_W, stage, blk_idx), false); | ||
|
|
||
| if (block.dw_start_w || block.pw_exp_w) { | ||
| found_block = true; | ||
| if (block.dw_start_w) { | ||
| block.dw_start_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_START_BN, stage, blk_idx), false); | ||
| } | ||
| if (block.pw_exp_w) { | ||
| block.pw_exp_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_EXP_BN, stage, blk_idx), false); | ||
| } | ||
| block.dw_mid_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_W, stage, blk_idx), false); | ||
| if (block.dw_mid_w) { | ||
| block.dw_mid_bn_w = get_tensor(string_format(TN_MNV5_BLK_DW_MID_BN, stage, blk_idx), false); | ||
| } | ||
| block.pw_proj_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_W, stage, blk_idx), false); | ||
| if (block.pw_proj_w) { | ||
| block.pw_proj_bn_w = get_tensor(string_format(TN_MNV5_BLK_PW_PROJ_BN, stage, blk_idx), false); | ||
| } | ||
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | ||
| } | ||
| } | ||
|
|
||
| // 3. Check for Attention (MQA) | ||
| // Even if UIR/Edge check failed, this might be a pure attention block | ||
| ggml_tensor* attn_q_check = get_tensor(string_format(TN_MNV5_ATTN_Q_W, stage, blk_idx), false); | ||
| if (attn_q_check) { | ||
| found_block = true; | ||
| block.attn_q_w = attn_q_check; | ||
| block.attn_k_w = get_tensor(string_format(TN_MNV5_ATTN_K_W, stage, blk_idx), false); | ||
| block.attn_v_w = get_tensor(string_format(TN_MNV5_ATTN_V_W, stage, blk_idx), false); | ||
| block.attn_o_w = get_tensor(string_format(TN_MNV5_ATTN_O_W, stage, blk_idx), false); | ||
| block.attn_k_dw_w = get_tensor(string_format(TN_MNV5_ATTN_K_DW, stage, blk_idx), false); | ||
| block.attn_k_norm_w = get_tensor(string_format(TN_MNV5_ATTN_K_NORM, stage, blk_idx), false); | ||
| block.attn_v_dw_w = get_tensor(string_format(TN_MNV5_ATTN_V_DW, stage, blk_idx), false); | ||
| block.attn_v_norm_w = get_tensor(string_format(TN_MNV5_ATTN_V_NORM, stage, blk_idx), false); | ||
| block.attn_norm_w = get_tensor(string_format(TN_MNV5_ATTN_NORM, stage, blk_idx), false); | ||
| // Note: Attention blocks also have layer_scale, load it if not already loaded by UIR check | ||
| if (!block.layer_scale_w) { | ||
| block.layer_scale_w = get_tensor(string_format(TN_MNV5_BLK_LAYER_SCALE, stage, blk_idx), false); | ||
| } | ||
| } | ||
|
|
||
| if (found_block) { | ||
| model.mobilenet_blocks.push_back(block); | ||
| blocks_found_in_stage++; | ||
| } else { | ||
| // End of blocks for this stage | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| // Track where this stage ends in the flat vector | ||
| if (blocks_found_in_stage > 0) { | ||
| model.mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1); | ||
| LOG_INF("%s: Stage %d ended at global block index %zu\n", __func__, stage, model.mobilenet_blocks.size() - 1); | ||
| } | ||
| } | ||
| model.mm_input_proj_w = get_tensor(TN_MM_INP_PROJ); | ||
| model.mm_soft_emb_norm_w = get_tensor(TN_MM_SOFT_EMB_N); | ||
| } break; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg "TN_MNV5" --type cpp -B 2 -A 2 | head -100Repository: ngxson/llama.cpp
Length of output: 8387
🏁 Script executed:
fd . -name "*.h" -type f | xargs rg -l "TN_MNV5" | head -5Repository: ngxson/llama.cpp
Length of output: 350
🏁 Script executed:
rg -n "mobilenet_stage_ends" --type cppRepository: ngxson/llama.cpp
Length of output: 693
🏁 Script executed:
rg -n "TN_MNV5_BLK_S0_EXP_W\|TN_MNV5_ATTN_Q_W" --type cpp -B 5Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
fd "convert" --type f | grep -i "mobilenet\|clip" | head -10Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg -l "conv_stem\|blk.*conv_exp" --type py | head -5Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
fd "*.py" -type f | xargs rg -l "mobilenet" | head -5Repository: ngxson/llama.cpp
Length of output: 378
🏁 Script executed:
cat -n tools/mtmd/clip.cpp | sed -n '1650,1660p'Repository: ngxson/llama.cpp
Length of output: 772
🏁 Script executed:
rg -A 10 "mobilenet_stage_ends.push_back" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 674
🏁 Script executed:
find . -name "*.py" -type f | xargs grep -l "gemma3n\|mobilenetv5" 2>/dev/null | head -5Repository: ngxson/llama.cpp
Length of output: 144
🏁 Script executed:
rg -A 5 "models.mobilenet_blocks" tools/mtmd/models/mobilenetv5.cpp | head -30Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
cat -n tools/mtmd/clip.cpp | sed -n '1567,1580p'Repository: ngxson/llama.cpp
Length of output: 1037
🏁 Script executed:
rg -B 5 -A 15 "blocks_found_in_stage" tools/mtmd/clip.cpp | head -40Repository: ngxson/llama.cpp
Length of output: 2169
🏁 Script executed:
cat -n gguf-py/gguf/tensor_mapping.py | head -50Repository: ngxson/llama.cpp
Length of output: 3017
🏁 Script executed:
rg -B 5 -A 5 "gemma.*3n\|GEMMA3NV" gguf-py/gguf/tensor_mapping.py | head -40Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg "conv_stem\|v\.blk\." gguf-py/gguf/tensor_mapping.py | head -20Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
cat -n convert_hf_to_gguf.py | sed -n '1,100p'Repository: ngxson/llama.cpp
Length of output: 3468
🏁 Script executed:
rg -l "mobilenetv5\|gemma3n" . --type file | head -10Repository: ngxson/llama.cpp
Length of output: 88
🏁 Script executed:
find . -name "*.py" -type f | xargs grep -l "gemma3n\|gemma-3n" 2>/dev/nullRepository: ngxson/llama.cpp
Length of output: 144
🏁 Script executed:
rg "v\.conv_stem\|v\.blk\." --type py | head -30Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
find . -path ./node_modules -prune -o -name "convert*.py" -type f -print | xargs grep -l "PROJECTOR\|gemma" 2>/dev/null | head -5Repository: ngxson/llama.cpp
Length of output: 114
🏁 Script executed:
cat -n tools/mtmd/models/mobilenetv5.cpp | sed -n '1,100p'Repository: ngxson/llama.cpp
Length of output: 4671
🏁 Script executed:
rg -A 20 "case PROJECTOR_TYPE_GEMMA3NV" tools/mtmd/clip.cpp | head -60Repository: ngxson/llama.cpp
Length of output: 3489
🏁 Script executed:
grep -rn "gemma3n\|mobilenetv5" convert_hf_to_gguf.py | head -30Repository: ngxson/llama.cpp
Length of output: 169
🏁 Script executed:
rg "class Gemma\|mobilenet" convert_hf_to_gguf.py -A 10 | head -50Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
cat -n tools/mtmd/clip-model.h | sed -n '320,340p'Repository: ngxson/llama.cpp
Length of output: 1023
🏁 Script executed:
rg "mobilenetv5_block" tools/mtmd/clip-model.h -B 5 -A 30 | head -60Repository: ngxson/llama.cpp
Length of output: 2023
🏁 Script executed:
python3 << 'EOF'
import re
# Define tensor name patterns from C++
tensor_patterns_cpp = {
"TN_MNV5_STEM_CONV": "v.conv_stem.conv.weight",
"TN_MNV5_STEM_BIAS": "v.conv_stem.conv.bias",
"TN_MNV5_STEM_BN": "v.conv_stem.bn.weight",
"TN_MNV5_BLK_S0_EXP_W": "v.blk.%d.%d.conv_exp.weight",
"TN_MNV5_BLK_S0_BN1_W": "v.blk.%d.%d.bn1.weight",
"TN_MNV5_ATTN_Q_W": "v.blk.%d.%d.attn.query.proj.weight",
"TN_MNV5_ATTN_K_W": "v.blk.%d.%d.attn.key.proj.weight",
}
print("Verified tensor name patterns:")
for name, pattern in tensor_patterns_cpp.items():
print(f" {name:25} -> {pattern}")
EOFRepository: ngxson/llama.cpp
Length of output: 146
🏁 Script executed:
rg -A 5 "blocks_found_in_stage\|stage_ends" tools/mtmd/models/mobilenetv5.cpp | head -40Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
cat -n tools/mtmd/clip.cpp | sed -n '1640,1660p'Repository: ngxson/llama.cpp
Length of output: 1246
🏁 Script executed:
rg "expected.*blocks\|assert.*stage\|blocks_found" tools/mtmd/models/mobilenetv5.cppRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg -B 2 -A 10 "for.*mobilenet_stage_ends\|mobilenet_blocks" tools/mtmd/models/mobilenetv5.cpp | head -50Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
rg -B 5 -A 15 "build_mobilenetv5\|process_mobilenet" tools/mtmd/models/mobilenetv5.cpp | head -60Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
cat -n tools/mtmd/models/mobilenetv5.cpp | sed -n '260,290p'Repository: ngxson/llama.cpp
Length of output: 1404
🏁 Script executed:
rg "mobilenet_blocks.size\|stage_ends.size" tools/mtmd/models/mobilenetv5.cppRepository: ngxson/llama.cpp
Length of output: 42
Add post-load validation to verify block discovery completeness
The dynamic MobileNetV5 block loading relies on exact tensor name matching with the Python converter. While the tensor name patterns are well-defined (e.g., v.blk.%d.%d.conv_exp.weight), there is no validation after the loading loop to detect if the expected blocks were actually found for each stage.
If tensor names from the converter don't match the C++ patterns, blocks will silently fail to load, potentially breaking inference. Add assertions or logging after line 1656 to verify:
- Expected number of blocks were found for each stage
- All 4 stages have at least one block (unless intentionally sparse)
- Total blocks count is reasonable for the model architecture
Additionally, verify that the tensor name patterns in clip-impl.h (e.g., TN_MNV5_BLK_S0_EXP_W = "v.blk.%d.%d.conv_exp.weight") match the exact tensor names produced by your Python converter.
🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 1567 - 1659, Summary: Add post-load
validation to ensure MobileNetV5 blocks were actually discovered and tensor name
patterns align with the converter. After the per-stage loading loop, record
per-stage counts (e.g., add a local vector<int> stage_block_counts and increment
with blocks_found_in_stage inside the existing loop), then validate: assert
model.mobilenet_stage_ends.size() == 4 (or log error if not), verify each
stage_block_counts[stage] > 0 (log which stage is empty and bail), and check
total model.mobilenet_blocks.size() is within expected bounds (log actual vs
expected and abort on gross mismatch). Also emit a warning listing any missing
key tensor patterns (use TN_MNV5_BLK_S0_EXP_W, TN_MNV5_BLK_DW_START_W,
TN_MNV5_ATTN_Q_W, etc.) so mismatches with clip-impl.h / Python converter can be
diagnosed.
| case PROJECTOR_TYPE_GEMMA3NV: | ||
| { | ||
| // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution | ||
| // regardless of input size (see architecture description) | ||
| n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; | ||
| } break; |
There was a problem hiding this comment.
Critical: Incorrect n_patches calculation
The comment states "MobileNetV5 MSFA adapter always outputs fixed 16×16 resolution" (256 tokens), but the code only divides once instead of squaring:
n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;Issue: This calculation is semantically incorrect and produces different results depending on the patch_size value:
- If
patch_size = 3(current buggy Python converter):768 / 3 = 256✓ (accidentally correct) - If
patch_size = 48(correct value after Python fix):768 / 48 = 16✗ (should be 256)
This is directly related to the critical bug in the Python converter mentioned in the PR objectives, where patch_size is computed as image_size // image_seq_length instead of image_size // sqrt(image_seq_length).
🔧 Proposed fix
Option 1: Fixed constant (recommended if output is truly always 16×16)
case PROJECTOR_TYPE_GEMMA3NV:
{
// MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
- // regardless of input size (see architecture description)
- n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+ n_patches = 16 * 16; // Fixed 256 tokens output
} break;Option 2: Compute from hyperparameters (if variable resolution is possible)
case PROJECTOR_TYPE_GEMMA3NV:
{
- // MobileNetV5 MSFA adapter always outputs fixed 16x16 resolution
- // regardless of input size (see architecture description)
- n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+ // MobileNetV5 outputs a grid of patches per side
+ int patches_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+ n_patches = patches_per_side * patches_per_side;
} break;🤖 Prompt for AI Agents
In @tools/mtmd/clip.cpp around lines 3242 - 3247, n_patches is computed
incorrectly for PROJECTOR_TYPE_GEMMA3NV: instead of producing 16x16 (=256)
tokens the code divides image_size by patch_size only once; update the logic in
the PROJECTOR_TYPE_GEMMA3NV branch (the block that assigns n_patches) to yield
the total number of patches, either by setting n_patches to the fixed constant
256 if the adapter truly always outputs 16x16, or by computing (image_size /
patch_size) squared (i.e. multiply the per-dimension count by itself) using
ctx->model.hparams.image_size and ctx->model.hparams.patch_size so the result
reflects total tokens correctly.
| ggml_tensor * clip_graph_mobilenetv5::build_mobilenet_attn(ggml_tensor * inp, const mobilenetv5_block & block) { | ||
| ggml_tensor * cur = inp; | ||
|
|
||
| // Norm | ||
| if (block.attn_norm_w) { | ||
| cur = rms_norm_2d(cur, block.attn_norm_w, 1e-6f); | ||
| } | ||
|
|
||
| // 1. Q Calculation | ||
| ggml_tensor * q = ggml_conv_2d_direct(ctx0, block.attn_q_w, cur, 1, 1, 0, 0, 1, 1); | ||
|
|
||
| // 2. K Calculation (Downsampled) | ||
| // Uses Conv2dSame(640, 640, kernel_size=(3, 3), stride=(2, 2), groups=640) | ||
| ggml_tensor * k_inp = cur; | ||
| if (block.attn_k_dw_w) { | ||
| int k_size = block.attn_k_dw_w->ne[0]; // Usually 3 | ||
| k_inp = pad_same_2d(cur, k_size, k_size, 2, 2); // Apply SAME padding | ||
| k_inp = ggml_conv_2d_dw(ctx0, block.attn_k_dw_w, k_inp, 2, 2, 0, 0, 1, 1); // padding=0 | ||
| if (block.attn_k_norm_w) { | ||
| k_inp = rms_norm_2d(k_inp, block.attn_k_norm_w, 1e-6f); | ||
| } | ||
| } | ||
| ggml_tensor * k = ggml_conv_2d_direct(ctx0, block.attn_k_w, k_inp, 1, 1, 0, 0, 1, 1); | ||
|
|
||
| // 3. V Calculation (Downsampled) | ||
| // Uses Conv2dSame(640, 640, kernel_size=(3, 3), stride=(2, 2), groups=640) | ||
| ggml_tensor * v_inp = cur; | ||
| if (block.attn_v_dw_w) { | ||
| int v_size = block.attn_v_dw_w->ne[0]; // Usually 3 | ||
| v_inp = pad_same_2d(cur, v_size, v_size, 2, 2); // Apply SAME padding | ||
| v_inp = ggml_conv_2d_dw(ctx0, block.attn_v_dw_w, v_inp, 2, 2, 0, 0, 1, 1); // padding=0 | ||
| if (block.attn_v_norm_w) { | ||
| v_inp = rms_norm_2d(v_inp, block.attn_v_norm_w, 1e-6f); | ||
| } | ||
| } | ||
| ggml_tensor * v = ggml_conv_2d_direct(ctx0, block.attn_v_w, v_inp, 1, 1, 0, 0, 1, 1); | ||
|
|
||
| const int W = cur->ne[0]; const int H = cur->ne[1]; const int B = cur->ne[3]; | ||
| const int D = k->ne[2]; // Head dimension | ||
| const int n_head = q->ne[2] / D; | ||
| const int N = W * H; | ||
|
|
||
| // Process Q: [W, H, D*n_head, B] -> [D, N, n_head, B] | ||
| q = ggml_reshape_3d(ctx0, q, N, D*n_head, B); | ||
| q = ggml_reshape_4d(ctx0, q, N, D, n_head, B); | ||
| q = ggml_permute(ctx0, q, 1, 0, 2, 3); // [D, N, n_head, B] | ||
| q = ggml_cont(ctx0, q); | ||
|
|
||
| const int Wk = k->ne[0]; const int Hk = k->ne[1]; | ||
| const int M = Wk * Hk; | ||
|
|
||
| // Process K: [Wk, Hk, D, B] -> [D, M, 1, B] | ||
| k = ggml_reshape_3d(ctx0, k, M, D, B); | ||
| k = ggml_reshape_4d(ctx0, k, M, D, 1, B); | ||
| k = ggml_permute(ctx0, k, 1, 0, 2, 3); // [D, M, 1, B] | ||
| k = ggml_cont(ctx0, k); | ||
|
|
||
| // Process V: [Wk, Hk, D, B] -> [M, D, 1, B] | ||
| v = ggml_reshape_3d(ctx0, v, M, D, B); | ||
| v = ggml_reshape_4d(ctx0, v, M, D, 1, B); | ||
| v = ggml_cont(ctx0, v); // [M, D, 1, B] | ||
|
|
||
| // Multi-Query Attention | ||
| float scale = 1.0f / sqrtf((float)D); | ||
|
|
||
| // Step 1: Compute Q @ K.T | ||
| ggml_tensor * scores = ggml_mul_mat(ctx0, k, q); | ||
|
|
||
| scores = ggml_scale(ctx0, scores, scale); | ||
|
|
||
| scores = ggml_soft_max(ctx0, scores); | ||
|
|
||
| ggml_tensor * kqv = ggml_mul_mat(ctx0, v, scores); | ||
|
|
||
| kqv = ggml_permute(ctx0, kqv, 1, 0, 2, 3); | ||
| kqv = ggml_cont(ctx0, kqv); | ||
|
|
||
|
|
||
| kqv = ggml_reshape_3d(ctx0, kqv, N, D * n_head, B); | ||
| kqv = ggml_reshape_4d(ctx0, kqv, W, H, D * n_head, B); | ||
| kqv = ggml_cont(ctx0, kqv); | ||
|
|
||
| // Output projection | ||
| cur = ggml_conv_2d_direct(ctx0, block.attn_o_w, kqv, 1, 1, 0, 0, 1, 1); | ||
|
|
||
| // Residual & Layer Scale | ||
| if (inp->ne[0] == cur->ne[0] && inp->ne[2] == cur->ne[2]) { | ||
| if (block.layer_scale_w) { | ||
| cur = ggml_mul(ctx0, cur, block.layer_scale_w); | ||
| } | ||
| cur = ggml_add(ctx0, cur, inp); | ||
| } | ||
|
|
||
| return cur; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tools/mtmd/models/mobilenetv5.cpp | sed -n '152,246p'Repository: ngxson/llama.cpp
Length of output: 4169
🏁 Script executed:
grep -n "GGML_ASSERT" tools/mtmd/models/mobilenetv5.cpp | head -20Repository: ngxson/llama.cpp
Length of output: 222
🏁 Script executed:
rg "n_head.*=.*ne\[.*\].*/" --type cpp -A 2 -B 2 | head -30Repository: ngxson/llama.cpp
Length of output: 420
🏁 Script executed:
grep -n "ne\[0\].*==.*ne\[0\].*&&.*ne\[1\].*==.*ne\[1\]" tools/mtmd/models/mobilenetv5.cppRepository: ngxson/llama.cpp
Length of output: 145
Add divisibility assertion and height check for spatial residual.
The code is missing two critical guards:
- Line 191 divides by
Dwithout asserting divisibility: AddGGML_ASSERT(q->ne[2] % D == 0)(GGML_ASSERT is already used elsewhere in this file at lines 325, 376, 397). - Line 238 residual check only validates width and channels but should also validate height to match the spatial dimension pattern used elsewhere in the file (see line 142:
inp->ne[0] == cur->ne[0] && inp->ne[1] == cur->ne[1]).
Additionally, verify that ggml_mul_mat(ctx0, k, q) at line 218 and ggml_soft_max(ctx0, scores) at line 222 have the correct tensor orientation for the intended [M,N,n_head,B] scoring operation.
Proposed changes
const int D = k->ne[2]; // Head dimension
const int n_head = q->ne[2] / D;
+ GGML_ASSERT(D > 0);
+ GGML_ASSERT(q->ne[2] % D == 0);- if (inp->ne[0] == cur->ne[0] && inp->ne[2] == cur->ne[2]) {
+ if (inp->ne[0] == cur->ne[0] && inp->ne[1] == cur->ne[1] && inp->ne[2] == cur->ne[2]) {
if (block.layer_scale_w) {
cur = ggml_mul(ctx0, cur, block.layer_scale_w);
}
cur = ggml_add(ctx0, cur, inp);
}🤖 Prompt for AI Agents
In @tools/mtmd/models/mobilenetv5.cpp around lines 152 - 246, In
build_mobilenet_attn add a divisibility assert before computing n_head: insert
GGML_ASSERT(q->ne[2] % D == 0) to ensure q->ne[2] is divisible by D, and extend
the spatial residual check to include height (require inp->ne[1] == cur->ne[1]
alongside inp->ne[0] and inp->ne[2]) so the residual only applies when W, H and
channels match; also verify the orientation/shape passed to ggml_mul_mat(ctx0,
k, q) and subsequent ggml_soft_max(ctx0, scores) so they operate on tensors
shaped as [D, M, 1, B] (k) and [D, N, n_head, B] (q) (or transpose them
appropriately) to produce scores of shape [D, M, N, B] for the intended
attention before softmax and matmul with v.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
gguf-py/gguf/constants.py (1)
463-474: Critical: Missing VISION_PROJECTOR_TYPE_NAMES entry for GEMMA3N.A new
VISION_PROJECTOR_TYPE.GEMMA3Nenum value is added at line 470, but the corresponding entry is missing from theVISION_PROJECTOR_TYPE_NAMESdictionary (lines 883-891). This will cause aKeyErrorwhen code attempts to look up the string name for this projector type.🔧 Proposed fix
Add the missing mapping to
VISION_PROJECTOR_TYPE_NAMES:VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = { VISION_PROJECTOR_TYPE.MLP: "mlp", VISION_PROJECTOR_TYPE.LDP: "ldp", VISION_PROJECTOR_TYPE.LDPV2: "ldpv2", VISION_PROJECTOR_TYPE.RESAMPLER: "resampler", VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter", VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger", + VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", VISION_PROJECTOR_TYPE.GEMMA3: "gemma3", }Note: The string value
"gemma3n"should match the projector type identifier used in the converter and C++ code.Also applies to: 883-891
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
gguf-py/gguf/constants.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: ggml-ci-arm64-cpu-low-perf
- GitHub Check: ggml-ci-arm64-cpu-high-perf
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
- GitHub Check: openEuler-latest-cmake-cann (x86, 310p, Release)
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server (ADDRESS, RelWithDebInfo)
- GitHub Check: server (Release, LLAMA_ARG_BACKEND_SAMPLING=1)
- GitHub Check: server-windows
- GitHub Check: server (UNDEFINED, RelWithDebInfo)
- GitHub Check: server (Release)
Mirror from upstream PR: ggml-org#18256
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.