Skip to content

Introduce whitelisted folders for external data validation#27352

Draft
yuslepukhin wants to merge 9 commits intomainfrom
yuslepukhin/whitelist_data_folders
Draft

Introduce whitelisted folders for external data validation#27352
yuslepukhin wants to merge 9 commits intomainfrom
yuslepukhin/whitelist_data_folders

Conversation

@yuslepukhin
Copy link
Member

Description

Many customers reported that they prefer to store external data in locations other than model folder PR
Previous security change disabled that possibility. PR #26776.
This PR introduces a new API that sets whitelisted folders option. Data stored under those folders or their subfolders would still be allowed.

Motivation and Context

qdrant/fastembed#603

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for loading TensorProto external data from user-configured “whitelisted” directories (in addition to the model directory), addressing the prior security hardening that restricted external data to the model folder.

Changes:

  • Introduces a new C/C++ SessionOptions API to configure semicolon-separated whitelisted external-data directories.
  • Adds parsing/validation logic for whitelist paths and extends external data path validation to allow matches under whitelisted folders.
  • Updates graph/session code paths and expands unit tests for whitelist behavior.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
onnxruntime/test/ir/graph_test.cc Updates call site for new Graph::ConvertInitializersIntoOrtValues signature.
onnxruntime/test/framework/tensorutils_test.cc Adds extensive tests for whitelist-aware path validation and whitelist parsing.
onnxruntime/core/session/provider_bridge_ort.cc Removes shared-provider host bridge for external data path validation.
onnxruntime/core/session/ort_apis.h Adds C API entrypoint declaration for SessionOptionsSetWhiteListedDataFolders.
onnxruntime/core/session/onnxruntime_c_api.cc Registers the new C API function pointer and updates version asserts (currently problematic).
onnxruntime/core/session/inference_session.cc Parses whitelist from SessionOptions and passes it into initializer conversion.
onnxruntime/core/session/abi_session_options.cc Implements SessionOptionsSetWhiteListedDataFolders (currently rejects nullptr).
onnxruntime/core/providers/shared_library/provider_interfaces.h Removes ProviderHost virtual for validating external data paths.
onnxruntime/core/providers/shared_library/provider_api.h Removes shared-provider wrapper for external data path validation.
onnxruntime/core/graph/graph.cc Extends initializer conversion to validate external paths against whitelist.
onnxruntime/core/framework/tensorprotoutils.h Adds ParseWhiteListedPaths and extends ValidateExternalDataPath signature.
onnxruntime/core/framework/tensorprotoutils.cc Implements whitelist parsing and whitelist-aware external data path validation.
onnxruntime/core/framework/session_options.h Adds SessionOptions::whitelisted_data_folders storage.
include/onnxruntime/core/session/onnxruntime_cxx_inline.h Adds C++ wrapper SessionOptions::SetWhiteListedDataFolders implementation.
include/onnxruntime/core/session/onnxruntime_cxx_api.h Adds C++ wrapper SessionOptions::SetWhiteListedDataFolders declaration.
include/onnxruntime/core/session/onnxruntime_c_api.h Adds public C API declaration/docs for SessionOptionsSetWhiteListedDataFolders (currently mismatched).
include/onnxruntime/core/graph/graph.h Changes public Graph API signature for ConvertInitializersIntoOrtValues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xenova
Copy link
Contributor

xenova commented Feb 17, 2026

I'd also like to voice that I have been encountering more errors like this with the hugging face cache system.

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : External data path validation failed for initializer: vision_model.embeddings.patch_embedding.weight. Error: tensorprotoutils.cc:347 ValidateExternalDataPath External data path: "vision_encoder.onnx_data" escapes model directory: ".../.cache/huggingface/hub/models--onnx-community--granite-docling-258M-ONNX/snapshots/e8602580df77443fc3421cf3bae0601da601e5c6/onnx"

Since this is quite a popular way to use models, hopefully there is a way to fix this.


Example reproduction (which used to work correctly before that update, from https://huggingface.co/onnx-community/granite-docling-258M-ONNX):

from transformers import AutoConfig, AutoProcessor
from transformers.image_utils import load_image
from huggingface_hub import hf_hub_download
import onnxruntime
import numpy as np


# 1. Load models
## Load config and processor
model_id = "onnx-community/granite-docling-258M-ONNX"
config = AutoConfig.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

## Download models from the Hugging Face Hub
vision_model_path = hf_hub_download(model_id, subfolder="onnx", filename="vision_encoder.onnx")         # graph
hf_hub_download(model_id, subfolder="onnx", filename="vision_encoder.onnx_data")                        # weights
embed_model_path = hf_hub_download(model_id, subfolder="onnx", filename="embed_tokens.onnx")            # graph
hf_hub_download(model_id, subfolder="onnx", filename="embed_tokens.onnx_data")                          # weights
decoder_model_path = hf_hub_download(model_id, subfolder="onnx", filename="decoder_model_merged.onnx")  # graph
hf_hub_download(model_id, subfolder="onnx", filename="decoder_model_merged.onnx_data")                  # weights

## Load sessions
vision_session = onnxruntime.InferenceSession(vision_model_path)
embed_session = onnxruntime.InferenceSession(embed_model_path)
decoder_session = onnxruntime.InferenceSession(decoder_model_path)

## Set config values
num_key_value_heads = config.text_config.num_key_value_heads
head_dim = config.text_config.head_dim
num_hidden_layers = config.text_config.num_hidden_layers
eos_token_id = config.text_config.eos_token_id
image_token_id = config.image_token_id


# 2. Prepare inputs
## Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

## Load image and apply processor
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="np")

## Prepare decoder inputs
batch_size = inputs['input_ids'].shape[0]
past_key_values = {
    f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
    for layer in range(num_hidden_layers)
    for kv in ('key', 'value')
}
image_features = None
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']


# 3. Generation loop
max_new_tokens = 4096
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
  inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]

  if image_features is None:
    ## Only compute vision features if not already computed
    image_features = vision_session.run(None, dict(
        pixel_values=inputs['pixel_values'],
        pixel_attention_mask=inputs['pixel_attention_mask'].astype(np.bool_)
    ))[0]

    ## Merge text and vision embeddings
    inputs_embeds[inputs['input_ids'] == image_token_id] = image_features.reshape(-1, image_features.shape[-1])

  logits, *present_key_values = decoder_session.run(None, dict(
      inputs_embeds=inputs_embeds,
      attention_mask=attention_mask,
      **past_key_values,
  ))

  ## Update values for next generation loop
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.concatenate([attention_mask, np.ones((batch_size, 1), dtype=attention_mask.dtype)], axis=-1)
  for j, key in enumerate(past_key_values):
    past_key_values[key] = present_key_values[j]

  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
  if (input_ids == eos_token_id).all():
    break

  ## (Optional) Streaming
  print(processor.decode(input_ids[0]), end='')
print()


# 4. Do something with the final output
print(processor.batch_decode(generated_tokens, skip_special_tokens=False)[0])

hariharans29
hariharans29 previously approved these changes Feb 17, 2026
@tianleiwu tianleiwu marked this pull request as draft February 18, 2026 02:14
@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 18, 2026

Please do not merge. There is another simple fix can solve the issue. See #27374.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants