Skip to content

Trust internal scikit-learn types needed for GB/HGB models#508

Closed
cakedev0 wants to merge 4 commits intoskops-dev:mainfrom
cakedev0:trust_gb_hgb_internals
Closed

Trust internal scikit-learn types needed for GB/HGB models#508
cakedev0 wants to merge 4 commits intoskops-dev:mainfrom
cakedev0:trust_gb_hgb_internals

Conversation

@cakedev0
Copy link
Copy Markdown

@cakedev0 cakedev0 commented Mar 30, 2026

This PR description was mostly written by AI. I reviewed it before submission.

Reference Issues/PRs

This will make #498 very easy, like a 4 lines change. #498 is not a bug, but I'd say it's still a big friction for mlflow users and probably doesn't help them adopting skops.

What does this implement/fix? Explain your changes.

This PR adds support for persisting/loading the internal scikit-learn types needed by GradientBoosting and HistGradientBoosting models without surfacing them as untrusted types.

The implementation introduces a sklearn-specific internal-object path in skops/io/_sklearn.py for a small allowlist of trusted sklearn internals used by GB/HGB models.

CyHalfMultinomialLoss is serialized under the qualified module path sklearn._loss._loss.CyHalfMultinomialLoss rather than _loss.CyHalfMultinomialLoss so that trust is anchored under sklearn.. Maybe this is overkilled?

The PR also adds targeted regression tests in skops/io/tests/test_persist.py covering GradientBoosting / HistGradientBoosting variants and checking that CyHalfMultinomialLoss is serialized under the qualified sklearn module path.

AI usage disclosure:

  • The code changes were developed with AI assistance in an interactive back-and-forth session.
  • I asked the assistant to probe which trusted types were needed for GradientBoosting/HistGradientBoosting, explain the relevant serialization/trust code paths, review security tradeoffs, and iterate on the implementation based on my feedback.
  • I reviewed the proposed changes, asked follow-up questions about the design and security properties, and requested specific adjustments such as using a qualified sklearn path for CyHalfMultinomialLoss and tightening/cleaning the sklearn-internal trust path.

@cakedev0 cakedev0 marked this pull request as ready for review March 30, 2026 17:29
@cakedev0
Copy link
Copy Markdown
Author

I asked Codex to audit added scikit-learn classes, here is its report (TLDR: those are all simple, numeric-oriented classes, and are safe to trust).

Details

Per-class notes

Stateless numeric wrappers

These classes are thin mathematical wrappers with no meaningful mutable state and no custom deserialization hooks. In practice they only expose NumPy / SciPy computations, so trusting them does not add an obvious code-execution surface.

Classes What they do Safety review
sklearn._loss.link.IdentityLink, sklearn._loss.link.LogLink, sklearn._loss.link.LogitLink, sklearn._loss.link.HalfLogitLink, sklearn._loss.link.MultinomialLogit Link functions used by the sklearn loss objects. Safe: stateless numeric wrappers only; no eval / exec, no dynamic import, no custom __setstate__, no attacker-controlled callable invocation on load.

Small passive data containers

These objects mainly carry parameters or bounds. The main caveat is that skops restores them through __new__ plus attribute injection, so constructor-time validation is bypassed.

Classes What they do Safety review
sklearn._loss.link.Interval Dataclass storing lower/upper bounds and inclusiveness flags for valid value ranges. Safe from code execution: only scalar fields and range checks; malformed values could violate invariants, but there is no code-exec hook.

Python loss wrappers around trusted numeric backends

These classes wrap Cython loss kernels plus link objects and a few scalar flags. They do not define custom deserialization logic; in the skops path they are rebuilt by restoring plain attributes (closs, link, flags, intervals, sometimes quantile / n_classes).

Classes What they do Safety review
sklearn._loss.loss.HalfSquaredError, sklearn._loss.loss.AbsoluteError, sklearn._loss.loss.PinballLoss, sklearn._loss.loss.HuberLoss, sklearn._loss.loss.HalfPoissonLoss, sklearn._loss.loss.HalfGammaLoss, sklearn._loss.loss.HalfBinomialLoss, sklearn._loss.loss.HalfMultinomialLoss, sklearn._loss.loss.ExponentialLoss Regression / classification loss objects used by GB / HGB and related estimators. Safe: attribute-only restore pattern, no custom __setstate__, no dynamic execution logic; malformed trusted state could still cause wrong predictions or runtime errors because constructor validation is not re-run.

Cython backend object

Classes What they do Safety review
sklearn._loss._loss.CyHalfMultinomialLoss Cython implementation of multiclass loss / gradient / probability kernels. Safe in the skops path: effectively stateless here (__getstate__ is None), and skops bypasses Cython's pickle helper by reconstructing the instance via __new__, so there is no attacker-controlled function call during load.

HistGradientBoosting internals with explicit state restoration

These are the only reviewed classes with meaningful state-restoration behavior. I checked those methods directly.

Classes What they do Safety review
sklearn.ensemble._hist_gradient_boosting.binning._BinMapper Stores HGB binning metadata and maps raw features to integer bins. Safe: deserialization goes through BaseEstimator.__setstate__, which only pops _sklearn_version, warns on version mismatch, then updates __dict__; no dynamic execution.
sklearn.ensemble._hist_gradient_boosting.predictor.TreePredictor Stores HGB tree arrays and runs fast prediction / partial dependence kernels. Safe: custom __setstate__ only restores state and casts nodes to PREDICTOR_RECORD_DTYPE for cross-bitness compatibility; no import/eval/callback behavior.

Copy link
Copy Markdown
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions:

  • can these types not be persisted at all right now? Do they have to have a new node?
  • can these not be simply added to trusted types in the trusted file?

@cakedev0
Copy link
Copy Markdown
Author

can these types not be persisted at all right now?

They can, but with many untrusted types which is surprising for those very common scikit-learn models. And when trying to use skops with mlflow, it's really inconvenient because you have to pass the trusted types list at log_model time and not at load time. So basically, you have to do:
mlflow.sklearn.log_model(model, serialization_format="skops", skops_trusted_types=skops.io.get_untrusted_types(skops.io.dumps(model))).

Do they have to have a new node?

I don't think so. But maybe mapping _loss.CyHalfMultinomialLoss to sklearn._loss._loss.CyHalfMultinomialLoss is more convenient to do with a dedicated node (but this mapping is not necessary, just extra-caution I'd say).

Can these not be simply added to trusted types in the trusted file?

The trusted file is mostly global default trust for public-ish broad categories: primitives, containers, dtypes, ufuncs, and sklearn estimators discovered via all_estimators() so it felt off to add some version-sensitive private scikit-learn types here.

IIRC, Codex proposed this design, I challenged it and got convinced by this argument.

@cakedev0
Copy link
Copy Markdown
Author

Note: before scikit-learn 1.4, GB/HGB use different internal things, so they still have untrusted types. For now, I decided to skip the tests, but I could probably add the necessary types instead. As you prefer.

@adrinjalali
Copy link
Copy Markdown
Member

You can add extra trusted classes in specific loader nodes, but it doesn't really matter, they can be in the global trusted file. That's much better than adding a new loader / dumper states just for those classes.

Copy link
Copy Markdown
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please apply my reviews in this PR.

Comment thread skops/io/_sklearn.py
Comment on lines +179 to +186
if not all(
type_name.startswith("sklearn.")
for type_name in TRUSTED_SKLEARN_INTERNAL_TYPE_NAMES
):
raise RuntimeError(
"All trusted sklearn internal type names must start with 'sklearn.'."
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more of a test. Import shouldn't raise. Alternatively, we can filter out here anything which doesn't start with sklearn.

Comment thread skops/io/_sklearn.py
)


def sklearn_internal_object_get_state(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to add a new node to support these objects. They can be simply trusted in _trusted.py, and added to a node's trusted types in the appropriate nodes.

estimator.fit(X, y)

dumped = dumps(estimator)
with ZipFile(io.BytesIO(dumped), "r") as zip_file:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if going through the zip file is a good idea, we should save / load and check if the loaded object is correct, with correct loaded attributes.

Copilot AI added a commit that referenced this pull request Apr 16, 2026
Address review comments from PR #508:
- Add sklearn internal types to _trusted_types.py instead of creating
  a new SklearnInternalObjectNode (review comment 2)
- Include internal types in ObjectNode's trusted defaults
- Add tests that save/load and verify correctness (review comment 3)
- Filter non-sklearn-prefixed types where appropriate (review comment 1)

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
@adrinjalali
Copy link
Copy Markdown
Member

Checking alternative implementation here: #513

adrinjalali added a commit that referenced this pull request Apr 17, 2026
* Trust internal sklearn types needed for GB/HGB models

Address review comments from PR #508:
- Add sklearn internal types to _trusted_types.py instead of creating
  a new SklearnInternalObjectNode (review comment 2)
- Include internal types in ObjectNode's trusted defaults
- Add tests that save/load and verify correctness (review comment 3)
- Filter non-sklearn-prefixed types where appropriate (review comment 1)

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

* Address code review: improve comments for clarity

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

* Address review: filter sklearn. prefix, remove CyHalfMultinomialLoss, drop warning suppression

- Filter SKLEARN_INTERNAL_TYPE_NAMES to only include types starting
  with 'sklearn.' to avoid trusting monkey-patched types from other
  packages
- Remove CyHalfMultinomialLoss from auto-trusted list since its
  __module__ reports '_loss' instead of 'sklearn._loss._loss' (Cython
  build issue on sklearn side)
- Add detailed comment explaining the sklearn Cython module name issue
- Remove unnecessary warnings.catch_warnings() in test — no warnings
  are produced during fitting of these estimators
- Update test to use get_untrusted_types() + loads(trusted=...) pattern

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

* Fix misleading comment in test

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

* Add all Cy* types to trusted types; conditionally test based on __module__ correctness

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

* Rename _cy_module_is_correct to cy_module_is_correct (local variable convention)

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

* Fix CI: apply black formatting and update pixi.lock

Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/9bf5c6f9-7729-4d13-a37a-18a6b934d296

Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

* update mechanism

* pre-commit update

* ...

* merge issues

* fix loss node

* docstring

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
Co-authored-by: adrinjalali <adrin.jalali@gmail.com>
@adrinjalali
Copy link
Copy Markdown
Member

closed in a different PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants