Trust internal scikit-learn types needed for GB/HGB models by cakedev0 · Pull Request #508 · skops-dev/skops

cakedev0 · 2026-03-30T17:18:07Z

This PR description was mostly written by AI. I reviewed it before submission.

Reference Issues/PRs

This will make #498 very easy, like a 4 lines change. #498 is not a bug, but I'd say it's still a big friction for mlflow users and probably doesn't help them adopting skops.

What does this implement/fix? Explain your changes.

This PR adds support for persisting/loading the internal scikit-learn types needed by GradientBoosting and HistGradientBoosting models without surfacing them as untrusted types.

The implementation introduces a sklearn-specific internal-object path in skops/io/_sklearn.py for a small allowlist of trusted sklearn internals used by GB/HGB models.

CyHalfMultinomialLoss is serialized under the qualified module path sklearn._loss._loss.CyHalfMultinomialLoss rather than _loss.CyHalfMultinomialLoss so that trust is anchored under sklearn.. Maybe this is overkilled?

The PR also adds targeted regression tests in skops/io/tests/test_persist.py covering GradientBoosting / HistGradientBoosting variants and checking that CyHalfMultinomialLoss is serialized under the qualified sklearn module path.

AI usage disclosure:

The code changes were developed with AI assistance in an interactive back-and-forth session.
I asked the assistant to probe which trusted types were needed for GradientBoosting/HistGradientBoosting, explain the relevant serialization/trust code paths, review security tradeoffs, and iterate on the implementation based on my feedback.
I reviewed the proposed changes, asked follow-up questions about the design and security properties, and requested specific adjustments such as using a qualified sklearn path for CyHalfMultinomialLoss and tightening/cleaning the sklearn-internal trust path.

cakedev0 · 2026-03-31T08:39:38Z

I asked Codex to audit added scikit-learn classes, here is its report (TLDR: those are all simple, numeric-oriented classes, and are safe to trust).

Details

Per-class notes

Stateless numeric wrappers

These classes are thin mathematical wrappers with no meaningful mutable state and no custom deserialization hooks. In practice they only expose NumPy / SciPy computations, so trusting them does not add an obvious code-execution surface.

Classes	What they do	Safety review
`sklearn._loss.link.IdentityLink`, `sklearn._loss.link.LogLink`, `sklearn._loss.link.LogitLink`, `sklearn._loss.link.HalfLogitLink`, `sklearn._loss.link.MultinomialLogit`	Link functions used by the sklearn loss objects.	Safe: stateless numeric wrappers only; no `eval` / `exec`, no dynamic import, no custom `__setstate__`, no attacker-controlled callable invocation on load.

Small passive data containers

These objects mainly carry parameters or bounds. The main caveat is that skops restores them through __new__ plus attribute injection, so constructor-time validation is bypassed.

Classes	What they do	Safety review
`sklearn._loss.link.Interval`	Dataclass storing lower/upper bounds and inclusiveness flags for valid value ranges.	Safe from code execution: only scalar fields and range checks; malformed values could violate invariants, but there is no code-exec hook.

Python loss wrappers around trusted numeric backends

These classes wrap Cython loss kernels plus link objects and a few scalar flags. They do not define custom deserialization logic; in the skops path they are rebuilt by restoring plain attributes (closs, link, flags, intervals, sometimes quantile / n_classes).

Classes	What they do	Safety review
`sklearn._loss.loss.HalfSquaredError`, `sklearn._loss.loss.AbsoluteError`, `sklearn._loss.loss.PinballLoss`, `sklearn._loss.loss.HuberLoss`, `sklearn._loss.loss.HalfPoissonLoss`, `sklearn._loss.loss.HalfGammaLoss`, `sklearn._loss.loss.HalfBinomialLoss`, `sklearn._loss.loss.HalfMultinomialLoss`, `sklearn._loss.loss.ExponentialLoss`	Regression / classification loss objects used by GB / HGB and related estimators.	Safe: attribute-only restore pattern, no custom `__setstate__`, no dynamic execution logic; malformed trusted state could still cause wrong predictions or runtime errors because constructor validation is not re-run.

Cython backend object

Classes	What they do	Safety review
`sklearn._loss._loss.CyHalfMultinomialLoss`	Cython implementation of multiclass loss / gradient / probability kernels.	Safe in the skops path: effectively stateless here (`__getstate__` is `None`), and skops bypasses Cython's pickle helper by reconstructing the instance via `__new__`, so there is no attacker-controlled function call during load.

HistGradientBoosting internals with explicit state restoration

These are the only reviewed classes with meaningful state-restoration behavior. I checked those methods directly.

Classes	What they do	Safety review
`sklearn.ensemble._hist_gradient_boosting.binning._BinMapper`	Stores HGB binning metadata and maps raw features to integer bins.	Safe: deserialization goes through `BaseEstimator.__setstate__`, which only pops `_sklearn_version`, warns on version mismatch, then updates `__dict__`; no dynamic execution.
`sklearn.ensemble._hist_gradient_boosting.predictor.TreePredictor`	Stores HGB tree arrays and runs fast prediction / partial dependence kernels.	Safe: custom `__setstate__` only restores state and casts `nodes` to `PREDICTOR_RECORD_DTYPE` for cross-bitness compatibility; no import/eval/callback behavior.

adrinjalali

Questions:

can these types not be persisted at all right now? Do they have to have a new node?
can these not be simply added to trusted types in the trusted file?

cakedev0 · 2026-04-13T15:01:24Z

can these types not be persisted at all right now?

They can, but with many untrusted types which is surprising for those very common scikit-learn models. And when trying to use skops with mlflow, it's really inconvenient because you have to pass the trusted types list at log_model time and not at load time. So basically, you have to do:
mlflow.sklearn.log_model(model, serialization_format="skops", skops_trusted_types=skops.io.get_untrusted_types(skops.io.dumps(model))).

Do they have to have a new node?

I don't think so. But maybe mapping _loss.CyHalfMultinomialLoss to sklearn._loss._loss.CyHalfMultinomialLoss is more convenient to do with a dedicated node (but this mapping is not necessary, just extra-caution I'd say).

Can these not be simply added to trusted types in the trusted file?

The trusted file is mostly global default trust for public-ish broad categories: primitives, containers, dtypes, ufuncs, and sklearn estimators discovered via all_estimators() so it felt off to add some version-sensitive private scikit-learn types here.

IIRC, Codex proposed this design, I challenged it and got convinced by this argument.

cakedev0 · 2026-04-13T15:04:47Z

Note: before scikit-learn 1.4, GB/HGB use different internal things, so they still have untrusted types. For now, I decided to skip the tests, but I could probably add the necessary types instead. As you prefer.

adrinjalali · 2026-04-13T16:35:59Z

You can add extra trusted classes in specific loader nodes, but it doesn't really matter, they can be in the global trusted file. That's much better than adding a new loader / dumper states just for those classes.

adrinjalali

@copilot please apply my reviews in this PR.

adrinjalali · 2026-04-16T11:22:43Z

+if not all(
+    type_name.startswith("sklearn.")
+    for type_name in TRUSTED_SKLEARN_INTERNAL_TYPE_NAMES
+):
+    raise RuntimeError(
+        "All trusted sklearn internal type names must start with 'sklearn.'."
+    )
+


this is more of a test. Import shouldn't raise. Alternatively, we can filter out here anything which doesn't start with sklearn.

adrinjalali · 2026-04-16T11:23:11Z

        )


+def sklearn_internal_object_get_state(


we don't need to add a new node to support these objects. They can be simply trusted in _trusted.py, and added to a node's trusted types in the appropriate nodes.

adrinjalali · 2026-04-16T11:24:23Z

+        estimator.fit(X, y)
+
+    dumped = dumps(estimator)
+    with ZipFile(io.BytesIO(dumped), "r") as zip_file:


not sure if going through the zip file is a good idea, we should save / load and check if the loaded object is correct, with correct loaded attributes.

Address review comments from PR #508: - Add sklearn internal types to _trusted_types.py instead of creating a new SklearnInternalObjectNode (review comment 2) - Include internal types in ObjectNode's trusted defaults - Add tests that save/load and verify correctness (review comment 3) - Filter non-sklearn-prefixed types where appropriate (review comment 1) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>

adrinjalali · 2026-04-16T12:10:26Z

Checking alternative implementation here: #513

* Trust internal sklearn types needed for GB/HGB models Address review comments from PR #508: - Add sklearn internal types to _trusted_types.py instead of creating a new SklearnInternalObjectNode (review comment 2) - Include internal types in ObjectNode's trusted defaults - Add tests that save/load and verify correctness (review comment 3) - Filter non-sklearn-prefixed types where appropriate (review comment 1) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Address code review: improve comments for clarity Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Address review: filter sklearn. prefix, remove CyHalfMultinomialLoss, drop warning suppression - Filter SKLEARN_INTERNAL_TYPE_NAMES to only include types starting with 'sklearn.' to avoid trusting monkey-patched types from other packages - Remove CyHalfMultinomialLoss from auto-trusted list since its __module__ reports '_loss' instead of 'sklearn._loss._loss' (Cython build issue on sklearn side) - Add detailed comment explaining the sklearn Cython module name issue - Remove unnecessary warnings.catch_warnings() in test — no warnings are produced during fitting of these estimators - Update test to use get_untrusted_types() + loads(trusted=...) pattern Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Fix misleading comment in test Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Add all Cy* types to trusted types; conditionally test based on __module__ correctness Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Rename _cy_module_is_correct to cy_module_is_correct (local variable convention) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Fix CI: apply black formatting and update pixi.lock Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/9bf5c6f9-7729-4d13-a37a-18a6b934d296 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * update mechanism * pre-commit update * ... * merge issues * fix loss node * docstring --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> Co-authored-by: adrinjalali <adrin.jalali@gmail.com>

adrinjalali · 2026-04-22T13:28:27Z

closed in a different PR

cakedev0 added 2 commits March 30, 2026 18:26

Trust internal scikit-learn types needed for GB/HGB models

856e6c4

clean-up

68ae655

cakedev0 marked this pull request as ready for review March 30, 2026 17:29

adrinjalali reviewed Apr 13, 2026

View reviewed changes

cakedev0 added 2 commits April 13, 2026 17:04

skip new tests for sklearn 1.2/1.3

7c7744b

Merge branch 'main' into trust_gb_hgb_internals

58e3795

adrinjalali reviewed Apr 16, 2026

View reviewed changes

Copilot AI mentioned this pull request Apr 16, 2026

Trust internal scikit-learn types needed for GB/HGB models #513

Merged

4 tasks

adrinjalali closed this Apr 22, 2026

		)


		def sklearn_internal_object_get_state(

Conversation

cakedev0 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

AI usage disclosure:

Uh oh!

cakedev0 commented Mar 31, 2026

Per-class notes

Stateless numeric wrappers

Small passive data containers

Python loss wrappers around trusted numeric backends

Cython backend object

HistGradientBoosting internals with explicit state restoration

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

cakedev0 commented Apr 13, 2026

Uh oh!

cakedev0 commented Apr 13, 2026

Uh oh!

adrinjalali commented Apr 13, 2026

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Apr 16, 2026

Uh oh!

adrinjalali commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cakedev0 commented Mar 30, 2026 •

edited

Loading