Trust internal scikit-learn types needed for GB/HGB models#508
Trust internal scikit-learn types needed for GB/HGB models#508cakedev0 wants to merge 4 commits intoskops-dev:mainfrom
Conversation
|
I asked Codex to audit added scikit-learn classes, here is its report (TLDR: those are all simple, numeric-oriented classes, and are safe to trust). DetailsPer-class notesStateless numeric wrappersThese classes are thin mathematical wrappers with no meaningful mutable state and no custom deserialization hooks. In practice they only expose NumPy / SciPy computations, so trusting them does not add an obvious code-execution surface.
Small passive data containersThese objects mainly carry parameters or bounds. The main caveat is that skops restores them through
Python loss wrappers around trusted numeric backendsThese classes wrap Cython loss kernels plus link objects and a few scalar flags. They do not define custom deserialization logic; in the skops path they are rebuilt by restoring plain attributes (
Cython backend object
HistGradientBoosting internals with explicit state restorationThese are the only reviewed classes with meaningful state-restoration behavior. I checked those methods directly.
|
adrinjalali
left a comment
There was a problem hiding this comment.
Questions:
- can these types not be persisted at all right now? Do they have to have a new node?
- can these not be simply added to trusted types in the trusted file?
They can, but with many untrusted types which is surprising for those very common scikit-learn models. And when trying to use skops with mlflow, it's really inconvenient because you have to pass the trusted types list at
I don't think so. But maybe mapping
The trusted file is mostly global default trust for public-ish broad categories: primitives, containers, dtypes, ufuncs, and sklearn estimators discovered via IIRC, Codex proposed this design, I challenged it and got convinced by this argument. |
|
Note: before scikit-learn 1.4, GB/HGB use different internal things, so they still have untrusted types. For now, I decided to skip the tests, but I could probably add the necessary types instead. As you prefer. |
|
You can add extra trusted classes in specific loader nodes, but it doesn't really matter, they can be in the global trusted file. That's much better than adding a new loader / dumper states just for those classes. |
adrinjalali
left a comment
There was a problem hiding this comment.
@copilot please apply my reviews in this PR.
| if not all( | ||
| type_name.startswith("sklearn.") | ||
| for type_name in TRUSTED_SKLEARN_INTERNAL_TYPE_NAMES | ||
| ): | ||
| raise RuntimeError( | ||
| "All trusted sklearn internal type names must start with 'sklearn.'." | ||
| ) | ||
|
|
There was a problem hiding this comment.
this is more of a test. Import shouldn't raise. Alternatively, we can filter out here anything which doesn't start with sklearn.
| ) | ||
|
|
||
|
|
||
| def sklearn_internal_object_get_state( |
There was a problem hiding this comment.
we don't need to add a new node to support these objects. They can be simply trusted in _trusted.py, and added to a node's trusted types in the appropriate nodes.
| estimator.fit(X, y) | ||
|
|
||
| dumped = dumps(estimator) | ||
| with ZipFile(io.BytesIO(dumped), "r") as zip_file: |
There was a problem hiding this comment.
not sure if going through the zip file is a good idea, we should save / load and check if the loaded object is correct, with correct loaded attributes.
Address review comments from PR #508: - Add sklearn internal types to _trusted_types.py instead of creating a new SklearnInternalObjectNode (review comment 2) - Include internal types in ObjectNode's trusted defaults - Add tests that save/load and verify correctness (review comment 3) - Filter non-sklearn-prefixed types where appropriate (review comment 1) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
|
Checking alternative implementation here: #513 |
* Trust internal sklearn types needed for GB/HGB models Address review comments from PR #508: - Add sklearn internal types to _trusted_types.py instead of creating a new SklearnInternalObjectNode (review comment 2) - Include internal types in ObjectNode's trusted defaults - Add tests that save/load and verify correctness (review comment 3) - Filter non-sklearn-prefixed types where appropriate (review comment 1) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Address code review: improve comments for clarity Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Address review: filter sklearn. prefix, remove CyHalfMultinomialLoss, drop warning suppression - Filter SKLEARN_INTERNAL_TYPE_NAMES to only include types starting with 'sklearn.' to avoid trusting monkey-patched types from other packages - Remove CyHalfMultinomialLoss from auto-trusted list since its __module__ reports '_loss' instead of 'sklearn._loss._loss' (Cython build issue on sklearn side) - Add detailed comment explaining the sklearn Cython module name issue - Remove unnecessary warnings.catch_warnings() in test — no warnings are produced during fitting of these estimators - Update test to use get_untrusted_types() + loads(trusted=...) pattern Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Fix misleading comment in test Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Add all Cy* types to trusted types; conditionally test based on __module__ correctness Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Rename _cy_module_is_correct to cy_module_is_correct (local variable convention) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * Fix CI: apply black formatting and update pixi.lock Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/9bf5c6f9-7729-4d13-a37a-18a6b934d296 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> * update mechanism * pre-commit update * ... * merge issues * fix loss node * docstring --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com> Co-authored-by: adrinjalali <adrin.jalali@gmail.com>
|
closed in a different PR |
This PR description was mostly written by AI. I reviewed it before submission.
Reference Issues/PRs
This will make #498 very easy, like a 4 lines change. #498 is not a bug, but I'd say it's still a big friction for mlflow users and probably doesn't help them adopting
skops.What does this implement/fix? Explain your changes.
This PR adds support for persisting/loading the internal scikit-learn types needed by GradientBoosting and HistGradientBoosting models without surfacing them as untrusted types.
The implementation introduces a sklearn-specific internal-object path in
skops/io/_sklearn.pyfor a small allowlist of trusted sklearn internals used by GB/HGB models.CyHalfMultinomialLossis serialized under the qualified module pathsklearn._loss._loss.CyHalfMultinomialLossrather than_loss.CyHalfMultinomialLossso that trust is anchored undersklearn.. Maybe this is overkilled?The PR also adds targeted regression tests in
skops/io/tests/test_persist.pycovering GradientBoosting / HistGradientBoosting variants and checking thatCyHalfMultinomialLossis serialized under the qualified sklearn module path.AI usage disclosure:
CyHalfMultinomialLossand tightening/cleaning the sklearn-internal trust path.