From 2b7da6c3aab8b7dc8cae71b98b68871aebb35283 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Thu, 15 Dec 2022 21:35:24 +0100 Subject: [PATCH 1/3] DOC rework persistence user guide --- docs/persistence.rst | 112 ++++++++++++++++++++++++------------------- 1 file changed, 63 insertions(+), 49 deletions(-) diff --git a/docs/persistence.rst b/docs/persistence.rst index 19959f85..01e804d0 100644 --- a/docs/persistence.rst +++ b/docs/persistence.rst @@ -5,22 +5,21 @@ Secure persistence with skops .. warning:: - This feature is very early in development, which means the API is - unstable and it is **not secure** at the moment. Therefore, use the same - caution as you would for ``pickle``: Don't load from sources that you - don't trust. In the future, more security will be added. + This feature is heavily under development, which means the API is unstable + and there might be security issues at the moment. Therefore, use caution + when loading files from sources you don't trust. Skops offers a way to save and load sklearn models without using :mod:`pickle`. -The ``pickle`` module is not secure, but with skops, you can securely save and -load sklearn models without using ``pickle``. +The ``pickle`` module is not secure, but with skops, you can [more] securely +save and load models without using ``pickle``. ``Pickle`` is the standard serialization format for sklearn and for Python in -general. One of the main advantages of ``pickle`` is that it can be used for -almost all Python code but this flexibility also makes it inherently insecure. -This is because loading certain types of objects requires the ability to run -arbitrary code, which can be misused for malicious purposes. For example, an -attacker can use it to steal secrets from your machine or install a virus. As -the `Python docs +general (``cloudpickle`` and ``joblib`` use the same format). One of the main +advantages of ``pickle`` is that it can be used for almost all Python objects +but this flexibility also makes it inherently insecure. This is because loading +certain types of objects requires the ability to run arbitrary code, which can +be misused for malicious purposes. For example, an attacker can use it to steal +secrets from your machine or install a virus. As the `Python docs `__ say: .. warning:: @@ -31,8 +30,12 @@ the `Python docs untrusted source, or that could have been tampered with. In contrast to ``pickle``, the :func:`skops.io.dump` and :func:`skops.io.load` -functions cannot be used to save arbitrary Python code, but they bypass -``pickle`` and are thus more secure. +functions have a more limited scope, while preventing users from running +arbitrary code or loading unknown and malicious objects. + +When loading a file, :func:`skops.io.load`/:func:`skops.io.loads` will read +traverse the input and check for known and unknown types, and will only +construct those objects if they are trusted, either by default or by the user. Usage ----- @@ -42,15 +45,22 @@ The code snippet below illustrates how to use :func:`skops.io.dump` and .. code:: python - from sklearn.linear_model import LogisticRegression + from xgboost.sklearn import XGBClassifier + from sklearn.model_selection import GridSearchCV, train_test_split + from sklearn.datasets import load_iris from skops.io import dump, load - clf = LogisticRegression(random_state=0, solver="liblinear") - clf.fit(X_train, y_train) - dump(clf, "my-logistic-regression.skops") + X, y = load_iris(return_X_y=True) + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4) + param_grid = {"tree_method": ["exact", "approx", "hist"]} + clf = GridSearchCV(XGBClassifier(), param_grid=param_grid).fit(X_train, y_train) + print(clf.score(X_test, y_test)) + 0.9666666666666667 + dump(clf, "my-model.skops") # ... - loaded = load("my-logistic-regression.skops", trusted=True) - loaded.predict(X_test) + loaded = load("my-model.skops", trusted=True) + print(loaded.score(X_test, y_test)) + 0.9666666666666667 # in memory from skops.io import dumps, loads @@ -64,28 +74,35 @@ using :func:`skops.io.get_untrusted_types`: .. code:: python from skops.io import get_untrusted_types - unknown_types = get_untrusted_types(file="my-logistic-regression.skops") + unknown_types = get_untrusted_types(file="my-model.skops") print(unknown_types) + ['numpy.float64', 'numpy.int64', 'sklearn.metrics._scorer._passthrough_scorer', + 'xgboost.core.Booster', 'xgboost.sklearn.XGBClassifier'] + +Note that everything in the above list is safe to load. We already have many +types included as trusted by default, and some of the above values might be +added to that list in the future. Once you check the list and you validate that everything in the list is safe, you can load the file with ``trusted=unknown_types``: .. code:: python - loaded = load("my-logistic-regression.skops", trusted=unknown_types) + loaded = load("my-model.skops", trusted=unknown_types) At the moment, we support the vast majority of sklearn estimators. This includes complex use cases such as :class:`sklearn.pipeline.Pipeline`, -:class:`sklearn.model_selection.GridSearchCV`, classes using Cython code, such -as :class:`sklearn.tree.DecisionTreeClassifier`, and more. If you discover an -sklearn estimator that does not work, please open an issue on the skops `GitHub -page `_ and let us know. - -In contrast to ``pickle``, skops cannot persist arbitrary Python code. This -means if you have custom functions (say, a custom function to be used with +:class:`sklearn.model_selection.GridSearchCV`, classes using objects defined in +Cython such such as :class:`sklearn.tree.DecisionTreeClassifier`, and more. If +you discover an sklearn estimator that does not work, please open an issue on +the skops `GitHub page `__ and let +us know. + +At the moment, ``skops`` cannot persist arbitrary Python code. This means if +you have custom functions (say, a custom function to be used with :class:`sklearn.preprocessing.FunctionTransformer`), it will not work. However, -most ``numpy`` and ``scipy`` functions should work. Therefore, you can actually -save built-in functions like ``numpy.sqrt``. +most ``numpy`` and ``scipy`` functions should work. Therefore, you can save +objects having references to functions such as ``numpy.sqrt``. Supported libraries ------------------- @@ -96,7 +113,7 @@ most types from **numpy** and **scipy** should be supported, such as (sparse) arrays, dtypes, random generators, and ufuncs. Apart from this core, we plan to support machine learning libraries commonly -used be the community. So far, those are: +used be the community. So far, we have tested the following libraries: - `LightGBM `_ (scikit-learn API) - `XGBoost `_ (scikit-learn API) @@ -104,24 +121,21 @@ used be the community. So far, those are: If you run into a problem using any of the mentioned libraries, this could mean there is a bug in skops. Please open an issue on `our issue tracker -`_ (but please check first if a +`__ (but please check first if a corresponding issue already exists). Roadmap ------- - -Currently, it is still possible to run insecure code when using skops -persistence. For example, it's possible to load a save file that evaluates -arbitrary code using :func:`eval`. However, we have concrete plans on how to -mitigate this, so please stay updated. - -On top of trying to support persisting all relevant sklearn objects, we plan on -making persistence extensible for other libraries. As a user, this means that -if you trust a certain library, you will be able to tell skops to load code -from that library. As a library author, there will be a clear path of what -needs to be done to add secure persistence to your library, such that skops can -save and load code from your library. - -To follow what features are currently planned, filter for the `"persistence" -label `_ in our GitHub -issues. +There needs to be more testing to harden the loader and make sure we don't run +arbitrary code when it's not intended. However, the safety mechanisms already +in place should prevent most cases of abuse. + +At the moment most persisting and loading arbitrary C extension types is not +possible unless a python object wraps around them and handles persistance and +loading via ``__getstate__`` and ``__setstate__``. We plan to develop an API +which would help third party libraries to make their C extension types +``skops`` compatible. + +You can check on our `"issue tracker +`__ which features are +planned for the near future. From 90a61f14baa0813c914ff116d5ae6495c5ddd712 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Fri, 16 Dec 2022 12:16:59 +0100 Subject: [PATCH 2/3] apply suggestions --- docs/persistence.rst | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/docs/persistence.rst b/docs/persistence.rst index 01e804d0..1ed4cf5b 100644 --- a/docs/persistence.rst +++ b/docs/persistence.rst @@ -33,15 +33,16 @@ In contrast to ``pickle``, the :func:`skops.io.dump` and :func:`skops.io.load` functions have a more limited scope, while preventing users from running arbitrary code or loading unknown and malicious objects. -When loading a file, :func:`skops.io.load`/:func:`skops.io.loads` will read -traverse the input and check for known and unknown types, and will only -construct those objects if they are trusted, either by default or by the user. +When loading a file, :func:`skops.io.load`/:func:`skops.io.loads` will traverse +the input, check for known and unknown types, and will only construct those +objects if they are trusted, either by default or by the user. Usage ----- The code snippet below illustrates how to use :func:`skops.io.dump` and -:func:`skops.io.load`: +:func:`skops.io.load`. Note that one needs `XGBoost +`__ installed to run this: .. code:: python @@ -93,10 +94,10 @@ you can load the file with ``trusted=unknown_types``: At the moment, we support the vast majority of sklearn estimators. This includes complex use cases such as :class:`sklearn.pipeline.Pipeline`, :class:`sklearn.model_selection.GridSearchCV`, classes using objects defined in -Cython such such as :class:`sklearn.tree.DecisionTreeClassifier`, and more. If -you discover an sklearn estimator that does not work, please open an issue on -the skops `GitHub page `__ and let -us know. +Cython such as :class:`sklearn.tree.DecisionTreeClassifier`, and more. If you +discover an sklearn estimator that does not work, please open an issue on the +skops `GitHub page `__ and let us +know. At the moment, ``skops`` cannot persist arbitrary Python code. This means if you have custom functions (say, a custom function to be used with @@ -130,8 +131,8 @@ There needs to be more testing to harden the loader and make sure we don't run arbitrary code when it's not intended. However, the safety mechanisms already in place should prevent most cases of abuse. -At the moment most persisting and loading arbitrary C extension types is not -possible unless a python object wraps around them and handles persistance and +At the moment, persisting and loading arbitrary C extension types is not +possible, unless a python object wraps around them and handles persistence and loading via ``__getstate__`` and ``__setstate__``. We plan to develop an API which would help third party libraries to make their C extension types ``skops`` compatible. From 7e74865cf35500575c6be5c4209798503930cf0b Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Fri, 16 Dec 2022 12:58:32 +0100 Subject: [PATCH 3/3] add a link to the space --- docs/persistence.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/persistence.rst b/docs/persistence.rst index 1ed4cf5b..0a3a7dfd 100644 --- a/docs/persistence.rst +++ b/docs/persistence.rst @@ -37,6 +37,11 @@ When loading a file, :func:`skops.io.load`/:func:`skops.io.loads` will traverse the input, check for known and unknown types, and will only construct those objects if they are trusted, either by default or by the user. +.. note:: + You can try out converting your existing pickle files to the skops format + using this Space on Hugging Face Hub: + `pickle-to-skops `__. + Usage -----