Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 70 additions & 50 deletions docs/persistence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,21 @@ Secure persistence with skops

.. warning::

This feature is very early in development, which means the API is
unstable and it is **not secure** at the moment. Therefore, use the same
caution as you would for ``pickle``: Don't load from sources that you
don't trust. In the future, more security will be added.
This feature is heavily under development, which means the API is unstable
and there might be security issues at the moment. Therefore, use caution
when loading files from sources you don't trust.

Skops offers a way to save and load sklearn models without using :mod:`pickle`.
The ``pickle`` module is not secure, but with skops, you can securely save and
load sklearn models without using ``pickle``.
The ``pickle`` module is not secure, but with skops, you can [more] securely
save and load models without using ``pickle``.

``Pickle`` is the standard serialization format for sklearn and for Python in
general. One of the main advantages of ``pickle`` is that it can be used for
almost all Python code but this flexibility also makes it inherently insecure.
This is because loading certain types of objects requires the ability to run
arbitrary code, which can be misused for malicious purposes. For example, an
attacker can use it to steal secrets from your machine or install a virus. As
the `Python docs
general (``cloudpickle`` and ``joblib`` use the same format). One of the main
advantages of ``pickle`` is that it can be used for almost all Python objects
but this flexibility also makes it inherently insecure. This is because loading
certain types of objects requires the ability to run arbitrary code, which can
be misused for malicious purposes. For example, an attacker can use it to steal
secrets from your machine or install a virus. As the `Python docs
<https://docs.python.org/3/library/pickle.html#module-pickle>`__ say:

.. warning::
Expand All @@ -31,26 +30,43 @@ the `Python docs
untrusted source, or that could have been tampered with.

In contrast to ``pickle``, the :func:`skops.io.dump` and :func:`skops.io.load`
functions cannot be used to save arbitrary Python code, but they bypass
``pickle`` and are thus more secure.
functions have a more limited scope, while preventing users from running
arbitrary code or loading unknown and malicious objects.

When loading a file, :func:`skops.io.load`/:func:`skops.io.loads` will traverse
the input, check for known and unknown types, and will only construct those
objects if they are trusted, either by default or by the user.

.. note::
You can try out converting your existing pickle files to the skops format
using this Space on Hugging Face Hub:
`pickle-to-skops <https://huggingface.co/spaces/scikit-learn/pickle-to-skops>`__.

Usage
-----

The code snippet below illustrates how to use :func:`skops.io.dump` and
:func:`skops.io.load`:
:func:`skops.io.load`. Note that one needs `XGBoost
<https://xgboost.readthedocs.io/en/stable/>`__ installed to run this:

.. code:: python

from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier
Comment thread
adrinjalali marked this conversation as resolved.
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_iris
from skops.io import dump, load

clf = LogisticRegression(random_state=0, solver="liblinear")
clf.fit(X_train, y_train)
dump(clf, "my-logistic-regression.skops")
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
param_grid = {"tree_method": ["exact", "approx", "hist"]}
clf = GridSearchCV(XGBClassifier(), param_grid=param_grid).fit(X_train, y_train)
print(clf.score(X_test, y_test))
0.9666666666666667
dump(clf, "my-model.skops")
# ...
loaded = load("my-logistic-regression.skops", trusted=True)
loaded.predict(X_test)
loaded = load("my-model.skops", trusted=True)
print(loaded.score(X_test, y_test))
0.9666666666666667

# in memory
from skops.io import dumps, loads
Expand All @@ -64,28 +80,35 @@ using :func:`skops.io.get_untrusted_types`:
.. code:: python

from skops.io import get_untrusted_types
unknown_types = get_untrusted_types(file="my-logistic-regression.skops")
unknown_types = get_untrusted_types(file="my-model.skops")
print(unknown_types)
['numpy.float64', 'numpy.int64', 'sklearn.metrics._scorer._passthrough_scorer',
'xgboost.core.Booster', 'xgboost.sklearn.XGBClassifier']

Note that everything in the above list is safe to load. We already have many
types included as trusted by default, and some of the above values might be
added to that list in the future.

Once you check the list and you validate that everything in the list is safe,
you can load the file with ``trusted=unknown_types``:

.. code:: python

loaded = load("my-logistic-regression.skops", trusted=unknown_types)
loaded = load("my-model.skops", trusted=unknown_types)

At the moment, we support the vast majority of sklearn estimators. This
includes complex use cases such as :class:`sklearn.pipeline.Pipeline`,
:class:`sklearn.model_selection.GridSearchCV`, classes using Cython code, such
as :class:`sklearn.tree.DecisionTreeClassifier`, and more. If you discover an
sklearn estimator that does not work, please open an issue on the skops `GitHub
page <https://github.com/skops-dev/skops/issues>`_ and let us know.

In contrast to ``pickle``, skops cannot persist arbitrary Python code. This
means if you have custom functions (say, a custom function to be used with
:class:`sklearn.model_selection.GridSearchCV`, classes using objects defined in
Cython such as :class:`sklearn.tree.DecisionTreeClassifier`, and more. If you
discover an sklearn estimator that does not work, please open an issue on the
skops `GitHub page <https://github.com/skops-dev/skops/issues>`__ and let us
know.

At the moment, ``skops`` cannot persist arbitrary Python code. This means if
you have custom functions (say, a custom function to be used with
:class:`sklearn.preprocessing.FunctionTransformer`), it will not work. However,
most ``numpy`` and ``scipy`` functions should work. Therefore, you can actually
save built-in functions like ``numpy.sqrt``.
most ``numpy`` and ``scipy`` functions should work. Therefore, you can save
objects having references to functions such as ``numpy.sqrt``.

Supported libraries
-------------------
Expand All @@ -96,32 +119,29 @@ most types from **numpy** and **scipy** should be supported, such as (sparse)
arrays, dtypes, random generators, and ufuncs.

Apart from this core, we plan to support machine learning libraries commonly
used be the community. So far, those are:
used be the community. So far, we have tested the following libraries:

- `LightGBM <https://lightgbm.readthedocs.io/>`_ (scikit-learn API)
- `XGBoost <https://xgboost.readthedocs.io/en/stable/>`_ (scikit-learn API)
- `CatBoost <https://catboost.ai/en/docs/>`_

If you run into a problem using any of the mentioned libraries, this could mean
there is a bug in skops. Please open an issue on `our issue tracker
<https://github.com/skops-dev/skops/issues>`_ (but please check first if a
<https://github.com/skops-dev/skops/issues>`__ (but please check first if a
corresponding issue already exists).

Roadmap
-------

Currently, it is still possible to run insecure code when using skops
persistence. For example, it's possible to load a save file that evaluates
arbitrary code using :func:`eval`. However, we have concrete plans on how to
mitigate this, so please stay updated.

On top of trying to support persisting all relevant sklearn objects, we plan on
making persistence extensible for other libraries. As a user, this means that
if you trust a certain library, you will be able to tell skops to load code
from that library. As a library author, there will be a clear path of what
needs to be done to add secure persistence to your library, such that skops can
save and load code from your library.

To follow what features are currently planned, filter for the `"persistence"
label <https://github.com/skops-dev/skops/labels/persistence>`_ in our GitHub
issues.
There needs to be more testing to harden the loader and make sure we don't run
arbitrary code when it's not intended. However, the safety mechanisms already
in place should prevent most cases of abuse.

At the moment, persisting and loading arbitrary C extension types is not
possible, unless a python object wraps around them and handles persistence and
loading via ``__getstate__`` and ``__setstate__``. We plan to develop an API
which would help third party libraries to make their C extension types
``skops`` compatible.

You can check on our `"issue tracker
<https://github.com/skops-dev/skops/labels/persistence>`__ which features are
planned for the near future.