ENH Persistence: Add support for LightGBM and XGBoost by BenjaminBossan · Pull Request #244 · skops-dev/skops

BenjaminBossan · 2022-12-08T16:07:16Z

EDITED

Resolves #226

Description

LightGBM already worked out of the box, so for this, only tests were added. For XGBoost and CatBoost, extra code was added to support them.

Implementation

Dependencies were extended to include xgboost, lightgbm, and catboost for testing, so that they should be covered by CI. If those libraries are not installed, the tests should be gracefully skipped.

For xgboost, the xgboost-specific persistence format was chosen, which can be used through estimator.save_model and estimator.load_model. I could not verify (yet) that this format is 100% secure. However, since it's a binary json file (.ubj), it should as secure as json. More information here:

~~https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html~~

~~The actual implementation for xgboost mirrors the one for numpy arrays, which also stores its data in an external file. However, I did not apply any caching here, as it seems it wouldn't be useful.~~

I decided to create a new test file, test_external.py, to put the new tests, instead of cramming more into test_persist.py. Since some functions from test_persist.py were required here, I moved them a new _utils.py module.

Interestingly, I found some potential bugs where some values returned by get_params() may differ for xgboost models after loading. This is independent of skops. Those edge cases are documented but were excluded from tests, as it appears that they should be inconsequential. We may want to report them.

Partially solves skops-dev#226 Description LightGBM already worked out of the box, so for this, only tests were added. For XGBoost, extra code was added to support it. Implementation Dependencies were extended to include xgboost and lightgbm for testing, so that they should be covered by CI. If those libraries are not installed, the tests should be gracefully skipped. For XGBoost, the xgb-specific persistence format was chosen, which can be used through estimator.save_model and estimator.load_model. I could not verify (yet) that this format is 100% secure. However, since it's a binary json file (.ubj), it should as secure as json. More information here: https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html The actual implementation for xgboost mirrors the one for numpy arrays, which also stores its data in an external file. However, I did not apply any caching here, as it seems it wouldn't be useful. I decided to create a new test file, test_external.py, to put the new tests, instead of cramming more into test_persist.py. Since some functions from test_persist were required here, I moved them a new _utils.py module. Interestingly, I found some potential bugs where some values returned by get_params may differ for xgboost models after loading. This is independent of skops. Those edge cases are documented but were excluded from tests, as it appears that they should be inconsequential.

1.7 not supported for Python 3.7

Fails when no GPU is available and I couldn't find a way to detect from Python if a GPU is available or not (short of trying and catching the error).

Apparently, on Windows, np.int32 is used, not np.int64.

BenjaminBossan · 2022-12-09T12:59:20Z

I removed the xgboost bug (?) docstring, since I couldn't find how to exclude it from running with doctest. Replicating it here so that it doesn't get lost. Ideally, someone with more experience in xgboost could comment on whether this really is a bug or not. If so, we could submit it to the xgboost issue tracker.

>>> import xgboost
>>> print(xgboost.__version__)
1.7.1
>>> estimator = xgboost.XGBClassifier(tree_method=None)
>>> X, y = [[0, 1], [2, 3]], [0, 1]
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
exact
>>> # after save/load roundtrip, tree_method's get_params value changes from exact to auto
>>> import tempfile
>>> tmp_file = f"{tempfile.mkdtemp()}.ubj"
>>> estimator.save_model(tmp_file)
>>> estimator.load_model(tmp_file)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
auto
>>> # a similar thing happens with gpu_id
>>> estimator = xgboost.XGBClassifier(tree_method='gpu_hist', booster='gbtree')
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
0
>>> # after save/load roundtrip, gpu_id's get_params value changes from 0 to -1
>>> estimator.save_model(tmp_file)
>>> # for gpu_id, the estimator needs to be re-initialized for the effect to occur
>>> estimator = xgboost.XGBClassifier()
>>> estimator.load_model(tmp_file)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
-1

BenjaminBossan · 2022-12-09T13:07:12Z

@@ -0,0 +1,145 @@
+import warnings


Everything in here is unmodified, just moved around.

BenjaminBossan · 2022-12-09T13:08:40Z

CI is green, so ready for review @skops-dev/maintainers

I would take a look at catboost separately to keep the PR smaller. I included lightgbm here since it didn't require any code changes.

adrinjalali

Could you maybe explain why going through the __dict__ wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?

I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.

adrinjalali · 2022-12-09T15:03:35Z

+Apart from this core, we plan to support machine learning libraries commonly
+used be the community. So far, those are:
+
+- `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API)


shouldn't we add xgboost here as a result of this PR?

adrinjalali · 2022-12-09T15:24:57Z

+        # list of constructors is hard-coded for higher security
+        constructors = {
+            "XGBClassifier": xgboost.XGBClassifier,
+            "XGBRegressor": xgboost.XGBRegressor,
+            "XGBRFClassifier": xgboost.XGBRFClassifier,
+            "XGBRFRegressor": xgboost.XGBRFRegressor,
+            "XGBRanker": xgboost.XGBRanker,
+        }


why can't we use the type saved in __class__ here?

My thinking was that it's more secure, but we could also store the class itself and use that here.

adrinjalali · 2022-12-09T15:25:35Z

+    try:
+        # xgboost doesn't allow to save to a memory buffer, so take roundtrip
+        # through temp file
+        tmp_file = f"{tempfile.mkdtemp()}.ubj"


can we use a context manager to make sure it's cleaned up after the fact?

I'll take a look.

adrinjalali · 2022-12-09T15:40:34Z

+
+        if self.type == "xgboost-format":  # fitted
+            # load_model works with bytearray, so no temp file necessary here
+            instance.load_model(bytearray(self.children["content"].getvalue()))


I really don't like this. I can't easily figure out what this actually does, and if it's anywhere near being safe

This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):

class NdArrayNode(Node): def __init__(...): ... if self.type == "numpy": self.children = { "content": io.BytesIO(load_context.src.read(state["file"])) } def _construct(self): if self.type == "numpy": content = np.load(self.children["content"], allow_pickle=False) ...

The difference is that, unfortunately, load_model cannot work with bytes, instead the content first needs to be converted to a bytearray instead.

I could also load the model from a temp file, but I thought this approach would be better.

BenjaminBossan

Could you maybe explain why going through the __dict__ wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?

So without special treatment, xgboost goes through ObjectNode. Since it has __getstate__ and __setstate__, those are being used. After we call the latter, there is some kind of validation going on:

https://github.com/dmlc/xgboost/blob/8824b40961c2f381fcd7a9a6af1697cbf20b608f/python-package/xgboost/core.py#L1674-L1684

This leads us to the C++ part of the code base, which I honestly didn't even want to inspect.

My thinking is that trying to make that approach work is too error prone, we might get something to run by tinkering with the attributes, but we couldn't be sure that it would work with other parameters. However, the save_model way is officially recommended, is (presumably) well tested, and it is stable across versions, so more robust for users than if we were directly storing attributes.

I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.

Regarding security, I couldn't find anything being mentioned explicitly, but the general advantages are explained in the linked document. As I said, since it's json format, I don't see how that would not be safe.

BenjaminBossan · 2022-12-09T16:03:35Z

+
+        if self.type == "xgboost-format":  # fitted
+            # load_model works with bytearray, so no temp file necessary here
+            instance.load_model(bytearray(self.children["content"].getvalue()))


This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):

class NdArrayNode(Node): def __init__(...): ... if self.type == "numpy": self.children = { "content": io.BytesIO(load_context.src.read(state["file"])) } def _construct(self): if self.type == "numpy": content = np.load(self.children["content"], allow_pickle=False) ...

The difference is that, unfortunately, load_model cannot work with bytes, instead the content first needs to be converted to a bytearray instead.

I could also load the model from a temp file, but I thought this approach would be better.

BenjaminBossan · 2022-12-09T16:04:39Z

+        # list of constructors is hard-coded for higher security
+        constructors = {
+            "XGBClassifier": xgboost.XGBClassifier,
+            "XGBRegressor": xgboost.XGBRegressor,
+            "XGBRFClassifier": xgboost.XGBRFClassifier,
+            "XGBRFRegressor": xgboost.XGBRFRegressor,
+            "XGBRanker": xgboost.XGBRanker,
+        }


My thinking was that it's more secure, but we could also store the class itself and use that here.

BenjaminBossan · 2022-12-09T16:04:57Z

+    try:
+        # xgboost doesn't allow to save to a memory buffer, so take roundtrip
+        # through temp file
+        tmp_file = f"{tempfile.mkdtemp()}.ubj"


I'll take a look.

BenjaminBossan · 2022-12-09T16:05:07Z

+Apart from this core, we plan to support machine learning libraries commonly
+used be the community. So far, those are:
+
+- `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API)


adrinjalali · 2022-12-12T14:58:50Z

So looking at this again, I don't understand why you think using __setstate__ would be error prone. It can be the case that somebody creates a custom .skops file where they put odd attributes there, but if the user loads a file where the input to __setstate__ is exactly what we got from __getstate__, then there shouldn't be any issues.

It seems the issue we have is that we don't support bytearray, I think. So maybe a better solution here would be to add support for that, which in turn would fix the issue for xgb w/o us explicitly having to do an xgb specific kinda thing.

BenjaminBossan · 2022-12-12T15:40:39Z

So looking at this again, I don't understand why you think using __setstate__ would be error prone. It can be the case that somebody creates a custom .skops file where they put odd attributes there, but if the user loads a file where the input to __setstate__ is exactly what we got from __getstate__, then there shouldn't be any issues.

My interpretation of the xgboost documentation was that their custom format is the most portable. Say they add a private attribute from version X to version Y, it seems like their format would take care of correctly dealing with the missing attribute, whereas something like pickle would fail. What I cannot say with certainty is: If we use the output of their __getstate__ and then call __setstate__, is this portability is already being taken care of, or is the "magic" happening somewhere else? Naively, I'd think that if __getstate__ and __setstate__ contain the portability logic, then pickle would work just fine.

I think this problem touches a more general topic when dealing with 3rd party packages: Should we try to implement methods as generically as possible or should we make use of custom serialization formats if they are offered by a package? In this case, should we use bytearray* or should we use their format explicitly? Or should decide on a case by case basis?

*we should probably support bytes and bytearray no matter what :)

Added support for bytes and bytearray, which is sufficient to dump and load xgboost. The xgboost specific code was commented out for now. To serialize bytes in json, base64 encoding was used.

BenjaminBossan · 2022-12-12T15:59:34Z

@adrinjalali I tested your suggestion to add support for byterray and it is indeed sufficient for supporting xgboost. The more general point still stands to be decided.

adrinjalali · 2022-12-12T16:41:22Z

The more general point still stands to be decided.

I agree. But I think for now we can do with only supporting bytes/byte array, and figure out supporting custom formats of 3d party libraries later.

One thing I wonder: should we store bytes in a binary file instead? I feel like it'd make save/load much faster, and keep the schema json file cleaner.

BenjaminBossan · 2022-12-12T17:13:38Z

I agree. But I think for now we can do with only supporting bytes/byte array, and figure out supporting custom formats of 3d party libraries later.

Okay, then let's start with that.

One thing I wonder: should we store bytes in a binary file instead? I feel like it'd make save/load much faster

It'll probably depend. In this case, maybe, but having the additional i/o is probably a big overhead if we have multiple small bytes.

adrinjalali · 2022-12-13T09:22:47Z

It'll probably depend. In this case, maybe, but having the additional i/o is probably a big overhead if we have multiple small bytes.

I wanted to suggest having a threshold for having them in the schema or in the file, but then I realized another thing to think of, is that if we allow binary data in the json schema, then people can create .skops files where there is a ton of binary data there, and we'd read them. That doesn't sound like a good idea to me. I'd be more comfortable reading those only from files. WDYT?

BenjaminBossan · 2022-12-13T09:35:30Z

then people can create .skops files where there is a ton of binary data there, and we'd read them. That doesn't sound like a good idea to me

I'm not against your proposal, I just want to understand what the concern is. Is it only about readability? Since you could argue that even if there is a big blob in the json, it's only one line so doesn't distract much (it could slow down some editors though).

adrinjalali · 2022-12-13T10:07:20Z

Slowing down an editor is one thing, loading a massive json into memory is another thing. I'm not sure how people could exploit this, but I kinda feel like keeping them in separate files is cleaner. Imagine people having images as bytes, and we'd be keeping them in the json file all together.

Besides, I don't think there are many cases where we'd encounter small byte arrays. So my guess is that in most cases those byte arrays are large rather than small, and therefore having all of them in files would speed things up on average.

BenjaminBossan · 2022-12-13T12:41:46Z

@adrinjalali I made the changes as you suggested, please review again. For good measure, I checked if adding the bytes support is sufficient for making catboost work and indeed it does, so I added tests for it as well.

adrinjalali

Pretty cool 🔥

adrinjalali · 2022-12-13T16:41:09Z

+        trusted: bool | Sequence[str] = False,
+    ) -> None:
+        super().__init__(state, load_context, trusted)
+        self.trusted = self._get_trusted(trusted, [])


shouldn't we trust bytes and bytearray?

adrinjalali · 2022-12-13T16:44:30Z

+    assert_params_equal(estimator.get_params(), loaded.get_params())
+
+
+def check_method_outputs(estimator, X, trusted):


isn't this common between these tests and scikit-learn ones? Would be nice to avoid duplication as much as we can here.

It's very close. For my understanding, is the suggestion to put this method into tests/_utils.py and use it in test_persist.py too?

adrinjalali · 2022-12-13T16:46:40Z

+    These discrepancies occur regardless of skops, so they're a problem in
+    xgboost itself. We assume that this has no practical consequences and thus
+    avoid testing these cases.


is there an upstream issue for this on the xgboost side? If not, it'd be nice to open one and link to it here.

As I commented, I could open on issue but ideally someone who knows xgboost well could comment on whether it's really a bug or not. If there is no one around, I can just open no matter what.

yeah I think the devs on the xgboost repo can decide if it's a bug which they wanna fix or not.

dmlc/xgboost#8596

- Refactor method output testing to be utility function - Add bytes and bytearray to trusted types

…jaminBossan/skops into persistence-support-lightgbm-xgboost

BenjaminBossan · 2022-12-14T10:29:04Z

@adrinjalali I addressed your comments, please review again.

Regarding the issue of forgetting to add bytes/bytearray to trusted: It's a bit of a sign that our test coverage could be improved, right? As is, those oversights were not caught with the standard set of tests we have.

adrinjalali · 2022-12-14T10:45:24Z

tests failing @BenjaminBossan

BenjaminBossan · 2022-12-14T11:04:30Z

@adrinjalali sorry, I made a list minute change and rolled it back, introducing the bug. Tests are green now.

adrinjalali

Otherwise happy to merge.

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

BenjaminBossan added 4 commits December 8, 2022 16:52

Add entry to changes.rst

1c13827

Lower min version of XGBoost to 1.6

3196f6a

1.7 not supported for Python 3.7

Remove gpu_hist from tested parameters

63e702e

Fails when no GPU is available and I couldn't find a way to detect from Python if a GPU is available or not (short of trying and catching the error).

BenjaminBossan changed the title ~~Add support for LightGBM and XGBoost~~ Persistence: Add support for LightGBM and XGBoost Dec 8, 2022

BenjaminBossan added the persistence Secure persistence feature label Dec 8, 2022

BenjaminBossan added 2 commits December 8, 2022 17:41

Trust np.int32 in lgbm test

9616b70

Apparently, on Windows, np.int32 is used, not np.int64.

Remove example from docstring replicating xgb bug

2a73e49

BenjaminBossan commented Dec 9, 2022

View reviewed changes

adrinjalali reviewed Dec 9, 2022

View reviewed changes

BenjaminBossan commented Dec 9, 2022

View reviewed changes

Use bytearray instead of xgboost custom format

7fad01c

Added support for bytes and bytearray, which is sufficient to dump and load xgboost. The xgboost specific code was commented out for now. To serialize bytes in json, base64 encoding was used.

BenjaminBossan added 3 commits December 13, 2022 11:22

Merge branch 'main' into persistence-support-lightgbm-xgboost

ac2c1a5

Save bytes in file, not in schema

9b4a9bf

Add CatBoost to complete the holy trinity

3910a54

Merge branch 'main' into persistence-support-lightgbm-xgboost

1b6aca4

adrinjalali reviewed Dec 13, 2022

View reviewed changes

BenjaminBossan added 2 commits December 14, 2022 11:20

Address reviewer comments

4c73b80

- Refactor method output testing to be utility function - Add bytes and bytearray to trusted types

Merge branch 'persistence-support-lightgbm-xgboost' of github.com:Ben…

f988aa5

…jaminBossan/skops into persistence-support-lightgbm-xgboost

Fix missing argument in test function

32adb81

adrinjalali reviewed Dec 14, 2022

View reviewed changes

Comment thread skops/io/tests/test_external.py Outdated

Reviewer comment: improve comment

62fb8f2

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

adrinjalali changed the title ~~Persistence: Add support for LightGBM and XGBoost~~ ENH Persistence: Add support for LightGBM and XGBoost Dec 14, 2022

adrinjalali merged commit 8e17cc5 into skops-dev:main Dec 14, 2022

BenjaminBossan deleted the persistence-support-lightgbm-xgboost branch December 14, 2022 12:24

		assert_params_equal(estimator.get_params(), loaded.get_params())


		def check_method_outputs(estimator, X, trusted):

Conversation

BenjaminBossan commented Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation

Uh oh!

BenjaminBossan commented Dec 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Dec 9, 2022

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Dec 12, 2022

Uh oh!

BenjaminBossan commented Dec 12, 2022

Uh oh!

BenjaminBossan commented Dec 12, 2022

Uh oh!

adrinjalali commented Dec 12, 2022

Uh oh!

BenjaminBossan commented Dec 12, 2022

Uh oh!

adrinjalali commented Dec 13, 2022

Uh oh!

BenjaminBossan commented Dec 13, 2022

Uh oh!

adrinjalali commented Dec 13, 2022

Uh oh!

BenjaminBossan commented Dec 13, 2022

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Dec 14, 2022

BenjaminBossan commented Dec 8, 2022 •

edited

Loading