ENH Persistence: Add support for LightGBM and XGBoost#244
Conversation
Partially solves skops-dev#226 Description LightGBM already worked out of the box, so for this, only tests were added. For XGBoost, extra code was added to support it. Implementation Dependencies were extended to include xgboost and lightgbm for testing, so that they should be covered by CI. If those libraries are not installed, the tests should be gracefully skipped. For XGBoost, the xgb-specific persistence format was chosen, which can be used through estimator.save_model and estimator.load_model. I could not verify (yet) that this format is 100% secure. However, since it's a binary json file (.ubj), it should as secure as json. More information here: https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html The actual implementation for xgboost mirrors the one for numpy arrays, which also stores its data in an external file. However, I did not apply any caching here, as it seems it wouldn't be useful. I decided to create a new test file, test_external.py, to put the new tests, instead of cramming more into test_persist.py. Since some functions from test_persist were required here, I moved them a new _utils.py module. Interestingly, I found some potential bugs where some values returned by get_params may differ for xgboost models after loading. This is independent of skops. Those edge cases are documented but were excluded from tests, as it appears that they should be inconsequential.
1.7 not supported for Python 3.7
Fails when no GPU is available and I couldn't find a way to detect from Python if a GPU is available or not (short of trying and catching the error).
Apparently, on Windows, np.int32 is used, not np.int64.
|
I removed the xgboost bug (?) docstring, since I couldn't find how to exclude it from running with doctest. Replicating it here so that it doesn't get lost. Ideally, someone with more experience in xgboost could comment on whether this really is a bug or not. If so, we could submit it to the xgboost issue tracker. >>> import xgboost
>>> print(xgboost.__version__)
1.7.1
>>> estimator = xgboost.XGBClassifier(tree_method=None)
>>> X, y = [[0, 1], [2, 3]], [0, 1]
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
exact
>>> # after save/load roundtrip, tree_method's get_params value changes from exact to auto
>>> import tempfile
>>> tmp_file = f"{tempfile.mkdtemp()}.ubj"
>>> estimator.save_model(tmp_file)
>>> estimator.load_model(tmp_file)
>>> print(estimator.tree_method)
None
>>> print(estimator.get_params()["tree_method"])
auto
>>> # a similar thing happens with gpu_id
>>> estimator = xgboost.XGBClassifier(tree_method='gpu_hist', booster='gbtree')
>>> estimator.fit(X, y)
XGBClassifier(...)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
0
>>> # after save/load roundtrip, gpu_id's get_params value changes from 0 to -1
>>> estimator.save_model(tmp_file)
>>> # for gpu_id, the estimator needs to be re-initialized for the effect to occur
>>> estimator = xgboost.XGBClassifier()
>>> estimator.load_model(tmp_file)
>>> print(estimator.gpu_id)
None
>>> print(estimator.get_params()["gpu_id"])
-1 |
| @@ -0,0 +1,145 @@ | |||
| import warnings | |||
There was a problem hiding this comment.
Everything in here is unmodified, just moved around.
|
CI is green, so ready for review @skops-dev/maintainers I would take a look at catboost separately to keep the PR smaller. I included lightgbm here since it didn't require any code changes. |
adrinjalali
left a comment
There was a problem hiding this comment.
Could you maybe explain why going through the __dict__ wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?
I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.
| Apart from this core, we plan to support machine learning libraries commonly | ||
| used be the community. So far, those are: | ||
|
|
||
| - `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API) |
There was a problem hiding this comment.
shouldn't we add xgboost here as a result of this PR?
| # list of constructors is hard-coded for higher security | ||
| constructors = { | ||
| "XGBClassifier": xgboost.XGBClassifier, | ||
| "XGBRegressor": xgboost.XGBRegressor, | ||
| "XGBRFClassifier": xgboost.XGBRFClassifier, | ||
| "XGBRFRegressor": xgboost.XGBRFRegressor, | ||
| "XGBRanker": xgboost.XGBRanker, | ||
| } |
There was a problem hiding this comment.
why can't we use the type saved in __class__ here?
There was a problem hiding this comment.
My thinking was that it's more secure, but we could also store the class itself and use that here.
| try: | ||
| # xgboost doesn't allow to save to a memory buffer, so take roundtrip | ||
| # through temp file | ||
| tmp_file = f"{tempfile.mkdtemp()}.ubj" |
There was a problem hiding this comment.
can we use a context manager to make sure it's cleaned up after the fact?
There was a problem hiding this comment.
I'll take a look.
|
|
||
| if self.type == "xgboost-format": # fitted | ||
| # load_model works with bytearray, so no temp file necessary here | ||
| instance.load_model(bytearray(self.children["content"].getvalue())) |
There was a problem hiding this comment.
I really don't like this. I can't easily figure out what this actually does, and if it's anywhere near being safe
There was a problem hiding this comment.
This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):
class NdArrayNode(Node):
def __init__(...):
...
if self.type == "numpy":
self.children = {
"content": io.BytesIO(load_context.src.read(state["file"]))
}
def _construct(self):
if self.type == "numpy":
content = np.load(self.children["content"], allow_pickle=False)
...The difference is that, unfortunately, load_model cannot work with bytes, instead the content first needs to be converted to a bytearray instead.
I could also load the model from a temp file, but I thought this approach would be better.
BenjaminBossan
left a comment
There was a problem hiding this comment.
Could you maybe explain why going through the
__dict__wouldn't work for xgboost? I suppose because the attributes are not C extension types, and rather pure C types?
So without special treatment, xgboost goes through ObjectNode. Since it has __getstate__ and __setstate__, those are being used. After we call the latter, there is some kind of validation going on:
This leads us to the C++ part of the code base, which I honestly didn't even want to inspect.
My thinking is that trying to make that approach work is too error prone, we might get something to run by tinkering with the attributes, but we couldn't be sure that it would work with other parameters. However, the save_model way is officially recommended, is (presumably) well tested, and it is stable across versions, so more robust for users than if we were directly storing attributes.
I would like us to understand what save/load of xgboost does, before adding this. We should document that somewhere.
Regarding security, I couldn't find anything being mentioned explicitly, but the general advantages are explained in the linked document. As I said, since it's json format, I don't see how that would not be safe.
|
|
||
| if self.type == "xgboost-format": # fitted | ||
| # load_model works with bytearray, so no temp file necessary here | ||
| instance.load_model(bytearray(self.children["content"].getvalue())) |
There was a problem hiding this comment.
This is the equivalent of what we do with numpy arrays (in fact, the whole approach is analogous):
class NdArrayNode(Node):
def __init__(...):
...
if self.type == "numpy":
self.children = {
"content": io.BytesIO(load_context.src.read(state["file"]))
}
def _construct(self):
if self.type == "numpy":
content = np.load(self.children["content"], allow_pickle=False)
...The difference is that, unfortunately, load_model cannot work with bytes, instead the content first needs to be converted to a bytearray instead.
I could also load the model from a temp file, but I thought this approach would be better.
| # list of constructors is hard-coded for higher security | ||
| constructors = { | ||
| "XGBClassifier": xgboost.XGBClassifier, | ||
| "XGBRegressor": xgboost.XGBRegressor, | ||
| "XGBRFClassifier": xgboost.XGBRFClassifier, | ||
| "XGBRFRegressor": xgboost.XGBRFRegressor, | ||
| "XGBRanker": xgboost.XGBRanker, | ||
| } |
There was a problem hiding this comment.
My thinking was that it's more secure, but we could also store the class itself and use that here.
| try: | ||
| # xgboost doesn't allow to save to a memory buffer, so take roundtrip | ||
| # through temp file | ||
| tmp_file = f"{tempfile.mkdtemp()}.ubj" |
There was a problem hiding this comment.
I'll take a look.
| Apart from this core, we plan to support machine learning libraries commonly | ||
| used be the community. So far, those are: | ||
|
|
||
| - `LightGBM <lightgbm.readthedocs.io/>`_ (scikit-learn API) |
|
So looking at this again, I don't understand why you think using It seems the issue we have is that we don't support |
My interpretation of the xgboost documentation was that their custom format is the most portable. Say they add a private attribute from version X to version Y, it seems like their format would take care of correctly dealing with the missing attribute, whereas something like pickle would fail. What I cannot say with certainty is: If we use the output of their I think this problem touches a more general topic when dealing with 3rd party packages: Should we try to implement methods as generically as possible or should we make use of custom serialization formats if they are offered by a package? In this case, should we use *we should probably support |
Added support for bytes and bytearray, which is sufficient to dump and load xgboost. The xgboost specific code was commented out for now. To serialize bytes in json, base64 encoding was used.
|
@adrinjalali I tested your suggestion to add support for |
I agree. But I think for now we can do with only supporting bytes/byte array, and figure out supporting custom formats of 3d party libraries later. One thing I wonder: should we store bytes in a binary file instead? I feel like it'd make save/load much faster, and keep the schema json file cleaner. |
Okay, then let's start with that.
It'll probably depend. In this case, maybe, but having the additional i/o is probably a big overhead if we have multiple small bytes. |
I wanted to suggest having a threshold for having them in the schema or in the file, but then I realized another thing to think of, is that if we allow binary data in the json schema, then people can create |
I'm not against your proposal, I just want to understand what the concern is. Is it only about readability? Since you could argue that even if there is a big blob in the json, it's only one line so doesn't distract much (it could slow down some editors though). |
|
Slowing down an editor is one thing, loading a massive json into memory is another thing. I'm not sure how people could exploit this, but I kinda feel like keeping them in separate files is cleaner. Imagine people having images as bytes, and we'd be keeping them in the json file all together. Besides, I don't think there are many cases where we'd encounter small byte arrays. So my guess is that in most cases those byte arrays are large rather than small, and therefore having all of them in files would speed things up on average. |
|
@adrinjalali I made the changes as you suggested, please review again. For good measure, I checked if adding the bytes support is sufficient for making catboost work and indeed it does, so I added tests for it as well. |
| trusted: bool | Sequence[str] = False, | ||
| ) -> None: | ||
| super().__init__(state, load_context, trusted) | ||
| self.trusted = self._get_trusted(trusted, []) |
There was a problem hiding this comment.
shouldn't we trust bytes and bytearray?
| assert_params_equal(estimator.get_params(), loaded.get_params()) | ||
|
|
||
|
|
||
| def check_method_outputs(estimator, X, trusted): |
There was a problem hiding this comment.
isn't this common between these tests and scikit-learn ones? Would be nice to avoid duplication as much as we can here.
There was a problem hiding this comment.
It's very close. For my understanding, is the suggestion to put this method into tests/_utils.py and use it in test_persist.py too?
| These discrepancies occur regardless of skops, so they're a problem in | ||
| xgboost itself. We assume that this has no practical consequences and thus | ||
| avoid testing these cases. |
There was a problem hiding this comment.
is there an upstream issue for this on the xgboost side? If not, it'd be nice to open one and link to it here.
There was a problem hiding this comment.
As I commented, I could open on issue but ideally someone who knows xgboost well could comment on whether it's really a bug or not. If there is no one around, I can just open no matter what.
There was a problem hiding this comment.
yeah I think the devs on the xgboost repo can decide if it's a bug which they wanna fix or not.
There was a problem hiding this comment.
- Refactor method output testing to be utility function - Add bytes and bytearray to trusted types
…jaminBossan/skops into persistence-support-lightgbm-xgboost
|
@adrinjalali I addressed your comments, please review again. Regarding the issue of forgetting to add bytes/bytearray to trusted: It's a bit of a sign that our test coverage could be improved, right? As is, those oversights were not caught with the standard set of tests we have. |
|
tests failing @BenjaminBossan |
|
@adrinjalali sorry, I made a list minute change and rolled it back, introducing the bug. Tests are green now. |
adrinjalali
left a comment
There was a problem hiding this comment.
Otherwise happy to merge.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
EDITED
Resolves #226
Description
LightGBM already worked out of the box, so for this, only tests were added. For XGBoost and CatBoost, extra code was added to support them.
Implementation
Dependencies were extended to include xgboost, lightgbm, and catboost for testing, so that they should be covered by CI. If those libraries are not installed, the tests should be gracefully skipped.
For xgboost, the xgboost-specific persistence format was chosen, which can be used throughestimator.save_modelandestimator.load_model. I could not verify (yet) that this format is 100% secure. However, since it's a binary json file (.ubj), it should as secure as json. More information here:https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.htmlThe actual implementation for xgboost mirrors the one for numpy arrays, which also stores its data in an external file. However, I did not apply any caching here, as it seems it wouldn't be useful.I decided to create a new test file,
test_external.py, to put the new tests, instead of cramming more intotest_persist.py. Since some functions fromtest_persist.pywere required here, I moved them a new_utils.pymodule.Interestingly, I found some potential bugs where some values returned by
get_params()may differ for xgboost models after loading. This is independent of skops. Those edge cases are documented but were excluded from tests, as it appears that they should be inconsequential. We may want to report them.