FIX: Support operators persistence#287
Conversation
Fixes skops-dev#283 Description Some of the operator functions can be quite useful sklearn users, especially in conjunction with FunctionTransformer. This is because it often allows to avoid writing small functions or using lambda, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them. This fix consists of two parts. On the one hand, functions like methodcaller now have a special dispatch type that correctly knows how to initialize them. On the other hand, we had a problem with some funcs like operator.add. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction with partial. However, their type is builtin_function_or_method. By default, they would be handled by object_get_state, but that doesn't work correctly with uninitialized objects. This has now been fixed. Open questions To get the type of operator.add et al, we need builtin_function_or_method. Right now, I just do builtin_function_or_method = type(len). Is there a better way? Regarding operators like attrgetter and methodcaller, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust _operator.attrgetter, which could seem harmless.
|
Ready for review @skops-dev/maintainers |
adrinjalali
left a comment
There was a problem hiding this comment.
Another alternative is to call __new__ with the param we need.
Do you mean changing the line
in
I tried it and it can be made to work. Here are a list of changes:
Overall, it's the same amount of code but with more indirection, and more to rewrite should we ever get rid of the |
adrinjalali
left a comment
There was a problem hiding this comment.
One issue I noticed, is that operator module includes too much. It sounds like an innocent module, but then it has things like call, setitem, deltem, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.
BenjaminBossan
left a comment
There was a problem hiding this comment.
One issue I noticed, is that
operatormodule includes too much. It sounds like an innocent module, but then it has things likecall,setitem,delitem, and inplace manipulators. I wanna make sure those are not supported out of the box, and certainly not as trusted.
As is, users need to trust those, we haven't added them to the list of automatically trusted modules. Do you want me to write a test that checks that loading fails if they are not set as trusted by the user?
|
A test to make sure those are not trusted would be nice. Generally, making sure things we know shouldn't be trusted are not trusted is a nice thing. I'm happy to do it here or in a separate PR, as you wish. |
The bug was that the attrs of the operator were stored as is, instead of storing their skops state. This worked in this instance because the attrs were just a list of str, i.e. primitive types. But still we should always store to (and restore from) the skops state. Another unit test was added that tests that operators are not trusted by default.
|
@adrinjalali I added tests as suggested. I also found a somewhat embarrassing bug, not sure if we could have added some checks to prevent this type of bug in the future. |
adrinjalali
left a comment
There was a problem hiding this comment.
Otherwise LGTM.
As for the bug you found and the test for it, it's a good question.
i* operators and for Python 3.11+, call.
adrinjalali
left a comment
There was a problem hiding this comment.
Hope this doesn't blow up 😁
What do you mean? Security-wise? Maybe we can better document that users shouldn't blindly |
Fixes #283
Description
Some of the operator functions can be quite useful sklearn users, especially in conjunction with
FunctionTransformer. This is because it often allows to avoid writing small functions or usinglambda, which can't be persisted. However, since some of these functions are actually classes defined in CPython that need special treatment, our code didn't work with them.This fix consists of two parts.
On the one hand, functions like
methodcallernow have a special dispatch type that knows how to correctly initialize them.On the other hand, we had a problem with some types like
operator.add. Since they take 2 arguments, it would be useful for them to be passed without initialization in conjunction withpartial. However, their type isbuiltin_function_or_method. By default, they would be handled byobject_get_state, but that doesn't work correctly with uninitialized objects. This has now been fixed by dispatching totype_get_state.Open questions
To get the type of
operator.addet al., we needbuiltin_function_or_method. Right now, I just dobuiltin_function_or_method = type(len). Is there a better way?Regarding operators like
attrgetterandmethodcaller, could there be a security implication by allowing them? Sure, users can decide not to trust that type, but whether it is trustworthy or not depends on its arguments. Right now, we just tell them that they need to trust_operator.attrgetter, which could seem harmless. The same argument can of course be made foreval& co, but there, users will be much more likely to recognize their danger.