FIX: Decision Tree Visualization#386
FIX: Decision Tree Visualization#386adrinjalali merged 12 commits intoskops-dev:mainfrom reidjohnson:fix-decisiontree-visualize
Conversation
Return empty if node is a constructor.
Swap LogisticRegression or RandomForestRegressor
| ])), | ||
| ])), | ||
| ("clf", LogisticRegression(random_state=0, solver="liblinear")), | ||
| ("clf", RandomForestRegressor(random_state=0)), |
There was a problem hiding this comment.
Yeah, good point. One challenge here is that I don't think it'll be straightforward without a minor refactor of the pipeline fixture. The simplest approaches seem to be to parameterize the fixture or to use a GridSearchCV. Any preferences or suggestions?
There was a problem hiding this comment.
You could also simply have a separate test only for RandomForestRegressor and RandomForestClsasifier.
There was a problem hiding this comment.
Added separate test for DecisionTreeClassifier and DecisionTreeRegressor.
| if node_name == "constructor": | ||
| return |
There was a problem hiding this comment.
this seems a bit shady, it might skip things that the user should see. Can you show here an example of what's inside this node in our case?
One would also need to update the node_name docstring if we're skipping here.
There was a problem hiding this comment.
Very sympathetic to the concerns. Suspect you all know better here, so would defer to your recommendations.
As far as I am aware, sklearn tree-based models are the only ones that produce a "constructor" object of this type (presumably as an artifact of how skops handles C Extension types).
Taking the example:
from sklearn.tree import DecisionTreeClassifier
from skops.io import dumps, loads, visualize
dumped = dumps(DecisionTreeClassifier().fit([[0, 1], [2, 3], [4, 5]], [0, 1, 2]))
loaded = loads(dumped)
visualize(dumped)The skipped node is <class 'sklearn.tree._tree.Tree'>. Skipping processing of this object, the code outputs:
root: sklearn.tree._classes.DecisionTreeClassifier
└──
attrs: builtins.dict├──
criterion: json-type("gini")├──
splitter: json-type("best")├──
max_depth: json-type(null)├──
min_samples_split: json-type(2)├──
min_samples_leaf: json-type(1)├──
min_weight_fraction_leaf: json-type(0.0)├──
max_features: json-type(null)├──
max_leaf_nodes: json-type(null)├──
random_state: json-type(null)├──
min_impurity_decrease: json-type(0.0)├──
class_weight: json-type(null)├──
ccp_alpha: json-type(0.0)├──
n_features_in_: json-type(2)├──
n_outputs_: json-type(1)├──
classes_: numpy.ndarray├──
n_classes_: numpy.int64├──
max_features_: json-type(2)├──
tree_: sklearn.tree._tree.Tree
│
├──
attrs: builtins.dict
│ │
├──
max_depth: json-type(2)
│ │
├──
node_count: json-type(5)
│ │
├──
nodes: numpy.ndarray
│ │
└──
values: numpy.ndarray
│
├──
args: builtins.tuple
│ │
├──
content: json-type(2)
│ │
├──
content: numpy.ndarray
│ │
└──
content: json-type(1)└──
_sklearn_version: json-type("1.3.0")
There was a problem hiding this comment.
I see. I think the best solution here would be to actually handle type, as in, in the visualized tree print what the type is, rather than skipping it. I'm sure there would be other cases where we'd encounter types as well.
There was a problem hiding this comment.
Gave this a first pass. So if isinstance(node, type), then print information about the type rather than skipping.
adrinjalali
left a comment
There was a problem hiding this comment.
This also needs a changelog.
| is_self_safe=False, | ||
| is_safe=False, |
There was a problem hiding this comment.
here we should defer to the information in the Node instead of assuming it to be unsafe all the time. the Tree constructor for instance, is safe.
There was a problem hiding this comment.
Right, I agree. However the challenge I'm running into is that it's not a valid Node object, so it does not have the necessary attributes:
AttributeError: type object 'sklearn.tree._tree.Tree' has no attribute 'is_self_safe'
If I understand correctly, it's because the constructor class is directly set as a child node here.
There was a problem hiding this comment.
so probably the parent when doing the walk_tree should send this info to the child.
| if "is_self_safe" in kwargs: | ||
| is_self_safe = kwargs["is_self_safe"] | ||
| if "is_safe" in kwargs: | ||
| is_safe = kwargs["is_safe"] |
There was a problem hiding this comment.
I think it'd make sense to pass these as actual args rather than **kwargs, and they'd need to be added to the docstring
adrinjalali
left a comment
There was a problem hiding this comment.
This looks quite nice, thank you @reidjohnson
BenjaminBossan
left a comment
There was a problem hiding this comment.
Hmm, the extension of the walk_tree API to include is_self_safe and is_safe made me skeptical, as it looks like we have a leaky abstraction here. I tried to understand why that is necessary and I think I found a deeper underlying issue.
I think the correct solution right now is that we should not visualize the constructor. It is not an attribute of the object, just a utility class we add to construct it, so its presence is not really needed. Therefore, I think we should actually just skip the node info on the constructor.
This should be perfectly safe with the current code, since we only have two examples where we should have type(node) is type, namely TreeNode and SGDNode, where we ensure that the constructor is indeed safe. However, this may of course change in the future. Therefore, we could do a check that the type is either a Tree or in ALLOWED_SGD_LOSSES.
If we make this change, then we can roll back the changes in the signature of walk_tree and avoid the leaky abstraction.
As a more long term solution, I think we might want to consider something like this: the children of a ReduceNode should not directly contain the constructor or any type. Instead, we should wrap the constructor into a TypeNode, which is capable of determining if the constructor is safe or not.
If we make that change, the walk_tree function would, if it encounters a ReduceNode, just rely on ReduceNode.is_safe, which under the hood calls is_safe on the TypeNode.
WDYT @adrinjalali?
|
@BenjaminBossan Thanks for the comments. Updated with wrapping constructor into a |
adrinjalali
left a comment
There was a problem hiding this comment.
This is much nicer, but we should handle this change in protocol the same way it's done via the files in the io.old folder.
| "constructor": TypeNode( | ||
| { | ||
| "__class__": constructor.__name__, | ||
| "__module__": constructor.__module__, |
There was a problem hiding this comment.
__module__ doesn't always exist, we should use get_module instead.
|
The failing test seems to be that in the latest sklearn version, they added a new parameter:
|
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks a lot, this LGTM.
I think the codecov complaints are false positives, so if @adrinjalali approves of this final version, we can merge.
adrinjalali
left a comment
There was a problem hiding this comment.
Thank you so much for the PR @reidjohnson , and thank you for being patient with our reviews.
|
Thank you for the reviews, have been happy to contribute. |
Fixes issue #385.
The root cause of the bug appears to be that the base sklearn tree constructor object is not wrapped as a valid skops
Nodetype, but is still included in the visualized object tree.This PR fixes the issue by appropriately processing this node.