Conversation
sandervh14
left a comment
There was a problem hiding this comment.
Hey Sam,
I left some important review comments, but I can't review again later this afternoon, so I hereby approve and trust that if you think indeed some of my comments indeed require a change, that you'll apply them, and if you think no changes are required, you can merge.
| ax.xaxis.set_tick_params(labelsize=14) | ||
| plt.setp(ax.get_xticklabels(), | ||
| rotation=45, ha="right", rotation_mode="anchor") | ||
| rotation=90, ha="right", rotation_mode="anchor") |
There was a problem hiding this comment.
Doesn't that make it less readable? Is it because for a dataset you were evaluating, 45° angle didn't display nice for some reason - even longer bin labels than I had, for example? + when rotation=90, ha="center" might be better.
There was a problem hiding this comment.
I reconsidered, and put it back to rotation=45. 😀 I had indeed one plot with a very large number of labels, and vertical rotation made it more or less readable. But agree to keep it like this, as it makes the typical cases optimally readable.
| # Set labels & ticks | ||
| ax2.set_xlabel('{} bins' ''.format(variable), fontsize=16) | ||
| ax2.xaxis.set_tick_params(rotation=45, labelsize=14) | ||
| ax2.xaxis.set_tick_params(rotation=90, labelsize=14) |
There was a problem hiding this comment.
If modifying X axis label stuff due to my above comment, this will also be impacted.
|
|
||
| def _find_next_best_model(self, train_data: pd.DataFrame, | ||
| def _find_next_best_model(self, | ||
| train_data: pd.DataFrame, |
There was a problem hiding this comment.
just calling it "basetable" or "basetable_parts" might be better, "train_data" could suggest to people - when they're not scrolling down - that only train data is used for feature selection.
There was a problem hiding this comment.
Agree, and also thought about changing it, but instead I emphasized in the documentation of ForwardFeatureSelection.fit() for train_data what the argument should be, and added an assertion to make sure it includes both a train & selection split.
| split="train")) | ||
| .evaluate(sel_data[current_predictors + [pred]], | ||
| sel_data[target_column_name], | ||
| split="selection")) |
There was a problem hiding this comment.
it seems like, since sel_data is now passed instead of train_data, split="selection" is redundant, default argument of evaluate apparently is split=None and then evaluation will just run fine on only selection split, since that is the only data split available in sel_data; though, if you want to explicitly leave split="selection" argument there for absolute clarity, I can follow that reasoning.
There was a problem hiding this comment.
Indeed, I put it there for clarity and that way it also populates the _eval_metrics_by_split model attribute. But your reasoning is correct.
| category_size_threshold: int = 5, | ||
| p_value_threshold: float = 0.001, | ||
| scale_contingency_table: bool = True, | ||
| forced_categories: dict = {}): |
There was a problem hiding this comment.
I'm also used to this style, is that bad towards linters or something?
There was a problem hiding this comment.
To be honest, I don't know what the recommended style is. I started doing it at some point, then saw that PyCharm prefers it with spaces, but kept on going. At some point we'll enforce consistent code formatting as put forward in #54.
There was a problem hiding this comment.
I'd use what PyCharm suggests, normally it checks for PEP rules, and linters as well. We'll see when we tackle #54.
Set of small (documentation) fixes and (!) reconfiguration of the forward feature selection such that it selects model based on the "selection" data split, after training the model on the "train" data split. Can the reviewer (@sandervh14) thoroughly check the commits which have 'ffs(election)' mentioned?
Also now correctly retrieves the model coefficients for linear regression (before it only took the first element - confusion because for logistic regression the coefficients are an array of an array).