Implement SKLearn interface#272
Conversation
change signatures of `fit` and `predict` to take arguments that default to attributes Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>
Co-authored-by: Julio Cezar Moscon <jcmoscon@gmail.com>
Co-authored-by: Julio Cezar Moscon <jcmoscon@gmail.com>
Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>
| def _auto_tuning( | ||
| self, observed_df, segment_settings: SegmentModelSettings | ||
| ) -> Dict[str, float]: | ||
| def _auto_tuning(self, observed_df) -> ProphetForecast: |
There was a problem hiding this comment.
This is a useful idea. My favorite model tuning fact is that random search is usually more efficient than grid search, especially across higher-dimensional spaces. The KPI Prophet models have been tuned with random search. If this method is computationally intensive or long-running, switching to random search could be a quick improvement.
There was a problem hiding this comment.
Yeah I think a big improvement would be to refactor to use a standardized parameter optimization library like Optuna or something so we'd get different methods for free
| if self.holidays == []: | ||
| self.holidays = None | ||
| self.holidays_raw = None | ||
| elif not self.holidays: | ||
| self.holidays_raw = None |
There was a problem hiding this comment.
What are these different conditionals checking for? The first one is clearly checking for an empty list. Is the second one checking if self.holidays is None?
There was a problem hiding this comment.
The idea was that its possible to have an empty list for holidays in the config, and in that case things would get screwed up in the elif case. So this ensures that an empty list is treated the same as just leaving it out of the config.
The whole holidays_raw and holidays thing is annoying. It comes down to the fact we have an intermediate class to parse the holidays in the config into the dataframe (oh prophet...) that is expected. And if I want to easily create new prophet classes it's easier to distinguish between the "raw" non-df holiday format and the 'not raw' df format that can be passed to prophet under the holidays keyword argument. It's annoying but I couldn't think of a cleaner way.
| if self.growth == "logistic": | ||
| self.logistic_growth_floor = observed_df["y"].min() * 0.5 | ||
| observed_df["floor"] = self.logistic_growth_floor | ||
| self.logistic_growth_cap = observed_df["y"].max() * 1.5 | ||
| observed_df["cap"] = self.logistic_growth_cap |
There was a problem hiding this comment.
Is this used anywhere? These floor/cap scalars seem quite arbitrary, I'd be hesitant to use the scalars as-is, and the way this is parameterized the scalars aren't changeable.
There was a problem hiding this comment.
To expand on this a little: logistic growth is convenient for some models because it enables exponential growth at values near the floor and saturating growth at values near the cap. For intermediate regions, the growth rate is roughly linear.
If the distance between the bounds is too large, you lose the dynamic benefits of a logistic curve and most of the growth will be modeled as roughly linear.
There was a problem hiding this comment.
Yeah it is used in the funnel forecast
bochocki
left a comment
There was a problem hiding this comment.
I’ve left some comments —- overall, this looks good. However, I am slightly concerned that the code complexity seems to have increased significantly. This might make it harder to maintain without a dedicated developer. It would be helpful to consider ways to simplify or document the code further for ease of future maintenance.
From the notebook, it seems like the KPI forecasts remain the same before and after these changes. If that’s confirmed, I think we’re in a good position to move forward with merging once the outstanding comments are addressed.
Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>
Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>
Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>
|
WRT code complexity: Yeah that is the definite downside to trying to "promote" models with segments so they'd be easier to use. I can take another pass at documenting/commenting so it's easier to work with, and can brainstorm ways to clean it up. We could also meet to try and come up with something if you think that'd be useful |
bochocki
left a comment
There was a problem hiding this comment.
Thanks for the quick turnaround! I don't have any good ideas for reducing the code complexity right now, but maybe it's something we can keep in the back of our minds to improve slowly over time.
|
Another thing I want to look into is trying to use DARTs (https://unit8co.github.io/darts/) which might eliminate a lot of the wrapper code around prophet, and maybe some of the stuff for handling data too |
|
Darts does look neat! I tried to evaluate it as part of the KPI model selection exercise that we used to decide on prophet, but at the time they didn't have M1 support and that was enough of a blocker for local development that I didn't explore it further. |
Changes:
kpi_forecasting.pybefore passing data to model classesBaseEnsembleForecast, created to deal with segmented models like FunnelForecast used to implementProphetAutotunerForecast, created to implement automated hyperparameter tuningFunnelForecastrecreated as aBaseEnsembleForecastthat uses aProphetAutotunerForecastas the base modelsummarizeandwrite_functions, along with all the functions called within them, moved outside of classesChecklist for reviewer:
referenced, the pull request should include the bug number in the title)
.circleci/config.yml) will cause environment variables (particularlycredentials) to be exposed in test logs
telemetry-airflow
responsibly.