Implement SKLearn interface by jaredsnyder · Pull Request #272 · mozilla/docker-etl

jaredsnyder · 2024-08-23T13:08:47Z

Changes:

data pull via metric club removed from all classes
start_date and end_date attributes removed from all classes, time filtering will now happen in kpi_forecasting.py before passing data to model classes
New class, BaseEnsembleForecast, created to deal with segmented models like FunnelForecast used to implement
New class, ProphetAutotunerForecast, created to implement automated hyperparameter tuning
FunnelForecast recreated as a BaseEnsembleForecast that uses a ProphetAutotunerForecast as the base model
summarize and write_functions, along with all the functions called within them, moved outside of classes

Checklist for reviewer:

Commits should reference a bug or github issue, if relevant (if a bug is
referenced, the pull request should include the bug number in the title)
Scan the PR and verify that no changes (particularly to
.circleci/config.yml) will cause environment variables (particularly
credentials) to be exposed in test logs
Ensure the container image will be using permissions granted to
telemetry-airflow
responsibly.

change signatures of `fit` and `predict` to take arguments that default to attributes Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>

Co-authored-by: Julio Cezar Moscon <jcmoscon@gmail.com>

…ophetForecast

Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>

bochocki · 2024-09-25T23:18:17Z

-    def _auto_tuning(
-        self, observed_df, segment_settings: SegmentModelSettings
-    ) -> Dict[str, float]:
+    def _auto_tuning(self, observed_df) -> ProphetForecast:


This is a useful idea. My favorite model tuning fact is that random search is usually more efficient than grid search, especially across higher-dimensional spaces. The KPI Prophet models have been tuned with random search. If this method is computationally intensive or long-running, switching to random search could be a quick improvement.

A concise explanation on Stack Exchange

Figure 11.2 in Deep Learning

Yeah I think a big improvement would be to refactor to use a standardized parameter optimization library like Optuna or something so we'd get different methods for free

bochocki · 2024-09-25T23:33:03Z

+        if self.holidays == []:
+            self.holidays = None
+            self.holidays_raw = None
+        elif not self.holidays:
+            self.holidays_raw = None


What are these different conditionals checking for? The first one is clearly checking for an empty list. Is the second one checking if self.holidays is None?

The idea was that its possible to have an empty list for holidays in the config, and in that case things would get screwed up in the elif case. So this ensures that an empty list is treated the same as just leaving it out of the config.

The whole holidays_raw and holidays thing is annoying. It comes down to the fact we have an intermediate class to parse the holidays in the config into the dataframe (oh prophet...) that is expected. And if I want to easily create new prophet classes it's easier to distinguish between the "raw" non-df holiday format and the 'not raw' df format that can be passed to prophet under the holidays keyword argument. It's annoying but I couldn't think of a cleaner way.

bochocki · 2024-09-25T23:39:21Z

+        if self.growth == "logistic":
+            self.logistic_growth_floor = observed_df["y"].min() * 0.5
+            observed_df["floor"] = self.logistic_growth_floor
+            self.logistic_growth_cap = observed_df["y"].max() * 1.5
+            observed_df["cap"] = self.logistic_growth_cap


Is this used anywhere? These floor/cap scalars seem quite arbitrary, I'd be hesitant to use the scalars as-is, and the way this is parameterized the scalars aren't changeable.

To expand on this a little: logistic growth is convenient for some models because it enables exponential growth at values near the floor and saturating growth at values near the cap. For intermediate regions, the growth rate is roughly linear.

If the distance between the bounds is too large, you lose the dynamic benefits of a logistic curve and most of the growth will be modeled as roughly linear.

Yeah it is used in the funnel forecast

bochocki

I’ve left some comments —- overall, this looks good. However, I am slightly concerned that the code complexity seems to have increased significantly. This might make it harder to maintain without a dedicated developer. It would be helpful to consider ways to simplify or document the code further for ease of future maintenance.

From the notebook, it seems like the KPI forecasts remain the same before and after these changes. If that’s confirmed, I think we’re in a good position to move forward with merging once the outstanding comments are addressed.

Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>

jaredsnyder · 2024-09-26T15:12:14Z

WRT code complexity: Yeah that is the definite downside to trying to "promote" models with segments so they'd be easier to use. I can take another pass at documenting/commenting so it's easier to work with, and can brainstorm ways to clean it up. We could also meet to try and come up with something if you think that'd be useful

bochocki

Thanks for the quick turnaround! I don't have any good ideas for reducing the code complexity right now, but maybe it's something we can keep in the back of our minds to improve slowly over time.

jaredsnyder · 2024-09-26T16:21:25Z

Another thing I want to look into is trying to use DARTs (https://unit8co.github.io/darts/) which might eliminate a lot of the wrapper code around prophet, and maybe some of the stuff for handling data too

bochocki · 2024-09-26T16:45:39Z

Darts does look neat! I tried to evaluate it as part of the KPI model selection exercise that we used to decide on prophet, but at the time they didn't have M1 support and that was enough of a blocker for local development that I didn't explore it further.

This reverts commit b5740d8.

This reverts commit b5740d8. (cherry picked from commit 73e76df)

This reverts commit 69d120a.

jaredsnyder and others added 30 commits July 23, 2024 14:41

refactored base_forecast and prophet_forecast to enable easier testing

2a60eef

Apply suggestions from code review

340fabf

change signatures of `fit` and `predict` to take arguments that default to attributes Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>

add test for fit

6c7d3f2

revert signatures

38e721d

made timezone-aware stamps naive

9b17337

finished base_forecast tests

90a822e

added tests for prophet class

72fabef

linting

1ece1dd

fixed divide by zero

606e2e4

linting again

585f2ca

adding tests to funnel_forecast

97bd46c

Merge branch 'main' into kpi_forecasting_funnel_unit_tests

0e0ea91

added tests for funnel_forecast

c35247d

Merge branch 'main' into kpi_forecasting_funnel_unit_tests

e54d2c3

feat(workday):remove unwanted fields (#249)

6ab0527

Co-authored-by: Julio Cezar Moscon <jcmoscon@gmail.com>

fix(exit):Added sys.exit() call (#250)

07e5388

Co-authored-by: Julio Cezar Moscon <jcmoscon@gmail.com>

fix issue with call to _get_crossvalidation_metric

b102a7a

fixed type check

0726287

Merge branch 'main' into kpi_forecasting_funnel_unit_tests

65f8e27

added string case to aggregate_to_period and added tests

d8db825

merge main

6b6dac6

update

2358ee3

revert file

83aa229

added more tests to prophet_forecast

d5a0e63

removed DotMap

b3edd10

modified README to make it match better between FunnelForecast and Pr…

fd1435b

…ophetForecast

Update jobs/kpi-forecasting/kpi_forecasting/models/base_forecast.py

f551f4c

Co-authored-by: Brad Ochocki Szasz <bochocki@mozilla.com>

Brad easy fixes

1a63912

remove magic year

6a8c90c

removed DotMap

963a116