Develop and compare parametric and machine learning models to explain and predict prescription drug sales volume. Estimate marketing response (elasticity, marginal effects) for Detailing and DTCA spending, and produce actionable managerial recommendations.
| Column | Type | Description |
|---|---|---|
| Class | Categorical | Therapeutic drug class |
| Agent | Categorical | Individual drug / brand |
| Year | Integer | Observation year |
| Detailing | Numeric | Sales force activity (visits or spend index) |
| DTCA | Numeric | Direct-to-Consumer Advertising spend |
| Actual Sales | Numeric | Target variable – prescription drug sales volume |
File: data/MMM_Drug Data.xlsx
The dataset has a panel structure: each row is one Agent × Year combination. Multiple agents belong to the same therapeutic class.
- Temporal split (leakage-free): earlier years → training; most recent year(s) → test.
- No random shuffling – this is time-series / panel data.
- Log(1 + x) transforms for Sales, Detailing, DTCA (handles right skew and zeros).
- Year trend variable and year-fixed-effect dummies.
- Interaction terms: Detailing × DTCA, Class × Detailing, Class × DTCA.
- 1-year lagged sales (per agent, computed before split to prevent leakage).
| Model | Key feature |
|---|---|
| P1 OLS Baseline | Raw-scale benchmark |
| P2 Log-Log | Direct elasticity interpretation |
| P3 Log-Log + Class FE | Controls for class heterogeneity |
| P4 Log-Log + Two-Way FE | Preferred spec – controls class & year shocks |
| P5 + Interactions | Class-specific marketing slopes |
| P6 Ridge | Regularised for robustness |
| P7 Lasso | Variable selection |
Random Forest, Gradient Boosting, XGBoost, LightGBM, Elastic Net. All trained with time-series cross-validation to prevent temporal leakage.
R², RMSE, MAE, MAPE. AIC/BIC for OLS models.
- Elasticities from log-log OLS (direct coefficient read-off).
- Marginal effects at sample means.
- Response curves (diminishing returns check).
- Class-level heterogeneity in marketing response.
- Approximate ROI / marginal response proxy.
data/MMM_Drug Data.xlsx
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activatepip install -r requirements.txtpython main.pyAll outputs are saved to /outputs (intermediate) and /report_assets (report-ready).
jupyter lab notebooks/831/
├── data/
│ └── MMM_Drug Data.xlsx ← place your data file here
├── notebooks/
│ └── exploration.ipynb ← interactive EDA (optional)
├── src/
│ ├── __init__.py
│ ├── utils.py ← paths, constants, shared helpers
│ ├── data_loader.py ← load Excel, validate, inspect
│ ├── preprocessing.py ← train/test split, categorical encoding
│ ├── eda.py ← EDA charts and summary stats
│ ├── feature_engineering.py ← log transforms, interactions, lags
│ ├── parametric_models.py ← OLS, log-log, FE, regularised
│ ├── ml_models.py ← RF, GBM, XGBoost, LightGBM, ElasticNet
│ ├── evaluation.py ← metrics, comparison table, residual plots
│ ├── interpretation.py ← elasticities, marginal effects, PDPs
│ └── reporting.py ← polished outputs, memo, markdown draft
├── outputs/ ← diagnostic charts and intermediate CSVs
├── report_assets/ ← report-ready charts, tables, and text
├── main.py ← end-to-end pipeline runner
├── requirements.txt
└── README.md
-
Log-log functional form is appropriate (justified by EDA – right-skewed distributions and roughly linear log-log scatter plots). The
log1ptransform is used to handle zero spend values safely. -
Temporal train/test split is the only leakage-safe strategy for panel data with a time dimension. Random k-fold is explicitly avoided.
-
Class fixed effects are included to absorb baseline sales differences across therapeutic categories. This is equivalent to a within-class comparison.
-
Year fixed effects absorb macro-level demand shocks common to all agents in a given year (e.g., generic entry, regulatory changes, market expansion).
-
Agent identity is not one-hot encoded by default because the number of agents may be large relative to the number of observations, risking overfitting. Agent-level information is captured via optional target encoding (
agent_fe). -
Endogeneity: firms may increase detailing and DTCA spending in markets or for drugs where they expect higher demand (reverse causality). The OLS estimates should be interpreted as associations, not causal effects, unless a valid instrument is available.
-
Elasticity interpretation from OLS is valid under the assumption that the log-log specification correctly describes the true data-generating process.
-
Positive elasticity on log(Detailing): a 1% increase in detailing is associated with approximately
β%increase in sales, holding class, year, and DTCA constant. -
Diminishing returns: if the response curve is concave or the polynomial term on log(Detailing)² has a negative coefficient, diminishing returns are present.
-
Interaction Detailing × DTCA: a positive coefficient means the two channels are complementary (each enhances the effect of the other). A negative coefficient means they are substitutes (one channel is more effective when the other is low).
-
AIC/BIC: prefer the model with lower AIC/BIC when test set R² is similar.
- Verify EDA charts in
/outputs/eda_*.pngmatch the dataset. - Check that train/test year split printed at runtime is sensible.
- Review
/report_assets/key_findings.txtand fill in [placeholders]. - Edit
/report_assets/report_draft.mdwith actual results and context. - Confirm elasticity signs are economically plausible (positive, typically < 1).
- If any marketing variable is insignificant, discuss in the Limitations section.
- Check residual plots for patterns (heteroskedasticity, outliers).
- Align the chosen "best model" recommendation with your group's written argument.
See requirements.txt. Core: pandas, numpy, statsmodels, scikit-learn,
matplotlib, seaborn, xgboost, lightgbm, shap.
If xgboost, lightgbm, or shap are not installed, the pipeline gracefully skips
those steps with a warning rather than crashing.