Skip to content

HanSun103/AIOS-SDI

Repository files navigation

Marketing Mix Model – Prescription Drug Sales

MMAI 831 | Group Assignment


Project Objective

Develop and compare parametric and machine learning models to explain and predict prescription drug sales volume. Estimate marketing response (elasticity, marginal effects) for Detailing and DTCA spending, and produce actionable managerial recommendations.


Dataset Description

Column Type Description
Class Categorical Therapeutic drug class
Agent Categorical Individual drug / brand
Year Integer Observation year
Detailing Numeric Sales force activity (visits or spend index)
DTCA Numeric Direct-to-Consumer Advertising spend
Actual Sales Numeric Target variable – prescription drug sales volume

File: data/MMM_Drug Data.xlsx

The dataset has a panel structure: each row is one Agent × Year combination. Multiple agents belong to the same therapeutic class.


Methodology

Train/Test Split

  • Temporal split (leakage-free): earlier years → training; most recent year(s) → test.
  • No random shuffling – this is time-series / panel data.

Feature Engineering

  • Log(1 + x) transforms for Sales, Detailing, DTCA (handles right skew and zeros).
  • Year trend variable and year-fixed-effect dummies.
  • Interaction terms: Detailing × DTCA, Class × Detailing, Class × DTCA.
  • 1-year lagged sales (per agent, computed before split to prevent leakage).

Parametric Models (P1–P7)

Model Key feature
P1 OLS Baseline Raw-scale benchmark
P2 Log-Log Direct elasticity interpretation
P3 Log-Log + Class FE Controls for class heterogeneity
P4 Log-Log + Two-Way FE Preferred spec – controls class & year shocks
P5 + Interactions Class-specific marketing slopes
P6 Ridge Regularised for robustness
P7 Lasso Variable selection

Machine Learning Models (M1–M5)

Random Forest, Gradient Boosting, XGBoost, LightGBM, Elastic Net. All trained with time-series cross-validation to prevent temporal leakage.

Evaluation Metrics

R², RMSE, MAE, MAPE. AIC/BIC for OLS models.

Business Interpretation

  • Elasticities from log-log OLS (direct coefficient read-off).
  • Marginal effects at sample means.
  • Response curves (diminishing returns check).
  • Class-level heterogeneity in marketing response.
  • Approximate ROI / marginal response proxy.

How to Run

1. Place the data file

data/MMM_Drug Data.xlsx

2. Create a virtual environment (recommended)

python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Run the full pipeline

python main.py

All outputs are saved to /outputs (intermediate) and /report_assets (report-ready).

5. Optional: explore in Jupyter

jupyter lab notebooks/

Folder Structure

831/
├── data/
│   └── MMM_Drug Data.xlsx        ← place your data file here
├── notebooks/
│   └── exploration.ipynb         ← interactive EDA (optional)
├── src/
│   ├── __init__.py
│   ├── utils.py                  ← paths, constants, shared helpers
│   ├── data_loader.py            ← load Excel, validate, inspect
│   ├── preprocessing.py          ← train/test split, categorical encoding
│   ├── eda.py                    ← EDA charts and summary stats
│   ├── feature_engineering.py    ← log transforms, interactions, lags
│   ├── parametric_models.py      ← OLS, log-log, FE, regularised
│   ├── ml_models.py              ← RF, GBM, XGBoost, LightGBM, ElasticNet
│   ├── evaluation.py             ← metrics, comparison table, residual plots
│   ├── interpretation.py         ← elasticities, marginal effects, PDPs
│   └── reporting.py              ← polished outputs, memo, markdown draft
├── outputs/                      ← diagnostic charts and intermediate CSVs
├── report_assets/                ← report-ready charts, tables, and text
├── main.py                       ← end-to-end pipeline runner
├── requirements.txt
└── README.md

Key Assumptions

  1. Log-log functional form is appropriate (justified by EDA – right-skewed distributions and roughly linear log-log scatter plots). The log1p transform is used to handle zero spend values safely.

  2. Temporal train/test split is the only leakage-safe strategy for panel data with a time dimension. Random k-fold is explicitly avoided.

  3. Class fixed effects are included to absorb baseline sales differences across therapeutic categories. This is equivalent to a within-class comparison.

  4. Year fixed effects absorb macro-level demand shocks common to all agents in a given year (e.g., generic entry, regulatory changes, market expansion).

  5. Agent identity is not one-hot encoded by default because the number of agents may be large relative to the number of observations, risking overfitting. Agent-level information is captured via optional target encoding (agent_fe).

  6. Endogeneity: firms may increase detailing and DTCA spending in markets or for drugs where they expect higher demand (reverse causality). The OLS estimates should be interpreted as associations, not causal effects, unless a valid instrument is available.

  7. Elasticity interpretation from OLS is valid under the assumption that the log-log specification correctly describes the true data-generating process.


Interpretation Notes

  • Positive elasticity on log(Detailing): a 1% increase in detailing is associated with approximately β% increase in sales, holding class, year, and DTCA constant.

  • Diminishing returns: if the response curve is concave or the polynomial term on log(Detailing)² has a negative coefficient, diminishing returns are present.

  • Interaction Detailing × DTCA: a positive coefficient means the two channels are complementary (each enhances the effect of the other). A negative coefficient means they are substitutes (one channel is more effective when the other is low).

  • AIC/BIC: prefer the model with lower AIC/BIC when test set R² is similar.


Manual Review Checklist (⚠ Before Submission)

  • Verify EDA charts in /outputs/eda_*.png match the dataset.
  • Check that train/test year split printed at runtime is sensible.
  • Review /report_assets/key_findings.txt and fill in [placeholders].
  • Edit /report_assets/report_draft.md with actual results and context.
  • Confirm elasticity signs are economically plausible (positive, typically < 1).
  • If any marketing variable is insignificant, discuss in the Limitations section.
  • Check residual plots for patterns (heteroskedasticity, outliers).
  • Align the chosen "best model" recommendation with your group's written argument.

Dependencies

See requirements.txt. Core: pandas, numpy, statsmodels, scikit-learn, matplotlib, seaborn, xgboost, lightgbm, shap.

If xgboost, lightgbm, or shap are not installed, the pipeline gracefully skips those steps with a warning rather than crashing.

About

MMMs-Develop and compare parametric and machine learning models to explain and predict prescription drug sales volume. Estimate marketing response (elasticity, marginal effects) for Detailing and DTCA spending, and produce actionable managerial recommendations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors