Languages: English · 繁體中文
Predicting absolute 21-day forward returns on S&P 500 stocks with macro
features, and comparing the result head-to-head with cross-sectional ranking
in ml-cross-sectional.
Feature pipeline is independent from
ml-cross-sectional. Raw data sources and low-level indicators overlap, but the feature matrix is rebuilt from scratch: cross-sectional z-scoring is removed, macro / beta / sector features are added, and the target is the continuous 21-day return rather than a cross-sectional rank.
Repo 2 proved that cross-sectional ranking ("which 100 will lead and which 100 will lag?") can be learned from price-volume features alone, because the relative comparison nets out market direction. This repo asks the harder version: can the same features predict the absolute forward return? That requires the model to know where the market is going, not just which stocks beat the median, so the feature design has to change — and, as notebooks 04 and 05 document, the answer is "barely, and with worse portfolio-construction properties than the ranker."
| Model | OOS MAE | OOS Pearson |
|---|---|---|
| linear_ridge | 0.079 | +0.09 |
| linear_lasso | 0.078 | +0.10 |
| lgbm_regressor | 0.089 | +0.13 |
| xgb_regressor | 0.092 | +0.11 |
| hist_mean | 0.074 | +0.00 |
Full OOS 2020-2024, all S&P 500 names. hist_mean is a per-symbol
training-period mean — note that it wins on MAE. Any learned model
must be judged on Pearson / IC, not MAE, because the target is heavy-tailed
enough that a constant prediction clocks a strong absolute error.
vs. ranking (notebook 05): when we pick the top-20 stocks each month
from xgb_regressor and from Repo 2's xgb_ranker (same model family
both sides, to isolate target formulation), the baskets overlap only
≈ 0.19 by Jaccard on average over 60 rebalances — roughly 3–4 shared
names out of 20. The regressor's top-20 takes a deeper 2022 drawdown
(−29.6% vs −22.7% intra-year), but over the full 2020–2024 window the
two equity curves alternate leadership year by year — the measured mean
basket beta is actually slightly higher for the ranker (1.43 vs 1.32),
so the "ranking is more robust" claim is confined to the 2022 regime and
the structural argument (a), not a universal beta-concentration story.
- Universe. Current S&P 500 constituents (502 names), 2015-01 to 2025-07. Survivorship is acknowledged — results are an upper bound.
- Target.
fwd_ret_21d = close[t+21] / close[t] - 1, raw continuous. - Features (33 cols).
- Stock (11):
mom_12_1,reversal_1w,ret_{21,63,126,252}d,vol_{20,60}d,rsi_14,macd_hist,volume_z_60. - Macro (10), all lagged one business day: VIX level + 20d change, 10Y yield + 20d change, term slope (10Y−2Y), BAA credit spread (Moody's BAA − 10Y), S&P 3M / 12M trailing return + 60d vol, 6M fed-funds move count.
- Exposure (12): 252d rolling beta vs ^GSPC, 11 GICS sector dummies.
- Stock (11):
- Models. Ridge / Lasso on standardised + median-imputed features; LightGBM & XGBoost regressors with RMSE loss; per-symbol HistMean as the zero-skill bar.
- Validation. Annual expanding-window walk-forward, OOS 2020–2024.
- Evaluation. MAE / RMSE / direction accuracy / Pearson / Spearman IC; threshold-long strategy with 5 bps one-way costs; Jaccard + signal correlation vs Repo 2.
| # | Notebook | What it shows |
|---|---|---|
| 01 | 01_regression_eda.ipynb |
Target σ ≈ 0.08, fat tails, per-stock R² vs market ≈ 0.3 — why macro matters |
| 02 | 02_training_walkforward.ipynb |
Cross-model table + year-by-year + MAE vs VIX regime |
| 03 | 03_error_analysis.ipynb |
Per-sector MAE, high/low-VIX split, worst 20 predictions |
| 04 | 04_threshold_strategy.ipynb |
Long when pred > τ; τ sweep; net equity curves vs SPX |
| 05 | 05_vs_ranking.ipynb |
Head-to-head with Repo 2: daily Spearman, top-20 Jaccard, drawdown behaviour |
Notebooks are built from scripts/build_0N_*.py — source diffs stay on
Python, not ipynb JSON. Re-run the build script and then
jupyter nbconvert --execute --ExecutePreprocessor.kernel_name=ml-return-forecast
to regenerate a notebook with outputs.
Absolute-return regression has three structural disadvantages vs. ranking:
- Market beta dominates the target. The median stock's 21-day return has R² ≈ 0.3 against the contemporaneous market return. A model that doesn't explicitly carry macro / beta features is learning the market, not the stock; a model that does carry them inherits macro look-ahead risks.
- Target distribution is fat-tailed. Squared-loss regressors over-fit outlier months (COVID, 2022). The per-fold MAE swings by 40% with regime — look at notebook 02's year breakdown, not the headline row.
- Thresholds don't beat quantiles. Notebook 04's τ sweep does not improve monotonically with τ: the top-predicted names aren't reliably better than the mass of positively-predicted names, because the regressor's "magnitude" is noisy. A proper ranker (Repo 2) uses top-quintile / long-short instead, which is more robust by construction.
Combined, these are the numerate version of the industry folklore that signal research is dominated by ranking. Repo 5 exists to make that folklore quantitative.
- Survivorship bias: universe is the current S&P 500. Names that were delisted or removed between 2015–2024 are invisible.
- Credit spread choice: FRED's public CSV endpoint for ICE's
BAMLH0A0HYM2(HY OAS) only returns ~2 years due to a licensing change.BAA10Y(Moody's BAA − 10Y) is used instead — a reasonable IG-spread proxy that covers the full window. - Macro look-ahead: every macro series is lagged one business day.
Some series (e.g.
FEDFUNDS) are monthly and forward-filled — the look-ahead guard is conservative but not airtight. - Timing convention: predictions are assumed to be acted on at the
close of day
t(same-day close-to-close frame), sobeta_252duses unlagged returns up tot. Macro series, which are released at a different cadence than equity prices, are shifted by one business day as an extra safety margin rather than to match this frame. - Sector snapshot: GICS sector is the current assignment, not a point-in-time mapping.
conda create -n ml-return-forecast python=3.13
conda activate ml-return-forecast
pip install -e .
# register the kernel so nbconvert executes notebooks in the right env
python -m ipykernel install --user --name ml-return-forecast
# data (writes to data/raw/)
python scripts/download_data.py
python scripts/download_macro.py
# features (writes to data/processed/)
python scripts/build_features.py
# train OOS 2020-2024
python scripts/train.py # writes reports/predictions/oos_2020_2024.parquet
# regenerate any notebook
python scripts/build_04_threshold_strategy.py
python -m jupyter nbconvert --to notebook --execute \
--ExecutePreprocessor.kernel_name=ml-return-forecast \
notebooks/04_threshold_strategy.ipynb --output 04_threshold_strategy.ipynbml-return-forecast/
├── data/
│ ├── raw/ # sp500_ohlcv_*.parquet, macro_*.parquet, sp500_sectors.csv
│ └── processed/ # features_*.parquet
├── notebooks/ # 01–05, executed
├── reports/
│ └── predictions/ # oos_2020_2024.parquet
├── scripts/
│ ├── download_data.py
│ ├── download_macro.py
│ ├── build_features.py
│ ├── train.py
│ └── build_0{1-5}_*.py # notebook source-of-truth
└── src/mlr/
├── features_stock.py
├── features_macro.py
├── features.py # assembly + beta + sector + target
├── model.py # 4 wrapper classes, 5 model instantiations
└── validation.py # walk_forward_years
Cross-sectional absolute-return regression (direct benchmark)
- Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine
learning. Review of Financial Studies, 33(5), 2223–2273.
doi:10.1093/rfs/hhaa009 — predicts
absolute monthly US equity returns with 94 firm characteristics plus 8
macro predictors, comparing linear / tree / neural models. This repo is a
scaled-down version of the same setup (21 stock + 10 macro + 12 exposure
features, 21-day horizon, Ridge / Lasso / LGBM / XGB), and its pairing
with
ml-cross-sectionalis the direct ranking-vs-regression comparison GKX does not make explicitly.
Macro predictability of returns (why macro doesn't save the regression)
- Welch, I., & Goyal, A. (2008). A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies, 21(4), 1455–1508. doi:10.1093/rfs/hhm014 — finds that the canonical macro predictor set (term spread, credit spread, dividend yield, etc.) offers almost no reliable out-of-sample forecast power for the aggregate market premium. The variables we feed as per-stock macro features here (VIX, 10Y, term slope, BAA credit spread, S&P trailing return/vol) are drawn from the same pool. We use them cross-sectionally rather than to time the index, but notebook 02's year-by-year MAE swings and notebook 04's flat threshold sweep are consistent with the W&G finding that these series carry less forward information than their contemporaneous correlation suggests.
Validation methodology
- López de Prado, M. (2018). Advances in financial machine learning. Wiley. Chapter 7 argues for purging + embargo (with CPCV as the recommended scheme) in financial cross-validation. We use plain annual expanding-window walk-forward with no purging — the same deliberate deviation made in Repo 2, justifiable at a 21-day target horizon and annual retrain where fold-to-fold IC noise dominates leakage, but a design choice a production setup should revisit.