Languages: English · 繁體中文
Retail-forum (Reddit) sentiment as an alternative-data factor on the S&P 500: does retail-forum sentiment predict forward returns, and does it carry incremental alpha after controlling for classic factors (momentum, short-term reversal)?
Status: shipped 2026-04-05. M1–M6 complete; 4 notebooks executed end-to-end. See
plan_repo3.mdfor the execution breakdown.
Across all of 2021 on the leukipp four-subreddit corpus
(~55k linked documents, 76k (doc, ticker) rows, 470 S&P 500 names):
| Variant | Mean rank-IC vs fwd 21d | HAC t-stat (lag 5) |
|---|---|---|
sum_sent_x_attn |
+0.015 | +1.53 |
mean_sent |
+0.012 | +1.51 |
attn_log |
−0.020 | −1.07 |
shrinkage_sent |
+0.014 | +1.66 |
Pooled signal does not clear the HAC |t| ≥ 2 bar but is in the right direction (positive sentiment → positive forward returns).
The interesting structure is per-subreddit:
| Variant | r/wallstreetbets | r/stocks | r/investing | r/options |
|---|---|---|---|---|
sum_sent_x_attn |
+2.00 | +0.96 | −2.41 | +0.15 |
shrinkage_sent |
+1.94 | +0.97 | −3.09 | +0.17 |
mean_sent |
+1.79 | +1.08 | −2.68 | −0.48 |
(HAC t-stats on daily rank-IC vs 21d fwd return)
— r/investing is a statistically-significant contrarian signal in 2021, r/wallstreetbets is borderline-positive, r/stocks and r/options are noise.
But this contrast collapses once we control for momentum + reversal
(notebook 03). Per-subreddit pooled HAC OLS of fwd21 ~ sent + mom + rev:
| Subreddit | β_sent | t_sent | n_obs |
|---|---|---|---|
| r/wallstreetbets | −0.000 | −0.22 | 11,202 |
| r/stocks | +0.002 | +1.25 | 9,531 |
| r/investing | +0.002 | +0.30 | 2,462 |
| r/options | +0.001 | +0.18 | 4,438 |
Sentiment is essentially a price-momentum echo — retail forums hype names that have been moving, and the momentum signal already captures that information.
Backtests (monthly rebalance, 5 bps per side, signal window 2021):
| Variant | L/S Sharpe | L/S ann ret | LO Sharpe | LO ann ret |
|---|---|---|---|---|
sum_sent_x_attn |
−1.07 | −12.6% | +1.57 | +30.3% |
mean_sent |
… | … | +2.30 | +45.3% |
attn_log |
… | … | +1.65 | +42.1% |
shrinkage_sent |
… | … | +1.69 | +33.2% |
Long-only top-quintile basket beats 2021 SPX (≈ +27%) with smaller MDD, but is heavily SPX-correlated. The L/S leg loses money because the signal is so concentrated on a few hot meme names that shorting the bottom quintile creates a large negative-beta drag in a strong bull year. Practical reading: any tradable edge here is on the long side, on names retail is already piling into — i.e. it is follow-through, not contrarian alpha.
The event study (notebook 04): 139 bullish mention spikes (mentions ≥ ticker p99 + polarity > 0) deliver mean abnormal returns of +1.17% (t+1) → +1.52% (t+10), concentrated in WSB (CAR(+5) = +1.43%) vs r/investing (+0.27%). Tactical short-horizon meme-momentum effect, not a durable cross-sectional factor.
Scope note (2026-03-28 revision). US equities only, single-archive scope. Post data comes from one public Kaggle archive,
leukipp/reddit-finance-data, covering 2021 full year submissions across four subreddits:wallstreetbets(775k),stocks(76k),investing(42k),options(29k) — ~920k documents total. The originally-planned larger archive (curiel/rwallstreetbets-posts-and-comments, 3M posts+comments 2023-06 → 2025-04) was dropped after pre-flight scoring failed repeatedly on both local CPU (non-linear slowdown to >20 h ETA) and Kaggle kernel (P100 CUDA kernel incompatibility with preinstalled PyTorch; accelerator pinning did not take effect). The decision trade-off is explained in the "Failure discussion" section below. No live collector in this repo (Reddit PRAW forward-collection was descoped — seeplan_repo3.md). Taiwan (PTT) and crypto sentiment are deferred — seeSDD.md.
Retail-forum chatter is noisy, biased, and self-promotional, yet 2021 made it clear that it can drive prices in the short term. Two follow-up questions matter for a portfolio repo:
- Is there a tradable cross-sectional sentiment factor once costs and ticker-recognition errors are accounted for?
- Is any of its alpha incremental to the classic momentum /
reversal factors from
classic-factors?
A single-year, cross-subreddit view (WSB meme crowd vs. r/stocks vs. r/investing vs. r/options) gives the narrative a natural control: if the sentiment factor shows very different strength between WSB and r/investing, that is evidence of sub-population effect rather than universal retail-forum alpha.
- Data: one Kaggle archive —
leukipp/reddit-finance-data, 2021 full year, submissions only across four subreddits (WSB, stocks, investing, options). S&P 500 OHLCV viaqtools.data.loaders.us. - Entity linking: regex
\$?[A-Z]{2,5}\b→ S&P 500 whitelist → common-word blacklist (CEO / YOLO / FY / …). Four S&P 500 tickers —COO(Cooper),DD(DuPont),IT(Gartner),MAR(Marriott) — are deliberately blacklisted because their all-caps form is dominated by slang / calendar usage; these four companies are invisible to this pipeline by design, documented insrc/alt_sentiment/entity_linking.py. Precision + recall spot-checked on 100 hand-labelled posts in notebook 01. - Sentiment model:
ProsusAI/finbert(HuggingFace, CPU batch). Truncated to 512 tokens. Scored once per document, then fanned out to every ticker linked to that document (cost is O(documents), not O(documents × tickers)). - Daily factor: four variants reported side by side to avoid
cherry-picking a friendly formula —
sum((pos − neg) · log1p(mentions))— combined sentiment × attentionmean(pos − neg)— pure sentimentlog1p(doc_count)— pure attentionmean(pos − neg) · n/(n+10)— shrinkage toward zero for thin names
- Evaluation: Rank IC with Newey-West / HAC t-stat (daily signal has autocorrelation; plain t-stat overstates significance), p-value, quintile long-short and long-only (monthly rebalance, US_EQUITY cost), incremental OLS vs momentum + reversal, event study ±10 trading days on mention-spike pump events (GameStop 2021-01 as a dedicated case study).
- Cross-subreddit analysis: the same factor is computed per subreddit in addition to the pooled version, so the WSB-vs-rest contrast is explicit.
alt-data-sentiment/
├── README.md
├── pyproject.toml
├── scripts/
│ ├── download_kaggle_wsb.py # Kaggle archive pull
│ ├── download_prices.py # S&P 500 OHLCV via qtools
│ ├── score_sentiment.py # FinBERT batch scorer
│ └── build_0N_*.py # notebook generators
├── notebooks/
│ ├── 01_data_quality.ipynb
│ ├── 02_sentiment_factor.ipynb
│ ├── 03_vs_classic_factors.ipynb
│ └── 04_event_study.ipynb
├── src/alt_sentiment/
│ ├── entity_linking.py # ticker extraction + whitelist/blacklist
│ ├── sentiment.py # FinBERT wrapper
│ ├── factor.py # daily aggregation + IC + quintile helpers
│ └── loaders/
│ ├── leukipp.py # active loader
│ └── curiel.py # parked — see failure discussion
├── reports/figures/
└── data/ # gitignored
- Single-year sample: 2021 was a meme-stock year. Cross-subreddit contrasts are robust within 2021 but external validity to 2022+ is assumed, not demonstrated, in this repo.
- Selection bias: WSB is a vocal subset of retail, heavily skewed to meme / high-beta names. The four-subreddit design partly probes this, but all four skew bullish and US-centric.
- Ticker recognition errors: the blacklist is hand-maintained. Four
real S&P 500 names (
COO,DD,IT,MAR) are deliberately excluded; the remaining false-positive surface (e.g.NOW,LOWwritten as emphasis vs. the companies) is left in, to be spot-checked in notebook 01 via both precision AND recall. - Survivorship / universe freeze: the S&P 500 constituent list is
a snapshot taken when
scripts/download_prices.pywas first run. Stocks added to or removed from the index after that snapshot are handled consistently, not historically accurately. Same universe-freeze simplification asml-cross-sectionalandml-return-forecast. - Kaggle archive, not live: no forward collector in this repo.
- FinBERT domain gap: trained on financial news, not social slang — "to the moon 🚀" and similar are routinely mis-scored. A spot check on 20 typical WSB lines is included in notebook 01.
The original plan had two archives: leukipp (2021) and
curiel/rwallstreetbets-posts-and-comments (3M rows, 2023-06 → 2025-04),
analysed as two regime case studies. During M3 we pre-flighted FinBERT
scoring on curiel and hit two hard stops:
- Local CPU run — the first ~10 chunks scored at ~9 min each, then the per-chunk time blew up non-linearly (chunk 13 took 77 min). The later 2024-2025 chunks have both a much higher ticker-mention density (7-8% of rows vs 0.5% in the EDA slice) AND longer text (comments approach the 512-token FinBERT cap), so each chunk needs ~15× the FinBERT compute of the benchmarked early chunks. Extrapolated full-run: >20 h. Not viable for a portfolio repo.
- Kaggle GPU run — we packaged the scoring as a Kaggle notebook
(generator in
scripts/kaggle/) and pushed via the Kaggle CLI withenable_gpu: true+--accelerator "GPU T4 x2". Kaggle nevertheless assigned a Tesla P100 (compute capability sm_60), which the preinstalled PyTorch no longer ships kernels for (PyTorch 2.4+ dropped sm_60). A runtime compute-capability check then fell back to CPU, but the CPU fallback also took >90 min and our polling loop timed out.
The pragmatic call: focus on leukipp (920k submissions, short texts, fits on a laptop CPU in 30-60 min), and document this history instead of shipping half-curiel-half-leukipp with mismatched regimes. The cross-subreddit angle within 2021 — WSB meme crowd vs r/investing long-term-bias vs r/options leverage bias — is a genuinely richer comparison than a cleaner one-year WSB file would have been.
conda create -n alt-data-sentiment python=3.13 -y
conda activate alt-data-sentiment
pip install -e .
# Archive (requires KAGGLE_API_TOKEN in .env, see .env.example)
python scripts/download_kaggle_wsb.py # leukipp (+ parked curiel pull)
python scripts/download_prices.py # 503 S&P 500 symbols, 2020-2025
# Sentiment scoring — leukipp via Kaggle T4/CPU (preferred, see below)
# Push + poll + download from Kaggle:
python scripts/kaggle/run_on_kaggle.py --dataset leukipp
# OR run locally on CPU (slower):
python scripts/score_sentiment.py --dataset leukipp --batch-size 16
# Build & execute the four notebooks (each takes 1-3 min after scoring)
for n in 01_data_quality 02_sentiment_factor 03_vs_classic_factors 04_event_study; do
python scripts/build_${n}.py
jupyter nbconvert --to notebook --execute --inplace notebooks/${n}.ipynb
done01_data_quality.ipynb— raw archive coverage by subreddit, link-rate after entity linking, top ticker frequencies, naive bot heuristic, FinBERT class-probability distribution, 10-row text spot check.02_sentiment_factor.ipynb— four factor variants × Newey-West HAC IC × monthly L/S × long-only × per-subreddit IC matrix. The money-shot subreddit-contrast result lives here.03_vs_classic_factors.ipynb— pairwise correlation, monthly OLS, pooled HAC regression of fwd21 on sentiment + momentum + reversal. Per-subreddit incremental check.04_event_study.ipynb— ±10 trading-day CAR around bullish mention spikes, per subreddit, plus a single-name case-study panel.