Skip to content

matthiola0/alt-data-sentiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt-data-sentiment

Languages: English · 繁體中文

Retail-forum (Reddit) sentiment as an alternative-data factor on the S&P 500: does retail-forum sentiment predict forward returns, and does it carry incremental alpha after controlling for classic factors (momentum, short-term reversal)?

Status: shipped 2026-04-05. M1–M6 complete; 4 notebooks executed end-to-end. See plan_repo3.md for the execution breakdown.

Headline result

Across all of 2021 on the leukipp four-subreddit corpus (~55k linked documents, 76k (doc, ticker) rows, 470 S&P 500 names):

Variant Mean rank-IC vs fwd 21d HAC t-stat (lag 5)
sum_sent_x_attn +0.015 +1.53
mean_sent +0.012 +1.51
attn_log −0.020 −1.07
shrinkage_sent +0.014 +1.66

Pooled signal does not clear the HAC |t| ≥ 2 bar but is in the right direction (positive sentiment → positive forward returns).

The interesting structure is per-subreddit:

Variant r/wallstreetbets r/stocks r/investing r/options
sum_sent_x_attn +2.00 +0.96 −2.41 +0.15
shrinkage_sent +1.94 +0.97 −3.09 +0.17
mean_sent +1.79 +1.08 −2.68 −0.48

(HAC t-stats on daily rank-IC vs 21d fwd return)

r/investing is a statistically-significant contrarian signal in 2021, r/wallstreetbets is borderline-positive, r/stocks and r/options are noise.

But this contrast collapses once we control for momentum + reversal (notebook 03). Per-subreddit pooled HAC OLS of fwd21 ~ sent + mom + rev:

Subreddit β_sent t_sent n_obs
r/wallstreetbets −0.000 −0.22 11,202
r/stocks +0.002 +1.25 9,531
r/investing +0.002 +0.30 2,462
r/options +0.001 +0.18 4,438

Sentiment is essentially a price-momentum echo — retail forums hype names that have been moving, and the momentum signal already captures that information.

Backtests (monthly rebalance, 5 bps per side, signal window 2021):

Variant L/S Sharpe L/S ann ret LO Sharpe LO ann ret
sum_sent_x_attn −1.07 −12.6% +1.57 +30.3%
mean_sent +2.30 +45.3%
attn_log +1.65 +42.1%
shrinkage_sent +1.69 +33.2%

Long-only top-quintile basket beats 2021 SPX (≈ +27%) with smaller MDD, but is heavily SPX-correlated. The L/S leg loses money because the signal is so concentrated on a few hot meme names that shorting the bottom quintile creates a large negative-beta drag in a strong bull year. Practical reading: any tradable edge here is on the long side, on names retail is already piling into — i.e. it is follow-through, not contrarian alpha.

The event study (notebook 04): 139 bullish mention spikes (mentions ≥ ticker p99 + polarity > 0) deliver mean abnormal returns of +1.17% (t+1) → +1.52% (t+10), concentrated in WSB (CAR(+5) = +1.43%) vs r/investing (+0.27%). Tactical short-horizon meme-momentum effect, not a durable cross-sectional factor.

Scope note (2026-03-28 revision). US equities only, single-archive scope. Post data comes from one public Kaggle archive, leukipp/reddit-finance-data, covering 2021 full year submissions across four subreddits: wallstreetbets (775k), stocks (76k), investing (42k), options (29k) — ~920k documents total. The originally-planned larger archive (curiel/rwallstreetbets-posts-and-comments, 3M posts+comments 2023-06 → 2025-04) was dropped after pre-flight scoring failed repeatedly on both local CPU (non-linear slowdown to >20 h ETA) and Kaggle kernel (P100 CUDA kernel incompatibility with preinstalled PyTorch; accelerator pinning did not take effect). The decision trade-off is explained in the "Failure discussion" section below. No live collector in this repo (Reddit PRAW forward-collection was descoped — see plan_repo3.md). Taiwan (PTT) and crypto sentiment are deferred — see SDD.md.

The question

Retail-forum chatter is noisy, biased, and self-promotional, yet 2021 made it clear that it can drive prices in the short term. Two follow-up questions matter for a portfolio repo:

  1. Is there a tradable cross-sectional sentiment factor once costs and ticker-recognition errors are accounted for?
  2. Is any of its alpha incremental to the classic momentum / reversal factors from classic-factors?

A single-year, cross-subreddit view (WSB meme crowd vs. r/stocks vs. r/investing vs. r/options) gives the narrative a natural control: if the sentiment factor shows very different strength between WSB and r/investing, that is evidence of sub-population effect rather than universal retail-forum alpha.

Method outline

  • Data: one Kaggle archive — leukipp/reddit-finance-data, 2021 full year, submissions only across four subreddits (WSB, stocks, investing, options). S&P 500 OHLCV via qtools.data.loaders.us.
  • Entity linking: regex \$?[A-Z]{2,5}\b → S&P 500 whitelist → common-word blacklist (CEO / YOLO / FY / …). Four S&P 500 tickers — COO (Cooper), DD (DuPont), IT (Gartner), MAR (Marriott) — are deliberately blacklisted because their all-caps form is dominated by slang / calendar usage; these four companies are invisible to this pipeline by design, documented in src/alt_sentiment/entity_linking.py. Precision + recall spot-checked on 100 hand-labelled posts in notebook 01.
  • Sentiment model: ProsusAI/finbert (HuggingFace, CPU batch). Truncated to 512 tokens. Scored once per document, then fanned out to every ticker linked to that document (cost is O(documents), not O(documents × tickers)).
  • Daily factor: four variants reported side by side to avoid cherry-picking a friendly formula —
    1. sum((pos − neg) · log1p(mentions)) — combined sentiment × attention
    2. mean(pos − neg) — pure sentiment
    3. log1p(doc_count) — pure attention
    4. mean(pos − neg) · n/(n+10) — shrinkage toward zero for thin names
  • Evaluation: Rank IC with Newey-West / HAC t-stat (daily signal has autocorrelation; plain t-stat overstates significance), p-value, quintile long-short and long-only (monthly rebalance, US_EQUITY cost), incremental OLS vs momentum + reversal, event study ±10 trading days on mention-spike pump events (GameStop 2021-01 as a dedicated case study).
  • Cross-subreddit analysis: the same factor is computed per subreddit in addition to the pooled version, so the WSB-vs-rest contrast is explicit.

Layout

alt-data-sentiment/
├── README.md
├── pyproject.toml
├── scripts/
│   ├── download_kaggle_wsb.py      # Kaggle archive pull
│   ├── download_prices.py          # S&P 500 OHLCV via qtools
│   ├── score_sentiment.py          # FinBERT batch scorer
│   └── build_0N_*.py               # notebook generators
├── notebooks/
│   ├── 01_data_quality.ipynb
│   ├── 02_sentiment_factor.ipynb
│   ├── 03_vs_classic_factors.ipynb
│   └── 04_event_study.ipynb
├── src/alt_sentiment/
│   ├── entity_linking.py           # ticker extraction + whitelist/blacklist
│   ├── sentiment.py                # FinBERT wrapper
│   ├── factor.py                   # daily aggregation + IC + quintile helpers
│   └── loaders/
│       ├── leukipp.py              # active loader
│       └── curiel.py               # parked — see failure discussion
├── reports/figures/
└── data/                           # gitignored

Honest caveats (to be expanded in the final README)

  • Single-year sample: 2021 was a meme-stock year. Cross-subreddit contrasts are robust within 2021 but external validity to 2022+ is assumed, not demonstrated, in this repo.
  • Selection bias: WSB is a vocal subset of retail, heavily skewed to meme / high-beta names. The four-subreddit design partly probes this, but all four skew bullish and US-centric.
  • Ticker recognition errors: the blacklist is hand-maintained. Four real S&P 500 names (COO, DD, IT, MAR) are deliberately excluded; the remaining false-positive surface (e.g. NOW, LOW written as emphasis vs. the companies) is left in, to be spot-checked in notebook 01 via both precision AND recall.
  • Survivorship / universe freeze: the S&P 500 constituent list is a snapshot taken when scripts/download_prices.py was first run. Stocks added to or removed from the index after that snapshot are handled consistently, not historically accurately. Same universe-freeze simplification as ml-cross-sectional and ml-return-forecast.
  • Kaggle archive, not live: no forward collector in this repo.
  • FinBERT domain gap: trained on financial news, not social slang — "to the moon 🚀" and similar are routinely mis-scored. A spot check on 20 typical WSB lines is included in notebook 01.

Failure discussion — why only one dataset

The original plan had two archives: leukipp (2021) and curiel/rwallstreetbets-posts-and-comments (3M rows, 2023-06 → 2025-04), analysed as two regime case studies. During M3 we pre-flighted FinBERT scoring on curiel and hit two hard stops:

  1. Local CPU run — the first ~10 chunks scored at ~9 min each, then the per-chunk time blew up non-linearly (chunk 13 took 77 min). The later 2024-2025 chunks have both a much higher ticker-mention density (7-8% of rows vs 0.5% in the EDA slice) AND longer text (comments approach the 512-token FinBERT cap), so each chunk needs ~15× the FinBERT compute of the benchmarked early chunks. Extrapolated full-run: >20 h. Not viable for a portfolio repo.
  2. Kaggle GPU run — we packaged the scoring as a Kaggle notebook (generator in scripts/kaggle/) and pushed via the Kaggle CLI with enable_gpu: true + --accelerator "GPU T4 x2". Kaggle nevertheless assigned a Tesla P100 (compute capability sm_60), which the preinstalled PyTorch no longer ships kernels for (PyTorch 2.4+ dropped sm_60). A runtime compute-capability check then fell back to CPU, but the CPU fallback also took >90 min and our polling loop timed out.

The pragmatic call: focus on leukipp (920k submissions, short texts, fits on a laptop CPU in 30-60 min), and document this history instead of shipping half-curiel-half-leukipp with mismatched regimes. The cross-subreddit angle within 2021 — WSB meme crowd vs r/investing long-term-bias vs r/options leverage bias — is a genuinely richer comparison than a cleaner one-year WSB file would have been.

Quickstart

conda create -n alt-data-sentiment python=3.13 -y
conda activate alt-data-sentiment
pip install -e .

# Archive (requires KAGGLE_API_TOKEN in .env, see .env.example)
python scripts/download_kaggle_wsb.py     # leukipp (+ parked curiel pull)
python scripts/download_prices.py         # 503 S&P 500 symbols, 2020-2025

# Sentiment scoring — leukipp via Kaggle T4/CPU (preferred, see below)
#   Push + poll + download from Kaggle:
python scripts/kaggle/run_on_kaggle.py --dataset leukipp
#   OR run locally on CPU (slower):
python scripts/score_sentiment.py --dataset leukipp --batch-size 16

# Build & execute the four notebooks (each takes 1-3 min after scoring)
for n in 01_data_quality 02_sentiment_factor 03_vs_classic_factors 04_event_study; do
    python scripts/build_${n}.py
    jupyter nbconvert --to notebook --execute --inplace notebooks/${n}.ipynb
done

Notebook tour

  • 01_data_quality.ipynb — raw archive coverage by subreddit, link-rate after entity linking, top ticker frequencies, naive bot heuristic, FinBERT class-probability distribution, 10-row text spot check.
  • 02_sentiment_factor.ipynb — four factor variants × Newey-West HAC IC × monthly L/S × long-only × per-subreddit IC matrix. The money-shot subreddit-contrast result lives here.
  • 03_vs_classic_factors.ipynb — pairwise correlation, monthly OLS, pooled HAC regression of fwd21 on sentiment + momentum + reversal. Per-subreddit incremental check.
  • 04_event_study.ipynb — ±10 trading-day CAR around bullish mention spikes, per subreddit, plus a single-name case-study panel.

About

Reddit retail-forum sentiment as an alt-data factor on the S&P 500: FinBERT scoring, cross-subreddit IC, classic-factor incremental regression, event study (2021)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors