alt-data-sentiment

Languages: English · 繁體中文

Retail-forum (Reddit) sentiment as an alternative-data factor on the S&P 500: does retail-forum sentiment predict forward returns, and does it carry incremental alpha after controlling for classic factors (momentum, short-term reversal)?

Status: shipped 2026-04-05. M1–M6 complete; 4 notebooks executed end-to-end. See plan_repo3.md for the execution breakdown.

Headline result

Across all of 2021 on the leukipp four-subreddit corpus (~55k linked documents, 76k (doc, ticker) rows, 470 S&P 500 names):

Variant	Mean rank-IC vs fwd 21d	HAC t-stat (lag 5)
`sum_sent_x_attn`	+0.015	+1.53
`mean_sent`	+0.012	+1.51
`attn_log`	−0.020	−1.07
`shrinkage_sent`	+0.014	+1.66

Pooled signal does not clear the HAC |t| ≥ 2 bar but is in the right direction (positive sentiment → positive forward returns).

The interesting structure is per-subreddit:

Variant	r/wallstreetbets	r/stocks	r/investing	r/options
`sum_sent_x_attn`	+2.00	+0.96	−2.41	+0.15
`shrinkage_sent`	+1.94	+0.97	−3.09	+0.17
`mean_sent`	+1.79	+1.08	−2.68	−0.48

(HAC t-stats on daily rank-IC vs 21d fwd return)

— r/investing is a statistically-significant contrarian signal in 2021, r/wallstreetbets is borderline-positive, r/stocks and r/options are noise.

But this contrast collapses once we control for momentum + reversal (notebook 03). Per-subreddit pooled HAC OLS of fwd21 ~ sent + mom + rev:

Subreddit	β_sent	t_sent	n_obs
r/wallstreetbets	−0.000	−0.22	11,202
r/stocks	+0.002	+1.25	9,531
r/investing	+0.002	+0.30	2,462
r/options	+0.001	+0.18	4,438

Sentiment is essentially a price-momentum echo — retail forums hype names that have been moving, and the momentum signal already captures that information.

Backtests (monthly rebalance, 5 bps per side, signal window 2021):

Variant	L/S Sharpe	L/S ann ret	LO Sharpe	LO ann ret
`sum_sent_x_attn`	−1.07	−12.6%	+1.57	+30.3%
`mean_sent`	…	…	+2.30	+45.3%
`attn_log`	…	…	+1.65	+42.1%
`shrinkage_sent`	…	…	+1.69	+33.2%

Long-only top-quintile basket beats 2021 SPX (≈ +27%) with smaller MDD, but is heavily SPX-correlated. The L/S leg loses money because the signal is so concentrated on a few hot meme names that shorting the bottom quintile creates a large negative-beta drag in a strong bull year. Practical reading: any tradable edge here is on the long side, on names retail is already piling into — i.e. it is follow-through, not contrarian alpha.

The event study (notebook 04): 139 bullish mention spikes (mentions ≥ ticker p99 + polarity > 0) deliver mean abnormal returns of +1.17% (t+1) → +1.52% (t+10), concentrated in WSB (CAR(+5) = +1.43%) vs r/investing (+0.27%). Tactical short-horizon meme-momentum effect, not a durable cross-sectional factor.

Scope note (2026-03-28 revision). US equities only, single-archive scope. Post data comes from one public Kaggle archive, leukipp/reddit-finance-data, covering 2021 full year submissions across four subreddits: wallstreetbets (775k), stocks (76k), investing (42k), options (29k) — ~920k documents total. The originally-planned larger archive (curiel/rwallstreetbets-posts-and-comments, 3M posts+comments 2023-06 → 2025-04) was dropped after pre-flight scoring failed repeatedly on both local CPU (non-linear slowdown to >20 h ETA) and Kaggle kernel (P100 CUDA kernel incompatibility with preinstalled PyTorch; accelerator pinning did not take effect). The decision trade-off is explained in the "Failure discussion" section below. No live collector in this repo (Reddit PRAW forward-collection was descoped — see plan_repo3.md). Taiwan (PTT) and crypto sentiment are deferred — see SDD.md.

The question

Retail-forum chatter is noisy, biased, and self-promotional, yet 2021 made it clear that it can drive prices in the short term. Two follow-up questions matter for a portfolio repo:

Is there a tradable cross-sectional sentiment factor once costs and ticker-recognition errors are accounted for?
Is any of its alpha incremental to the classic momentum / reversal factors from classic-factors?

A single-year, cross-subreddit view (WSB meme crowd vs. r/stocks vs. r/investing vs. r/options) gives the narrative a natural control: if the sentiment factor shows very different strength between WSB and r/investing, that is evidence of sub-population effect rather than universal retail-forum alpha.

Method outline

Data: one Kaggle archive — leukipp/reddit-finance-data, 2021 full year, submissions only across four subreddits (WSB, stocks, investing, options). S&P 500 OHLCV via qtools.data.loaders.us.
Entity linking: regex \$?[A-Z]{2,5}\b → S&P 500 whitelist → common-word blacklist (CEO / YOLO / FY / …). Four S&P 500 tickers — COO (Cooper), DD (DuPont), IT (Gartner), MAR (Marriott) — are deliberately blacklisted because their all-caps form is dominated by slang / calendar usage; these four companies are invisible to this pipeline by design, documented in src/alt_sentiment/entity_linking.py. Precision + recall spot-checked on 100 hand-labelled posts in notebook 01.
Sentiment model: ProsusAI/finbert (HuggingFace, CPU batch). Truncated to 512 tokens. Scored once per document, then fanned out to every ticker linked to that document (cost is O(documents), not O(documents × tickers)).
Daily factor: four variants reported side by side to avoid cherry-picking a friendly formula —
1. sum((pos − neg) · log1p(mentions)) — combined sentiment × attention
2. mean(pos − neg) — pure sentiment
3. log1p(doc_count) — pure attention
4. mean(pos − neg) · n/(n+10) — shrinkage toward zero for thin names
Evaluation: Rank IC with Newey-West / HAC t-stat (daily signal has autocorrelation; plain t-stat overstates significance), p-value, quintile long-short and long-only (monthly rebalance, US_EQUITY cost), incremental OLS vs momentum + reversal, event study ±10 trading days on mention-spike pump events (GameStop 2021-01 as a dedicated case study).
Cross-subreddit analysis: the same factor is computed per subreddit in addition to the pooled version, so the WSB-vs-rest contrast is explicit.

Layout

alt-data-sentiment/
├── README.md
├── pyproject.toml
├── scripts/
│   ├── download_kaggle_wsb.py      # Kaggle archive pull
│   ├── download_prices.py          # S&P 500 OHLCV via qtools
│   ├── score_sentiment.py          # FinBERT batch scorer
│   └── build_0N_*.py               # notebook generators
├── notebooks/
│   ├── 01_data_quality.ipynb
│   ├── 02_sentiment_factor.ipynb
│   ├── 03_vs_classic_factors.ipynb
│   └── 04_event_study.ipynb
├── src/alt_sentiment/
│   ├── entity_linking.py           # ticker extraction + whitelist/blacklist
│   ├── sentiment.py                # FinBERT wrapper
│   ├── factor.py                   # daily aggregation + IC + quintile helpers
│   └── loaders/
│       ├── leukipp.py              # active loader
│       └── curiel.py               # parked — see failure discussion
├── reports/figures/
└── data/                           # gitignored

Honest caveats (to be expanded in the final README)

Single-year sample: 2021 was a meme-stock year. Cross-subreddit contrasts are robust within 2021 but external validity to 2022+ is assumed, not demonstrated, in this repo.
Selection bias: WSB is a vocal subset of retail, heavily skewed to meme / high-beta names. The four-subreddit design partly probes this, but all four skew bullish and US-centric.
Ticker recognition errors: the blacklist is hand-maintained. Four real S&P 500 names (COO, DD, IT, MAR) are deliberately excluded; the remaining false-positive surface (e.g. NOW, LOW written as emphasis vs. the companies) is left in, to be spot-checked in notebook 01 via both precision AND recall.
Survivorship / universe freeze: the S&P 500 constituent list is a snapshot taken when scripts/download_prices.py was first run. Stocks added to or removed from the index after that snapshot are handled consistently, not historically accurately. Same universe-freeze simplification as ml-cross-sectional and ml-return-forecast.
Kaggle archive, not live: no forward collector in this repo.
FinBERT domain gap: trained on financial news, not social slang — "to the moon 🚀" and similar are routinely mis-scored. A spot check on 20 typical WSB lines is included in notebook 01.

Failure discussion — why only one dataset

The original plan had two archives: leukipp (2021) and curiel/rwallstreetbets-posts-and-comments (3M rows, 2023-06 → 2025-04), analysed as two regime case studies. During M3 we pre-flighted FinBERT scoring on curiel and hit two hard stops:

Local CPU run — the first ~10 chunks scored at ~9 min each, then the per-chunk time blew up non-linearly (chunk 13 took 77 min). The later 2024-2025 chunks have both a much higher ticker-mention density (7-8% of rows vs 0.5% in the EDA slice) AND longer text (comments approach the 512-token FinBERT cap), so each chunk needs ~15× the FinBERT compute of the benchmarked early chunks. Extrapolated full-run: >20 h. Not viable for a portfolio repo.
Kaggle GPU run — we packaged the scoring as a Kaggle notebook (generator in scripts/kaggle/) and pushed via the Kaggle CLI with enable_gpu: true + --accelerator "GPU T4 x2". Kaggle nevertheless assigned a Tesla P100 (compute capability sm_60), which the preinstalled PyTorch no longer ships kernels for (PyTorch 2.4+ dropped sm_60). A runtime compute-capability check then fell back to CPU, but the CPU fallback also took >90 min and our polling loop timed out.

The pragmatic call: focus on leukipp (920k submissions, short texts, fits on a laptop CPU in 30-60 min), and document this history instead of shipping half-curiel-half-leukipp with mismatched regimes. The cross-subreddit angle within 2021 — WSB meme crowd vs r/investing long-term-bias vs r/options leverage bias — is a genuinely richer comparison than a cleaner one-year WSB file would have been.

Quickstart

conda create -n alt-data-sentiment python=3.13 -y
conda activate alt-data-sentiment
pip install -e .

# Archive (requires KAGGLE_API_TOKEN in .env, see .env.example)
python scripts/download_kaggle_wsb.py     # leukipp (+ parked curiel pull)
python scripts/download_prices.py         # 503 S&P 500 symbols, 2020-2025

# Sentiment scoring — leukipp via Kaggle T4/CPU (preferred, see below)
#   Push + poll + download from Kaggle:
python scripts/kaggle/run_on_kaggle.py --dataset leukipp
#   OR run locally on CPU (slower):
python scripts/score_sentiment.py --dataset leukipp --batch-size 16

# Build & execute the four notebooks (each takes 1-3 min after scoring)
for n in 01_data_quality 02_sentiment_factor 03_vs_classic_factors 04_event_study; do
    python scripts/build_${n}.py
    jupyter nbconvert --to notebook --execute --inplace notebooks/${n}.ipynb
done

Notebook tour

01_data_quality.ipynb — raw archive coverage by subreddit, link-rate after entity linking, top ticker frequencies, naive bot heuristic, FinBERT class-probability distribution, 10-row text spot check.
02_sentiment_factor.ipynb — four factor variants × Newey-West HAC IC × monthly L/S × long-only × per-subreddit IC matrix. The money-shot subreddit-contrast result lives here.
03_vs_classic_factors.ipynb — pairwise correlation, monthly OLS, pooled HAC regression of fwd21 on sentiment + momentum + reversal. Per-subreddit incremental check.
04_event_study.ipynb — ±10 trading-day CAR around bullish mention spikes, per subreddit, plus a single-name case-study panel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

alt-data-sentiment

Headline result

The question

Method outline

Layout

Honest caveats (to be expanded in the final README)

Failure discussion — why only one dataset

Quickstart

Notebook tour

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
notebooks		notebooks
reports		reports
scripts		scripts
src/alt_sentiment		src/alt_sentiment
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

alt-data-sentiment

Headline result

The question

Method outline

Layout

Honest caveats (to be expanded in the final README)

Failure discussion — why only one dataset

Quickstart

Notebook tour

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages