Skip to content

Dox-Alpha/CSAlpha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSAlpha

Cross-sectional alpha signal modeling with grouped factors and stock-wise attention.

Python PyTorch License Status

CSAlpha is a PyTorch research pipeline for daily cross-sectional stock return forecasting. It builds an A-share multi-factor dataset from Baostock daily bars, trains a grouped-factor encoder with cross-sectional self-attention, and evaluates the resulting alpha signal with IC, RankIC, decile returns, and Top-N long-short portfolios.

This repository is intended as a research and engineering demo for cross-sectional alpha modeling. It is not investment advice.

✨ Highlights

  • Cross-sectional modeling: self-attention runs along the stock dimension, so each stock is scored in the context of the same day's universe.
  • Grouped factor encoder: price, volume/liquidity, momentum, and volatility features are encoded by separate MLP blocks before being merged.
  • Ranking-aware objective: training combines IC loss, ListMLE ranking loss, and a light masked MSE term.
  • Variable-size daily panels: each training sample is one trading day; batches are padded across days and handled with masks.
  • Evaluation workflow: reports IC/RankIC, ICIR/RankICIR, decile returns, Top-N long-short metrics, turnover estimates, and optional EWMA score smoothing.

📈 Results Snapshot

A lightweight reference result is available in docs/results.md. The full experiments/ directory is intentionally excluded from version control because it contains logs, checkpoints, predictions, and other generated artifacts.

Reference run highlights on the 2024-2025 test split:

  • Smoothed RankIC mean: 0.0562
  • Estimated long-short one-way turnover reduced from 56.80% to 25.47% with EWMA smoothing
  • After-cost smoothed long-short Sharpe: 1.2137 under a 5 bps one-way cost assumption

Long-short net value comparison

🛠️ Installation

Choose one environment manager.

Option A: uv

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

Option B: conda

conda create -n csalpha python=3.11 -y
conda activate csalpha
pip install -r requirements.txt

Option C: standard venv

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The core dependencies are torch, pandas, pyarrow, scipy, pyyaml, tqdm, matplotlib, and baostock.

If the default PyTorch wheel does not match your GPU driver, install a driver-compatible PyTorch wheel first, then install the remaining requirements.

🚀 Quick Start

Run the full pipeline:

bash run.sh

This is equivalent to:

bash scripts/01_download.sh
bash scripts/02_build_features.sh
bash scripts/03_train.sh
bash scripts/04_evaluate.sh

For a smaller smoke run, first download only a subset of stocks, then train a single seed:

python -m src.data.download_baostock --config configs/default.yaml --max_stocks 50
python -m src.data.features --config configs/default.yaml
python -m src.train --config configs/default.yaml --seeds 42 --run-name smoke
python -m src.evaluate --config configs/default.yaml --run smoke

The smoke run checks that the pipeline works, but it is not meant to produce meaningful research results.

🗂️ Data

The default configuration uses the CSI 500 universe from Baostock. To reduce survivorship bias, the downloader samples historical index constituents every data.universe_refresh_days days and takes their union.

The default date split is:

  • Train: 2013-01-01 to 2021-12-31
  • Validation: 2022-01-01 to 2023-12-31
  • Test: 2024-01-01 to 2025-12-31

Generated files are stored locally and are not committed:

data/raw/          # One parquet file per stock
data/processed/    # One parquet file per trading day
experiments/       # Checkpoints, logs, predictions, summaries, and plots

See data/README.md for the data directory layout.

🧠 Method Overview

Features are computed from adjusted daily OHLCV data. The label is the close-to-close return from t+1 to t+2, while features only use information available at or before day t. This keeps the target aligned with a realistic next-day tradability assumption.

The model architecture is:

daily panel (N stocks, F features)
        |
        |-- GroupMLP(price)
        |-- GroupMLP(volume)
        |-- GroupMLP(momentum)
        |-- GroupMLP(volatility)
        |
      concat -> projection -> cross-sectional self-attention -> regression score

The training loss is:

L = 0.5 * L_IC + 0.4 * L_ListMLE + 0.1 * L_MSE

where L_IC directly optimizes cross-sectional correlation, L_ListMLE encourages correct ranking, and masked MSE provides a small scale anchor.

📊 Evaluation

src.evaluate loads the checkpoints from a run directory and writes evaluation artifacts into the same run directory. By default, experiments/latest points to the most recent training run.

Typical output:

experiments/exp_YYYYMMDD_HHMMSS/
├── config.yaml
├── train.log
├── checkpoints.json
├── seed_42/best.pt
├── seed_2024/best.pt
├── seed_7/best.pt
├── evaluate.log
├── summary.md
├── test_daily.csv
├── test_daily_gross.csv
├── test_predictions.parquet
├── test_turnover_gross.csv
├── test_turnover_smoothed.csv
├── ls_curve.png
├── ic_timeseries.png
├── ic_cumulative.png
├── group_curves.png
├── drawdown.png
├── turnover_timeseries.png
└── monthly_heatmap.png

The main report is summary.md, which compares raw scores and EWMA-smoothed scores when eval.smooth_span > 1.

⚙️ Configuration

Most behavior is controlled in configs/default.yaml, including:

  • universe and date ranges
  • feature groups
  • cross-sectional MAD clipping and z-score settings
  • model width, attention heads, depth, and dropout
  • loss weights
  • seeds, optimizer settings, EMA, early stopping
  • Top-N portfolio settings, transaction cost, and EWMA smoothing

📦 Project Structure

CSAlpha/
├── configs/default.yaml        # Data ranges, feature groups, model, train, and eval settings
├── data/                       # Local data directory; raw and processed data are git-ignored
├── docs/                       # Lightweight result snapshots and public-facing assets
├── scripts/                    # Step-by-step pipeline wrappers
├── src/
│   ├── data/                   # Download, feature generation, preprocessing, dataset
│   ├── losses/                 # IC, ListMLE, and masked MSE losses
│   ├── metrics/                # Daily IC, RankIC, portfolio, and grouping metrics
│   ├── models/                 # Group MLP and cross-sectional attention model
│   ├── evaluate.py             # Test-set evaluation and report generation
│   ├── train.py                # Multi-seed training with EMA and early stopping
│   └── utils.py
├── run.sh                      # Full pipeline: download -> features -> train -> evaluate
├── requirements.txt
└── README.md

⚠️ Limitations

  • The current feature set is based on daily market data only; fundamentals, industry exposures, and intraday data are not included.
  • The evaluation is a research backtest. It does not fully model execution constraints, market impact, liquidity limits, or limit-up/limit-down fill failures.
  • Industry and size neutralization are not implemented in the default pipeline.
  • Baostock is a convenient public data source, but users should independently verify data quality and licensing before any serious use.

📄 License

This project is released under the MIT License. See LICENSE.

🙏 Acknowledgements

The feature naming and panel-style data organization are partly inspired by Microsoft Qlib.

About

A PyTorch research pipeline for cross-sectional stock return forecasting with grouped factors and stock-wise attention.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors