CSAlpha

Cross-sectional alpha signal modeling with grouped factors and stock-wise attention.

CSAlpha is a PyTorch research pipeline for daily cross-sectional stock return forecasting. It builds an A-share multi-factor dataset from Baostock daily bars, trains a grouped-factor encoder with cross-sectional self-attention, and evaluates the resulting alpha signal with IC, RankIC, decile returns, and Top-N long-short portfolios.

This repository is intended as a research and engineering demo for cross-sectional alpha modeling. It is not investment advice.

✨ Highlights

Cross-sectional modeling: self-attention runs along the stock dimension, so each stock is scored in the context of the same day's universe.
Grouped factor encoder: price, volume/liquidity, momentum, and volatility features are encoded by separate MLP blocks before being merged.
Ranking-aware objective: training combines IC loss, ListMLE ranking loss, and a light masked MSE term.
Variable-size daily panels: each training sample is one trading day; batches are padded across days and handled with masks.
Evaluation workflow: reports IC/RankIC, ICIR/RankICIR, decile returns, Top-N long-short metrics, turnover estimates, and optional EWMA score smoothing.

📈 Results Snapshot

A lightweight reference result is available in docs/results.md. The full experiments/ directory is intentionally excluded from version control because it contains logs, checkpoints, predictions, and other generated artifacts.

Reference run highlights on the 2024-2025 test split:

Smoothed RankIC mean: 0.0562
Estimated long-short one-way turnover reduced from 56.80% to 25.47% with EWMA smoothing
After-cost smoothed long-short Sharpe: 1.2137 under a 5 bps one-way cost assumption

🛠️ Installation

Choose one environment manager.

Option A: uv

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

Option B: conda

conda create -n csalpha python=3.11 -y
conda activate csalpha
pip install -r requirements.txt

Option C: standard venv

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The core dependencies are torch, pandas, pyarrow, scipy, pyyaml, tqdm, matplotlib, and baostock.

If the default PyTorch wheel does not match your GPU driver, install a driver-compatible PyTorch wheel first, then install the remaining requirements.

🚀 Quick Start

Run the full pipeline:

bash run.sh

This is equivalent to:

bash scripts/01_download.sh
bash scripts/02_build_features.sh
bash scripts/03_train.sh
bash scripts/04_evaluate.sh

For a smaller smoke run, first download only a subset of stocks, then train a single seed:

python -m src.data.download_baostock --config configs/default.yaml --max_stocks 50
python -m src.data.features --config configs/default.yaml
python -m src.train --config configs/default.yaml --seeds 42 --run-name smoke
python -m src.evaluate --config configs/default.yaml --run smoke

The smoke run checks that the pipeline works, but it is not meant to produce meaningful research results.

🗂️ Data

The default configuration uses the CSI 500 universe from Baostock. To reduce survivorship bias, the downloader samples historical index constituents every data.universe_refresh_days days and takes their union.

The default date split is:

Train: 2013-01-01 to 2021-12-31
Validation: 2022-01-01 to 2023-12-31
Test: 2024-01-01 to 2025-12-31

Generated files are stored locally and are not committed:

data/raw/          # One parquet file per stock
data/processed/    # One parquet file per trading day
experiments/       # Checkpoints, logs, predictions, summaries, and plots

See data/README.md for the data directory layout.

🧠 Method Overview

Features are computed from adjusted daily OHLCV data. The label is the close-to-close return from t+1 to t+2, while features only use information available at or before day t. This keeps the target aligned with a realistic next-day tradability assumption.

The model architecture is:

daily panel (N stocks, F features)
        |
        |-- GroupMLP(price)
        |-- GroupMLP(volume)
        |-- GroupMLP(momentum)
        |-- GroupMLP(volatility)
        |
      concat -> projection -> cross-sectional self-attention -> regression score

The training loss is:

L = 0.5 * L_IC + 0.4 * L_ListMLE + 0.1 * L_MSE

where L_IC directly optimizes cross-sectional correlation, L_ListMLE encourages correct ranking, and masked MSE provides a small scale anchor.

📊 Evaluation

src.evaluate loads the checkpoints from a run directory and writes evaluation artifacts into the same run directory. By default, experiments/latest points to the most recent training run.

Typical output:

experiments/exp_YYYYMMDD_HHMMSS/
├── config.yaml
├── train.log
├── checkpoints.json
├── seed_42/best.pt
├── seed_2024/best.pt
├── seed_7/best.pt
├── evaluate.log
├── summary.md
├── test_daily.csv
├── test_daily_gross.csv
├── test_predictions.parquet
├── test_turnover_gross.csv
├── test_turnover_smoothed.csv
├── ls_curve.png
├── ic_timeseries.png
├── ic_cumulative.png
├── group_curves.png
├── drawdown.png
├── turnover_timeseries.png
└── monthly_heatmap.png

The main report is summary.md, which compares raw scores and EWMA-smoothed scores when eval.smooth_span > 1.

⚙️ Configuration

Most behavior is controlled in configs/default.yaml, including:

universe and date ranges
feature groups
cross-sectional MAD clipping and z-score settings
model width, attention heads, depth, and dropout
loss weights
seeds, optimizer settings, EMA, early stopping
Top-N portfolio settings, transaction cost, and EWMA smoothing

📦 Project Structure

CSAlpha/
├── configs/default.yaml        # Data ranges, feature groups, model, train, and eval settings
├── data/                       # Local data directory; raw and processed data are git-ignored
├── docs/                       # Lightweight result snapshots and public-facing assets
├── scripts/                    # Step-by-step pipeline wrappers
├── src/
│   ├── data/                   # Download, feature generation, preprocessing, dataset
│   ├── losses/                 # IC, ListMLE, and masked MSE losses
│   ├── metrics/                # Daily IC, RankIC, portfolio, and grouping metrics
│   ├── models/                 # Group MLP and cross-sectional attention model
│   ├── evaluate.py             # Test-set evaluation and report generation
│   ├── train.py                # Multi-seed training with EMA and early stopping
│   └── utils.py
├── run.sh                      # Full pipeline: download -> features -> train -> evaluate
├── requirements.txt
└── README.md

⚠️ Limitations

The current feature set is based on daily market data only; fundamentals, industry exposures, and intraday data are not included.
The evaluation is a research backtest. It does not fully model execution constraints, market impact, liquidity limits, or limit-up/limit-down fill failures.
Industry and size neutralization are not implemented in the default pipeline.
Baostock is a convenient public data source, but users should independently verify data quality and licensing before any serious use.

📄 License

This project is released under the MIT License. See LICENSE.

🙏 Acknowledgements

The feature naming and panel-style data organization are partly inspired by Microsoft Qlib.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSAlpha

✨ Highlights

📈 Results Snapshot

🛠️ Installation

Option A: uv

Option B: conda

Option C: standard venv

🚀 Quick Start

🗂️ Data

🧠 Method Overview

📊 Evaluation

⚙️ Configuration

📦 Project Structure

⚠️ Limitations

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
data		data
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

CSAlpha

✨ Highlights

📈 Results Snapshot

🛠️ Installation

Option A: uv

Option B: conda

Option C: standard venv

🚀 Quick Start

🗂️ Data

🧠 Method Overview

📊 Evaluation

⚙️ Configuration

📦 Project Structure

⚠️ Limitations

📄 License

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages