Cross-sectional alpha signal modeling with grouped factors and stock-wise attention.
CSAlpha is a PyTorch research pipeline for daily cross-sectional stock return forecasting. It builds an A-share multi-factor dataset from Baostock daily bars, trains a grouped-factor encoder with cross-sectional self-attention, and evaluates the resulting alpha signal with IC, RankIC, decile returns, and Top-N long-short portfolios.
This repository is intended as a research and engineering demo for cross-sectional alpha modeling. It is not investment advice.
- Cross-sectional modeling: self-attention runs along the stock dimension, so each stock is scored in the context of the same day's universe.
- Grouped factor encoder: price, volume/liquidity, momentum, and volatility features are encoded by separate MLP blocks before being merged.
- Ranking-aware objective: training combines IC loss, ListMLE ranking loss, and a light masked MSE term.
- Variable-size daily panels: each training sample is one trading day; batches are padded across days and handled with masks.
- Evaluation workflow: reports IC/RankIC, ICIR/RankICIR, decile returns, Top-N long-short metrics, turnover estimates, and optional EWMA score smoothing.
A lightweight reference result is available in docs/results.md. The full experiments/ directory is intentionally excluded from version control because it contains logs, checkpoints, predictions, and other generated artifacts.
Reference run highlights on the 2024-2025 test split:
- Smoothed RankIC mean:
0.0562 - Estimated long-short one-way turnover reduced from
56.80%to25.47%with EWMA smoothing - After-cost smoothed long-short Sharpe:
1.2137under a 5 bps one-way cost assumption
Choose one environment manager.
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtconda create -n csalpha python=3.11 -y
conda activate csalpha
pip install -r requirements.txtpython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe core dependencies are torch, pandas, pyarrow, scipy, pyyaml, tqdm, matplotlib, and baostock.
If the default PyTorch wheel does not match your GPU driver, install a driver-compatible PyTorch wheel first, then install the remaining requirements.
Run the full pipeline:
bash run.shThis is equivalent to:
bash scripts/01_download.sh
bash scripts/02_build_features.sh
bash scripts/03_train.sh
bash scripts/04_evaluate.shFor a smaller smoke run, first download only a subset of stocks, then train a single seed:
python -m src.data.download_baostock --config configs/default.yaml --max_stocks 50
python -m src.data.features --config configs/default.yaml
python -m src.train --config configs/default.yaml --seeds 42 --run-name smoke
python -m src.evaluate --config configs/default.yaml --run smokeThe smoke run checks that the pipeline works, but it is not meant to produce meaningful research results.
The default configuration uses the CSI 500 universe from Baostock. To reduce survivorship bias, the downloader samples historical index constituents every data.universe_refresh_days days and takes their union.
The default date split is:
- Train:
2013-01-01to2021-12-31 - Validation:
2022-01-01to2023-12-31 - Test:
2024-01-01to2025-12-31
Generated files are stored locally and are not committed:
data/raw/ # One parquet file per stock
data/processed/ # One parquet file per trading day
experiments/ # Checkpoints, logs, predictions, summaries, and plots
See data/README.md for the data directory layout.
Features are computed from adjusted daily OHLCV data. The label is the close-to-close return from t+1 to t+2, while features only use information available at or before day t. This keeps the target aligned with a realistic next-day tradability assumption.
The model architecture is:
daily panel (N stocks, F features)
|
|-- GroupMLP(price)
|-- GroupMLP(volume)
|-- GroupMLP(momentum)
|-- GroupMLP(volatility)
|
concat -> projection -> cross-sectional self-attention -> regression score
The training loss is:
L = 0.5 * L_IC + 0.4 * L_ListMLE + 0.1 * L_MSE
where L_IC directly optimizes cross-sectional correlation, L_ListMLE encourages correct ranking, and masked MSE provides a small scale anchor.
src.evaluate loads the checkpoints from a run directory and writes evaluation artifacts into the same run directory. By default, experiments/latest points to the most recent training run.
Typical output:
experiments/exp_YYYYMMDD_HHMMSS/
├── config.yaml
├── train.log
├── checkpoints.json
├── seed_42/best.pt
├── seed_2024/best.pt
├── seed_7/best.pt
├── evaluate.log
├── summary.md
├── test_daily.csv
├── test_daily_gross.csv
├── test_predictions.parquet
├── test_turnover_gross.csv
├── test_turnover_smoothed.csv
├── ls_curve.png
├── ic_timeseries.png
├── ic_cumulative.png
├── group_curves.png
├── drawdown.png
├── turnover_timeseries.png
└── monthly_heatmap.png
The main report is summary.md, which compares raw scores and EWMA-smoothed scores when eval.smooth_span > 1.
Most behavior is controlled in configs/default.yaml, including:
- universe and date ranges
- feature groups
- cross-sectional MAD clipping and z-score settings
- model width, attention heads, depth, and dropout
- loss weights
- seeds, optimizer settings, EMA, early stopping
- Top-N portfolio settings, transaction cost, and EWMA smoothing
CSAlpha/
├── configs/default.yaml # Data ranges, feature groups, model, train, and eval settings
├── data/ # Local data directory; raw and processed data are git-ignored
├── docs/ # Lightweight result snapshots and public-facing assets
├── scripts/ # Step-by-step pipeline wrappers
├── src/
│ ├── data/ # Download, feature generation, preprocessing, dataset
│ ├── losses/ # IC, ListMLE, and masked MSE losses
│ ├── metrics/ # Daily IC, RankIC, portfolio, and grouping metrics
│ ├── models/ # Group MLP and cross-sectional attention model
│ ├── evaluate.py # Test-set evaluation and report generation
│ ├── train.py # Multi-seed training with EMA and early stopping
│ └── utils.py
├── run.sh # Full pipeline: download -> features -> train -> evaluate
├── requirements.txt
└── README.md
- The current feature set is based on daily market data only; fundamentals, industry exposures, and intraday data are not included.
- The evaluation is a research backtest. It does not fully model execution constraints, market impact, liquidity limits, or limit-up/limit-down fill failures.
- Industry and size neutralization are not implemented in the default pipeline.
- Baostock is a convenient public data source, but users should independently verify data quality and licensing before any serious use.
This project is released under the MIT License. See LICENSE.
The feature naming and panel-style data organization are partly inspired by Microsoft Qlib.
