Statistical Arbitrage Trading System

A production-grade statistical arbitrage (stat-arb) trading system that identifies market mispricings through quantitative factor analysis, portfolio optimization, and systematic execution. The system processes historical market data, generates alpha signals from multiple strategies, optimizes portfolio positions considering transaction costs and risk, and backtests trading strategies through multiple simulation engines.

Overview

This system implements a complete workflow for statistical arbitrage trading:

Data Loading & Preprocessing: Loads and processes market data from multiple sources
Alpha Generation: Calculates predictive signals from 20+ trading strategies
Factor Analysis: Decomposes returns using PCA and Barra risk models
Portfolio Optimization: Maximizes risk-adjusted returns with realistic constraints
Backtesting: Simulates execution across multiple engines with transaction cost modeling

The system is designed for daily rebalancing across ~1,400 US equities with sophisticated risk management and execution cost modeling.

Features

Core Capabilities

Multi-Source Data Integration: Daily/intraday prices, Barra factors, analyst estimates, short locates
20+ Alpha Strategies: PCA decomposition, analyst signals, momentum, mean reversion, order flow
Advanced Optimization: NLP solver with factor risk, transaction costs, and participation constraints
Multiple Simulation Engines: Daily (BSIM), order-level (OSIM), intraday (QSIM), full system (SSIM)
Risk Management: Factor exposure limits, position sizing, sector neutrality
Realistic Execution Modeling: Market impact, slippage, borrow costs, VWAP vs. close fills

Technical Features

HDF5 Caching: Fast data loading with compressed storage
Vectorized Operations: Efficient pandas/numpy operations for large datasets
Rolling Window Analysis: Adaptive factor models with 30-60 day windows
Winsorization: Robust outlier handling at 5-sigma levels
Corporate Action Handling: Automatic adjustment for splits and dividends

Architecture

Data Flow

Raw Market Data (CSV/SQL)
    ↓
Load & Merge (loaddata.py)
    ↓
Calculate Returns & Features (calc.py)
    ↓
Filter Tradable Universe
    ↓
Generate Alpha Signals (strategy files)
    ↓
Fit Regression Coefficients (regress.py)
    ↓
PCA Decomposition (pca.py) [optional]
    ↓
Portfolio Optimization (opt.py)
    ↓
Simulation Engines (bsim/osim/qsim/ssim)
    ↓
Performance Analysis & Reporting

Key Components

Component	File	Description
Data Loading	`loaddata.py`	Load market data, fundamentals, analyst estimates
Calculations	`calc.py`	Forward returns, volume profiles, winsorization
Regression	`regress.py`	Fit alpha factors to forward returns (WLS)
PCA	`pca.py`	Principal component decomposition
Optimization	`opt.py`	Portfolio optimization with OpenOpt NLP
Big Sim	`bsim.py`	Daily rebalancing backtest
Order Sim	`osim.py`	Order-level execution backtest
Quote Sim	`qsim.py`	Intraday 30-min bar backtest
System Sim	`ssim.py`	Full lifecycle position tracking
Utilities	`util.py`	Helper functions for data merging

Installation

Requirements

Python 2.7 (legacy codebase)
NumPy 1.16.0
Pandas 0.23.4
OpenOpt 0.5628
statsmodels
scikit-learn
matplotlib
lmfit
MySQL connector (optional, for database access)

Setup

# Clone the repository
git clone https://github.com/yourusername/statarb.git
cd statarb

# Install dependencies
pip install -r requirements.txt

# For Cython optimization module (optional)
python setup.py build_ext --inplace

Quick Start

1. Prepare Data Directories

Set the base directories in loaddata.py:

UNIV_BASE_DIR = "/path/to/universe/"
PRICE_BASE_DIR = "/path/to/prices/"
BARRA_BASE_DIR = "/path/to/barra/"
BAR_BASE_DIR = "/path/to/bars/"
EARNINGS_BASE_DIR = "/path/to/earnings/"
LOCATES_BASE_DIR = "/path/to/locates/"
ESTIMATES_BASE_DIR = "/path/to/estimates/"

2. Run a Simple Backtest

# Run BSIM with a single alpha signal
python bsim.py --start=20130101 --end=20130630 \
    --fcast=hl:1:1 \
    --kappa=2e-8 \
    --maxnot=200e6

3. Combine Multiple Alphas

# Combine high-low and beta-adjusted signals
python bsim.py --start=20130101 --end=20130630 \
    --fcast=hl:1:0.6,bd:0.8:0.4 \
    --kappa=2e-8

Data Requirements

Required Data Sources

Universe Files (UNIV_BASE_DIR/YYYY/YYYYMMDD.csv)
- Columns: sid, ticker_root, status, country, currency
Price Files (PRICE_BASE_DIR/YYYY/YYYYMMDD.csv)
- Columns: sid, ticker, open, high, low, close, volume, mkt_cap
Barra Files (BARRA_BASE_DIR/YYYY/YYYYMMDD.csv)
- Risk factors: beta, momentum, size, volatility, etc. (13 factors)
- Industry classifications (58 industries)
Bar Files (BAR_BASE_DIR/YYYY/YYYYMMDD.h5)
- Intraday 30-minute bars with VWAP and volume
- Format: HDF5 with MultiIndex (timestamp, sid)
Locates File (LOCATES_BASE_DIR/borrow.csv)
- Short borrow availability and rates

Universe Filters

Price Range: $2.00 - $500.00
Min ADV: $1M (tradable) / $5M (expandable universe)
Country: USA
Currency: USD
Market Cap: Top 1,400 stocks by default

Usage

Portfolio Optimization

The optimization module (opt.py) maximizes:

Utility = Alpha - κ(Specific Risk + Factor Risk) - Slippage - Execution Costs

Key Parameters:

kappa: Risk aversion (2e-8 to 4.3e-5)
max_sumnot: Max total notional ($50M default)
max_posnot: Max position size (0.48% of capital)
slip_nu: Market impact coefficient (0.14-0.18)

Constraints:

Position limits: ±$40k-$1M per stock
Capital limits: $4-50M aggregate notional
Participation: Max 1.5% of ADV
Factor exposure: Limited Barra factor bets

Simulation Engines

BSIM - Daily Simulation

Most comprehensive daily backtest with optimized positions:

python bsim.py \
    --start=20130101 \
    --end=20130630 \
    --fcast=hl:1:0.5,bd:0.8:0.3,pca:1.2:0.2 \
    --horizon=3 \
    --kappa=2e-8 \
    --maxnot=200e6 \
    --locates=True \
    --vwap=False

Arguments:

--start/--end: Date range (YYYYMMDD)
--fcast: Alpha signals (format: name:multiplier:weight)
--horizon: Forecast horizon in days
--kappa: Risk aversion parameter
--maxnot: Maximum notional
--vwap: Use VWAP execution (default: close)

OSIM - Order Simulation

Order-level backtest with fill strategy analysis:

python osim.py \
    --start=20130101 \
    --end=20130630 \
    --fill=vwap \
    --slipbps=0.0001 \
    --fcast=alpha_files

QSIM - Intraday Simulation

30-minute bar simulation for intraday strategies:

python qsim.py \
    --start=20130101 \
    --end=20130630 \
    --fcast=qhl_intra \
    --horizon=3 \
    --mult=1000 \
    --slipbps=0.0001

SSIM - System Simulation

Full lifecycle with position and cash tracking:

python ssim.py \
    --start=20130101 \
    --end=20131231 \
    --fcast=combined_alpha

Strategies

The system includes 20+ alpha strategies in separate files:

Signal Types

Category	Files	Description
PCA	`pca.py`	Market-neutral returns from PCA decomposition
Beta-Adjusted	`bd.py`, `badj_*.py`	Order flow signals adjusted for beta
High-Low	`hl.py`, `qhl_*.py`	Intraday high-low mean reversion
Analyst	`analyst*.py`, `rating_diff.py`	Analyst rating and estimate changes
Momentum	`mom_year.py`	Annual momentum signals
Volatility	`vadj_*.py`	Volume-adjusted position models
Overnight	`c2o.py`	Close-to-open gap trading
Earnings	`eps.py`, `target.py`	Earnings surprises and target misses

Strategy Development Workflow

Develop Alpha: Create new strategy file with alpha calculation
Fit Coefficients: Use regress.py to fit on in-sample data
Generate Forecasts: Apply to out-of-sample period
Optimize: Run through opt.py to get target positions
Backtest: Simulate with appropriate engine (BSIM/OSIM/QSIM/SSIM)
Analyze: Evaluate Sharpe, drawdown, factor exposures

Simulation Engines

Comparison

Engine	Use Case	Granularity	Execution Model
BSIM	Daily strategies	Daily	Optimized positions
OSIM	Fill analysis	Order-level	VWAP/mid/close fills
QSIM	Intraday strategies	30-min bars	Time-of-day analysis
SSIM	Full system	Daily + intraday	Complete lifecycle

Output Metrics

All engines provide:

P&L: Daily and cumulative
Sharpe Ratio: Risk-adjusted returns
Drawdown: Maximum peak-to-trough decline
Turnover: Average daily trading volume
Factor Exposures: Barra factor bets over time
Execution Quality: Realized vs. estimated costs

Configuration

Universe Parameters

Edit in loaddata.py:

# Tradable universe
t_low_price = 2.0
t_high_price = 500.0
t_min_advp = 1000000.0  # $1M min ADV

# Expandable universe
e_low_price = 2.25
e_high_price = 500.0
e_min_advp = 5000000.0  # $5M min ADV

# Universe size
uni_size = 1400  # Top N by market cap

Optimization Parameters

Edit in opt.py:

max_sumnot = 50.0e6      # $50M max notional
max_posnot = 0.0048      # 0.48% max per position
kappa = 4.3e-5           # Risk aversion

# Slippage model
slip_alpha = 1.0         # Base cost
slip_beta = 0.6          # Participation power
slip_delta = 0.25        # Participation coefficient
slip_nu = 0.14           # Market impact
execFee = 0.00015        # 1.5 bps execution fee

Factor Configuration

Edit in calc.py:

BARRA_FACTORS = ['country', 'growth', 'size', 'sizenl',
                 'divyild', 'btop', 'earnyild', 'beta',
                 'resvol', 'betanl', 'momentum', 'leverage',
                 'liquidty']

PROP_FACTORS = ['srisk_pct_z', 'rating_mean_z']

Project Structure

statarb/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── setup.py                  # Cython build configuration
│
├── loaddata.py              # Data loading and preprocessing
├── calc.py                  # Factor calculations
├── regress.py               # Regression analysis
├── pca.py                   # PCA decomposition
├── opt.py                   # Portfolio optimization
├── util.py                  # Utility functions
│
├── bsim.py                  # Daily simulation engine
├── osim.py                  # Order simulation engine
├── qsim.py                  # Intraday simulation engine
├── ssim.py                  # System simulation engine
│
├── bd.py                    # Beta-adjusted order flow
├── hl.py                    # High-low strategy
├── pca.py                   # PCA alpha generation
├── analyst*.py              # Analyst signal strategies
├── rating_diff.py           # Rating change strategy
├── vadj_*.py               # Volume-adjusted strategies
├── mom_year.py              # Momentum strategy
├── eps.py                   # Earnings surprise strategy
├── target.py                # Price target strategy
├── c2o.py                   # Close-to-open strategy
└── ... (additional strategies)
│
└── salamander/              # Standalone module
    ├── instructions.txt     # Salamander usage guide
    ├── requirements.txt     # Salamander dependencies
    ├── gen_dir.py          # Directory structure generator
    ├── gen_hl.py           # Alpha signal generator
    ├── gen_alpha.py        # Alpha file creator
    ├── bsim.py             # Standalone backtest engine
    ├── simulation.py       # Portfolio simulation
    └── ... (supporting files)

Salamander Module

The salamander/ directory contains a standalone, simplified version of the system for easier deployment and development.

Features

Modular directory structure
Simplified alpha generation pipeline
Standalone backtest engine
Documented workflow in instructions.txt

Usage

# 1. Create directory structure
python3 salamander/gen_dir.py --dir=/path/to/data

# 2. Generate alpha signals from raw data
python3 salamander/gen_hl.py \
    --start=20100630 \
    --end=20130630 \
    --dir=/path/to/data

# 3. Create alpha signal files
python3 salamander/gen_alpha.py \
    --start=20100630 \
    --end=20130630 \
    --dir=/path/to/data

# 4. Run backtest
python3 salamander/bsim.py \
    --start=20130101 \
    --end=20130630 \
    --dir=/path/to/data \
    --fcast=hl:1:1

Directory Structure

data/
├── all/          # Alpha signal files
├── hl/           # High-low strategy files
├── locates/      # Short borrow data (borrow.csv)
├── opt/          # Optimization outputs
├── blotter/      # Trade records
├── raw/          # Raw market data
└── all_graphs/   # Visualization outputs

Performance Metrics

Key Metrics

The system evaluates strategies using:

Sharpe Ratio: Risk-adjusted returns (annualized)
Information Ratio: Alpha vs. benchmark volatility
Maximum Drawdown: Largest peak-to-trough decline
Turnover: Average daily trading as % of capital
Hit Rate: Percentage of profitable days
Factor Exposures: Bets on Barra risk factors
Participation Rate: Trading volume vs. ADV

Risk Management

Factor Neutrality: Limits on Barra factor exposures
Sector Limits: Industry concentration constraints
Position Sizing: Market cap and liquidity-based limits
Participation Constraints: Max 1.5% of ADV to minimize impact
Correlation Monitoring: Rolling 30-day cross-security correlations

Advanced Topics

Custom Alpha Development

To create a new alpha signal:

Create a new Python file (e.g., my_alpha.py)
Load data using loaddata.py functions
Calculate your alpha signal
Use regress.py to fit coefficients on training data
Generate out-of-sample forecasts
Save to HDF5 or CSV for simulation engines

Example structure:

from loaddata import *
from calc import *
from regress import *

# Load data
daily_df = load_prices(start, end, lookback)
barra_df = load_barra(start, end, lookback)

# Calculate alpha
daily_df['my_alpha'] = calculate_my_signal(daily_df)

# Fit regression
fits_df = regress_alpha(daily_df, 'my_alpha', horizon=3)

# Generate forecast
forecast_df = apply_coefficients(daily_df, fits_df)

# Save results
dump_alpha(forecast_df, 'my_alpha')

Multi-Factor Combination

Combine multiple alphas with optimized weights:

python bsim.py \
    --start=20130101 \
    --end=20130630 \
    --fcast=pca:1.0:0.3,hl:1.2:0.25,bd:0.8:0.2,analyst:1.5:0.15,mom:1.0:0.1

Weights should sum to 1.0 for proper risk attribution.

Transaction Cost Analysis

The system models realistic costs:

Execution Fees: 1.5 bps fixed
Slippage: Nonlinear function of participation rate
Market Impact: Based on order size vs. ADV
Borrow Costs: For short positions
Opportunity Cost: From delayed fills

Analyze realized vs. estimated costs using OSIM engine.

Contributing

This is a research codebase. Key areas for improvement:

Python 3 migration
Additional alpha strategies
Enhanced optimization algorithms
Real-time data integration
Machine learning alpha generation
Improved execution modeling

License

Apache 2.0

Contact

For questions and support, please open an issue on GitHub.

Disclaimer: This system is for research and educational purposes. Use at your own risk. Past performance does not guarantee future results. Trading involves substantial risk of loss.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
plan		plan
salamander		salamander
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
PLAN.md		PLAN.md
PROMPT.md		PROMPT.md
README.md		README.md
align.pl		align.pl
analyst.py		analyst.py
analyst_badj.py		analyst_badj.py
badj2_intra.py		badj2_intra.py
badj2_multi.py		badj2_multi.py
badj_both.py		badj_both.py
badj_dow_multi.py		badj_dow_multi.py
badj_intra.py		badj_intra.py
badj_multi.py		badj_multi.py
badj_rating.py		badj_rating.py
bd.py		bd.py
bd1.py		bd1.py
bd_intra.py		bd_intra.py
bigsim_test.py		bigsim_test.py
bsim.py		bsim.py
bsim_weights.py		bsim_weights.py
bsz.py		bsz.py
bsz1.py		bsz1.py
c2o.py		c2o.py
calc.py		calc.py
dumpall.py		dumpall.py
ebs.py		ebs.py
eps.py		eps.py
factors.py		factors.py
hl.py		hl.py
hl_intra.py		hl_intra.py
htb.py		htb.py
include.sh		include.sh
load_data_live.py		load_data_live.py
loaddata.py		loaddata.py
mom_year.py		mom_year.py
new1.py		new1.py
opt.py		opt.py
opt.py.old		opt.py.old
osim.py		osim.py
osim2.py		osim2.py
osim_simple.py		osim_simple.py
other.py		other.py
other2.py		other2.py
pca.py		pca.py
pca_generator.py		pca_generator.py
pca_generator_daily.py		pca_generator_daily.py
prod_eps.py		prod_eps.py
prod_rtg.py		prod_rtg.py
prod_sal.py		prod_sal.py
prod_tgt.py		prod_tgt.py
qhl_both.py		qhl_both.py
qhl_both_i.py		qhl_both_i.py
qhl_intra.py		qhl_intra.py
qhl_multi.py		qhl_multi.py
qsim.py		qsim.py
rating_diff.py		rating_diff.py
rating_diff_updn.py		rating_diff_updn.py
readcsv.py		readcsv.py
regress.py		regress.py
requirements.txt		requirements.txt
rev.py		rev.py
rrb.py		rrb.py
setup.py		setup.py
slip.py		slip.py
ssim.py		ssim.py
ssim.sh		ssim.sh
target.py		target.py
util.py		util.py
vadj.py		vadj.py
vadj_intra.py		vadj_intra.py
vadj_multi.py		vadj_multi.py
vadj_old.py		vadj_old.py
vadj_pos.py		vadj_pos.py

License

noterminusgit/statarb

Folders and files

Latest commit

History

Repository files navigation

Statistical Arbitrage Trading System

Overview

Table of Contents

Features

Core Capabilities

Technical Features

Architecture

Data Flow

Key Components

Installation

Requirements

Setup

Quick Start

1. Prepare Data Directories

2. Run a Simple Backtest

3. Combine Multiple Alphas

Data Requirements

Required Data Sources

Universe Filters

Usage

Portfolio Optimization

Simulation Engines

BSIM - Daily Simulation

OSIM - Order Simulation

QSIM - Intraday Simulation

SSIM - System Simulation

Strategies

Signal Types

Strategy Development Workflow

Simulation Engines

Comparison

Output Metrics

Configuration

Universe Parameters

Optimization Parameters

Factor Configuration

Project Structure

Salamander Module

Features

Usage

Directory Structure

Performance Metrics

Key Metrics

Risk Management

Advanced Topics

Custom Alpha Development

Multi-Factor Combination

Transaction Cost Analysis

Contributing

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages