MicroAlpha Engine

A C++/Python Microstructure Research Engine

MicroAlpha Engine is a reproducible research project for event-driven limit order book prediction built to showcase:

clean microstructure reasoning
C++ feature engineering with Python bindings
modular quant research pipeline design
reproducibility, diagnostics, and artifact-driven experimentation

The goal is not to build a production trading system. The goal is to demonstrate the ability to design and implement a serious research-engineering workflow for high-frequency market data.

TL;DR

This project studies short-horizon midprice direction prediction from LOBSTER order book data using a clean event-driven pipeline.

Main takeaways:

Pooling multiple tickers improves performance, but not uniformly.
Predictability is highly heterogeneous across names.
INTC and MSFT are much easier to predict than AAPL, AMZN, and GOOG under the tested setup.
Label design matters a lot: dropping zero-return events materially changes the effective learning problem.
The strongest predictive features are intuitive microstructure signals:
- best-level queue imbalance
- microprice deviation
- short-horizon accumulated OFI
All features are computed in C++, exposed to Python, and used in a reproducible research pipeline.

A preserved example run is available under:

docs/example_run/2026-04-07_161228_h500_direction_pooled_5t/

This reference run contains:

config snapshot
metrics
per-ticker summaries
diagnostics
figures
CSV tables
run log

Serialized model binaries (.joblib) are excluded from the saved run to keep the repository lightweight.

What this repo contains

Reproducible experiment pipeline for event-driven LOB prediction
C++ feature engine exposed to Python through pybind11
Config-driven experiments with per-run artifact generation
Model comparison between:
- Logistic Regression
- HistGradientBoostingClassifier
Diagnostics and interpretability:
- feature summaries
- per-ticker pooled evaluation
- logistic coefficients
- permutation importance
Dockerized execution
Automated tests
Saved example run artifacts for easy inspection without rerunning the full pipeline

Research question

Can short-horizon direction be predicted from order book state and recent event flow, and how does predictability change when:

training on a single ticker vs pooling multiple tickers
conditioning on price movement vs keeping zero-return events
comparing linear vs nonlinear models

This repo focuses on research quality and engineering quality, not on execution, alpha monetization, or live deployment.

Main findings

1. Pooling helps, but not uniformly

Pooling 5 tickers improved the pooled headline AUC, but per-ticker analysis showed that:

weak names stayed weak or improved only modestly
strong names stayed strong
pooled performance is partly driven by cross-sectional heterogeneity, not just universal transfer

2. Label construction changes the problem

Using binary_drop_ties solves:

direction prediction conditional on price moving

This materially increases apparent predictability for high-tie names like INTC and MSFT.

Using binary_keep_ties_as_zero lowers performance, but does not eliminate the effect, which suggests that the phenomenon is real and not purely a filtering artifact.

3. Core signal is concentrated in a few sensible features

Across coefficient analysis and permutation importance, the main signal comes from:

queue_imbalance_best
microprice_deviation
ofi_roll_sum_50

The nonlinear tree model extracts additional value from:

depth imbalance
short-term volatility
some longer normalized OFI windows

4. Predictability is cross-sectionally heterogeneous

INTC and MSFT are structurally easier than the other names in this sample and at this horizon. The project carefully narrows down what this does not come from:

not obvious leakage
not purely movement prediction
not purely pooling
not purely tie filtering

The exact structural cause remains open, but the effect is documented and characterized carefully.

Pipeline at a glance

Load LOBSTER message/order book data
Compute all features in C++
Create labels from forward midprice delta
Align features and labels
Split each ticker in time
Pool train/test sets across tickers if requested
Train models
Evaluate
Save artifacts, figures, tables, diagnostics, and logs

Main entrypoint:

python -m scripts.run_experiment

Feature set

All features are computed in C++.

Core order-book features

ofi_best
ofi_best_norm
queue_imbalance_best
depth_imbalance_3
depth_imbalance_5
depth_imbalance_10
spread
microprice_deviation

Temporal features

ofi_roll_sum_50
ofi_best_norm_roll_sum_10
ofi_best_norm_roll_sum_50
ofi_best_norm_roll_sum_100
midprice_vol_50
event_intensity_1s

Labels

The project supports multiple binary task formulations.

Direction conditional on movement

task.name = direction
label_mode = binary_drop_ties

Direction with ties kept as class 0

task.name = direction 
label_mode = binary_keep_ties_as_zero

Movement prediction

task.name = movement 
label_mode = binary

Main results in the saved reference run use:

task.name = direction
label_mode = binary_drop_ties
horizon = 500

Models

1. Logistic Regression

standardized
interpretable baseline
coefficients saved to artifacts

2. HistGradientBoostingClassifier

nonlinear tabular model
better at exploiting interactions and nonlinear state structure
permutation importance saved to artifacts

Quickstart

Local C++ build prerequisites

Building the _cpp extension locally requires:

CMake >= 3.18
Python >= 3.12
pybind11 installed in the active Python environment
a C++20-capable compiler
- Windows: Visual Studio Build Tools / MSVC
- Linux: g++
- macOS: Apple Clang / Xcode command line tools

If local native build setup is inconvenient, use the Docker workflow instead.

1. Create environment

python -m venv .venv

Activate it:

Linux/macOS:

source .venv/bin/activate

Windows:

.venv\Scripts\activate

2. Install

pip install --upgrade pip
pip install -e .
pip install -e ".[dev]"

3. Build the C++ extension

Windows PowerShell:

.\scripts\build_cpp.ps1

Linux/macOS

cmake -S cpp -B cpp/build \
  -DPYBIND11_FINDPYTHON=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -Dpybind11_DIR="$(python -m pybind11 --cmakedir)"
cmake --build cpp/build --config Release

4. Put LOBSTER data in place

The config expects CSV files under data/raw/... as referenced in:

config/experiment.yaml

5. Run tests

pytest -q

6. Run experiment

python -m scripts.run_experiment

Artifacts are written under:

artifacts/<run_id>/

Docker

Build:

docker build -t microalpha .

Test:

docker run --rm microalpha pytest -q

Run:

docker run --rm \
  --mount type=bind,source="$(pwd)/data",target=/app/data \
  --mount type=bind,source="$(pwd)/artifacts",target=/app/artifacts \
  microalpha

Windows PowerShell

docker run --rm `
  --mount "type=bind,source=${PWD}\data,target=/app/data" `
  --mount "type=bind,source=${PWD}\artifacts,target=/app/artifacts" `
  microalpha

Notes:

data/ is mounted so the container can read local LOBSTER data
artifacts/ is mounted so outputs persist on the host
code and config are copied into the image at build time for reproducibility

Tests

The test suite covers the core invariants of the project:

config loading smoke test
LOBSTER loader smoke test
label alignment correctness
pooled split / per-ticker test segment consistency
feature matrix shape / feature-name consistency
_cpp extension import smoke test

Run:

pytest -q

Example run artifacts

A preserved example run is stored in:

docs/example_run/2026-04-07_161228_h500_direction_pooled_5t/

This is included so readers can inspect:

what a completed run folder looks like
what files are generated
how metrics / diagnostics / tables / figures are organized

This saved run excludes model .joblib files to keep the repository lightweight.

Reproducibility

The pipeline is designed around:

config-driven experiments
per-run artifact folders
saved config snapshot
logs
deterministic train/test splitting by ticker
explicit pooled test reconstruction for per-ticker pooled evaluation

Each run saves:

config.json
metrics.json
split_summary.json
ticker_summaries.json
ticker_feature_diagnostics.json
pooled_ticker_metrics.json
permutation importance JSON/CSV
plots
run log

Repo structure

microalpha-engine/
├── artifacts/                # GENERATED: per-run artifacts
├── config/
│   └── experiment.yaml       # main experiment configuration
├── cpp/                      # C++ feature engine + bindings build
│   ├── include/
│   │   ├── microalpha/
│   │   │   └── features.hpp
│   ├── src/
│   │   ├── bindings.cpp
│   │   └── features.cpp
│   └── CMakeLists.txt
├── data/                     # local input data (not distributed)
├── docs/
│   └── example_run/            # preserved example run artifacts
├── microalpha/               # Python package
│   ├── config.py
│   ├── diagnostics.py
│   ├── evaluation.py
│   ├── features.py
│   ├── io.py
│   ├── labels.py
│   ├── models.py
│   ├── pipeline.py
│   └── utils.py
├── scripts/
│   ├── build_cpp.ps1         # script to build C++ module
│   └── run_experiment.py     # main orchestrator
├── tests/                    # automated tests
├── Dockerfile
├── LICENSE
├── pyproject.toml
└── README.md

Data note

This repository does not include raw LOBSTER market data.
This project was built using free LOBSTER sample data for AAPL, AMZN, GOOG, INTC, and MSFT at 10 depth levels.

To run the full pipeline, you must supply your own data in the expected directory structure under data/raw/....

Code licensing and data licensing are separate issues.
This repo distributes the code and example artifacts, not the raw proprietary dataset.

License

This project is licensed under the BSD 3-Clause License.
See the LICENSE file for details.

Contact

Gautier Petit

GitHub: gautierpetit
LinkedIn: gautierpetitch

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
config		config
cpp		cpp
docs/example_run/2026-04-07_161228_h500_direction_pooled_5t		docs/example_run/2026-04-07_161228_h500_direction_pooled_5t
microalpha		microalpha
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

MicroAlpha Engine

A C++/Python Microstructure Research Engine

TL;DR

What this repo contains

Research question

Main findings

1. Pooling helps, but not uniformly

2. Label construction changes the problem

3. Core signal is concentrated in a few sensible features

4. Predictability is cross-sectionally heterogeneous

Pipeline at a glance

Feature set

Core order-book features

Temporal features

Labels

Direction conditional on movement

Direction with ties kept as class 0

Movement prediction

Main results in the saved reference run use:

Models

1. Logistic Regression

2. HistGradientBoostingClassifier

Quickstart

Local C++ build prerequisites

1. Create environment

Activate it:

2. Install

3. Build the C++ extension

4. Put LOBSTER data in place

5. Run tests

6. Run experiment

Docker

Tests

Example run artifacts

Reproducibility

Repo structure

Data note

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages