MicroAlpha Engine is a reproducible research project for event-driven limit order book prediction built to showcase:
- clean microstructure reasoning
- C++ feature engineering with Python bindings
- modular quant research pipeline design
- reproducibility, diagnostics, and artifact-driven experimentation
The goal is not to build a production trading system. The goal is to demonstrate the ability to design and implement a serious research-engineering workflow for high-frequency market data.
This project studies short-horizon midprice direction prediction from LOBSTER order book data using a clean event-driven pipeline.
Main takeaways:
- Pooling multiple tickers improves performance, but not uniformly.
- Predictability is highly heterogeneous across names.
- INTC and MSFT are much easier to predict than AAPL, AMZN, and GOOG under the tested setup.
- Label design matters a lot: dropping zero-return events materially changes the effective learning problem.
- The strongest predictive features are intuitive microstructure signals:
- best-level queue imbalance
- microprice deviation
- short-horizon accumulated OFI
- All features are computed in C++, exposed to Python, and used in a reproducible research pipeline.
A preserved example run is available under:
docs/example_run/2026-04-07_161228_h500_direction_pooled_5t/
This reference run contains:
- config snapshot
- metrics
- per-ticker summaries
- diagnostics
- figures
- CSV tables
- run log
Serialized model binaries (.joblib) are excluded from the saved run to keep the repository lightweight.
- Reproducible experiment pipeline for event-driven LOB prediction
- C++ feature engine exposed to Python through pybind11
- Config-driven experiments with per-run artifact generation
- Model comparison between:
- Logistic Regression
- HistGradientBoostingClassifier
- Diagnostics and interpretability:
- feature summaries
- per-ticker pooled evaluation
- logistic coefficients
- permutation importance
- Dockerized execution
- Automated tests
- Saved example run artifacts for easy inspection without rerunning the full pipeline
Can short-horizon direction be predicted from order book state and recent event flow, and how does predictability change when:
- training on a single ticker vs pooling multiple tickers
- conditioning on price movement vs keeping zero-return events
- comparing linear vs nonlinear models
This repo focuses on research quality and engineering quality, not on execution, alpha monetization, or live deployment.
Pooling 5 tickers improved the pooled headline AUC, but per-ticker analysis showed that:
- weak names stayed weak or improved only modestly
- strong names stayed strong
- pooled performance is partly driven by cross-sectional heterogeneity, not just universal transfer
Using binary_drop_ties solves:
direction prediction conditional on price moving
This materially increases apparent predictability for high-tie names like INTC and MSFT.
Using binary_keep_ties_as_zero lowers performance, but does not eliminate the effect, which suggests that the phenomenon is real and not purely a filtering artifact.
Across coefficient analysis and permutation importance, the main signal comes from:
- queue_imbalance_best
- microprice_deviation
- ofi_roll_sum_50
The nonlinear tree model extracts additional value from:
- depth imbalance
- short-term volatility
- some longer normalized OFI windows
INTC and MSFT are structurally easier than the other names in this sample and at this horizon. The project carefully narrows down what this does not come from:
- not obvious leakage
- not purely movement prediction
- not purely pooling
- not purely tie filtering
The exact structural cause remains open, but the effect is documented and characterized carefully.
- Load LOBSTER message/order book data
- Compute all features in C++
- Create labels from forward midprice delta
- Align features and labels
- Split each ticker in time
- Pool train/test sets across tickers if requested
- Train models
- Evaluate
- Save artifacts, figures, tables, diagnostics, and logs
Main entrypoint:
python -m scripts.run_experiment
All features are computed in C++.
ofi_bestofi_best_normqueue_imbalance_bestdepth_imbalance_3depth_imbalance_5depth_imbalance_10spreadmicroprice_deviation
ofi_roll_sum_50ofi_best_norm_roll_sum_10ofi_best_norm_roll_sum_50ofi_best_norm_roll_sum_100midprice_vol_50event_intensity_1s
The project supports multiple binary task formulations.
task.name = direction
label_mode = binary_drop_ties
task.name = direction
label_mode = binary_keep_ties_as_zero
task.name = movement
label_mode = binary
task.name = direction
label_mode = binary_drop_ties
horizon = 500
- standardized
- interpretable baseline
- coefficients saved to artifacts
- nonlinear tabular model
- better at exploiting interactions and nonlinear state structure
- permutation importance saved to artifacts
Building the _cpp extension locally requires:
- CMake >= 3.18
- Python >= 3.12
pybind11installed in the active Python environment- a C++20-capable compiler
- Windows: Visual Studio Build Tools / MSVC
- Linux:
g++ - macOS: Apple Clang / Xcode command line tools
If local native build setup is inconvenient, use the Docker workflow instead.
python -m venv .venv
Linux/macOS:
source .venv/bin/activate
Windows:
.venv\Scripts\activate
pip install --upgrade pip
pip install -e .
pip install -e ".[dev]"
Windows PowerShell:
.\scripts\build_cpp.ps1Linux/macOS
cmake -S cpp -B cpp/build \
-DPYBIND11_FINDPYTHON=ON \
-DCMAKE_BUILD_TYPE=Release \
-Dpybind11_DIR="$(python -m pybind11 --cmakedir)"
cmake --build cpp/build --config ReleaseThe config expects CSV files under data/raw/... as referenced in:
config/experiment.yaml
pytest -q
python -m scripts.run_experiment
Artifacts are written under:
artifacts/<run_id>/
Build:
docker build -t microalpha .
Test:
docker run --rm microalpha pytest -q
Run:
docker run --rm \
--mount type=bind,source="$(pwd)/data",target=/app/data \
--mount type=bind,source="$(pwd)/artifacts",target=/app/artifacts \
microalpha
Windows PowerShell
docker run --rm `
--mount "type=bind,source=${PWD}\data,target=/app/data" `
--mount "type=bind,source=${PWD}\artifacts,target=/app/artifacts" `
microalpha
Notes:
data/is mounted so the container can read local LOBSTER dataartifacts/is mounted so outputs persist on the host- code and config are copied into the image at build time for reproducibility
The test suite covers the core invariants of the project:
- config loading smoke test
- LOBSTER loader smoke test
- label alignment correctness
- pooled split / per-ticker test segment consistency
- feature matrix shape / feature-name consistency
_cppextension import smoke test
Run:
pytest -q
A preserved example run is stored in:
docs/example_run/2026-04-07_161228_h500_direction_pooled_5t/
This is included so readers can inspect:
- what a completed run folder looks like
- what files are generated
- how metrics / diagnostics / tables / figures are organized
This saved run excludes model .joblib files to keep the repository lightweight.
The pipeline is designed around:
- config-driven experiments
- per-run artifact folders
- saved config snapshot
- logs
- deterministic train/test splitting by ticker
- explicit pooled test reconstruction for per-ticker pooled evaluation
Each run saves:
config.jsonmetrics.jsonsplit_summary.jsonticker_summaries.jsonticker_feature_diagnostics.jsonpooled_ticker_metrics.json- permutation importance JSON/CSV
- plots
- run log
microalpha-engine/
├── artifacts/ # GENERATED: per-run artifacts
├── config/
│ └── experiment.yaml # main experiment configuration
├── cpp/ # C++ feature engine + bindings build
│ ├── include/
│ │ ├── microalpha/
│ │ │ └── features.hpp
│ ├── src/
│ │ ├── bindings.cpp
│ │ └── features.cpp
│ └── CMakeLists.txt
├── data/ # local input data (not distributed)
├── docs/
│ └── example_run/ # preserved example run artifacts
├── microalpha/ # Python package
│ ├── config.py
│ ├── diagnostics.py
│ ├── evaluation.py
│ ├── features.py
│ ├── io.py
│ ├── labels.py
│ ├── models.py
│ ├── pipeline.py
│ └── utils.py
├── scripts/
│ ├── build_cpp.ps1 # script to build C++ module
│ └── run_experiment.py # main orchestrator
├── tests/ # automated tests
├── Dockerfile
├── LICENSE
├── pyproject.toml
└── README.md
This repository does not include raw LOBSTER market data.
This project was built using free LOBSTER sample data for AAPL, AMZN, GOOG, INTC, and MSFT at 10 depth levels.
To run the full pipeline, you must supply your own data in the expected directory structure under data/raw/....
Code licensing and data licensing are separate issues.
This repo distributes the code and example artifacts, not the raw proprietary dataset.
This project is licensed under the BSD 3-Clause License.
See the LICENSE file for details.
Gautier Petit