This package provides an implementation of the Orbformer wave function foundation model.
We also provide the following:
- the Neural Electron Real-space Density model, which can be trained from Orbformer checkpoints
- Light Atom Curriculum dataset, and other datasets referenced in the Orbformer paper
- notebooks to reproduce figures from the paper
You will need compute infrastructure in which it is possible to install JAX. In our experiments, we used an Nvidia A100 GPU running on Linux with CUDA version 12. Some features of this repository require an Nvidia GPU that has compute capability 8.0 or later.
- Clone this repository
git clone https://github.com/microsoft/oneqmc.git
cd ./oneqmc- Install and activate the provided
condaenvironment, which automatically installs JAX
conda env create -f environment-os.yaml -n oneqmc
conda activate oneqmc- Append the location of the
oneqmcsource files to yourPYTHONPATH
export PYTHONPATH=$PYTHONPATH:$PWD/src- Test your set-up by running a simple single-point fine-tuning using the LAC checkpoint
python scripts/transferable.py -d CH4 -c checkpoints/lac/lac.chkpt --discard-sampler-state --max-eq-steps 100The main entrypoint for running wave function training and fine-tuning is transferable.py.
Many datasets that are mentioned in the paper are provided in the data directory.
The TinyMol dataset can be downloaded from the deeperwin repository and
processed into our format by running
python scripts/download_tinymol_dataset.pyTo fine-tune from the LAC checkpoint, use
python scripts/transferable.py -d <subdirectory of ./data> -c checkpoints/lac/lac.chkpt --discard-sampler-state -n <number of training steps> -w <output directory>and to train a model from scratch, use
python scripts/transferable.py -d <subdirectory of ./data> -n <number of training steps> -w <output directory>We recommend using distinct output directories for every training run.
Regarding other optional arguments, run python scripts/transferable.py -h for more information.
To create a new dataset for fine-tuning, we recommend using qcelemental format. Given a Python dictionary of the form
structures: dict[str, qcel.model.Molecule] = {name: mol}you can save this in a format that is readable by oneqmc via
structures_flat = {k: v.dict(encoding="json") for k, v in structures.items()}
with open("./data/<dataset name>/structures.json", "w") as f_out:
json.dump(structures_flat, f_out)We also support nested directories as datasets, and dataset names containing /.
An alternative .yaml file format is also supported for datasets.
We recommend energy evaluation using a fresh evaluation run and by computing a robust mean on the evaluation energies.
A fresh evaluation run from a checkpoint can be launched using the --test argument, e.g.
python scripts/transferable.py -d <subdirectory of ./data> -n <number of test steps> -c <chkpt> --discard-sampler-state -w <different output directory> --testThis will run MCMC and energy evaluation without updating the model parameters.
We recommend using separate -w output directories for training and evaluation.
The saved energies from the evaluation can be accessed via the h5 file
from oneqmc.analysis import read_result
from oneqmc.analysis.energy import robust_mean
# If evaluating on a single GPU, this will have shape [num_mols, num_steps]
energy = read_result(
test_output_directory,
keys=["E_loc/mean_elec"],
subdir="training",
)
# Robust mean of energies for structure 0
rmean_0 = robust_mean(energy[0, :]).squeeze()Numerous other metrics are logged during training and evaluation, and be accessed in an analogous way.
A tensorboard log file is also generated automatically during training.
To store less output, you can pass the arguments --metric-logger-period <int> and --metric-logger <subset of {h5, tb} possibly empty>.
Checkpoints are automatically generated during any training or fine-tuning run.
If you run the same training command with the same output directory passed to -w, the script will automatically resume from the last checkpoint. This behaviour can be switched off using --no-autoresume. To resume from some other checkpoint, pass the -c argument without using --discard-sampler-state, and this will restore the sampler and optimizer states as well as the model parameters.
To store fewer checkpoints, you can pass the arguments --chkpts-fast-interval and --chkpts-slow-interval.
It is not possible to fine-tune from the LAC checkpoint on a molecule that contains a nucleus that is heavier than Fluorine. For such molecule, we suggest training from scratch, or pretraining your own model.
For LAC pretraining Phase 2, we used the following settings, running on 16 GPUs
python scripts/transferable.py -d lightatomcurriculum/level2 --data-augmentation rotation fuzz --electron-batch-size 1024 --mol-batch-size 16 -n 400000 --max-restarts 200 --multi-system-sampler double-langevin --repeated-sampling-len 40 --max-eq-steps 300 --metric-logger-period 25 -c <phase 1b checkpoint> --discard-sampler-state -w <output directory>Given an alternative pretraining dataset, you can launch a pretraining by adapting this command. Omit -c <phase 1b checkpoint> --discard-sampler-state to begin pretraining
from a randomly initialized network.
There are several strategies that can be employed for very large molecules that would naively cause OOM errors.
- Reduce
--electron-batch-sizeand set--mol-batch-sizeequal to the number of GPUs. - Passing
--local-energy-chunk-sizewill cause the local energy to be evaluated sequentially in blocks of the given size. For example, running with--electron-batch-size 512 --local-energy-chunk-size 128will compute local energies in 4 passes. - To fine-tune on a single, large molecule using multiple GPUs, use the flag
--repeat-single-moland pass a dataset of length 1.
The notebooks/paper directory contains notebooks that reproduce the plots shown in the paper.
You can extract the electron density from an Orbformer checkpoint by fitting a new model
python scripts/density.py -d <dataset of size 1> -c <trained orbformer checkpoint> -w <output directory>Note that density models are currently only supported for single molecules, so the dataset should be of size 1. You can filter out individual molecules from a larger dataset using --data-file-whitelist and --data-json-whitelist.
You can run python scripts/density.py -h for information on other arguments that are accepted by this script.
The training code will automatically evaluate the density on Lebedev-Laikov grids of various sizes during training.
If you use this repository, please cite our work. The Orbformer model, checkpoints, training scheme:
@article{foster2025ab,
title={An ab initio foundation model of wavefunctions that accurately describes chemical bond breaking},
author={Adam Foster and Zeno Sch{\"a}tzle and P. Bern{\'a}t Szab{\'o} and Lixue Cheng and Jonas K{\"o}hler and Gino Cassella and Nicholas Gao and Jiawei Li and Frank No{\'e} and Jan Hermann},
journal={arXiv preprint arXiv:2506.19960},
year={2025}
}Electron density extraction:
@article{cheng2025highly,
title={Highly accurate real-space electron densities with neural networks},
author={Cheng, Lixue and Szab{\'o}, P Bern{\'a}t and Sch{\"a}tzle, Zeno and Kooi, Derk P and K{\"o}hler, Jonas and Giesbertz, Klaas JH and No{\'e}, Frank and Hermann, Jan and Gori-Giorgi, Paola and Foster, Adam},
journal={The Journal of Chemical Physics},
volume={162},
number={3},
year={2025},
publisher={AIP Publishing}
}This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
