A physics-informed machine learning pipeline for predicting molecular Self-Consistent Field (SCF) energies at the B3LYP/6-31G(2df,p) level of theory. Predictions complete in under 1 second, enabling high-throughput screening of molecular libraries.
- Key Features
- Physics Components
- Files
- Performance
- Installation
- Usage
- Input File Format
- Supported Elements
- For Developers
- Limitations
- Citation
- License
- Acknowledgments
- Fast predictions: < 1 second per molecule (~ 57 mols/sec)
- Physics-informed: Embeds physical laws directly into the model architecture
- Generalizable: Trained on small molecules (≤9 heavy atoms), extends to larger drug-like molecules, tested up to 24 heacy atoms (167% extrapolation)
- Solvation support: Predicts energies in various solvents (water, DMSO, methanol, etc.)
The model incorporates multiple fundamental physical principles organized into hierarchical correction layers:
- B3LYP reference energies for isolated atoms (H, C, N, O, F)
- Provides the foundation for molecular energy prediction via atom counting
Hückel Theory (π-systems)
- π-electron delocalization energy for conjugated and aromatic systems
- HOMO-LUMO gap estimation
- Orbital energy spread for reactivity prediction
Extended Hückel Theory
- Valence orbital ionization potentials (VOIP) for all atoms
- Slater orbital exponents for overlap estimation
- Electronegativity and chemical hardness sums
Lone Pair Interactions
- Explicit counting of lone pairs by atom type and hybridization (N_sp², N_sp³, O_sp², O_sp³, F)
- Lone pair orbital energy contributions
- Adjacent lone pair repulsion (1,2 and 1,3 interactions)
- Hyperconjugation potential (n→σ* donation to adjacent C-H bonds)
- Anomeric effect detection (O-C-O and N-C-O patterns)
Nitrogen Environments
- Pyrrole N (lone pair in π-system)
- Pyridine N (lone pair perpendicular to ring)
- Aniline N (conjugated with aromatic ring)
- Amide N (conjugated with C=O)
- Aliphatic amines (primary, secondary, tertiary)
- Nitrile, nitro, and imine nitrogen
Oxygen Environments
- Carbonyl oxygen (ketone, aldehyde)
- Carboxylic acid and ester oxygen
- Amide oxygen
- Alcohols and phenols
- Aliphatic vs aromatic ethers (critical for methoxy groups)
- Furan oxygen (aromatic)
- Epoxide oxygen (ring strain)
Aromatic-Heteroatom Conjugation
- Electron-donating groups on aromatics (methoxy, hydroxy, amino)
- Electron-withdrawing groups (nitro, cyano, carbonyl)
- Resonance stabilization energy corrections
- Polysubstituted aromatic interactions
- Non-linear scaling with electron count (N^1.3)
- Heteroatom-heteroatom correlation enhancement
- Oxygen pair correlation (particularly strong for multiple O atoms)
- Heteroatom fraction and variance terms
- Linear baseline from atom count, hydrogen count, and bond count
- Bond-specific vibrational frequencies (C-H, N-H, O-H stretches, etc.)
- Gradient boosting residual model for fine corrections
- Proper handling of high-frequency modes
Within Training Domain (≤9 heavy atoms)
- Polynomial size correction learned from residuals
- Size-stratified heteroatom corrections (size bins × O count)
Extrapolation Beyond Training (>9 heavy atoms)
- Learned O-size trend extrapolation
- Quadratic size correction for large molecules
- Heteroatom-size interaction terms (O×size, N×size)
- Framework stabilization from carbon skeleton
- Damping factors for very large molecules (>20 heavy atoms)
- Kirkwood dielectric continuum for electrostatic solvation
- Cavity formation energy from solvent-accessible surface area
- Dispersion interactions based on molecular polarizability
- Explicit hydrogen bond corrections for protic solvents
- Supported solvents: water, methanol, ethanol, DMSO, acetonitrile, chloroform, hexane, and others
- SMARTS-based detection of 18+ functional groups
- Empirical bias corrections learned from training residuals
- Covers ketones, aldehydes, carboxylic acids, esters, amides, nitriles, amines, alcohols, phenols, ethers, nitro groups, and heterocycles
| File | Description | Who needs it |
|---|---|---|
msep_core.py |
Core library with all functions | Everyone |
msep_model.pkl |
Pre-trained model weights | Everyone |
msep_predict.py |
User-facing prediction script | End users |
msep_train.py |
Training script (generates model.pkl) | Developers only |
| Metric | Result (kcal/mol) |
|---|---|
| MAE (overall) | 3.42 |
| MAE (N ≤ 7) | 2.10 |
| MAE (N = 9) | 3.57 |
| RMSE | 5.34 |
| Bias | -0.005 |
| P95 Error | 9.17 |
| ZPVE MAE | 2.19 |
Tested on QM9 dataset (10,000 held-out molecules with ≤9 heavy atoms)
# Install dependencies
pip install numpy pandas scikit-learn requests
# Install RDKit (required)
conda install -c conda-forge rdkit
# OR
pip install rdkit- Download these files to your working directory:
msep_core.pymsep_model.pklmsep_predict.py
# Basic usage
python msep_predict.py compounds.csv
# Specify output file
python msep_predict.py compounds.csv --output results.csv
# Specify model path
python msep_predict.py compounds.csv --model /path/to/msep_model.pkl# Make sure to have `msep_core.py` , `msep_model.pkl` , `msep_predict.py`, and 'Input_compounds.csv' in working directory and all other requirements are installed.
%run msep_predict.py Create a CSV file named Input_compounds.csv with the following columns:
| Column | Required | Description |
|---|---|---|
smiles |
Yes | SMILES string of the molecule |
Compound |
No | Compound name/identifier |
scf |
No | Experimental SCF energy (Hartree) for validation |
solvent |
No | Solvent name (default: vacuum) |
| Compound | smiles | solvent | scf |
|---|---|---|---|
Bufotenin |
CN(C)CCc1c[nH]c2ccc(O)cc12 |
water |
(In Ha, if available) |
Multiple inputs can exist for the same compound if the user has different SCF energies at different conformers. Simply continue to fill out the file, filling all applicable sections (even if repeated).
vacuum/gas(default)watermethanol/meohethanol/etohdmsoacetonitrile/mecndichloromethane/dcm/ch2cl2chloroform/chcl3acetonethfhexanebenzenetoluene
The model supports molecules containing: H, C, N, O, F
Molecules with other elements will return None and be marked as failed.
If you need to retrain the model:
python msep_train.pyThis will:
- Download the QM9 dataset
- Extract features for 50,000 molecules
- Train the ML models
- Save to
msep_model.pkl
Training takes approximately 10-15 minutes.
The model uses a stacked ensemble:
- Huber regression - robust baseline
- Ridge regression with polynomial features
- Gradient boosting (HistGradientBoostingRegressor) - main model
- Refinement GB - captures residuals
- Size/FG corrections - empirical corrections
- Heteroatom corrections - accounts for lone pairs
- Extrapolation corrections - for systems larger than 9 heavy atoms
- Training domain: Best accuracy for molecules with ≤9 heavy atoms
- Elements: Only H, C, N, O, F (current physics features are heavily focused on N, O, weaker in F)
- Conformers: Does not predict conformational energy differences
- Level of theory: Trained on B3LYP/6-31G(2df,p); other methods may differ
- Error by atom type: Heteroatom-rich molecules tend to have higher prediction errors.
If you use this code, please cite:
@software{msep2026,
title={MSEP: Machine Learning SCF Energy Prediction},
year={2026},
url={https://github.com/jkimthe16th/MSEP}
}
MIT License
- QM9 dataset: Ramakrishnan et al., Scientific Data 1, 140022 (2014)
- RDKit: Open-source cheminformatics toolkit
