Data repository for the publication: Data-driven discovery of photoactive quaternary oxides using first-principles machine learning
The high-throughput workflow uses a mixture of machine learning, data-driven models and first-principles calculations. The overall aim is to filter through a search space of 1 million quaternary oxide compositions to identify those that fall within a stated stability window, have a bandgap in the range 1.0 - 2.5 eV, and are comprised of earth-abundant elements.
Steps 1 and 2: Machine learning
- Train a Gradient Boosting Regressor (GBR) model to predict bandgap from composition
- Filter newly generated compositions using the GBR model
- Rank compositions by sustainability
- Assign structures
- Apply oxidation state probability filter
Steps 4 and 5: Thermodynamic stability and electronic properties
- Thermodynamic stability calculations with high-throughput Density Functional Theory (DFT)
- Bandgap calculation with hybrid DFT
The required data can be downloaded separately from the above Zenodo DOI link
and should be untarred directly into this directory, creating a sub-directory named data. For the first notebook, a dataset is also required from the CMR.
The notebooks make use of many Python packages:
pip install pymongo pymatgen matminer scikit-learn smact pandas atomate fireworks
- Some notebooks connect to the Materials Project using their API. It is therefore possible that data downloaded fresh may not exactly match data used for the work in the original paper.
- The GBR model is built from scratch. Due to the randomness deliberately introduced in the training process, the predicted bandgap values of the same composition will vary slightly each time a new model is built.
- Many different libraries are used and I am not an expert in all of them: some of the code is probably far from elegant!