This repository contains the code and data accompanying our paper 'Not as simple as we thought: a rigorous examination of data aggregation in materials informatics' (https://pubs.rsc.org/en/content/articlelanding/2024/dd/d3dd00207a).
- Clone this repository:
git clone https://github.com/FedeOtto/DataIntegrationMI - Install a new
condaenvironment fromdaggrmi_env.yml:conda env create -f daggrmi_env.yml - Activate the new environment:
conda activate daggrmi
All the utilized datasets are stored in the datasets folder. MPDS data can be obtained by running the retrieve_mpds.py script, given that access to the API is provided. For more info visit https://mpds.io/developer/. Examples of data aggregation can still be reproduced using Materials Project (mp) and AFLOW (aflow) data assessed in this work.
The scripts present in this repository allow to reproduce specific figures and analysis outlined in the main paper. In particular:
2_datasets.pyreproduces Fig.1.1_imbalance.pyreproduces Fig. 2.3_AB_augment.pyreproduces the results presented in Table 2.6_self_augment.pyreproduces the plots in Fig. 3.