Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 21 additions & 19 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@

|Code style: black|

|FlashX| |FlowX| |Minimal| |Publish|
|FlashX| |FlowX| |Minimal| |Publish| |Linting|

BoxKit is a library that provides building blocks to parallelize and
scale data science, high performance computing, and machine learning
applications for block-structured datasets. Spatial data from
simulations and experiments can be accessed and managed using tools
available in this library when working with more data analysis oriented
packages like SciKit (https://github.com/scikit-learn/scikit-learn) and
FlowNet (https://github.com/NVIDIA/flownet2-pytorch)
**********
Overview
**********

An overview of BoxKit is available in ``paper/paper.md`` that can be
compiled into a Journal of Open Source Software (JOSS) pdf by running
``make`` in the ``paper`` directory. Please note that the ``Makefile``
requires a functioning Docker service on the machine.

**************
Installation
Expand Down Expand Up @@ -63,17 +64,7 @@ source code and is an effective method for debugging. Note that the
*******

BoxKit is undergoing active development and therefore design changes are
frequent, however, the library is divided into two broad categories:

- **Create**: Containing interface for classes/methods to store spatial
data in a rectangular/cubic frame, along with auxillary tools to
manage irregular geometries composed of unstructured triangular mesh.

- **Utilities**: Containing interface for classes/method to improve
memory managment of data on NUMA nodes.

We are currently setting up use cases for BoxKit, and will update this
section when we are able to demonstrate proof-of-concept.
frequent, we will update this section soon.

**********
Citation
Expand All @@ -92,6 +83,15 @@ section when we are able to demonstrate proof-of-concept.
url = {https://doi.org/10.5281/zenodo.7255632}
}

**************
Contribution
**************

Contribution to the source code is encouraged. Developers can create
pull requests from their individual forks to the ``development`` branch.
Please read ``DESIGN.rst`` for an overview of software design and
developer guide

****************
Help & Support
****************
Expand All @@ -109,5 +109,7 @@ Please file an issue on the repository page

.. |Publish| image:: https://github.com/akashdhruv/BoxKit/workflows/Publish/badge.svg

.. |Linting| image:: https://github.com/akashdhruv/BoxKit/workflows/Linting/badge.svg

.. |icon| image:: ./media/icon.svg
:width: 30
1 change: 1 addition & 0 deletions media/workflow.drawio

Large diffs are not rendered by default.

Binary file added media/workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion paper/Makefile
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
all:
docker run --rm --volume $(PWD):/data --user $(id -u):$(id -g) --env JOURNAL=joss openjournals/inara
docker run --rm --volume $(PWD):/data --volume $(PWD)/../media:/media \
--user $(id -u):$(id -g) --env JOURNAL=joss openjournals/inara
36 changes: 29 additions & 7 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
@article{DUBEY2022,
Title = {Flash-{X}: A multiphysics simulation software instrument},
Title = {Flash-{X}: A Multiphysics Simulation Software Instrument},
Journal = {SoftwareX},
Volume = {19},
Pages = {101168},
Expand All @@ -10,14 +10,36 @@ @article{DUBEY2022
Author = {Anshu Dubey and Klaus Weide and Jared O’Neal and Akash Dhruv and Sean Couch and J. Austin Harris and Tom Klosterman and Rajeev Jain and Johann Rudi and Bronson Messer and Michael Pajkos and Jared Carlson and Ran Chu and Mohamed Wahib and Saurabh Chawdhary and Paul M. Ricker and Dongwook Lee and Katie Antypas and Katherine M. Riley and Christopher Daley and Murali Ganapathy and Francis X. Timmes and Dean M. Townsley and Marcos Vanella and John Bachan and Paul M. Rich and Shravan Kumar and Eirik Endeve and W. Raphael Hix and Anthony Mezzacappa and Thomas Papatheodore}
}

@unpublished{DHRUV2023,
@article{DHRUV2023,
Author = {Akash Dhruv},
Title = {A Vortex Damping Outflow Forcing for Multiphase Flows with Sharp Interfacial Jumps},
Year = {2023}
Year = {2023},
Journal = {Under Review}
}

@unpublished{HASSAN2023,
Author = {Shakeel Hassan and Arthur Feeney and Akash Dhruv and Jihoon Kim and Youngjoon Suh and Jaiyoung Ryu and Yoonjin Won and Aparna Chandramowlishwaran},
Title = {{B}ubble{ML}: A Multi-Physics Dataset and Benchmarks for Machine Learning},
Year = {2023}
@dataset{HASSAN2023,
author = {Hassan, Sheikh Md Shakeel and
Feeney, Arthur and
Dhruv, Akash and
Kim, Jihoon and
Suh, Youngjoon and
Ryu, Jaiyoung and
Won , Yoonjin and
Chandramowlishwaran, Aparna},
title = {{BubbleML: A Multi-Physics Dataset and Benchmarks
for Machine Learning}},
month = jun,
year = 2023,
note = {{Only a sample is provided here. Please visit the
GitHub repository for a complete version.}},
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.8039787},
url = {https://doi.org/10.5281/zenodo.8039787}
}

@misc{argonne,
author = {{ANL}},
year = 2023,
url = {https://www.anl.gov/topic/business/laboratory-directed-research-and-development-ldrd}
}
103 changes: 53 additions & 50 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,72 +9,75 @@ tags:
authors:
- name: Akash Dhruv
orcid: 0000-0003-4997-321X
affiliation: 1
affiliations:
- name: Argonne National Laboratory, USA
index: 1
date: 15 June 2023
bibliography: paper.bib
---

# Summary

BoxKit is a library that provides building blocks to parallelize and
scale data science, high performance computing, and machine learning
scale data science, statistical analysis, and machine learning
applications for block-structured datasets. Spatial data from
simulations and experiments can be accessed and managed using tools
available in this library when working with more data analysis oriented
packages like SciKit, FlowNet, and OpticalFlow
simulations can be accessed and managed using tools
available in this library when working with Python-based
packages like SciKit, PyTorch, and OpticalFlow.

The library provides a Python interface to efficiently access Adaptive
Mesh Refinement (AMR) data typical of simulation outputs, and leverages
multiprocessing libraries like JobLib and Dask to scale analysis on
Non-Uniform Memory Access (NUMA) and distributed computing architectures.

# Statement of need

Details about why there is software addresses references like
[@DUBEY2022], [@DHRUV2023], and [@HASSAN2023]

# Mathematics

Single dollars ($) are required for inline mathematics e.g. $f(x) = e^{\pi/x}$

Double dollars make self-standing equations:

$$\Theta(x) = \left\{\begin{array}{l}
0\textrm{ if } x < 0\cr
1\textrm{ else}
\end{array}\right.$$

You can also use plain \LaTeX for equations
\begin{equation}\label{eq:fourier}
\hat f(\omega) = \int_{-\infty}^{\infty} f(x) e^{i\omega x} dx
\end{equation}
and refer to \autoref{eq:fourier} from text.

# Citations

Citations to entries in paper.bib should be in
[rMarkdown](http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)
format.

If you want to cite a software repository URL (e.g. something on GitHub without a preferred
citation) then you can do it with the example BibTeX entry below for @fidgit.

For a quick reference, the following citation commands can be used:
- `@author:2001` -> "Author et al. (2001)"
- `[@author:2001]` -> "(Author et al., 2001)"
- `[@author1:2001; @author2:2001]` -> "(Author1 et al., 2001; Author2 et al., 2002)"

# Figures

Figures can be included like this:
![Caption for example figure.\label{fig:example}](figure.png)
and referenced from text using \autoref{fig:example}.

Figure sizes can be customized by adding an optional second parameter:
![Caption for example figure.](figure.png){ width=20% }
Simulation sofware instruments like Flash-X [@DUBEY2022] store output in
the form of Hierarchical Data Format (HDF5) datasets. Each dataset is often
terabytes (TB) in size and requires cache efficient techniques to enable its
integration with Python packages. BoxKit datastructures act as a wrapper around
simulation output stored in HDF5 files and provide metadata for AMR blocks that
describe the simulation domain. The wrapper objects are lightweight in nature and
represent chunks of data stored on disk, acting as array like input for Python
functions/methods. This approach allows for selective loading of data from disk to
memory in form of chunks/blocks which improves cache efficiency. The library also enables
creation of new datasets for data-intensive workflows, and can be extended beyond its current
application to numerical simulations.

![BoxKit is designed to integrate simulation software instruments like Flash-X
with Python-based machine learning and data analysis packages. Large simulation
datasets (~TB) can leverage BoxKit to improve performance of offline training/analysis.
This mechanism is part of a broader workflow to integrate simulations with machine
learning using a Fortran-Python bridge shown with dotted lines. \label{fig:workflow}](../media/workflow.png)

BoxKit also offers wrappers to scale the process of deploying workflows on NUMA and distributed
computing architectures by providing decorators that can parallelize Python operations over a
single datastructure to operate over a list. This can be understood better using the
workflow described in Figure \autoref{fig:workflow} that has been applied to data analysis and
machine learning applications in chemical and thermal science engineering [@DHRUV2023; @HASSAN2023].
Output from Flash-X boiling simulations is created and stored on multinode clusters. Processing
this output through BoxKit allows for scaling a simple operation over block to a list of blocks as
shown below,

```
def operation_on_block(block, *args):
pass

Action(operation_on_block, no_of_processes, parallel_backend)(
(block for block in list_of_blocks), *args)
```

The `Action` wrappers converts the function, `operation_on_block`, into a parallel method which
can be deployed on a multinode cluster with the desired backend (JobLib/Dask). BoxKit does not
interfere with parallelization schema of target applications like SciKit, OpticalFlow, and PyTorch
which function independently using available resources.

We aim to use BoxKit as part of a broader workflow that integrates Fortran/C++ based applications
with state-of-art machine learning packages available in Python, described using dotted line in
Figure \autoref{fig:workflow}.

# Acknowledgements

We acknowledge contributions from Brigitta Sipocz, Syrtis Major, and Semyeong
Oh, and support from Kathryn Johnston during the genesis of this project.
We acknowledge contributions from Laboratory Directed Research and Development
(LDRD) program supported by Argonne National Laboratory [@argonne].

# References