MegaMmap: Blurring the Boundary Between Memory and Storage

MegaMmap is a software distributed shared memory (DSM) system that eliminates the traditional boundary between memory and storage for data-intensive HPC workloads. By providing a unified, byte-addressable interface that spans DRAM, NVMe, SSD, and HDD tiers, MegaMmap enables applications to work with datasets larger than memory capacity while maintaining competitive performance.

Key Features

Infinite Memory Abstraction: Present massive datasets as if they were in main memory
Intelligent Tiering: Automatically manage data across DRAM, NVMe, SSD, and HDD tiers
Transactional Memory API: Declare access patterns to optimize prefetching and coherence
Intent-Aware Coherence: Reduce communication overhead with workload-specific optimizations
Persistent Integration: Transparently stage data to/from HDF5, Parquet, and other formats
HPC-Optimized: Designed for scientific simulations, machine learning, and data analytics

Performance Highlights

2.6x DRAM Reduction: Maintain performance with 60% less memory usage
2x Faster than Spark: Outperform cloud-based solutions for memory-intensive workloads
Unbounded Dataset Size: Process datasets 2x larger than available memory
45% Code Reduction: Simpler development compared to traditional out-of-core approaches

Architecture

MegaMmap consists of several key components:

Private Cache (pcache): Per-process DRAM cache for low-latency access
Shared Cache (scache): Distributed, tiered cache across all processes
Data Organizer: Intelligent placement based on access patterns and scores
Prefetcher: Overlaps computation with data movement
Transaction System: Declares intent for optimized coherence

Quick Start

Prerequisites

C++17 compliant compiler (GCC 9.4.1+)
MPI implementation (MPICH 3.4.3+ recommended)
Hermes buffering system
CMake 3.12+

Installation

# Clone the repository
git clone https://github.com/grc-iit/mega_mmap.git
cd mega_mmap

# Install dependencies (requires Spack)
./deps.sh

# Build MegaMmap
mkdir build && cd build
cmake ../ -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(scspkg pkg root mega_mmap)
make -j8

# Load environment
module load mega_mmap

Basic Usage

#include <mega_mmap/vector.h>

void KMeansInertia(std::vector<Point3D> &ks) {
    int rank = mpi::get_rank();
    int nprocs = mpi::get_comm_size();
    
    // Create a shared vector from a Parquet file
    mm::Vector<Point3D> pts("/points.parquet");
    pts.BoundMemory(MEGABYTES(1));  // Limit to 1MB DRAM
    pts.Pgas(rank, nprocs);          // Partition across processes
    
    // Begin read-only transaction
    auto tx = pts.SeqTxBegin(pts.local_off(), pts.local_size(), MM_READ_ONLY);
    
    float distance = 0;
    for (Point3D p : tx) {
        distance += pow(NearestCentroid(p, ks), 2);
    }
    
    pts.TxEnd();
}

MegaMmap AI Guidelines (if visiting using a coding Agent)

Build & Test Commands

Build: mkdir build && cd build && cmake ../ -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(scspkg pkg root mega_mmap) && make -j8
Debug Build: Use -DCMAKE_BUILD_TYPE=Debug for debug symbols
Single Test: jarvis pipeline run yaml test/unit/pipelines/{test_name}.yaml (e.g., mm_kmeans_mega.yaml)
All Tests: Tests defined in test/unit/CMakeLists.txt using jarvis_test() function

Code Style Guidelines

Language: C++17 standard
Namespaces: Use mm namespace for core library code
Naming:
- Classes: PascalCase (e.g., Vector, Bounds)
- Variables: snake_case with trailing underscore (e.g., window_size_, elmts_per_page_)
- Functions: PascalCase for public methods
- Constants: UPPER_CASE with prefix (e.g., MM_READ_ONLY, MM_PAGE_SIZE)
Headers: Include guards format: MEGAMMAP_INCLUDE_{PATH}_{FILE}_H_
File Organization:
- Headers in include/mega_mmap/
- Implementations in benchmark/ for executables
- Tests in test/unit/
Dependencies: Uses Hermes, MPI, Arrow, Parquet, YAML-CPP, Catch2, OpenMP
Macros: Use BIT_OPT(u32, n) for bit flags, KILOBYTES(), MEGABYTES() for sizes
Error Handling: Hermes logging via hermes_shm/util/logging.h

📈 Supported Workloads

MegaMmap has been validated on production HPC applications:

Machine Learning: KMeans clustering, Random Forest classification
Scientific Simulation: Gray-Scott reaction-diffusion models
Data Analytics: DBSCAN clustering on cosmological datasets
Signal Processing: Gadget2 cosmological simulation conversion

🧪 Running Benchmarks

# Run single benchmark
jarvis pipeline run yaml test/unit/pipelines/mm_kmeans_mega.yaml

# Run all benchmarks
cd test/unit && make -j8

📚 Documentation

Paper (SC24) - Complete research paper with evaluation
API Reference - API documentation
Examples - Application examples
Configuration Guide - Configuration parameters

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

National Science Foundation (NSF) Grants CSSI-2104013 and Core-2313154
U.S. Department of Energy (DOE) Contract DE-SC0024593
Chameleon Cloud Testbed for development environment

📞 Contact

For questions, support, or collaborations:

Luke Logan (Author & Maintainer): llogan@illinoistech.edu
Anthony Kougkas: akougkas@illinoistech.edu
Xian-He Sun: sun@illinoistech.edu

Gnosis Research Center - Illinois Institute of Technology

Citation: If you use MegaMmap in your research, please cite our SC24 paper.

@inproceedings{logan2024megammap,
  title={MegaMmap: Blurring the Boundary Between Memory and Storage for Data-Intensive Workloads},
  author={Logan, Luke and Kougkas, Anthony and Sun, Xian-He},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MegaMmap: Blurring the Boundary Between Memory and Storage

Key Features

Performance Highlights

Architecture

Quick Start

Prerequisites

Installation

Basic Usage

MegaMmap AI Guidelines (if visiting using a coding Agent)

Build & Test Commands

Code Style Guidelines

📈 Supported Workloads

🧪 Running Benchmarks

📚 Documentation

📄 License

🙏 Acknowledgments

📞 Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
.idea		.idea
analysis		analysis
benchmark		benchmark
ci		ci
docker		docker
include/mega_mmap		include/mega_mmap
paper		paper
scripts		scripts
test		test
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
deps.sh		deps.sh

License

grc-iit/mega_mmap

Folders and files

Latest commit

History

Repository files navigation

MegaMmap: Blurring the Boundary Between Memory and Storage

Key Features

Performance Highlights

Architecture

Quick Start

Prerequisites

Installation

Basic Usage

MegaMmap AI Guidelines (if visiting using a coding Agent)

Build & Test Commands

Code Style Guidelines

📈 Supported Workloads

🧪 Running Benchmarks

📚 Documentation

📄 License

🙏 Acknowledgments

📞 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages