An ultra-fast OCI distribution proxy and caching engine designed to scale AI image delivery across massive HPC-K8s clusters.
- 🚀 High Performance: Optimized proxying with intelligent caching based on Containerd
- 🔄 Dual Mode Architecture:
- Proxy Mode: Transparent proxy with blob request optimization
- SuperNode Mode: Pull images from Registry and store to distributed shared storage
- 📦 OCI v2 API Compatible: Full support for OCI Registry API v2 specification
- 🌊 JSONL Streaming: Custom streaming protocol for distributed data retrieval
- 📦 Chunked Storage: Concurrent downloads with incremental updates on distributed storage
- 🔍 Smart Caching: On-demand pulling with intelligent cache management
- 🎯 Image Warmup: CRD-based image preheating from Registry for SuperNode mode
- 🐳 Kubernetes Native: Helm charts with RBAC and leader election
- 💾 Distributed Storage: Native support for POSIX-compatible distributed shared storage systems
- 🔧 Containerd Integration: Seamless integration with Containerd runtime
Reflux adopts a unified service architecture that distinguishes roles through the --blob-mode parameter, built on Containerd with distributed shared storage:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Containerd/Node │────│ Proxy Mode │────│ SuperNode Mode │
│ │ │ │ │ │
│ nerdctl/ctr pull│ │ • Transparent │ │ • Pull from │
│ nerdctl/ctr push│ │ • Optimized │ │ • Registry │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
└────────────────────────┘
│
┌─────────────────┐
│ Distributed │
│ Shared Storage │
│ │
│ POSIX-compatible │
└─────────────────┘
│
┌─────────────────┐
│ Registry │
│ (Harbor/etc.) │
│ │
│ Private/Upstream │
└─────────────────┘
Production Test Results (400-node AI-HPC Cluster)
- Test Scenario: 400 nodes concurrent pulling 20GB AI training model image
- Client Download Bandwidth: 77.4 GBps total
- SuperNode Upstream Bandwidth: 253 MBps (Harbor Registry)
- Full Image Download Time: 3 minutes
- Efficiency: 306x concurrent throughput improvement
- Go 1.19+
- Containerd (for containerized deployment)
- Kubernetes 1.19+ (for Kubernetes deployment)
- Distributed shared storage (Ceph, NFS, S3, etc.) - Required for SuperNode mode
# Clone repository
git clone <repository-url>
cd reflux
# Build proxy server
make build-proxy
# Build controller
make build-controller
# Build all components
make all./bin/reflux-proxy --blob-mode=proxy --http-port=8080 --upstream=https://harbor.example.com --storage=/mnt/shared-storage/reflux# Run proxy server
./bin/reflux-proxy --blob-mode=supernode --http-port=8080 --upstream=https://harbor.example.com --storage=/mnt/shared-storage/reflux
# Run controller (with leader election)
./bin/reflux-controller --blob-mode=supernode --storage=/mnt/shared-storage/reflux --leader-electReflux is designed to work seamlessly with Containerd. Autoconfigure your Containerd to use Reflux as a mirror:
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
# /etc/containerd/certs.d/harbor.example.com
server = "https://harbor.example.com"
[host."https://localhost:9980"]
capabilities = ["pull"]
skip_verify = true
[host."https://localhost:9980".header]
x-custom-2 = ["value1", "value2"]# Generate a private key
openssl genrsa -out tls.key 2048
# Generate a Certificate Signing Request (CSR) with the specified subject
openssl req -new -key tls.key -out tls.csr -subj "/CN=reflux"
# Generate a self-signed certificate valid for 365 days
openssl x509 -req -in tls.csr -signkey tls.key -out tls.crt -days 365
# Create a Kubernetes TLS Secret
kubectl create secret tls reflux-tls --cert=tls.crt --key=tls.key -n reflux-system
# Clean up temporary files (optional)
rm tls.key tls.csr tls.crtcd deploy/reflux
helm install reflux .Reflux is specifically designed for AI and High-Performance Computing (HPC) environments where:
- Large-scale model images: Efficiently cache and distribute large AI/ML model containers
- Distributed training: Ensure consistent image availability across compute nodes
- High concurrency: Handle simultaneous image pulls during job scheduling
- Network optimization: Reduce bandwidth usage and improve pull speeds
