Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
<p align="center">
<img src="madengine.png" alt="madengine Logo" />
</p>

# madengine

[![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://python.org)
Expand Down Expand Up @@ -34,6 +38,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
- **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang
- **🐳 Container-Native** - Docker-based execution with GPU support (ROCm, CUDA)
- **📂 ROCm Path** - Support for non-default ROCm installs via `--rocm-path` or `ROCM_PATH` (e.g. Rock, pip)
- **📊 Performance Tools** - Integrated profiling with rocprof/rocprofv3, rocblas, MIOpen, RCCL tracing
- **🎯 ROCprofv3 Profiles** - 8 pre-configured profiles for compute/memory/communication bottleneck analysis
- **🔍 Environment Validation** - TheRock ROCm detection and validation tools
Expand All @@ -56,6 +61,14 @@ madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
```

If ROCm is not installed under `/opt/rocm` (e.g. Rock or pip install), use `--rocm-path` or set `ROCM_PATH`:

```bash
madengine run --tags dummy --rocm-path /path/to/rocm \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy ...
```

**Results saved to `perf_entry.csv`**

## 📋 Commands
Expand Down Expand Up @@ -593,6 +606,8 @@ madengine run --tags model --keep-alive
madengine build --tags model --clean-docker-cache --verbose
```

**ROCm not in /opt/rocm:** If you use a custom ROCm location (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set `ROCM_PATH` or pass `--rocm-path` to `madengine run` so GPU detection and container env use the correct paths.

**Common Issues:**
- **False failures with profiling**: If models show FAILURE but have performance metrics, see [Profiling Troubleshooting](docs/profiling.md#false-failure-detection-with-rocprof)
- **ROCProf log errors**: Messages like `E20251230` are informational logs, not errors (fixed in v2.0+)
Expand Down
6 changes: 6 additions & 0 deletions docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ madengine run [OPTIONS]
|--------|-------|------|---------|-------------|
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
Expand Down Expand Up @@ -215,6 +216,10 @@ madengine run [OPTIONS]
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Custom ROCm path (when ROCm is not in /opt/rocm, e.g. Rock or pip install)
madengine run --tags dummy --rocm-path /path/to/rocm \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Run with pre-built images (manifest-based)
madengine run --manifest-file build_manifest.json

Expand Down Expand Up @@ -571,6 +576,7 @@ madengine recognizes these environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `MODEL_DIR` | Path to MAD package directory | Auto-detected |
| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set) | `/opt/rocm` |
| `MAD_VERBOSE_CONFIG` | Enable verbose configuration logging | `false` |
| `MAD_DOCKERHUB_USER` | Docker Hub username | None |
| `MAD_DOCKERHUB_PASSWORD` | Docker Hub password/token | None |
Expand Down
9 changes: 9 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ madengine run --tags model --additional-context-file config.json
- `"UBUNTU"` - Ubuntu Linux
- `"CENTOS"` - CentOS Linux

### ROCm path (run only)

When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context):

- **CLI:** `madengine run --rocm-path /path/to/rocm ...`
- **Environment:** `export ROCM_PATH=/path/to/rocm`

Resolution order: `--rocm-path` → `ROCM_PATH` → `/opt/rocm`. This applies only to the run phase; build does not perform GPU detection.

## Build Configuration

### Batch Manifest
Expand Down
4 changes: 4 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
```

**Non-default ROCm location:** If ROCm is not under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip install), set `ROCM_PATH` or use `madengine run --rocm-path /path/to/rocm` so GPU detection and container env use the correct paths.

### NVIDIA CUDA

```bash
Expand Down Expand Up @@ -138,6 +140,8 @@ rocm-smi
ls -la /dev/kfd /dev/dri
```

If ROCm is installed in a non-default path (e.g. Rock or pip), set `export ROCM_PATH=/path/to/rocm` or use `madengine run --rocm-path /path/to/rocm`.

### MAD Package Not Found

Ensure you're running madengine commands from within a MAD package directory:
Expand Down
4 changes: 3 additions & 1 deletion docs/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,9 @@ Collect comprehensive ROCm profiling data:
}
```

**Output:** ROCm profiler data files
**Output:** ROCm profiler data files (e.g. `rpd_output/trace.rpd`).

**Note:** The rpd pre-script installs build dependencies in the container (e.g. `nlohmann-json3-dev` on Ubuntu) so the rocmProfileData tracer can compile; the first run may take longer while packages are installed.

### ROCprofv3 - Advanced GPU Profiling

Expand Down
17 changes: 17 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,22 @@ madengine run --tags model \
- `gpu_vendor`: "AMD", "NVIDIA"
- `guest_os`: "UBUNTU", "CENTOS"

### ROCm path (non-default installs)

When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths:

```bash
# Via environment variable
export ROCM_PATH=/path/to/rocm
madengine run --tags model --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Via CLI (overrides ROCM_PATH)
madengine run --tags model --rocm-path /path/to/rocm \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
```

`--rocm-path` applies only to the **run** command (not build). See [CLI Reference - run](cli-reference.md#run---execute-models).

### Deploy to Kubernetes

```bash
Expand Down Expand Up @@ -577,6 +593,7 @@ madengine build --tags model --clean-docker-cache --verbose
| Variable | Description | Example |
|----------|-------------|---------|
| `MODEL_DIR` | MAD package directory | `/path/to/MAD` |
| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). | `/path/to/rocm` |
| `MAD_VERBOSE_CONFIG` | Verbose config logging | `"true"` |
| `MAD_DOCKERHUB_USER` | Docker Hub username | `"myusername"` |
| `MAD_DOCKERHUB_PASSWORD` | Docker Hub password | `"mytoken"` |
Expand Down
Binary file added madengine.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions src/madengine/cli/commands/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,13 @@ def run(
help="Remove intermediate perf_entry files after run (keeps perf.csv and perf_super files)",
),
] = False,
rocm_path: Annotated[
Optional[str],
typer.Option(
"--rocm-path",
help="ROCm installation path (overrides ROCM_PATH env; default: /opt/rocm). Use when ROCm is not under /opt/rocm (e.g. Rock tar/whl).",
),
] = None,
) -> None:
"""
🚀 Run model containers in distributed scenarios.
Expand Down Expand Up @@ -199,6 +206,7 @@ def run(
disable_skip_gpu_arch=disable_skip_gpu_arch,
verbose=verbose,
cleanup_perf=cleanup_perf,
rocm_path=rocm_path,
_separate_phases=True,
)

Expand Down Expand Up @@ -323,6 +331,7 @@ def run(
disable_skip_gpu_arch=disable_skip_gpu_arch,
verbose=verbose,
cleanup_perf=cleanup_perf,
rocm_path=rocm_path,
_separate_phases=False, # Full workflow uses .live.log (not .run.live.log)
)

Expand Down
17 changes: 17 additions & 0 deletions src/madengine/core/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,3 +228,20 @@ def _get_public_github_rocm_key():


PUBLIC_GITHUB_ROCM_KEY = _get_public_github_rocm_key()


def get_rocm_path(override=None):
"""Return ROCm installation root directory.

Resolution order: override (e.g. from CLI) -> ROCM_PATH env -> default /opt/rocm.
Path is normalized to absolute form with no trailing slash.

Args:
override: Optional path overriding env and default.

Returns:
str: Absolute ROCm root path.
"""
raw = override if override else os.environ.get("ROCM_PATH", "/opt/rocm")
path = os.path.abspath(os.path.expanduser(str(raw).strip()))
return path.rstrip(os.sep)
31 changes: 22 additions & 9 deletions src/madengine/core/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

# third-party modules
from madengine.core.console import Console
from madengine.core.constants import get_rocm_path
from madengine.utils.gpu_validator import validate_rocm_installation, GPUInstallationError, GPUVendor
from madengine.utils.gpu_tool_factory import get_gpu_tool_manager
from madengine.utils.gpu_tool_manager import BaseGPUToolManager
Expand Down Expand Up @@ -80,17 +81,20 @@ def __init__(
additional_context: str = None,
additional_context_file: str = None,
build_only_mode: bool = False,
rocm_path: str = None,
) -> None:
"""Constructor of the Context class.

Args:
additional_context: The additional context.
additional_context_file: The additional context file.
build_only_mode: Whether running in build-only mode (no GPU detection).
rocm_path: Optional ROCm installation path (overrides ROCM_PATH env; default /opt/rocm).

Raises:
RuntimeError: If GPU detection fails and not in build-only mode.
"""
self._rocm_path = get_rocm_path(rocm_path)
# Initialize the console
self.console = Console()
self._gpu_context_initialized = False
Expand Down Expand Up @@ -252,6 +256,9 @@ def init_gpu_context(self) -> None:
if "MAD_GPU_VENDOR" not in self.ctx["docker_env_vars"]:
self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]

self.ctx["rocm_path"] = self._rocm_path
self.ctx["docker_env_vars"]["ROCM_PATH"] = self._rocm_path

if "MAD_SYSTEM_NGPUS" not in self.ctx["docker_env_vars"]:
self.ctx["docker_env_vars"][
"MAD_SYSTEM_NGPUS"
Expand Down Expand Up @@ -337,7 +344,7 @@ def _get_tool_manager(self) -> BaseGPUToolManager:
else:
vendor = None # Auto-detect

self._gpu_tool_manager = get_gpu_tool_manager(vendor)
self._gpu_tool_manager = get_gpu_tool_manager(vendor, rocm_path=self._rocm_path)

return self._gpu_tool_manager

Expand Down Expand Up @@ -382,8 +389,11 @@ def get_gpu_vendor(self) -> str:
print(f"Warning: nvidia-smi check failed: {e}")

# Check AMD - try amd-smi first, fallback to rocm-smi (PR #54)
# Increased timeout to 180s for SLURM compute nodes where GPU initialization may be slow
amd_smi_paths = ["/opt/rocm/bin/amd-smi", "/usr/local/bin/amd-smi"]
# Use configurable ROCm path (ROCM_PATH / --rocm-path) for non-default installs
amd_smi_paths = [
os.path.join(self._rocm_path, "bin", "amd-smi"),
"/usr/local/bin/amd-smi",
]
for amd_smi_path in amd_smi_paths:
if os.path.exists(amd_smi_path):
try:
Expand All @@ -395,9 +405,10 @@ def get_gpu_vendor(self) -> str:
print(f"Warning: amd-smi check failed for {amd_smi_path}: {e}")

# Fallback to rocm-smi (PR #54)
if os.path.exists("/opt/rocm/bin/rocm-smi"):
rocm_smi_path = os.path.join(self._rocm_path, "bin", "rocm-smi")
if os.path.exists(rocm_smi_path):
try:
result = self.console.sh("/opt/rocm/bin/rocm-smi --showid > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180)
result = self.console.sh(f"{rocm_smi_path} --showid > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180)
if result and result.strip() == "AMD":
return "AMD"
except Exception as e:
Expand Down Expand Up @@ -510,14 +521,15 @@ def get_system_gpu_architecture(self) -> str:
"""
if self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] == "AMD":
try:
arch = self.console.sh("/opt/rocm/bin/rocminfo |grep -o -m 1 'gfx.*'")
rocminfo_path = os.path.join(self._rocm_path, "bin", "rocminfo")
arch = self.console.sh(f"{rocminfo_path} |grep -o -m 1 'gfx.*'")
if not arch or arch.strip() == "":
raise RuntimeError("rocminfo returned empty architecture")
return arch
except Exception as e:
raise RuntimeError(
f"Unable to determine AMD GPU architecture. "
f"Ensure ROCm is installed and rocminfo is accessible at /opt/rocm/bin/rocminfo. "
f"Ensure ROCm is installed and rocminfo is accessible (ROCM_PATH={self._rocm_path}). "
f"Error: {e}"
)
elif self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] == "NVIDIA":
Expand Down Expand Up @@ -666,9 +678,10 @@ def get_gpu_renderD_nodes(self) -> typing.Optional[typing.List[int]]:
raise RuntimeError("Tool manager returned None for ROCm version")
except Exception as e:
# Fallback to direct file read
rocm_version_str = self.console.sh("cat /opt/rocm/.info/version | cut -d'-' -f1")
version_file = os.path.join(self._rocm_path, ".info", "version")
rocm_version_str = self.console.sh(f"cat {version_file} | cut -d'-' -f1")
if not rocm_version_str or rocm_version_str.strip() == "":
raise RuntimeError("Failed to retrieve ROCm version from /opt/rocm/.info/version")
raise RuntimeError(f"Failed to retrieve ROCm version from {version_file}")

# Parse version safely
try:
Expand Down
13 changes: 7 additions & 6 deletions src/madengine/execution/container_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from madengine.core.console import Console
from madengine.core.context import Context
from madengine.core.docker import Docker
from madengine.core.constants import get_rocm_path
from madengine.core.timeout import Timeout
from madengine.core.dataprovider import Data
from madengine.utils.ops import PythonicTee, file_print
Expand Down Expand Up @@ -907,18 +908,18 @@ def run_container(
# Show GPU info with version-aware tool selection (PR #54)
if gpu_vendor.find("AMD") != -1:
print(f"🎮 Checking AMD GPU status...")
# Use version-aware SMI tool selection
# Note: Use amd-smi without arguments to show full status table (same as legacy madengine)
rocm_path = self.context.ctx.get("rocm_path") or get_rocm_path()
amd_smi_path = os.path.join(rocm_path, "bin", "amd-smi")
rocm_smi_path = os.path.join(rocm_path, "bin", "rocm-smi")
try:
tool_manager = self.context._get_tool_manager()
preferred_tool = tool_manager.get_preferred_smi_tool()
if preferred_tool == "amd-smi":
model_docker.sh("/opt/rocm/bin/amd-smi || /opt/rocm/bin/rocm-smi || true")
model_docker.sh(f"{amd_smi_path} || {rocm_smi_path} || true")
else:
model_docker.sh("/opt/rocm/bin/rocm-smi || /opt/rocm/bin/amd-smi || true")
model_docker.sh(f"{rocm_smi_path} || {amd_smi_path} || true")
except Exception:
# Fallback: try both tools
model_docker.sh("/opt/rocm/bin/amd-smi || /opt/rocm/bin/rocm-smi || true")
model_docker.sh(f"{amd_smi_path} || {rocm_smi_path} || true")
elif gpu_vendor.find("NVIDIA") != -1:
print(f"🎮 Checking NVIDIA GPU status...")
model_docker.sh("/usr/bin/nvidia-smi || true")
Expand Down
5 changes: 5 additions & 0 deletions src/madengine/orchestration/run_orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
create_error_context,
handle_error,
)
from madengine.core.constants import get_rocm_path
from madengine.utils.session_tracker import SessionTracker


Expand Down Expand Up @@ -107,9 +108,11 @@ def _init_runtime_context(self):
else:
context_string = None

rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
self.context = Context(
additional_context=context_string,
build_only_mode=False,
rocm_path=rocm_path,
)

# Initialize data provider if data config exists
Expand Down Expand Up @@ -383,9 +386,11 @@ def _create_manifest_from_local_image(
# Initialize build-only context for manifest generation
# (we need context structure, but skip GPU detection since we're not building)
context_string = repr(self.additional_context) if self.additional_context else None
rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
build_context = Context(
additional_context=context_string,
build_only_mode=True,
rocm_path=rocm_path,
)

# Create manifest structure
Expand Down
8 changes: 6 additions & 2 deletions src/madengine/scripts/common/pre_scripts/trace.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ case "$tool" in
rpd)
if [ "$os" == 'ubuntu' ]; then
sudo apt update
sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip
sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev
elif [ "$os" == 'centos' ]; then
sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip
sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel
else
echo "Unable to detect Host OS in trace pre-script"
fi
Expand All @@ -43,6 +43,10 @@ rpd)

# Build RPD tracer locally without system install
cd ./rocmProfileData
# Workaround for upstream rocmProfileData Makefile typo: UStringTable.o -> StringTable.o
if [ -f rpd_tracer/Makefile ]; then
sed -i 's/UStringTable\.o/StringTable.o/g' rpd_tracer/Makefile
fi
make rpd
if [ $? -ne 0 ]; then
echo "Error: Failed to build RPD tracer"
Expand Down
Loading