diff --git a/README.md b/README.md
index 08d2c31c..25924fde 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,7 @@
+
+
+
+
# madengine
[](https://python.org)
@@ -34,6 +38,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
- **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang
- **🐳 Container-Native** - Docker-based execution with GPU support (ROCm, CUDA)
+- **📂 ROCm Path** - Support for non-default ROCm installs via `--rocm-path` or `ROCM_PATH` (e.g. Rock, pip)
- **📊 Performance Tools** - Integrated profiling with rocprof/rocprofv3, rocblas, MIOpen, RCCL tracing
- **🎯 ROCprofv3 Profiles** - 8 pre-configured profiles for compute/memory/communication bottleneck analysis
- **🔍 Environment Validation** - TheRock ROCm detection and validation tools
@@ -56,6 +61,14 @@ madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
```
+If ROCm is not installed under `/opt/rocm` (e.g. Rock or pip install), use `--rocm-path` or set `ROCM_PATH`:
+
+```bash
+madengine run --tags dummy --rocm-path /path/to/rocm \
+ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
+# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy ...
+```
+
**Results saved to `perf_entry.csv`**
## 📋 Commands
@@ -593,6 +606,8 @@ madengine run --tags model --keep-alive
madengine build --tags model --clean-docker-cache --verbose
```
+**ROCm not in /opt/rocm:** If you use a custom ROCm location (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set `ROCM_PATH` or pass `--rocm-path` to `madengine run` so GPU detection and container env use the correct paths.
+
**Common Issues:**
- **False failures with profiling**: If models show FAILURE but have performance metrics, see [Profiling Troubleshooting](docs/profiling.md#false-failure-detection-with-rocprof)
- **ROCProf log errors**: Messages like `E20251230` are informational logs, not errors (fixed in v2.0+)
diff --git a/docs/cli-reference.md b/docs/cli-reference.md
index 5d58f1e6..0e638eec 100644
--- a/docs/cli-reference.md
+++ b/docs/cli-reference.md
@@ -188,6 +188,7 @@ madengine run [OPTIONS]
|--------|-------|------|---------|-------------|
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
+| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
@@ -215,6 +216,10 @@ madengine run [OPTIONS]
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
+# Custom ROCm path (when ROCm is not in /opt/rocm, e.g. Rock or pip install)
+madengine run --tags dummy --rocm-path /path/to/rocm \
+ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
+
# Run with pre-built images (manifest-based)
madengine run --manifest-file build_manifest.json
@@ -571,6 +576,7 @@ madengine recognizes these environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `MODEL_DIR` | Path to MAD package directory | Auto-detected |
+| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set) | `/opt/rocm` |
| `MAD_VERBOSE_CONFIG` | Enable verbose configuration logging | `false` |
| `MAD_DOCKERHUB_USER` | Docker Hub username | None |
| `MAD_DOCKERHUB_PASSWORD` | Docker Hub password/token | None |
diff --git a/docs/configuration.md b/docs/configuration.md
index 8af78bae..dde8e094 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -45,6 +45,15 @@ madengine run --tags model --additional-context-file config.json
- `"UBUNTU"` - Ubuntu Linux
- `"CENTOS"` - CentOS Linux
+### ROCm path (run only)
+
+When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context):
+
+- **CLI:** `madengine run --rocm-path /path/to/rocm ...`
+- **Environment:** `export ROCM_PATH=/path/to/rocm`
+
+Resolution order: `--rocm-path` → `ROCM_PATH` → `/opt/rocm`. This applies only to the run phase; build does not perform GPU detection.
+
## Build Configuration
### Batch Manifest
diff --git a/docs/installation.md b/docs/installation.md
index d3f79b85..7ff51f1e 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -83,6 +83,8 @@ madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
```
+**Non-default ROCm location:** If ROCm is not under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip install), set `ROCM_PATH` or use `madengine run --rocm-path /path/to/rocm` so GPU detection and container env use the correct paths.
+
### NVIDIA CUDA
```bash
@@ -138,6 +140,8 @@ rocm-smi
ls -la /dev/kfd /dev/dri
```
+If ROCm is installed in a non-default path (e.g. Rock or pip), set `export ROCM_PATH=/path/to/rocm` or use `madengine run --rocm-path /path/to/rocm`.
+
### MAD Package Not Found
Ensure you're running madengine commands from within a MAD package directory:
diff --git a/docs/profiling.md b/docs/profiling.md
index 2c870b6b..f4323d69 100644
--- a/docs/profiling.md
+++ b/docs/profiling.md
@@ -120,7 +120,9 @@ Collect comprehensive ROCm profiling data:
}
```
-**Output:** ROCm profiler data files
+**Output:** ROCm profiler data files (e.g. `rpd_output/trace.rpd`).
+
+**Note:** The rpd pre-script installs build dependencies in the container (e.g. `nlohmann-json3-dev` on Ubuntu) so the rocmProfileData tracer can compile; the first run may take longer while packages are installed.
### ROCprofv3 - Advanced GPU Profiling
diff --git a/docs/usage.md b/docs/usage.md
index 89ebd415..c8073c13 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -288,6 +288,22 @@ madengine run --tags model \
- `gpu_vendor`: "AMD", "NVIDIA"
- `guest_os`: "UBUNTU", "CENTOS"
+### ROCm path (non-default installs)
+
+When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths:
+
+```bash
+# Via environment variable
+export ROCM_PATH=/path/to/rocm
+madengine run --tags model --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
+
+# Via CLI (overrides ROCM_PATH)
+madengine run --tags model --rocm-path /path/to/rocm \
+ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
+```
+
+`--rocm-path` applies only to the **run** command (not build). See [CLI Reference - run](cli-reference.md#run---execute-models).
+
### Deploy to Kubernetes
```bash
@@ -577,6 +593,7 @@ madengine build --tags model --clean-docker-cache --verbose
| Variable | Description | Example |
|----------|-------------|---------|
| `MODEL_DIR` | MAD package directory | `/path/to/MAD` |
+| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). | `/path/to/rocm` |
| `MAD_VERBOSE_CONFIG` | Verbose config logging | `"true"` |
| `MAD_DOCKERHUB_USER` | Docker Hub username | `"myusername"` |
| `MAD_DOCKERHUB_PASSWORD` | Docker Hub password | `"mytoken"` |
diff --git a/madengine.png b/madengine.png
new file mode 100755
index 00000000..29e396f2
Binary files /dev/null and b/madengine.png differ
diff --git a/src/madengine/cli/commands/run.py b/src/madengine/cli/commands/run.py
index 90fc16f8..aa1866a7 100644
--- a/src/madengine/cli/commands/run.py
+++ b/src/madengine/cli/commands/run.py
@@ -140,6 +140,13 @@ def run(
help="Remove intermediate perf_entry files after run (keeps perf.csv and perf_super files)",
),
] = False,
+ rocm_path: Annotated[
+ Optional[str],
+ typer.Option(
+ "--rocm-path",
+ help="ROCm installation path (overrides ROCM_PATH env; default: /opt/rocm). Use when ROCm is not under /opt/rocm (e.g. Rock tar/whl).",
+ ),
+ ] = None,
) -> None:
"""
🚀 Run model containers in distributed scenarios.
@@ -199,6 +206,7 @@ def run(
disable_skip_gpu_arch=disable_skip_gpu_arch,
verbose=verbose,
cleanup_perf=cleanup_perf,
+ rocm_path=rocm_path,
_separate_phases=True,
)
@@ -323,6 +331,7 @@ def run(
disable_skip_gpu_arch=disable_skip_gpu_arch,
verbose=verbose,
cleanup_perf=cleanup_perf,
+ rocm_path=rocm_path,
_separate_phases=False, # Full workflow uses .live.log (not .run.live.log)
)
diff --git a/src/madengine/core/constants.py b/src/madengine/core/constants.py
index f86e51fe..c98980a8 100644
--- a/src/madengine/core/constants.py
+++ b/src/madengine/core/constants.py
@@ -228,3 +228,20 @@ def _get_public_github_rocm_key():
PUBLIC_GITHUB_ROCM_KEY = _get_public_github_rocm_key()
+
+
+def get_rocm_path(override=None):
+ """Return ROCm installation root directory.
+
+ Resolution order: override (e.g. from CLI) -> ROCM_PATH env -> default /opt/rocm.
+ Path is normalized to absolute form with no trailing slash.
+
+ Args:
+ override: Optional path overriding env and default.
+
+ Returns:
+ str: Absolute ROCm root path.
+ """
+ raw = override if override else os.environ.get("ROCM_PATH", "/opt/rocm")
+ path = os.path.abspath(os.path.expanduser(str(raw).strip()))
+ return path.rstrip(os.sep)
diff --git a/src/madengine/core/context.py b/src/madengine/core/context.py
index ce463abb..e1d93b61 100644
--- a/src/madengine/core/context.py
+++ b/src/madengine/core/context.py
@@ -21,6 +21,7 @@
# third-party modules
from madengine.core.console import Console
+from madengine.core.constants import get_rocm_path
from madengine.utils.gpu_validator import validate_rocm_installation, GPUInstallationError, GPUVendor
from madengine.utils.gpu_tool_factory import get_gpu_tool_manager
from madengine.utils.gpu_tool_manager import BaseGPUToolManager
@@ -80,6 +81,7 @@ def __init__(
additional_context: str = None,
additional_context_file: str = None,
build_only_mode: bool = False,
+ rocm_path: str = None,
) -> None:
"""Constructor of the Context class.
@@ -87,10 +89,12 @@ def __init__(
additional_context: The additional context.
additional_context_file: The additional context file.
build_only_mode: Whether running in build-only mode (no GPU detection).
+ rocm_path: Optional ROCm installation path (overrides ROCM_PATH env; default /opt/rocm).
Raises:
RuntimeError: If GPU detection fails and not in build-only mode.
"""
+ self._rocm_path = get_rocm_path(rocm_path)
# Initialize the console
self.console = Console()
self._gpu_context_initialized = False
@@ -252,6 +256,9 @@ def init_gpu_context(self) -> None:
if "MAD_GPU_VENDOR" not in self.ctx["docker_env_vars"]:
self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]
+ self.ctx["rocm_path"] = self._rocm_path
+ self.ctx["docker_env_vars"]["ROCM_PATH"] = self._rocm_path
+
if "MAD_SYSTEM_NGPUS" not in self.ctx["docker_env_vars"]:
self.ctx["docker_env_vars"][
"MAD_SYSTEM_NGPUS"
@@ -337,7 +344,7 @@ def _get_tool_manager(self) -> BaseGPUToolManager:
else:
vendor = None # Auto-detect
- self._gpu_tool_manager = get_gpu_tool_manager(vendor)
+ self._gpu_tool_manager = get_gpu_tool_manager(vendor, rocm_path=self._rocm_path)
return self._gpu_tool_manager
@@ -382,8 +389,11 @@ def get_gpu_vendor(self) -> str:
print(f"Warning: nvidia-smi check failed: {e}")
# Check AMD - try amd-smi first, fallback to rocm-smi (PR #54)
- # Increased timeout to 180s for SLURM compute nodes where GPU initialization may be slow
- amd_smi_paths = ["/opt/rocm/bin/amd-smi", "/usr/local/bin/amd-smi"]
+ # Use configurable ROCm path (ROCM_PATH / --rocm-path) for non-default installs
+ amd_smi_paths = [
+ os.path.join(self._rocm_path, "bin", "amd-smi"),
+ "/usr/local/bin/amd-smi",
+ ]
for amd_smi_path in amd_smi_paths:
if os.path.exists(amd_smi_path):
try:
@@ -395,9 +405,10 @@ def get_gpu_vendor(self) -> str:
print(f"Warning: amd-smi check failed for {amd_smi_path}: {e}")
# Fallback to rocm-smi (PR #54)
- if os.path.exists("/opt/rocm/bin/rocm-smi"):
+ rocm_smi_path = os.path.join(self._rocm_path, "bin", "rocm-smi")
+ if os.path.exists(rocm_smi_path):
try:
- result = self.console.sh("/opt/rocm/bin/rocm-smi --showid > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180)
+ result = self.console.sh(f"{rocm_smi_path} --showid > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180)
if result and result.strip() == "AMD":
return "AMD"
except Exception as e:
@@ -510,14 +521,15 @@ def get_system_gpu_architecture(self) -> str:
"""
if self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] == "AMD":
try:
- arch = self.console.sh("/opt/rocm/bin/rocminfo |grep -o -m 1 'gfx.*'")
+ rocminfo_path = os.path.join(self._rocm_path, "bin", "rocminfo")
+ arch = self.console.sh(f"{rocminfo_path} |grep -o -m 1 'gfx.*'")
if not arch or arch.strip() == "":
raise RuntimeError("rocminfo returned empty architecture")
return arch
except Exception as e:
raise RuntimeError(
f"Unable to determine AMD GPU architecture. "
- f"Ensure ROCm is installed and rocminfo is accessible at /opt/rocm/bin/rocminfo. "
+ f"Ensure ROCm is installed and rocminfo is accessible (ROCM_PATH={self._rocm_path}). "
f"Error: {e}"
)
elif self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] == "NVIDIA":
@@ -666,9 +678,10 @@ def get_gpu_renderD_nodes(self) -> typing.Optional[typing.List[int]]:
raise RuntimeError("Tool manager returned None for ROCm version")
except Exception as e:
# Fallback to direct file read
- rocm_version_str = self.console.sh("cat /opt/rocm/.info/version | cut -d'-' -f1")
+ version_file = os.path.join(self._rocm_path, ".info", "version")
+ rocm_version_str = self.console.sh(f"cat {version_file} | cut -d'-' -f1")
if not rocm_version_str or rocm_version_str.strip() == "":
- raise RuntimeError("Failed to retrieve ROCm version from /opt/rocm/.info/version")
+ raise RuntimeError(f"Failed to retrieve ROCm version from {version_file}")
# Parse version safely
try:
diff --git a/src/madengine/execution/container_runner.py b/src/madengine/execution/container_runner.py
index fe414e13..c3299049 100644
--- a/src/madengine/execution/container_runner.py
+++ b/src/madengine/execution/container_runner.py
@@ -19,6 +19,7 @@
from madengine.core.console import Console
from madengine.core.context import Context
from madengine.core.docker import Docker
+from madengine.core.constants import get_rocm_path
from madengine.core.timeout import Timeout
from madengine.core.dataprovider import Data
from madengine.utils.ops import PythonicTee, file_print
@@ -907,18 +908,18 @@ def run_container(
# Show GPU info with version-aware tool selection (PR #54)
if gpu_vendor.find("AMD") != -1:
print(f"🎮 Checking AMD GPU status...")
- # Use version-aware SMI tool selection
- # Note: Use amd-smi without arguments to show full status table (same as legacy madengine)
+ rocm_path = self.context.ctx.get("rocm_path") or get_rocm_path()
+ amd_smi_path = os.path.join(rocm_path, "bin", "amd-smi")
+ rocm_smi_path = os.path.join(rocm_path, "bin", "rocm-smi")
try:
tool_manager = self.context._get_tool_manager()
preferred_tool = tool_manager.get_preferred_smi_tool()
if preferred_tool == "amd-smi":
- model_docker.sh("/opt/rocm/bin/amd-smi || /opt/rocm/bin/rocm-smi || true")
+ model_docker.sh(f"{amd_smi_path} || {rocm_smi_path} || true")
else:
- model_docker.sh("/opt/rocm/bin/rocm-smi || /opt/rocm/bin/amd-smi || true")
+ model_docker.sh(f"{rocm_smi_path} || {amd_smi_path} || true")
except Exception:
- # Fallback: try both tools
- model_docker.sh("/opt/rocm/bin/amd-smi || /opt/rocm/bin/rocm-smi || true")
+ model_docker.sh(f"{amd_smi_path} || {rocm_smi_path} || true")
elif gpu_vendor.find("NVIDIA") != -1:
print(f"🎮 Checking NVIDIA GPU status...")
model_docker.sh("/usr/bin/nvidia-smi || true")
diff --git a/src/madengine/orchestration/run_orchestrator.py b/src/madengine/orchestration/run_orchestrator.py
index b1c77a25..e1513bb3 100644
--- a/src/madengine/orchestration/run_orchestrator.py
+++ b/src/madengine/orchestration/run_orchestrator.py
@@ -29,6 +29,7 @@
create_error_context,
handle_error,
)
+from madengine.core.constants import get_rocm_path
from madengine.utils.session_tracker import SessionTracker
@@ -107,9 +108,11 @@ def _init_runtime_context(self):
else:
context_string = None
+ rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
self.context = Context(
additional_context=context_string,
build_only_mode=False,
+ rocm_path=rocm_path,
)
# Initialize data provider if data config exists
@@ -383,9 +386,11 @@ def _create_manifest_from_local_image(
# Initialize build-only context for manifest generation
# (we need context structure, but skip GPU detection since we're not building)
context_string = repr(self.additional_context) if self.additional_context else None
+ rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
build_context = Context(
additional_context=context_string,
build_only_mode=True,
+ rocm_path=rocm_path,
)
# Create manifest structure
diff --git a/src/madengine/scripts/common/pre_scripts/trace.sh b/src/madengine/scripts/common/pre_scripts/trace.sh
index 7fcffdd1..0d5cb80e 100644
--- a/src/madengine/scripts/common/pre_scripts/trace.sh
+++ b/src/madengine/scripts/common/pre_scripts/trace.sh
@@ -24,9 +24,9 @@ case "$tool" in
rpd)
if [ "$os" == 'ubuntu' ]; then
sudo apt update
- sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip
+ sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev
elif [ "$os" == 'centos' ]; then
- sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip
+ sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel
else
echo "Unable to detect Host OS in trace pre-script"
fi
@@ -43,6 +43,10 @@ rpd)
# Build RPD tracer locally without system install
cd ./rocmProfileData
+ # Workaround for upstream rocmProfileData Makefile typo: UStringTable.o -> StringTable.o
+ if [ -f rpd_tracer/Makefile ]; then
+ sed -i 's/UStringTable\.o/StringTable.o/g' rpd_tracer/Makefile
+ fi
make rpd
if [ $? -ne 0 ]; then
echo "Error: Failed to build RPD tracer"
diff --git a/src/madengine/utils/gpu_tool_factory.py b/src/madengine/utils/gpu_tool_factory.py
index 4f8fa60c..b3a0b566 100644
--- a/src/madengine/utils/gpu_tool_factory.py
+++ b/src/madengine/utils/gpu_tool_factory.py
@@ -11,24 +11,29 @@
import logging
from typing import Dict, Optional
+from madengine.core.constants import get_rocm_path
from madengine.utils.gpu_tool_manager import BaseGPUToolManager
from madengine.utils.gpu_validator import GPUVendor, detect_gpu_vendor
logger = logging.getLogger(__name__)
-# Singleton instances per vendor
-_manager_instances: Dict[GPUVendor, BaseGPUToolManager] = {}
+# Singleton instances: key = (vendor, rocm_path) for AMD, (vendor, "") for NVIDIA
+_manager_instances: Dict[tuple, BaseGPUToolManager] = {}
-def get_gpu_tool_manager(vendor: Optional[GPUVendor] = None) -> BaseGPUToolManager:
+def get_gpu_tool_manager(
+ vendor: Optional[GPUVendor] = None,
+ rocm_path: Optional[str] = None,
+) -> BaseGPUToolManager:
"""Get GPU tool manager for the specified vendor.
-
+
This function implements the singleton pattern - only one manager instance
- is created per vendor type and reused across all calls.
-
+ is created per (vendor, rocm_path) and reused across all calls.
+
Args:
vendor: GPU vendor (AMD, NVIDIA, etc.). If None, auto-detects.
-
+ rocm_path: Optional ROCm root path for AMD (default: ROCM_PATH env or /opt/rocm).
+
Returns:
GPU tool manager instance for the specified vendor
@@ -49,23 +54,26 @@ def get_gpu_tool_manager(vendor: Optional[GPUVendor] = None) -> BaseGPUToolManag
"""
# Auto-detect vendor if not specified
if vendor is None:
- vendor = detect_gpu_vendor()
+ vendor = detect_gpu_vendor(rocm_path=rocm_path)
logger.debug(f"Auto-detected GPU vendor: {vendor.value}")
-
- # Check if we already have a singleton instance
- if vendor in _manager_instances:
+
+ # Cache key: (vendor, rocm_path) for AMD so different paths get different managers
+ resolved_rocm = get_rocm_path(rocm_path) if vendor == GPUVendor.AMD else ""
+ cache_key = (vendor, resolved_rocm)
+
+ if cache_key in _manager_instances:
logger.debug(f"Returning cached {vendor.value} tool manager")
- return _manager_instances[vendor]
-
+ return _manager_instances[cache_key]
+
# Create new manager instance based on vendor
if vendor == GPUVendor.AMD:
try:
from madengine.utils.rocm_tool_manager import ROCmToolManager
- manager = ROCmToolManager()
+ manager = ROCmToolManager(rocm_path=rocm_path)
logger.info(f"Created new ROCm tool manager")
except ImportError as e:
raise ImportError(f"Failed to import ROCm tool manager: {e}")
-
+
elif vendor == GPUVendor.NVIDIA:
try:
from madengine.utils.nvidia_tool_manager import NvidiaToolManager
@@ -85,8 +93,8 @@ def get_gpu_tool_manager(vendor: Optional[GPUVendor] = None) -> BaseGPUToolManag
raise ValueError(f"Unsupported GPU vendor: {vendor.value}")
# Cache the manager instance
- _manager_instances[vendor] = manager
-
+ _manager_instances[cache_key] = manager
+
return manager
@@ -108,13 +116,14 @@ def clear_manager_cache() -> None:
logger.debug("Cleared all GPU tool manager instances")
-def get_cached_managers() -> Dict[GPUVendor, BaseGPUToolManager]:
+def get_cached_managers() -> Dict[tuple, BaseGPUToolManager]:
"""Get dictionary of currently cached manager instances.
-
+
Primarily for debugging and testing purposes.
-
+ Keys are (GPUVendor, rocm_path) for AMD, (GPUVendor, "") for NVIDIA.
+
Returns:
- Dictionary mapping GPUVendor to manager instances
+ Dictionary mapping (vendor, rocm_path) to manager instances
"""
return _manager_instances.copy()
diff --git a/src/madengine/utils/gpu_validator.py b/src/madengine/utils/gpu_validator.py
index c542e8c3..7014268a 100644
--- a/src/madengine/utils/gpu_validator.py
+++ b/src/madengine/utils/gpu_validator.py
@@ -14,6 +14,8 @@
from dataclasses import dataclass
from enum import Enum
+from madengine.core.constants import get_rocm_path
+
class GPUVendor(Enum):
"""Supported GPU vendors"""
@@ -43,33 +45,31 @@ def __post_init__(self):
class ROCmValidator:
"""Validator for AMD ROCm installation with tool manager integration"""
-
- # Essential ROCm components to check
- ESSENTIAL_PATHS = {
- 'rocm_root': '/opt/rocm',
- 'hip_path': '/opt/rocm/bin/hipconfig',
- 'rocminfo': '/opt/rocm/bin/rocminfo',
- }
-
- # Optional but recommended components
- RECOMMENDED_PATHS = {
- 'amd_smi': '/opt/rocm/bin/amd-smi',
- 'rocm_smi': '/opt/rocm/bin/rocm-smi',
- }
-
- # KFD (Kernel Fusion Driver) paths
+
+ # KFD (Kernel Fusion Driver) paths - not under ROCm install
KFD_PATHS = {
'kfd_device': '/dev/kfd',
'kfd_topology': '/sys/devices/virtual/kfd/kfd/topology/nodes',
}
-
- def __init__(self, verbose: bool = False):
+
+ def __init__(self, verbose: bool = False, rocm_path: Optional[str] = None):
"""Initialize ROCm validator
-
+
Args:
verbose: If True, print detailed validation progress
+ rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm)
"""
self.verbose = verbose
+ self.rocm_path = get_rocm_path(rocm_path)
+ self.ESSENTIAL_PATHS = {
+ 'rocm_root': self.rocm_path,
+ 'hip_path': os.path.join(self.rocm_path, 'bin', 'hipconfig'),
+ 'rocminfo': os.path.join(self.rocm_path, 'bin', 'rocminfo'),
+ }
+ self.RECOMMENDED_PATHS = {
+ 'amd_smi': os.path.join(self.rocm_path, 'bin', 'amd-smi'),
+ 'rocm_smi': os.path.join(self.rocm_path, 'bin', 'rocm-smi'),
+ }
self._tool_manager = None # Lazy initialization
def _run_command(self, cmd: List[str], timeout: int = 10) -> Tuple[bool, str, str]:
@@ -110,7 +110,7 @@ def _get_tool_manager(self):
if self._tool_manager is None:
try:
from madengine.utils.rocm_tool_manager import ROCmToolManager
- self._tool_manager = ROCmToolManager()
+ self._tool_manager = ROCmToolManager(rocm_path=self.rocm_path)
except ImportError as e:
if self.verbose:
print(f"Warning: Could not import ROCmToolManager: {e}")
@@ -140,7 +140,7 @@ def _get_rocm_version(self) -> Optional[str]:
return stdout.split('-')[0] # Remove build suffix
# Try version file
- version_file = '/opt/rocm/.info/version'
+ version_file = os.path.join(self.rocm_path, '.info', 'version')
if os.path.exists(version_file):
try:
with open(version_file, 'r') as f:
@@ -348,9 +348,10 @@ def validate(self) -> GPUValidationResult:
# Generate suggestions based on issues
if result.issues:
- if not self._check_path_exists('/opt/rocm'):
+ if not self._check_path_exists(self.rocm_path):
result.suggestions.append(
- "ROCm does not appear to be installed. Install ROCm: "
+ f"ROCm does not appear to be installed at {self.rocm_path}. "
+ "Set ROCM_PATH if using a non-default install, or install ROCm: "
"https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html"
)
@@ -595,39 +596,50 @@ def validate(self) -> GPUValidationResult:
return result
-def detect_gpu_vendor() -> GPUVendor:
+def detect_gpu_vendor(rocm_path: Optional[str] = None) -> GPUVendor:
"""Detect which GPU vendor is present on the system
-
+
+ Args:
+ rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm)
+
Returns:
GPUVendor enum value
"""
if os.path.exists("/usr/bin/nvidia-smi"):
return GPUVendor.NVIDIA
- elif os.path.exists("/opt/rocm/bin/rocm-smi") or os.path.exists("/opt/rocm/bin/amd-smi"):
+ rocm = get_rocm_path(rocm_path)
+ if os.path.exists(os.path.join(rocm, "bin", "rocm-smi")) or os.path.exists(os.path.join(rocm, "bin", "amd-smi")):
return GPUVendor.AMD
- else:
- return GPUVendor.UNKNOWN
+ if os.path.exists("/usr/local/bin/amd-smi"):
+ return GPUVendor.AMD
+ return GPUVendor.UNKNOWN
-def validate_gpu_installation(vendor: Optional[GPUVendor] = None, verbose: bool = False, raise_on_error: bool = True) -> GPUValidationResult:
+def validate_gpu_installation(
+ vendor: Optional[GPUVendor] = None,
+ verbose: bool = False,
+ raise_on_error: bool = True,
+ rocm_path: Optional[str] = None,
+) -> GPUValidationResult:
"""Validate GPU installation on the current node
-
+
Args:
vendor: GPU vendor to validate (auto-detected if None)
verbose: Print detailed validation progress
raise_on_error: Raise GPUInstallationError if validation fails
-
+ rocm_path: Optional ROCm root path for AMD (default: ROCM_PATH env or /opt/rocm)
+
Returns:
GPUValidationResult
-
+
Raises:
GPUInstallationError: If validation fails and raise_on_error is True
"""
if vendor is None:
- vendor = detect_gpu_vendor()
-
+ vendor = detect_gpu_vendor(rocm_path=rocm_path)
+
if vendor == GPUVendor.AMD:
- validator = ROCmValidator(verbose=verbose)
+ validator = ROCmValidator(verbose=verbose, rocm_path=rocm_path)
rocm_result = validator.validate()
# Convert ROCmValidationResult to GPUValidationResult
result = GPUValidationResult(
@@ -709,20 +721,27 @@ def _format_error_message(self, result: GPUValidationResult) -> str:
ROCmInstallationError = GPUInstallationError # For backwards compatibility
-def validate_rocm_installation(verbose: bool = False, raise_on_error: bool = True) -> GPUValidationResult:
+def validate_rocm_installation(
+ verbose: bool = False,
+ raise_on_error: bool = True,
+ rocm_path: Optional[str] = None,
+) -> GPUValidationResult:
"""Validate ROCm installation on the current node (backwards compatibility wrapper)
-
+
Args:
verbose: Print detailed validation progress
raise_on_error: Raise GPUInstallationError if validation fails
-
+ rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm)
+
Returns:
GPUValidationResult
-
+
Raises:
GPUInstallationError: If validation fails and raise_on_error is True
"""
- return validate_gpu_installation(vendor=GPUVendor.AMD, verbose=verbose, raise_on_error=raise_on_error)
+ return validate_gpu_installation(
+ vendor=GPUVendor.AMD, verbose=verbose, raise_on_error=raise_on_error, rocm_path=rocm_path
+ )
if __name__ == "__main__":
diff --git a/src/madengine/utils/rocm_tool_manager.py b/src/madengine/utils/rocm_tool_manager.py
index 0324d231..439f7da2 100644
--- a/src/madengine/utils/rocm_tool_manager.py
+++ b/src/madengine/utils/rocm_tool_manager.py
@@ -22,6 +22,7 @@
import re
from typing import Dict, List, Optional, Tuple
+from madengine.core.constants import get_rocm_path
from madengine.utils.gpu_tool_manager import BaseGPUToolManager
@@ -43,17 +44,20 @@ class ROCmToolManager(BaseGPUToolManager):
- ROCm < 6.4.1: Use rocm-smi
- If both tools fail: Raise error with debugging information
"""
-
- # Tool paths
- AMD_SMI_PATH = "/opt/rocm/bin/amd-smi"
- ROCM_SMI_PATH = "/opt/rocm/bin/rocm-smi"
- HIPCONFIG_PATH = "/opt/rocm/bin/hipconfig"
- ROCMINFO_PATH = "/opt/rocm/bin/rocminfo"
- ROCM_VERSION_FILE = "/opt/rocm/.info/version"
-
- def __init__(self):
- """Initialize ROCm tool manager."""
+
+ def __init__(self, rocm_path: Optional[str] = None):
+ """Initialize ROCm tool manager.
+
+ Args:
+ rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm).
+ """
super().__init__()
+ self.rocm_path = get_rocm_path(rocm_path)
+ self.AMD_SMI_PATH = os.path.join(self.rocm_path, "bin", "amd-smi")
+ self.ROCM_SMI_PATH = os.path.join(self.rocm_path, "bin", "rocm-smi")
+ self.HIPCONFIG_PATH = os.path.join(self.rocm_path, "bin", "hipconfig")
+ self.ROCMINFO_PATH = os.path.join(self.rocm_path, "bin", "rocminfo")
+ self.ROCM_VERSION_FILE = os.path.join(self.rocm_path, ".info", "version")
self._log_debug("Initialized ROCm tool manager")
def get_version(self) -> Optional[str]:
@@ -294,7 +298,7 @@ def get_gpu_count(self) -> int:
f"Unable to determine number of AMD GPUs.\n"
f"Error: {e}\n"
f"Suggestions:\n"
- f"- Verify ROCm installation: ls -la /opt/rocm/bin/\n"
+ f"- Verify ROCm installation: ls -la {self.rocm_path}/bin/\n"
f"- Check GPU accessibility: ls -la /dev/kfd /dev/dri\n"
f"- Ensure user is in 'video' and 'render' groups\n"
f"- See: https://github.com/ROCm/TheRock"
@@ -346,7 +350,7 @@ def get_gpu_product_name(self, gpu_id: int = 0) -> str:
f"Error: {e}\n"
f"Suggestions:\n"
f"- Verify GPU {gpu_id} exists: {self.ROCM_SMI_PATH} --showid\n"
- f"- Check ROCm version: cat /opt/rocm/.info/version\n"
+ f"- Check ROCm version: cat {self.ROCM_VERSION_FILE}\n"
f"- For ROCm >= 6.4.1, ensure amd-smi is installed"
)
diff --git a/tests/fixtures/utils.py b/tests/fixtures/utils.py
index eabbe13a..64b9d50b 100644
--- a/tests/fixtures/utils.py
+++ b/tests/fixtures/utils.py
@@ -45,8 +45,11 @@ def has_gpu() -> bool:
# Ultra-simple file existence check (no subprocess calls)
# This is safe for pytest collection and avoids hanging
nvidia_exists = os.path.exists("/usr/bin/nvidia-smi")
- amd_rocm_exists = os.path.exists("/opt/rocm/bin/rocm-smi") or os.path.exists(
- "/usr/local/bin/rocm-smi"
+ from madengine.core.constants import get_rocm_path
+ rocm_path = get_rocm_path()
+ amd_rocm_exists = (
+ os.path.exists(os.path.join(rocm_path, "bin", "rocm-smi"))
+ or os.path.exists("/usr/local/bin/rocm-smi")
)
_has_gpu_cache = nvidia_exists or amd_rocm_exists
diff --git a/tests/integration/test_gpu_management.py b/tests/integration/test_gpu_management.py
index 8bec767c..ef18a810 100644
--- a/tests/integration/test_gpu_management.py
+++ b/tests/integration/test_gpu_management.py
@@ -292,14 +292,17 @@ def test_get_cached_managers(self):
"""Test getting dictionary of cached managers."""
amd_manager = get_gpu_tool_manager(GPUVendor.AMD)
nvidia_manager = get_gpu_tool_manager(GPUVendor.NVIDIA)
-
+
cached = get_cached_managers()
-
+
assert len(cached) == 2
- assert GPUVendor.AMD in cached
- assert GPUVendor.NVIDIA in cached
- assert cached[GPUVendor.AMD] is amd_manager
- assert cached[GPUVendor.NVIDIA] is nvidia_manager
+ # Cache keys are (vendor, rocm_path): find by vendor
+ amd_keys = [k for k in cached if k[0] == GPUVendor.AMD]
+ nvidia_keys = [k for k in cached if k[0] == GPUVendor.NVIDIA]
+ assert len(amd_keys) == 1
+ assert len(nvidia_keys) == 1
+ assert cached[amd_keys[0]] is amd_manager
+ assert cached[nvidia_keys[0]] is nvidia_manager
diff --git a/tests/unit/test_rocm_path.py b/tests/unit/test_rocm_path.py
new file mode 100644
index 00000000..f33916eb
--- /dev/null
+++ b/tests/unit/test_rocm_path.py
@@ -0,0 +1,102 @@
+"""
+Unit tests for ROCm path (ROCM_PATH / --rocm-path) support.
+
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
+"""
+
+import os
+import pytest
+
+from madengine.core.constants import get_rocm_path
+
+
+@pytest.mark.unit
+class TestGetRocmPath:
+ """Test get_rocm_path() resolution."""
+
+ def test_get_rocm_path_default(self, monkeypatch):
+ """Without override or ROCM_PATH, returns default /opt/rocm (normalized)."""
+ monkeypatch.delenv("ROCM_PATH", raising=False)
+ path = get_rocm_path(None)
+ assert path == "/opt/rocm"
+
+ def test_get_rocm_path_override(self):
+ """Override argument takes precedence."""
+ path = get_rocm_path("/custom/rocm")
+ assert path == os.path.abspath("/custom/rocm").rstrip(os.sep)
+
+ def test_get_rocm_path_env(self, monkeypatch):
+ """ROCM_PATH env is used when override is None."""
+ monkeypatch.setenv("ROCM_PATH", "/env/rocm")
+ try:
+ path = get_rocm_path(None)
+ assert path == os.path.abspath("/env/rocm").rstrip(os.sep)
+ finally:
+ monkeypatch.delenv("ROCM_PATH", raising=False)
+
+ def test_get_rocm_path_override_overrides_env(self, monkeypatch):
+ """Override takes precedence over ROCM_PATH env."""
+ monkeypatch.setenv("ROCM_PATH", "/env/rocm")
+ try:
+ path = get_rocm_path("/cli/rocm")
+ assert path == os.path.abspath("/cli/rocm").rstrip(os.sep)
+ finally:
+ monkeypatch.delenv("ROCM_PATH", raising=False)
+
+
+@pytest.mark.unit
+class TestContextRocmPath:
+ """Test Context stores and uses rocm_path."""
+
+ def test_context_build_only_stores_rocm_path(self):
+ """Context with build_only_mode=True and rocm_path sets _rocm_path."""
+ from madengine.core.context import Context
+
+ ctx = Context(build_only_mode=True, rocm_path="/opt/rocm")
+ assert ctx._rocm_path == "/opt/rocm"
+
+ def test_context_runtime_includes_rocm_path_in_ctx(self):
+ """Context in runtime mode includes rocm_path and ROCM_PATH in docker_env_vars."""
+ from madengine.core.context import Context
+ from unittest.mock import patch
+
+ with patch.object(Context, "get_gpu_vendor", return_value="AMD"), \
+ patch.object(Context, "get_system_ngpus", return_value=2), \
+ patch.object(Context, "get_system_gpu_architecture", return_value="gfx90a"), \
+ patch.object(Context, "get_system_gpu_product_name", return_value="MI250"), \
+ patch.object(Context, "get_system_hip_version", return_value="5.4"), \
+ patch.object(Context, "get_docker_gpus", return_value="0-1"), \
+ patch.object(Context, "get_gpu_renderD_nodes", return_value=None):
+ ctx = Context(rocm_path="/my/rocm")
+ assert ctx.ctx.get("rocm_path") == "/my/rocm"
+ assert ctx.ctx["docker_env_vars"].get("ROCM_PATH") == "/my/rocm"
+
+
+@pytest.mark.unit
+class TestRocmToolManagerRocmPath:
+ """Test ROCmToolManager uses configurable rocm_path."""
+
+ def test_rocm_tool_manager_paths_under_rocm_path(self):
+ """ROCmToolManager(rocm_path=X) sets paths under X."""
+ from madengine.utils.rocm_tool_manager import ROCmToolManager
+
+ manager = ROCmToolManager(rocm_path="/custom/rocm")
+ assert manager.rocm_path == "/custom/rocm"
+ assert manager.AMD_SMI_PATH == "/custom/rocm/bin/amd-smi"
+ assert manager.ROCM_SMI_PATH == "/custom/rocm/bin/rocm-smi"
+ assert manager.ROCM_VERSION_FILE == "/custom/rocm/.info/version"
+
+
+@pytest.mark.unit
+class TestRunCommandRocmPath:
+ """Test run command exposes --rocm-path."""
+
+ def test_run_help_includes_rocm_path(self):
+ """madengine run --help mentions --rocm-path."""
+ from typer.testing import CliRunner
+ from madengine.cli import app
+
+ runner = CliRunner()
+ result = runner.invoke(app, ["run", "--help"])
+ assert result.exit_code == 0
+ assert "--rocm-path" in result.output