diff --git a/README.md b/README.md index 08d2c31c..25924fde 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,7 @@ +

+ madengine Logo +

+ # madengine [![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://python.org) @@ -34,6 +38,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep - **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration - **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang - **🐳 Container-Native** - Docker-based execution with GPU support (ROCm, CUDA) +- **📂 ROCm Path** - Support for non-default ROCm installs via `--rocm-path` or `ROCM_PATH` (e.g. Rock, pip) - **📊 Performance Tools** - Integrated profiling with rocprof/rocprofv3, rocblas, MIOpen, RCCL tracing - **🎯 ROCprofv3 Profiles** - 8 pre-configured profiles for compute/memory/communication bottleneck analysis - **🔍 Environment Validation** - TheRock ROCm detection and validation tools @@ -56,6 +61,14 @@ madengine run --tags dummy \ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' ``` +If ROCm is not installed under `/opt/rocm` (e.g. Rock or pip install), use `--rocm-path` or set `ROCM_PATH`: + +```bash +madengine run --tags dummy --rocm-path /path/to/rocm \ + --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' +# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy ... +``` + **Results saved to `perf_entry.csv`** ## 📋 Commands @@ -593,6 +606,8 @@ madengine run --tags model --keep-alive madengine build --tags model --clean-docker-cache --verbose ``` +**ROCm not in /opt/rocm:** If you use a custom ROCm location (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set `ROCM_PATH` or pass `--rocm-path` to `madengine run` so GPU detection and container env use the correct paths. + **Common Issues:** - **False failures with profiling**: If models show FAILURE but have performance metrics, see [Profiling Troubleshooting](docs/profiling.md#false-failure-detection-with-rocprof) - **ROCProf log errors**: Messages like `E20251230` are informational logs, not errors (fixed in v2.0+) diff --git a/docs/cli-reference.md b/docs/cli-reference.md index 5d58f1e6..0e638eec 100644 --- a/docs/cli-reference.md +++ b/docs/cli-reference.md @@ -188,6 +188,7 @@ madengine run [OPTIONS] |--------|-------|------|---------|-------------| | `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) | | `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) | +| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). | | `--registry` | `-r` | TEXT | `None` | Docker registry URL | | `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) | | `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string | @@ -215,6 +216,10 @@ madengine run [OPTIONS] madengine run --tags dummy \ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' +# Custom ROCm path (when ROCm is not in /opt/rocm, e.g. Rock or pip install) +madengine run --tags dummy --rocm-path /path/to/rocm \ + --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' + # Run with pre-built images (manifest-based) madengine run --manifest-file build_manifest.json @@ -571,6 +576,7 @@ madengine recognizes these environment variables: | Variable | Description | Default | |----------|-------------|---------| | `MODEL_DIR` | Path to MAD package directory | Auto-detected | +| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set) | `/opt/rocm` | | `MAD_VERBOSE_CONFIG` | Enable verbose configuration logging | `false` | | `MAD_DOCKERHUB_USER` | Docker Hub username | None | | `MAD_DOCKERHUB_PASSWORD` | Docker Hub password/token | None | diff --git a/docs/configuration.md b/docs/configuration.md index 8af78bae..dde8e094 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -45,6 +45,15 @@ madengine run --tags model --additional-context-file config.json - `"UBUNTU"` - Ubuntu Linux - `"CENTOS"` - CentOS Linux +### ROCm path (run only) + +When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context): + +- **CLI:** `madengine run --rocm-path /path/to/rocm ...` +- **Environment:** `export ROCM_PATH=/path/to/rocm` + +Resolution order: `--rocm-path` → `ROCM_PATH` → `/opt/rocm`. This applies only to the run phase; build does not perform GPU detection. + ## Build Configuration ### Batch Manifest diff --git a/docs/installation.md b/docs/installation.md index d3f79b85..7ff51f1e 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -83,6 +83,8 @@ madengine run --tags dummy \ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' ``` +**Non-default ROCm location:** If ROCm is not under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip install), set `ROCM_PATH` or use `madengine run --rocm-path /path/to/rocm` so GPU detection and container env use the correct paths. + ### NVIDIA CUDA ```bash @@ -138,6 +140,8 @@ rocm-smi ls -la /dev/kfd /dev/dri ``` +If ROCm is installed in a non-default path (e.g. Rock or pip), set `export ROCM_PATH=/path/to/rocm` or use `madengine run --rocm-path /path/to/rocm`. + ### MAD Package Not Found Ensure you're running madengine commands from within a MAD package directory: diff --git a/docs/profiling.md b/docs/profiling.md index 2c870b6b..f4323d69 100644 --- a/docs/profiling.md +++ b/docs/profiling.md @@ -120,7 +120,9 @@ Collect comprehensive ROCm profiling data: } ``` -**Output:** ROCm profiler data files +**Output:** ROCm profiler data files (e.g. `rpd_output/trace.rpd`). + +**Note:** The rpd pre-script installs build dependencies in the container (e.g. `nlohmann-json3-dev` on Ubuntu) so the rocmProfileData tracer can compile; the first run may take longer while packages are installed. ### ROCprofv3 - Advanced GPU Profiling diff --git a/docs/usage.md b/docs/usage.md index 89ebd415..c8073c13 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -288,6 +288,22 @@ madengine run --tags model \ - `gpu_vendor`: "AMD", "NVIDIA" - `guest_os`: "UBUNTU", "CENTOS" +### ROCm path (non-default installs) + +When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths: + +```bash +# Via environment variable +export ROCM_PATH=/path/to/rocm +madengine run --tags model --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' + +# Via CLI (overrides ROCM_PATH) +madengine run --tags model --rocm-path /path/to/rocm \ + --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' +``` + +`--rocm-path` applies only to the **run** command (not build). See [CLI Reference - run](cli-reference.md#run---execute-models). + ### Deploy to Kubernetes ```bash @@ -577,6 +593,7 @@ madengine build --tags model --clean-docker-cache --verbose | Variable | Description | Example | |----------|-------------|---------| | `MODEL_DIR` | MAD package directory | `/path/to/MAD` | +| `ROCM_PATH` | ROCm installation root (used when `--rocm-path` not set). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). | `/path/to/rocm` | | `MAD_VERBOSE_CONFIG` | Verbose config logging | `"true"` | | `MAD_DOCKERHUB_USER` | Docker Hub username | `"myusername"` | | `MAD_DOCKERHUB_PASSWORD` | Docker Hub password | `"mytoken"` | diff --git a/madengine.png b/madengine.png new file mode 100755 index 00000000..29e396f2 Binary files /dev/null and b/madengine.png differ diff --git a/src/madengine/cli/commands/run.py b/src/madengine/cli/commands/run.py index 90fc16f8..aa1866a7 100644 --- a/src/madengine/cli/commands/run.py +++ b/src/madengine/cli/commands/run.py @@ -140,6 +140,13 @@ def run( help="Remove intermediate perf_entry files after run (keeps perf.csv and perf_super files)", ), ] = False, + rocm_path: Annotated[ + Optional[str], + typer.Option( + "--rocm-path", + help="ROCm installation path (overrides ROCM_PATH env; default: /opt/rocm). Use when ROCm is not under /opt/rocm (e.g. Rock tar/whl).", + ), + ] = None, ) -> None: """ 🚀 Run model containers in distributed scenarios. @@ -199,6 +206,7 @@ def run( disable_skip_gpu_arch=disable_skip_gpu_arch, verbose=verbose, cleanup_perf=cleanup_perf, + rocm_path=rocm_path, _separate_phases=True, ) @@ -323,6 +331,7 @@ def run( disable_skip_gpu_arch=disable_skip_gpu_arch, verbose=verbose, cleanup_perf=cleanup_perf, + rocm_path=rocm_path, _separate_phases=False, # Full workflow uses .live.log (not .run.live.log) ) diff --git a/src/madengine/core/constants.py b/src/madengine/core/constants.py index f86e51fe..c98980a8 100644 --- a/src/madengine/core/constants.py +++ b/src/madengine/core/constants.py @@ -228,3 +228,20 @@ def _get_public_github_rocm_key(): PUBLIC_GITHUB_ROCM_KEY = _get_public_github_rocm_key() + + +def get_rocm_path(override=None): + """Return ROCm installation root directory. + + Resolution order: override (e.g. from CLI) -> ROCM_PATH env -> default /opt/rocm. + Path is normalized to absolute form with no trailing slash. + + Args: + override: Optional path overriding env and default. + + Returns: + str: Absolute ROCm root path. + """ + raw = override if override else os.environ.get("ROCM_PATH", "/opt/rocm") + path = os.path.abspath(os.path.expanduser(str(raw).strip())) + return path.rstrip(os.sep) diff --git a/src/madengine/core/context.py b/src/madengine/core/context.py index ce463abb..e1d93b61 100644 --- a/src/madengine/core/context.py +++ b/src/madengine/core/context.py @@ -21,6 +21,7 @@ # third-party modules from madengine.core.console import Console +from madengine.core.constants import get_rocm_path from madengine.utils.gpu_validator import validate_rocm_installation, GPUInstallationError, GPUVendor from madengine.utils.gpu_tool_factory import get_gpu_tool_manager from madengine.utils.gpu_tool_manager import BaseGPUToolManager @@ -80,6 +81,7 @@ def __init__( additional_context: str = None, additional_context_file: str = None, build_only_mode: bool = False, + rocm_path: str = None, ) -> None: """Constructor of the Context class. @@ -87,10 +89,12 @@ def __init__( additional_context: The additional context. additional_context_file: The additional context file. build_only_mode: Whether running in build-only mode (no GPU detection). + rocm_path: Optional ROCm installation path (overrides ROCM_PATH env; default /opt/rocm). Raises: RuntimeError: If GPU detection fails and not in build-only mode. """ + self._rocm_path = get_rocm_path(rocm_path) # Initialize the console self.console = Console() self._gpu_context_initialized = False @@ -252,6 +256,9 @@ def init_gpu_context(self) -> None: if "MAD_GPU_VENDOR" not in self.ctx["docker_env_vars"]: self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"] + self.ctx["rocm_path"] = self._rocm_path + self.ctx["docker_env_vars"]["ROCM_PATH"] = self._rocm_path + if "MAD_SYSTEM_NGPUS" not in self.ctx["docker_env_vars"]: self.ctx["docker_env_vars"][ "MAD_SYSTEM_NGPUS" @@ -337,7 +344,7 @@ def _get_tool_manager(self) -> BaseGPUToolManager: else: vendor = None # Auto-detect - self._gpu_tool_manager = get_gpu_tool_manager(vendor) + self._gpu_tool_manager = get_gpu_tool_manager(vendor, rocm_path=self._rocm_path) return self._gpu_tool_manager @@ -382,8 +389,11 @@ def get_gpu_vendor(self) -> str: print(f"Warning: nvidia-smi check failed: {e}") # Check AMD - try amd-smi first, fallback to rocm-smi (PR #54) - # Increased timeout to 180s for SLURM compute nodes where GPU initialization may be slow - amd_smi_paths = ["/opt/rocm/bin/amd-smi", "/usr/local/bin/amd-smi"] + # Use configurable ROCm path (ROCM_PATH / --rocm-path) for non-default installs + amd_smi_paths = [ + os.path.join(self._rocm_path, "bin", "amd-smi"), + "/usr/local/bin/amd-smi", + ] for amd_smi_path in amd_smi_paths: if os.path.exists(amd_smi_path): try: @@ -395,9 +405,10 @@ def get_gpu_vendor(self) -> str: print(f"Warning: amd-smi check failed for {amd_smi_path}: {e}") # Fallback to rocm-smi (PR #54) - if os.path.exists("/opt/rocm/bin/rocm-smi"): + rocm_smi_path = os.path.join(self._rocm_path, "bin", "rocm-smi") + if os.path.exists(rocm_smi_path): try: - result = self.console.sh("/opt/rocm/bin/rocm-smi --showid > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180) + result = self.console.sh(f"{rocm_smi_path} --showid > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180) if result and result.strip() == "AMD": return "AMD" except Exception as e: @@ -510,14 +521,15 @@ def get_system_gpu_architecture(self) -> str: """ if self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] == "AMD": try: - arch = self.console.sh("/opt/rocm/bin/rocminfo |grep -o -m 1 'gfx.*'") + rocminfo_path = os.path.join(self._rocm_path, "bin", "rocminfo") + arch = self.console.sh(f"{rocminfo_path} |grep -o -m 1 'gfx.*'") if not arch or arch.strip() == "": raise RuntimeError("rocminfo returned empty architecture") return arch except Exception as e: raise RuntimeError( f"Unable to determine AMD GPU architecture. " - f"Ensure ROCm is installed and rocminfo is accessible at /opt/rocm/bin/rocminfo. " + f"Ensure ROCm is installed and rocminfo is accessible (ROCM_PATH={self._rocm_path}). " f"Error: {e}" ) elif self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] == "NVIDIA": @@ -666,9 +678,10 @@ def get_gpu_renderD_nodes(self) -> typing.Optional[typing.List[int]]: raise RuntimeError("Tool manager returned None for ROCm version") except Exception as e: # Fallback to direct file read - rocm_version_str = self.console.sh("cat /opt/rocm/.info/version | cut -d'-' -f1") + version_file = os.path.join(self._rocm_path, ".info", "version") + rocm_version_str = self.console.sh(f"cat {version_file} | cut -d'-' -f1") if not rocm_version_str or rocm_version_str.strip() == "": - raise RuntimeError("Failed to retrieve ROCm version from /opt/rocm/.info/version") + raise RuntimeError(f"Failed to retrieve ROCm version from {version_file}") # Parse version safely try: diff --git a/src/madengine/execution/container_runner.py b/src/madengine/execution/container_runner.py index fe414e13..c3299049 100644 --- a/src/madengine/execution/container_runner.py +++ b/src/madengine/execution/container_runner.py @@ -19,6 +19,7 @@ from madengine.core.console import Console from madengine.core.context import Context from madengine.core.docker import Docker +from madengine.core.constants import get_rocm_path from madengine.core.timeout import Timeout from madengine.core.dataprovider import Data from madengine.utils.ops import PythonicTee, file_print @@ -907,18 +908,18 @@ def run_container( # Show GPU info with version-aware tool selection (PR #54) if gpu_vendor.find("AMD") != -1: print(f"🎮 Checking AMD GPU status...") - # Use version-aware SMI tool selection - # Note: Use amd-smi without arguments to show full status table (same as legacy madengine) + rocm_path = self.context.ctx.get("rocm_path") or get_rocm_path() + amd_smi_path = os.path.join(rocm_path, "bin", "amd-smi") + rocm_smi_path = os.path.join(rocm_path, "bin", "rocm-smi") try: tool_manager = self.context._get_tool_manager() preferred_tool = tool_manager.get_preferred_smi_tool() if preferred_tool == "amd-smi": - model_docker.sh("/opt/rocm/bin/amd-smi || /opt/rocm/bin/rocm-smi || true") + model_docker.sh(f"{amd_smi_path} || {rocm_smi_path} || true") else: - model_docker.sh("/opt/rocm/bin/rocm-smi || /opt/rocm/bin/amd-smi || true") + model_docker.sh(f"{rocm_smi_path} || {amd_smi_path} || true") except Exception: - # Fallback: try both tools - model_docker.sh("/opt/rocm/bin/amd-smi || /opt/rocm/bin/rocm-smi || true") + model_docker.sh(f"{amd_smi_path} || {rocm_smi_path} || true") elif gpu_vendor.find("NVIDIA") != -1: print(f"🎮 Checking NVIDIA GPU status...") model_docker.sh("/usr/bin/nvidia-smi || true") diff --git a/src/madengine/orchestration/run_orchestrator.py b/src/madengine/orchestration/run_orchestrator.py index b1c77a25..e1513bb3 100644 --- a/src/madengine/orchestration/run_orchestrator.py +++ b/src/madengine/orchestration/run_orchestrator.py @@ -29,6 +29,7 @@ create_error_context, handle_error, ) +from madengine.core.constants import get_rocm_path from madengine.utils.session_tracker import SessionTracker @@ -107,9 +108,11 @@ def _init_runtime_context(self): else: context_string = None + rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None)) self.context = Context( additional_context=context_string, build_only_mode=False, + rocm_path=rocm_path, ) # Initialize data provider if data config exists @@ -383,9 +386,11 @@ def _create_manifest_from_local_image( # Initialize build-only context for manifest generation # (we need context structure, but skip GPU detection since we're not building) context_string = repr(self.additional_context) if self.additional_context else None + rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None)) build_context = Context( additional_context=context_string, build_only_mode=True, + rocm_path=rocm_path, ) # Create manifest structure diff --git a/src/madengine/scripts/common/pre_scripts/trace.sh b/src/madengine/scripts/common/pre_scripts/trace.sh index 7fcffdd1..0d5cb80e 100644 --- a/src/madengine/scripts/common/pre_scripts/trace.sh +++ b/src/madengine/scripts/common/pre_scripts/trace.sh @@ -24,9 +24,9 @@ case "$tool" in rpd) if [ "$os" == 'ubuntu' ]; then sudo apt update - sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip + sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev elif [ "$os" == 'centos' ]; then - sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip + sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel else echo "Unable to detect Host OS in trace pre-script" fi @@ -43,6 +43,10 @@ rpd) # Build RPD tracer locally without system install cd ./rocmProfileData + # Workaround for upstream rocmProfileData Makefile typo: UStringTable.o -> StringTable.o + if [ -f rpd_tracer/Makefile ]; then + sed -i 's/UStringTable\.o/StringTable.o/g' rpd_tracer/Makefile + fi make rpd if [ $? -ne 0 ]; then echo "Error: Failed to build RPD tracer" diff --git a/src/madengine/utils/gpu_tool_factory.py b/src/madengine/utils/gpu_tool_factory.py index 4f8fa60c..b3a0b566 100644 --- a/src/madengine/utils/gpu_tool_factory.py +++ b/src/madengine/utils/gpu_tool_factory.py @@ -11,24 +11,29 @@ import logging from typing import Dict, Optional +from madengine.core.constants import get_rocm_path from madengine.utils.gpu_tool_manager import BaseGPUToolManager from madengine.utils.gpu_validator import GPUVendor, detect_gpu_vendor logger = logging.getLogger(__name__) -# Singleton instances per vendor -_manager_instances: Dict[GPUVendor, BaseGPUToolManager] = {} +# Singleton instances: key = (vendor, rocm_path) for AMD, (vendor, "") for NVIDIA +_manager_instances: Dict[tuple, BaseGPUToolManager] = {} -def get_gpu_tool_manager(vendor: Optional[GPUVendor] = None) -> BaseGPUToolManager: +def get_gpu_tool_manager( + vendor: Optional[GPUVendor] = None, + rocm_path: Optional[str] = None, +) -> BaseGPUToolManager: """Get GPU tool manager for the specified vendor. - + This function implements the singleton pattern - only one manager instance - is created per vendor type and reused across all calls. - + is created per (vendor, rocm_path) and reused across all calls. + Args: vendor: GPU vendor (AMD, NVIDIA, etc.). If None, auto-detects. - + rocm_path: Optional ROCm root path for AMD (default: ROCM_PATH env or /opt/rocm). + Returns: GPU tool manager instance for the specified vendor @@ -49,23 +54,26 @@ def get_gpu_tool_manager(vendor: Optional[GPUVendor] = None) -> BaseGPUToolManag """ # Auto-detect vendor if not specified if vendor is None: - vendor = detect_gpu_vendor() + vendor = detect_gpu_vendor(rocm_path=rocm_path) logger.debug(f"Auto-detected GPU vendor: {vendor.value}") - - # Check if we already have a singleton instance - if vendor in _manager_instances: + + # Cache key: (vendor, rocm_path) for AMD so different paths get different managers + resolved_rocm = get_rocm_path(rocm_path) if vendor == GPUVendor.AMD else "" + cache_key = (vendor, resolved_rocm) + + if cache_key in _manager_instances: logger.debug(f"Returning cached {vendor.value} tool manager") - return _manager_instances[vendor] - + return _manager_instances[cache_key] + # Create new manager instance based on vendor if vendor == GPUVendor.AMD: try: from madengine.utils.rocm_tool_manager import ROCmToolManager - manager = ROCmToolManager() + manager = ROCmToolManager(rocm_path=rocm_path) logger.info(f"Created new ROCm tool manager") except ImportError as e: raise ImportError(f"Failed to import ROCm tool manager: {e}") - + elif vendor == GPUVendor.NVIDIA: try: from madengine.utils.nvidia_tool_manager import NvidiaToolManager @@ -85,8 +93,8 @@ def get_gpu_tool_manager(vendor: Optional[GPUVendor] = None) -> BaseGPUToolManag raise ValueError(f"Unsupported GPU vendor: {vendor.value}") # Cache the manager instance - _manager_instances[vendor] = manager - + _manager_instances[cache_key] = manager + return manager @@ -108,13 +116,14 @@ def clear_manager_cache() -> None: logger.debug("Cleared all GPU tool manager instances") -def get_cached_managers() -> Dict[GPUVendor, BaseGPUToolManager]: +def get_cached_managers() -> Dict[tuple, BaseGPUToolManager]: """Get dictionary of currently cached manager instances. - + Primarily for debugging and testing purposes. - + Keys are (GPUVendor, rocm_path) for AMD, (GPUVendor, "") for NVIDIA. + Returns: - Dictionary mapping GPUVendor to manager instances + Dictionary mapping (vendor, rocm_path) to manager instances """ return _manager_instances.copy() diff --git a/src/madengine/utils/gpu_validator.py b/src/madengine/utils/gpu_validator.py index c542e8c3..7014268a 100644 --- a/src/madengine/utils/gpu_validator.py +++ b/src/madengine/utils/gpu_validator.py @@ -14,6 +14,8 @@ from dataclasses import dataclass from enum import Enum +from madengine.core.constants import get_rocm_path + class GPUVendor(Enum): """Supported GPU vendors""" @@ -43,33 +45,31 @@ def __post_init__(self): class ROCmValidator: """Validator for AMD ROCm installation with tool manager integration""" - - # Essential ROCm components to check - ESSENTIAL_PATHS = { - 'rocm_root': '/opt/rocm', - 'hip_path': '/opt/rocm/bin/hipconfig', - 'rocminfo': '/opt/rocm/bin/rocminfo', - } - - # Optional but recommended components - RECOMMENDED_PATHS = { - 'amd_smi': '/opt/rocm/bin/amd-smi', - 'rocm_smi': '/opt/rocm/bin/rocm-smi', - } - - # KFD (Kernel Fusion Driver) paths + + # KFD (Kernel Fusion Driver) paths - not under ROCm install KFD_PATHS = { 'kfd_device': '/dev/kfd', 'kfd_topology': '/sys/devices/virtual/kfd/kfd/topology/nodes', } - - def __init__(self, verbose: bool = False): + + def __init__(self, verbose: bool = False, rocm_path: Optional[str] = None): """Initialize ROCm validator - + Args: verbose: If True, print detailed validation progress + rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm) """ self.verbose = verbose + self.rocm_path = get_rocm_path(rocm_path) + self.ESSENTIAL_PATHS = { + 'rocm_root': self.rocm_path, + 'hip_path': os.path.join(self.rocm_path, 'bin', 'hipconfig'), + 'rocminfo': os.path.join(self.rocm_path, 'bin', 'rocminfo'), + } + self.RECOMMENDED_PATHS = { + 'amd_smi': os.path.join(self.rocm_path, 'bin', 'amd-smi'), + 'rocm_smi': os.path.join(self.rocm_path, 'bin', 'rocm-smi'), + } self._tool_manager = None # Lazy initialization def _run_command(self, cmd: List[str], timeout: int = 10) -> Tuple[bool, str, str]: @@ -110,7 +110,7 @@ def _get_tool_manager(self): if self._tool_manager is None: try: from madengine.utils.rocm_tool_manager import ROCmToolManager - self._tool_manager = ROCmToolManager() + self._tool_manager = ROCmToolManager(rocm_path=self.rocm_path) except ImportError as e: if self.verbose: print(f"Warning: Could not import ROCmToolManager: {e}") @@ -140,7 +140,7 @@ def _get_rocm_version(self) -> Optional[str]: return stdout.split('-')[0] # Remove build suffix # Try version file - version_file = '/opt/rocm/.info/version' + version_file = os.path.join(self.rocm_path, '.info', 'version') if os.path.exists(version_file): try: with open(version_file, 'r') as f: @@ -348,9 +348,10 @@ def validate(self) -> GPUValidationResult: # Generate suggestions based on issues if result.issues: - if not self._check_path_exists('/opt/rocm'): + if not self._check_path_exists(self.rocm_path): result.suggestions.append( - "ROCm does not appear to be installed. Install ROCm: " + f"ROCm does not appear to be installed at {self.rocm_path}. " + "Set ROCM_PATH if using a non-default install, or install ROCm: " "https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html" ) @@ -595,39 +596,50 @@ def validate(self) -> GPUValidationResult: return result -def detect_gpu_vendor() -> GPUVendor: +def detect_gpu_vendor(rocm_path: Optional[str] = None) -> GPUVendor: """Detect which GPU vendor is present on the system - + + Args: + rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm) + Returns: GPUVendor enum value """ if os.path.exists("/usr/bin/nvidia-smi"): return GPUVendor.NVIDIA - elif os.path.exists("/opt/rocm/bin/rocm-smi") or os.path.exists("/opt/rocm/bin/amd-smi"): + rocm = get_rocm_path(rocm_path) + if os.path.exists(os.path.join(rocm, "bin", "rocm-smi")) or os.path.exists(os.path.join(rocm, "bin", "amd-smi")): return GPUVendor.AMD - else: - return GPUVendor.UNKNOWN + if os.path.exists("/usr/local/bin/amd-smi"): + return GPUVendor.AMD + return GPUVendor.UNKNOWN -def validate_gpu_installation(vendor: Optional[GPUVendor] = None, verbose: bool = False, raise_on_error: bool = True) -> GPUValidationResult: +def validate_gpu_installation( + vendor: Optional[GPUVendor] = None, + verbose: bool = False, + raise_on_error: bool = True, + rocm_path: Optional[str] = None, +) -> GPUValidationResult: """Validate GPU installation on the current node - + Args: vendor: GPU vendor to validate (auto-detected if None) verbose: Print detailed validation progress raise_on_error: Raise GPUInstallationError if validation fails - + rocm_path: Optional ROCm root path for AMD (default: ROCM_PATH env or /opt/rocm) + Returns: GPUValidationResult - + Raises: GPUInstallationError: If validation fails and raise_on_error is True """ if vendor is None: - vendor = detect_gpu_vendor() - + vendor = detect_gpu_vendor(rocm_path=rocm_path) + if vendor == GPUVendor.AMD: - validator = ROCmValidator(verbose=verbose) + validator = ROCmValidator(verbose=verbose, rocm_path=rocm_path) rocm_result = validator.validate() # Convert ROCmValidationResult to GPUValidationResult result = GPUValidationResult( @@ -709,20 +721,27 @@ def _format_error_message(self, result: GPUValidationResult) -> str: ROCmInstallationError = GPUInstallationError # For backwards compatibility -def validate_rocm_installation(verbose: bool = False, raise_on_error: bool = True) -> GPUValidationResult: +def validate_rocm_installation( + verbose: bool = False, + raise_on_error: bool = True, + rocm_path: Optional[str] = None, +) -> GPUValidationResult: """Validate ROCm installation on the current node (backwards compatibility wrapper) - + Args: verbose: Print detailed validation progress raise_on_error: Raise GPUInstallationError if validation fails - + rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm) + Returns: GPUValidationResult - + Raises: GPUInstallationError: If validation fails and raise_on_error is True """ - return validate_gpu_installation(vendor=GPUVendor.AMD, verbose=verbose, raise_on_error=raise_on_error) + return validate_gpu_installation( + vendor=GPUVendor.AMD, verbose=verbose, raise_on_error=raise_on_error, rocm_path=rocm_path + ) if __name__ == "__main__": diff --git a/src/madengine/utils/rocm_tool_manager.py b/src/madengine/utils/rocm_tool_manager.py index 0324d231..439f7da2 100644 --- a/src/madengine/utils/rocm_tool_manager.py +++ b/src/madengine/utils/rocm_tool_manager.py @@ -22,6 +22,7 @@ import re from typing import Dict, List, Optional, Tuple +from madengine.core.constants import get_rocm_path from madengine.utils.gpu_tool_manager import BaseGPUToolManager @@ -43,17 +44,20 @@ class ROCmToolManager(BaseGPUToolManager): - ROCm < 6.4.1: Use rocm-smi - If both tools fail: Raise error with debugging information """ - - # Tool paths - AMD_SMI_PATH = "/opt/rocm/bin/amd-smi" - ROCM_SMI_PATH = "/opt/rocm/bin/rocm-smi" - HIPCONFIG_PATH = "/opt/rocm/bin/hipconfig" - ROCMINFO_PATH = "/opt/rocm/bin/rocminfo" - ROCM_VERSION_FILE = "/opt/rocm/.info/version" - - def __init__(self): - """Initialize ROCm tool manager.""" + + def __init__(self, rocm_path: Optional[str] = None): + """Initialize ROCm tool manager. + + Args: + rocm_path: Optional ROCm root path (default: ROCM_PATH env or /opt/rocm). + """ super().__init__() + self.rocm_path = get_rocm_path(rocm_path) + self.AMD_SMI_PATH = os.path.join(self.rocm_path, "bin", "amd-smi") + self.ROCM_SMI_PATH = os.path.join(self.rocm_path, "bin", "rocm-smi") + self.HIPCONFIG_PATH = os.path.join(self.rocm_path, "bin", "hipconfig") + self.ROCMINFO_PATH = os.path.join(self.rocm_path, "bin", "rocminfo") + self.ROCM_VERSION_FILE = os.path.join(self.rocm_path, ".info", "version") self._log_debug("Initialized ROCm tool manager") def get_version(self) -> Optional[str]: @@ -294,7 +298,7 @@ def get_gpu_count(self) -> int: f"Unable to determine number of AMD GPUs.\n" f"Error: {e}\n" f"Suggestions:\n" - f"- Verify ROCm installation: ls -la /opt/rocm/bin/\n" + f"- Verify ROCm installation: ls -la {self.rocm_path}/bin/\n" f"- Check GPU accessibility: ls -la /dev/kfd /dev/dri\n" f"- Ensure user is in 'video' and 'render' groups\n" f"- See: https://github.com/ROCm/TheRock" @@ -346,7 +350,7 @@ def get_gpu_product_name(self, gpu_id: int = 0) -> str: f"Error: {e}\n" f"Suggestions:\n" f"- Verify GPU {gpu_id} exists: {self.ROCM_SMI_PATH} --showid\n" - f"- Check ROCm version: cat /opt/rocm/.info/version\n" + f"- Check ROCm version: cat {self.ROCM_VERSION_FILE}\n" f"- For ROCm >= 6.4.1, ensure amd-smi is installed" ) diff --git a/tests/fixtures/utils.py b/tests/fixtures/utils.py index eabbe13a..64b9d50b 100644 --- a/tests/fixtures/utils.py +++ b/tests/fixtures/utils.py @@ -45,8 +45,11 @@ def has_gpu() -> bool: # Ultra-simple file existence check (no subprocess calls) # This is safe for pytest collection and avoids hanging nvidia_exists = os.path.exists("/usr/bin/nvidia-smi") - amd_rocm_exists = os.path.exists("/opt/rocm/bin/rocm-smi") or os.path.exists( - "/usr/local/bin/rocm-smi" + from madengine.core.constants import get_rocm_path + rocm_path = get_rocm_path() + amd_rocm_exists = ( + os.path.exists(os.path.join(rocm_path, "bin", "rocm-smi")) + or os.path.exists("/usr/local/bin/rocm-smi") ) _has_gpu_cache = nvidia_exists or amd_rocm_exists diff --git a/tests/integration/test_gpu_management.py b/tests/integration/test_gpu_management.py index 8bec767c..ef18a810 100644 --- a/tests/integration/test_gpu_management.py +++ b/tests/integration/test_gpu_management.py @@ -292,14 +292,17 @@ def test_get_cached_managers(self): """Test getting dictionary of cached managers.""" amd_manager = get_gpu_tool_manager(GPUVendor.AMD) nvidia_manager = get_gpu_tool_manager(GPUVendor.NVIDIA) - + cached = get_cached_managers() - + assert len(cached) == 2 - assert GPUVendor.AMD in cached - assert GPUVendor.NVIDIA in cached - assert cached[GPUVendor.AMD] is amd_manager - assert cached[GPUVendor.NVIDIA] is nvidia_manager + # Cache keys are (vendor, rocm_path): find by vendor + amd_keys = [k for k in cached if k[0] == GPUVendor.AMD] + nvidia_keys = [k for k in cached if k[0] == GPUVendor.NVIDIA] + assert len(amd_keys) == 1 + assert len(nvidia_keys) == 1 + assert cached[amd_keys[0]] is amd_manager + assert cached[nvidia_keys[0]] is nvidia_manager diff --git a/tests/unit/test_rocm_path.py b/tests/unit/test_rocm_path.py new file mode 100644 index 00000000..f33916eb --- /dev/null +++ b/tests/unit/test_rocm_path.py @@ -0,0 +1,102 @@ +""" +Unit tests for ROCm path (ROCM_PATH / --rocm-path) support. + +Copyright (c) Advanced Micro Devices, Inc. All rights reserved. +""" + +import os +import pytest + +from madengine.core.constants import get_rocm_path + + +@pytest.mark.unit +class TestGetRocmPath: + """Test get_rocm_path() resolution.""" + + def test_get_rocm_path_default(self, monkeypatch): + """Without override or ROCM_PATH, returns default /opt/rocm (normalized).""" + monkeypatch.delenv("ROCM_PATH", raising=False) + path = get_rocm_path(None) + assert path == "/opt/rocm" + + def test_get_rocm_path_override(self): + """Override argument takes precedence.""" + path = get_rocm_path("/custom/rocm") + assert path == os.path.abspath("/custom/rocm").rstrip(os.sep) + + def test_get_rocm_path_env(self, monkeypatch): + """ROCM_PATH env is used when override is None.""" + monkeypatch.setenv("ROCM_PATH", "/env/rocm") + try: + path = get_rocm_path(None) + assert path == os.path.abspath("/env/rocm").rstrip(os.sep) + finally: + monkeypatch.delenv("ROCM_PATH", raising=False) + + def test_get_rocm_path_override_overrides_env(self, monkeypatch): + """Override takes precedence over ROCM_PATH env.""" + monkeypatch.setenv("ROCM_PATH", "/env/rocm") + try: + path = get_rocm_path("/cli/rocm") + assert path == os.path.abspath("/cli/rocm").rstrip(os.sep) + finally: + monkeypatch.delenv("ROCM_PATH", raising=False) + + +@pytest.mark.unit +class TestContextRocmPath: + """Test Context stores and uses rocm_path.""" + + def test_context_build_only_stores_rocm_path(self): + """Context with build_only_mode=True and rocm_path sets _rocm_path.""" + from madengine.core.context import Context + + ctx = Context(build_only_mode=True, rocm_path="/opt/rocm") + assert ctx._rocm_path == "/opt/rocm" + + def test_context_runtime_includes_rocm_path_in_ctx(self): + """Context in runtime mode includes rocm_path and ROCM_PATH in docker_env_vars.""" + from madengine.core.context import Context + from unittest.mock import patch + + with patch.object(Context, "get_gpu_vendor", return_value="AMD"), \ + patch.object(Context, "get_system_ngpus", return_value=2), \ + patch.object(Context, "get_system_gpu_architecture", return_value="gfx90a"), \ + patch.object(Context, "get_system_gpu_product_name", return_value="MI250"), \ + patch.object(Context, "get_system_hip_version", return_value="5.4"), \ + patch.object(Context, "get_docker_gpus", return_value="0-1"), \ + patch.object(Context, "get_gpu_renderD_nodes", return_value=None): + ctx = Context(rocm_path="/my/rocm") + assert ctx.ctx.get("rocm_path") == "/my/rocm" + assert ctx.ctx["docker_env_vars"].get("ROCM_PATH") == "/my/rocm" + + +@pytest.mark.unit +class TestRocmToolManagerRocmPath: + """Test ROCmToolManager uses configurable rocm_path.""" + + def test_rocm_tool_manager_paths_under_rocm_path(self): + """ROCmToolManager(rocm_path=X) sets paths under X.""" + from madengine.utils.rocm_tool_manager import ROCmToolManager + + manager = ROCmToolManager(rocm_path="/custom/rocm") + assert manager.rocm_path == "/custom/rocm" + assert manager.AMD_SMI_PATH == "/custom/rocm/bin/amd-smi" + assert manager.ROCM_SMI_PATH == "/custom/rocm/bin/rocm-smi" + assert manager.ROCM_VERSION_FILE == "/custom/rocm/.info/version" + + +@pytest.mark.unit +class TestRunCommandRocmPath: + """Test run command exposes --rocm-path.""" + + def test_run_help_includes_rocm_path(self): + """madengine run --help mentions --rocm-path.""" + from typer.testing import CliRunner + from madengine.cli import app + + runner = CliRunner() + result = runner.invoke(app, ["run", "--help"]) + assert result.exit_code == 0 + assert "--rocm-path" in result.output