ROCm · coketaste · Apr 20, 2026 · Apr 18, 2026 · Apr 18, 2026 · Apr 20, 2026
@@ -0,0 +1,129 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Development Setup
+
+```bash
+# Install in development mode with all dependencies
+pip install -e ".[dev]"
+
+# Optional: install Kubernetes support
+pip install -e ".[all]"
+
+# Setup pre-commit hooks
+pre-commit install
+```
+
+## Commands
+
+```bash
+# Run all tests
+pytest
+
+# Run specific test file
+pytest tests/unit/test_error_handling.py -v
+
+# Run specific test class or function
+pytest tests/unit/test_error_handling.py::TestErrorPatternMatching -v
+
+# Run tests with coverage
+pytest --cov=src/madengine --cov-report=html
+
+# Skip slow tests
+pytest -m "not slow"
+
+# Format code
+black src/ tests/
+isort src/ tests/
+
+# Lint
+flake8 src/ tests/
+
+# Type check
+mypy src/madengine
+
+# Run all pre-commit checks
+pre-commit run --all-files
+```
+
+## Architecture
+
+madengine is a CLI tool for running AI/ML models in local Docker, Kubernetes, and SLURM environments. The entry point is `madengine.cli.app:cli_main` (registered as the `madengine` console script).
+
+### Layer Structure
+
+**CLI Layer** (`src/madengine/cli/`)
+- `app.py` — Typer app wiring, registers 5 commands: `discover`, `build`, `run`, `report`, `database`
+- `commands/` — One file per command (build, run, discover, report, database)
+- `constants.py` — `ExitCode` enum (`SUCCESS=0`, `FAILURE=1`, `BUILD_FAILURE=2`, `RUN_FAILURE=3`, `INVALID_ARGS=4`)
+
+**Orchestration Layer** (`src/madengine/orchestration/`)
+- `build_orchestrator.py` — `BuildOrchestrator`: discovers models, builds Docker images, writes `build_manifest.json`
+- `run_orchestrator.py` — `RunOrchestrator`: reads or triggers builds, infers deployment target, delegates to local or distributed execution
+
+**Core Layer** (`src/madengine/core/`)
+- `context.py` — `Context` class: merges `additional_context` with system detection (GPU vendor, architecture, OS, ROCm path). Uses `ast.literal_eval()` to parse additional_context strings (not `json.loads` — pass Python dict repr, not JSON)
+- `console.py` — `Console`: shell execution wrapper with live output support
+- `docker.py` — Docker command wrapper
+
+**Execution Layer** (`src/madengine/execution/`)
+- `container_runner.py` — `ContainerRunner`: runs models from manifest via `docker run`, writes results to `perf.csv`
+- `docker_builder.py` — `DockerBuilder`: builds images from Dockerfiles
+- `container_runner_helpers.py` — Log error pattern scanning, timeout resolution
+
+**Deployment Layer** (`src/madengine/deployment/`)
+- `factory.py` — `DeploymentFactory`: Factory pattern, registers `SlurmDeployment` and `KubernetesDeployment`
+- `base.py` — `BaseDeployment` abstract class, `DeploymentConfig` dataclass
+- `kubernetes.py` / `slurm.py` — Concrete deployments; target is inferred by Convention over Configuration: presence of `"k8s"` or `"kubernetes"` key → K8s; `"slurm"` key → SLURM; neither → local
+- `presets/` — JSON preset files for K8s/SLURM default configurations; auto-merged with minimal user configs
+- `config_loader.py` — Loads and merges preset JSON with user-supplied config
+
+**Utils** (`src/madengine/utils/`)
+- `discover_models.py` — `DiscoverModels`: three discovery methods: root `models.json`, `scripts/{dir}/models.json`, or `scripts/{dir}/get_models_json.py` (dynamic)
+- `gpu_tool_factory.py` / `gpu_tool_manager.py` — GPU vendor abstraction (AMD/NVIDIA)
+- `gpu_validator.py` — ROCm installation detection, GPU vendor detection
+- `config_parser.py` — `ConfigParser`: parses `--additional-context` and tools config
+
+**Reporting** (`src/madengine/reporting/`)
+- `update_perf_csv.py` — Writes/appends to `perf.csv` and `perf_entry.csv`
+- `csv_to_html.py` / `csv_to_email.py` — Report generation
+
+### Key Data Flows
+
+1. **Build flow**: CLI → `BuildOrchestrator` → `DiscoverModels` (finds models by tags) → `DockerBuilder` (builds images) → writes `build_manifest.json`
+
+2. **Run flow**: CLI → `RunOrchestrator` → loads/generates `build_manifest.json` → infers target → `ContainerRunner` (local) or `DeploymentFactory` (K8s/SLURM) → writes `perf.csv`
+
+3. **`additional_context`**: User JSON/Python-dict string merged into `Context.ctx`. Context is parsed with `ast.literal_eval()`, so values can use Python dict syntax. Keys like `k8s`, `slurm`, `distributed`, `tools`, `pre_scripts`, `post_scripts` drive behavior.
+
+4. **Model definition**: Models defined in `models.json` with fields: `name`, `tags`, `dockerfile`, `scripts`, `n_gpus`, `args`, `timeout`, `skip_gpu_arch`, etc.
+
+5. **Script isolation**: During run, `scripts/common/` is populated from the madengine package (pre_scripts, post_scripts, tools) and cleaned up afterwards. The MAD project's own `scripts/` and `docker/` directories are preserved.
+
+### Deployment Target Inference
+
+No explicit `"deploy"` field is needed. Target is inferred from config structure:
+- `"k8s"` or `"kubernetes"` key present → Kubernetes deployment
+- `"slurm"` key present → SLURM deployment
+- Neither → local Docker execution
+
+### Test Structure
+
+```
+tests/
+├── unit/         # Fast isolated tests with mocking
+├── integration/  # End-to-end with real Docker/system calls
+├── e2e/          # Full workflow tests
+└── fixtures/     # Dummy models, scripts, and data for testing
+```
+
+Pytest config is in `pyproject.toml` under `[tool.pytest.ini_options]`. Test markers: `slow`, `integration`.
+
+### Code Style
+
+- Black formatting, 88-character line length
+- isort with `profile = "black"`
+- Google-style docstrings
+- Type hints required for public functions
+- Conventional commits: `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `style:`, `perf:`, `chore:`
@@ -16,10 +16,8 @@ dependencies = [
   "GitPython",
   "jsondiff",
   "sqlalchemy",
-  "setuptools-rust",
   "paramiko",
   "tqdm",
-  "pytest",
   "typing-extensions",
   "pymongo",
   "toml",

@@ -8,6 +8,7 @@
 """
 
 import sys
+from importlib.metadata import PackageNotFoundError, version as pkg_version
 
 import typer
 from rich.traceback import install
@@ -55,9 +56,12 @@ def main(
     Built with Typer and Rich for a beautiful, production-ready experience.
     """
     if version:
-        # You might want to get the actual version from your package
+        try:
+            _version = pkg_version("madengine")
+        except PackageNotFoundError:
+            _version = "unknown"
         console.print(
-            "🚀 [bold cyan]madengine[/bold cyan] version [green]2.0.0[/green]"
+            f"🚀 [bold cyan]madengine[/bold cyan] version [green]{_version}[/green]"
         )
         raise typer.Exit()
 

@@ -194,6 +194,9 @@ def run(
     # Convert -1 (default) to actual default timeout value (7200 seconds = 2 hours)
     if timeout == -1:
         timeout = 7200
+    # 0 means "no timeout" per the help text — map to None so subprocess never expires
+    elif timeout == 0:
+        timeout = None
 
     try:
         # Check if we're doing execution-only or full workflow

@@ -5,11 +5,13 @@
 Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
 """
 
+from enum import IntEnum
+
 
 # Exit codes
-class ExitCode:
+class ExitCode(IntEnum):
     """Exit codes for CLI commands."""
-    
+
     SUCCESS = 0
     FAILURE = 1
     BUILD_FAILURE = 2

@@ -395,6 +395,8 @@ def process_batch_manifest_entries(
 
         # If the model was not built (build_new=false), create an entry for it
         if not build_new:
+            # Initialize with a safe fallback so the except block can always reference it
+            dockerfile_matched = "unknown"
             # Find the model configuration by discovering models with this tag
             try:
                 # Create a temporary args object to discover the model

@@ -0,0 +1,178 @@
+#!/usr/bin/env python3
+"""
+Shared authentication utilities for madengine.
+
+Centralises credential loading logic used by both BuildOrchestrator and
+RunOrchestrator so that fixes and improvements only need to be made once.
+
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
+"""
+
+import json
+import os
+import shlex
+from typing import Dict, Optional
+
+from madengine.core.errors import (
+    ConfigurationError,
+    create_error_context,
+    handle_error,
+)
+
+
+def load_credentials() -> Optional[Dict]:
+    """Load credentials from credential.json and environment variables.
+
+    Precedence (highest wins):
+      1. ``MAD_DOCKERHUB_USER`` / ``MAD_DOCKERHUB_PASSWORD`` environment vars
+         (merged into the ``dockerhub`` key of the returned dict)
+      2. ``credential.json`` in the current working directory
+
+    Returns:
+        Credentials dict (keyed by registry name), or ``None`` if no
+        credentials are found.
+    """
+    credentials: Optional[Dict] = None
+
+    credential_file = "credential.json"
+    if os.path.exists(credential_file):
+        try:
+            with open(credential_file) as f:
+                credentials = json.load(f)
+            print(
+                f"Loaded credentials from {credential_file}: "
+                f"{list(credentials.keys())}"
+            )
+        except Exception as e:
+            context = create_error_context(
+                operation="load_credentials",
+                component="auth",
+                file_path=credential_file,
+            )
+            handle_error(
+                ConfigurationError(
+                    f"Could not load credentials: {e}",
+                    context=context,
+                    suggestions=[
+                        "Check if credential.json exists and has valid JSON format"
+                    ],
+                )
+            )
+
+    # Environment variables override / supplement file credentials
+    docker_hub_user = os.environ.get("MAD_DOCKERHUB_USER")
+    docker_hub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD")
+    docker_hub_repo = os.environ.get("MAD_DOCKERHUB_REPO")
+
+    if docker_hub_user and docker_hub_password:
+        print("Found Docker Hub credentials in environment variables")
+        if credentials is None:
+            credentials = {}
+        credentials["dockerhub"] = {
+            "username": docker_hub_user,
+            "password": docker_hub_password,
+        }
+        if docker_hub_repo:
+            credentials["dockerhub"]["repository"] = docker_hub_repo
+
+    return credentials
+
+
+def login_to_registry(
+    registry: str,
+    credentials: Optional[Dict],
+    console,
+    rich_console,
+    raise_on_failure: bool = True,
+) -> None:
+    """Login to a Docker registry.
+
+    This is the single shared implementation used by both DockerBuilder
+    and ContainerRunner.
+
+    Args:
+        registry: Registry URL (e.g., "localhost:5000", "docker.io", or empty
+            for DockerHub).
+        credentials: Credentials dictionary keyed by registry name.
+        console: A ``Console`` instance for shell execution.
+        rich_console: A Rich ``Console`` instance for formatted output.
+        raise_on_failure: If ``True`` (default), raise ``RuntimeError`` on any
+            failure (missing key, invalid format, or docker login error).
+            Set to ``False`` to log and return instead, allowing the caller
+            to fall back to pulling public images.
+    """
+    if not credentials:
+        rich_console.print(
+            "[yellow]No credentials provided for registry login[/yellow]"
+        )
+        return
+
+    registry_key = registry if registry else "dockerhub"
+
+    # Normalise docker.io → dockerhub
+    if registry and registry.lower() == "docker.io":
+        registry_key = "dockerhub"
+
+    if registry_key not in credentials:
+        error_msg = f"No credentials found for registry: {registry_key}"
+        if registry_key == "dockerhub":
+            error_msg += (
+                f"\nPlease add dockerhub credentials to credential.json:\n"
+                "{\n"
+                '  "dockerhub": {\n'
+                '    "repository": "your-repository",\n'
+                '    "username": "your-dockerhub-username",\n'
+                '    "password": "your-dockerhub-password-or-token"\n'
+                "  }\n"
+                "}"
+            )
+        else:
+            error_msg += (
+                f"\nPlease add {registry_key} credentials to credential.json:\n"
+                "{\n"
+                f'  "{registry_key}": {{\n'
+                f'    "repository": "your-repository",\n'
+                f'    "username": "your-{registry_key}-username",\n'
+                f'    "password": "your-{registry_key}-password"\n'
+                "  }\n"
+                "}"
+            )
+        rich_console.print(f"[red]{error_msg}[/red]")
+        if raise_on_failure:
+            raise RuntimeError(error_msg)
+        return
+
+    creds = credentials[registry_key]
+
+    if "username" not in creds or "password" not in creds:
+        error_msg = (
+            f"Invalid credentials format for registry: {registry_key}"
+            f"\nCredentials must contain 'username' and 'password' fields"
+        )
+        rich_console.print(f"[red]{error_msg}[/red]")
+        if raise_on_failure:
+            raise RuntimeError(error_msg)
+        return
+
+    username = str(creds["username"])
+    password = str(creds["password"])
+
+    quoted_password = shlex.quote(password)
+    quoted_username = shlex.quote(username)
+    login_command = f"printf %s {quoted_password} | docker login"
+    if registry and registry.lower() not in ["docker.io", "dockerhub"]:
+        login_command += f" {shlex.quote(str(registry))}"
+    login_command += f" --username {quoted_username} --password-stdin"
+
+    try:
+        console.sh(login_command, secret=True)
+        rich_console.print(
+            f"[green]Successfully logged in to registry: "
+            f"{registry or 'DockerHub'}[/green]"
+        )
+    except Exception as e:
+        rich_console.print(
+            f"[red]Failed to login to registry {registry}: {e}[/red]"
+        )
+        if raise_on_failure:
+            raise