Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Development Setup

```bash
# Install in development mode with all dependencies
pip install -e ".[dev]"

# Optional: install Kubernetes support
pip install -e ".[all]"

# Setup pre-commit hooks
pre-commit install
```

## Commands

```bash
# Run all tests
pytest

# Run specific test file
pytest tests/unit/test_error_handling.py -v

# Run specific test class or function
pytest tests/unit/test_error_handling.py::TestErrorPatternMatching -v

# Run tests with coverage
pytest --cov=src/madengine --cov-report=html

# Skip slow tests
pytest -m "not slow"

# Format code
black src/ tests/
isort src/ tests/

# Lint
flake8 src/ tests/

# Type check
mypy src/madengine

# Run all pre-commit checks
pre-commit run --all-files
```

## Architecture

madengine is a CLI tool for running AI/ML models in local Docker, Kubernetes, and SLURM environments. The entry point is `madengine.cli.app:cli_main` (registered as the `madengine` console script).

### Layer Structure

**CLI Layer** (`src/madengine/cli/`)
- `app.py` — Typer app wiring, registers 5 commands: `discover`, `build`, `run`, `report`, `database`
- `commands/` — One file per command (build, run, discover, report, database)
- `constants.py` — `ExitCode` enum (`SUCCESS=0`, `FAILURE=1`, `BUILD_FAILURE=2`, `RUN_FAILURE=3`, `INVALID_ARGS=4`)

**Orchestration Layer** (`src/madengine/orchestration/`)
- `build_orchestrator.py` — `BuildOrchestrator`: discovers models, builds Docker images, writes `build_manifest.json`
- `run_orchestrator.py` — `RunOrchestrator`: reads or triggers builds, infers deployment target, delegates to local or distributed execution

**Core Layer** (`src/madengine/core/`)
- `context.py` — `Context` class: merges `additional_context` with system detection (GPU vendor, architecture, OS, ROCm path). Uses `ast.literal_eval()` to parse additional_context strings (not `json.loads` — pass Python dict repr, not JSON)
- `console.py` — `Console`: shell execution wrapper with live output support
- `docker.py` — Docker command wrapper

**Execution Layer** (`src/madengine/execution/`)
- `container_runner.py` — `ContainerRunner`: runs models from manifest via `docker run`, writes results to `perf.csv`
- `docker_builder.py` — `DockerBuilder`: builds images from Dockerfiles
- `container_runner_helpers.py` — Log error pattern scanning, timeout resolution

**Deployment Layer** (`src/madengine/deployment/`)
- `factory.py` — `DeploymentFactory`: Factory pattern, registers `SlurmDeployment` and `KubernetesDeployment`
- `base.py` — `BaseDeployment` abstract class, `DeploymentConfig` dataclass
- `kubernetes.py` / `slurm.py` — Concrete deployments; target is inferred by Convention over Configuration: presence of `"k8s"` or `"kubernetes"` key → K8s; `"slurm"` key → SLURM; neither → local
- `presets/` — JSON preset files for K8s/SLURM default configurations; auto-merged with minimal user configs
- `config_loader.py` — Loads and merges preset JSON with user-supplied config

**Utils** (`src/madengine/utils/`)
- `discover_models.py` — `DiscoverModels`: three discovery methods: root `models.json`, `scripts/{dir}/models.json`, or `scripts/{dir}/get_models_json.py` (dynamic)
- `gpu_tool_factory.py` / `gpu_tool_manager.py` — GPU vendor abstraction (AMD/NVIDIA)
- `gpu_validator.py` — ROCm installation detection, GPU vendor detection
- `config_parser.py` — `ConfigParser`: parses `--additional-context` and tools config

**Reporting** (`src/madengine/reporting/`)
- `update_perf_csv.py` — Writes/appends to `perf.csv` and `perf_entry.csv`
- `csv_to_html.py` / `csv_to_email.py` — Report generation

### Key Data Flows

1. **Build flow**: CLI → `BuildOrchestrator` → `DiscoverModels` (finds models by tags) → `DockerBuilder` (builds images) → writes `build_manifest.json`

2. **Run flow**: CLI → `RunOrchestrator` → loads/generates `build_manifest.json` → infers target → `ContainerRunner` (local) or `DeploymentFactory` (K8s/SLURM) → writes `perf.csv`

3. **`additional_context`**: User JSON/Python-dict string merged into `Context.ctx`. Context is parsed with `ast.literal_eval()`, so values can use Python dict syntax. Keys like `k8s`, `slurm`, `distributed`, `tools`, `pre_scripts`, `post_scripts` drive behavior.

4. **Model definition**: Models defined in `models.json` with fields: `name`, `tags`, `dockerfile`, `scripts`, `n_gpus`, `args`, `timeout`, `skip_gpu_arch`, etc.

5. **Script isolation**: During run, `scripts/common/` is populated from the madengine package (pre_scripts, post_scripts, tools) and cleaned up afterwards. The MAD project's own `scripts/` and `docker/` directories are preserved.

### Deployment Target Inference

No explicit `"deploy"` field is needed. Target is inferred from config structure:
- `"k8s"` or `"kubernetes"` key present → Kubernetes deployment
- `"slurm"` key present → SLURM deployment
- Neither → local Docker execution

### Test Structure

```
tests/
├── unit/ # Fast isolated tests with mocking
├── integration/ # End-to-end with real Docker/system calls
├── e2e/ # Full workflow tests
└── fixtures/ # Dummy models, scripts, and data for testing
```

Pytest config is in `pyproject.toml` under `[tool.pytest.ini_options]`. Test markers: `slow`, `integration`.

### Code Style

- Black formatting, 88-character line length
- isort with `profile = "black"`
- Google-style docstrings
- Type hints required for public functions
- Conventional commits: `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `style:`, `perf:`, `chore:`
2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,8 @@ dependencies = [
"GitPython",
"jsondiff",
"sqlalchemy",
"setuptools-rust",
"paramiko",
"tqdm",
"pytest",
"typing-extensions",
"pymongo",
"toml",
Expand Down
8 changes: 6 additions & 2 deletions src/madengine/cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
"""

import sys
from importlib.metadata import PackageNotFoundError, version as pkg_version

import typer
from rich.traceback import install
Expand Down Expand Up @@ -55,9 +56,12 @@ def main(
Built with Typer and Rich for a beautiful, production-ready experience.
"""
if version:
# You might want to get the actual version from your package
try:
_version = pkg_version("madengine")
except PackageNotFoundError:
_version = "unknown"
console.print(
"🚀 [bold cyan]madengine[/bold cyan] version [green]2.0.0[/green]"
f"🚀 [bold cyan]madengine[/bold cyan] version [green]{_version}[/green]"
)
raise typer.Exit()

Expand Down
3 changes: 3 additions & 0 deletions src/madengine/cli/commands/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,9 @@ def run(
# Convert -1 (default) to actual default timeout value (7200 seconds = 2 hours)
if timeout == -1:
timeout = 7200
# 0 means "no timeout" per the help text — map to None so subprocess never expires
elif timeout == 0:
timeout = None

try:
# Check if we're doing execution-only or full workflow
Expand Down
6 changes: 4 additions & 2 deletions src/madengine/cli/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
"""

from enum import IntEnum


# Exit codes
class ExitCode:
class ExitCode(IntEnum):
"""Exit codes for CLI commands."""

SUCCESS = 0
FAILURE = 1
BUILD_FAILURE = 2
Expand Down
2 changes: 2 additions & 0 deletions src/madengine/cli/validators.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,8 @@ def process_batch_manifest_entries(

# If the model was not built (build_new=false), create an entry for it
if not build_new:
# Initialize with a safe fallback so the except block can always reference it
dockerfile_matched = "unknown"
# Find the model configuration by discovering models with this tag
try:
# Create a temporary args object to discover the model
Expand Down
178 changes: 178 additions & 0 deletions src/madengine/core/auth.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
#!/usr/bin/env python3
"""
Shared authentication utilities for madengine.

Centralises credential loading logic used by both BuildOrchestrator and
RunOrchestrator so that fixes and improvements only need to be made once.

Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
"""

import json
import os
import shlex
from typing import Dict, Optional

from madengine.core.errors import (
ConfigurationError,
create_error_context,
handle_error,
)


def load_credentials() -> Optional[Dict]:
"""Load credentials from credential.json and environment variables.

Precedence (highest wins):
1. ``MAD_DOCKERHUB_USER`` / ``MAD_DOCKERHUB_PASSWORD`` environment vars
(merged into the ``dockerhub`` key of the returned dict)
2. ``credential.json`` in the current working directory

Returns:
Credentials dict (keyed by registry name), or ``None`` if no
credentials are found.
"""
credentials: Optional[Dict] = None

credential_file = "credential.json"
if os.path.exists(credential_file):
try:
with open(credential_file) as f:
credentials = json.load(f)
print(
f"Loaded credentials from {credential_file}: "
f"{list(credentials.keys())}"
)
except Exception as e:
context = create_error_context(
operation="load_credentials",
component="auth",
file_path=credential_file,
)
handle_error(
ConfigurationError(
f"Could not load credentials: {e}",
context=context,
suggestions=[
"Check if credential.json exists and has valid JSON format"
],
)
)

# Environment variables override / supplement file credentials
docker_hub_user = os.environ.get("MAD_DOCKERHUB_USER")
docker_hub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD")
docker_hub_repo = os.environ.get("MAD_DOCKERHUB_REPO")

if docker_hub_user and docker_hub_password:
print("Found Docker Hub credentials in environment variables")
if credentials is None:
credentials = {}
credentials["dockerhub"] = {
"username": docker_hub_user,
"password": docker_hub_password,
}
if docker_hub_repo:
credentials["dockerhub"]["repository"] = docker_hub_repo

return credentials


def login_to_registry(
registry: str,
credentials: Optional[Dict],
console,
rich_console,
raise_on_failure: bool = True,
) -> None:
"""Login to a Docker registry.

This is the single shared implementation used by both DockerBuilder
and ContainerRunner.

Args:
registry: Registry URL (e.g., "localhost:5000", "docker.io", or empty
for DockerHub).
credentials: Credentials dictionary keyed by registry name.
console: A ``Console`` instance for shell execution.
rich_console: A Rich ``Console`` instance for formatted output.
raise_on_failure: If ``True`` (default), raise ``RuntimeError`` on any
failure (missing key, invalid format, or docker login error).
Set to ``False`` to log and return instead, allowing the caller
to fall back to pulling public images.
"""
if not credentials:
rich_console.print(
"[yellow]No credentials provided for registry login[/yellow]"
)
return

registry_key = registry if registry else "dockerhub"

# Normalise docker.io → dockerhub
if registry and registry.lower() == "docker.io":
registry_key = "dockerhub"

if registry_key not in credentials:
error_msg = f"No credentials found for registry: {registry_key}"
if registry_key == "dockerhub":
error_msg += (
f"\nPlease add dockerhub credentials to credential.json:\n"
"{\n"
' "dockerhub": {\n'
' "repository": "your-repository",\n'
' "username": "your-dockerhub-username",\n'
' "password": "your-dockerhub-password-or-token"\n'
" }\n"
"}"
)
else:
error_msg += (
f"\nPlease add {registry_key} credentials to credential.json:\n"
"{\n"
f' "{registry_key}": {{\n'
f' "repository": "your-repository",\n'
f' "username": "your-{registry_key}-username",\n'
f' "password": "your-{registry_key}-password"\n'
" }\n"
"}"
)
rich_console.print(f"[red]{error_msg}[/red]")
if raise_on_failure:
raise RuntimeError(error_msg)
return

Comment thread
coketaste marked this conversation as resolved.
creds = credentials[registry_key]

if "username" not in creds or "password" not in creds:
error_msg = (
f"Invalid credentials format for registry: {registry_key}"
f"\nCredentials must contain 'username' and 'password' fields"
)
rich_console.print(f"[red]{error_msg}[/red]")
if raise_on_failure:
raise RuntimeError(error_msg)
return

username = str(creds["username"])
password = str(creds["password"])

quoted_password = shlex.quote(password)
quoted_username = shlex.quote(username)
login_command = f"printf %s {quoted_password} | docker login"
if registry and registry.lower() not in ["docker.io", "dockerhub"]:
login_command += f" {shlex.quote(str(registry))}"
login_command += f" --username {quoted_username} --password-stdin"
Comment thread
coketaste marked this conversation as resolved.

try:
console.sh(login_command, secret=True)
rich_console.print(
f"[green]Successfully logged in to registry: "
f"{registry or 'DockerHub'}[/green]"
)
except Exception as e:
rich_console.print(
f"[red]Failed to login to registry {registry}: {e}[/red]"
)
if raise_on_failure:
raise
Loading