Configuration Guide

Complete guide to configuring madengine for various use cases and environments.

Configuration Methods

1. Inline JSON String

madengine run --tags model \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

2. Configuration File

madengine run --tags model --additional-context-file config.json

config.json:

{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "timeout_multiplier": 2.0
}

Default Configuration Values

madengine provides sensible defaults for common AMD/Ubuntu workflows:

Field	Default Value	Customization
`gpu_vendor`	`AMD`	Set to `NVIDIA` for NVIDIA GPUs
`guest_os`	`UBUNTU`	Set to `CENTOS` for CentOS containers

When Defaults Apply

Defaults are applied during the build command when fields are not explicitly provided:

# Uses defaults: {"gpu_vendor": "AMD", "guest_os": "UBUNTU"}
madengine build --tags model

# Explicit override
madengine build --tags model \
  --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}'

When defaults are applied, you'll see an informative message:

ℹ️  Using default values for build configuration:
   • gpu_vendor: AMD (default)
   • guest_os: UBUNTU (default)

💡 To customize, use --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}'

Partial Configuration

You can provide one field and let the other default:

# Override only gpu_vendor (guest_os defaults to UBUNTU)
madengine build --tags model \
  --additional-context '{"gpu_vendor": "NVIDIA"}'

# Override only guest_os (gpu_vendor defaults to AMD)
madengine build --tags model \
  --additional-context '{"guest_os": "CENTOS"}'

Production Recommendations

For production deployments:

✅ DO explicitly specify all configuration values
✅ DO use configuration files for reproducibility
⚠️ AVOID relying on defaults in automated workflows

Run Command Behavior

The run command does NOT require these values because it can detect GPU vendor at runtime. Defaults only apply to the build command where Dockerfile selection requires them.

Run phase: log error pattern scan

After a successful container run, madengine may scan the run log file for fixed substrings (for example RuntimeError:, OutOfMemoryError, Traceback (most recent call last)). If a match is found, the run can be marked FAILURE even when performance metrics exist—intended as a safety net when logs show obvious Python or OOM errors.

Some suites (for example layer unit tests) intentionally print benign RuntimeError: text while pytest still passes. In those cases you can disable the scan or narrow what counts as an error.

Keys can be set in --additional-context / --additional-context-file, or on the model entry in models.json (same keys). Runtime context overrides the model when both are set.

Key	Type	Default	Description
`log_error_pattern_scan`	bool or string/number (coerced)	`true`	If `false`, skip substring-based log failure detection entirely (rely on exit codes and other signals).
`log_error_benign_patterns`	array of strings	`[]`	Extra lines to exclude before matching (appended to built-in exclusions such as ROCProf/metrics noise). Model list is merged first, then context list.
`log_error_patterns`	array of strings (non-empty)	(built-in list)	If set, replaces the default pattern list. Use only when you need a custom allowlist of failure substrings.

Example — disable scan for a tag (pytest is authoritative):

madengine run --tags my_unit_test_suite \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "log_error_pattern_scan": false}'

Example — extra benign substrings (prefer stable strings from real logs):

{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "log_error_benign_patterns": [
    "expected benign fragment from workload log"
  ]
}

Disabling the scan does not change performance metric extraction from the log; it only affects the post-hoc grep used to set has_errors for status.

Basic Configuration

gpu_vendor (case-insensitive):

"AMD" - AMD ROCm GPUs
"NVIDIA" - NVIDIA CUDA GPUs

guest_os (case-insensitive):

"UBUNTU" - Ubuntu Linux
"CENTOS" - CentOS Linux

ROCm path (run only)

Host (where madengine runs validation): by default, the ROCm root is auto-detected (traditional /opt/rocm, TheRock rocm-sdk / manifest layout, or ROCM_PATH-like env hints). Set MAD_AUTO_ROCM_PATH=0 to skip auto and use only legacy resolution (ROCM_PATH then /opt/rocm).

Overrides (recommended for CI):

Additional context (host): top-level "MAD_ROCM_PATH": "/path/to/host/rocm" — controls where madengine looks for host GPU tools (rocminfo, amd-smi, etc.).
Additional context (container): "docker_env_vars": { "MAD_ROCM_PATH": "/path/inside/image" } — sets the in-container ROCM_PATH for Docker runs. If omitted, at run time madengine uses the image OCI Env (ROCM_PATH / ROCM_HOME) if present, then an in-container probe, then defaults to /opt/rocm. The host-resolved path is not mirrored into the container.

These two keys are independent, allowing host and container to use different ROCm installations without confusion.

Precedence (host): top-level MAD_ROCM_PATH → auto-detect (unless disabled) → ROCM_PATH → /opt/rocm.

Precedence (container, local Docker run, AMD): docker_env_vars.MAD_ROCM_PATH (maps to ROCM_PATH for the workload) or explicit ROCM_PATH in docker_env_vars → image OCI Env (ROCM_PATH / ROCM_HOME) → in-image probe → default /opt/rocm with a warning. Implemented in ContainerRunner.run_container after the run image is resolved.

This applies to the run phase; build uses build-only context (no GPU detection) but still honors MAD_ROCM_PATH in context when set.

At the start of each container run, a Run Phase Environment table is printed showing host vs container installation type (apt install or therock), ROCm/CUDA root, and version side-by-side. See Run phase environment table.

Build Configuration

Batch Manifest

Use batch manifest files for selective builds with per-model configuration:

madengine build --batch-manifest batch.json \
  --registry my-registry.com \
  --additional-context-file config.json

Batch manifest structure (batch.json):

[
  {
    "model_name": "model1",
    "build_new": true,
    "registry": "registry1.io",
    "registry_image": "namespace/model1"
  },
  {
    "model_name": "model2",
    "build_new": false,
    "registry": "registry2.io",
    "registry_image": "namespace/model2"
  }
]

Fields:

model_name (string, required): Model tag to include
build_new (boolean, optional, default: false): Whether to build this model
- true: Build the model from source
- false: Reference existing image without rebuilding
registry (string, optional): Per-model registry override
registry_image (string, optional): Custom registry image name/namespace

Key Behaviors:

Only models with "build_new": true are built
Models with "build_new": false are included in output manifest without building
Per-model registry overrides the global --registry flag
Cannot use --batch-manifest and --tags together (mutually exclusive)

Use Case - CI/CD Incremental Builds:

[
  {"model_name": "changed_model", "build_new": true},
  {"model_name": "stable_model1", "build_new": false},
  {"model_name": "stable_model2", "build_new": false}
]

This allows you to rebuild only changed models while maintaining references to existing stable images in a single manifest.

Docker Configuration

Environment Variables

Pass environment variables to containers:

{
  "docker_env_vars": {
    "HSA_ENABLE_SDMA": "0",
    "PYTORCH_TUNABLEOP_ENABLED": "1",
    "NCCL_DEBUG": "INFO"
  }
}

Custom Base Image

Override Docker base image:

{
  "MAD_CONTAINER_IMAGE": "rocm/pytorch:custom-tag"
}

Or override BASE_DOCKER in FROM line:

{
  "docker_build_arg": {
    "BASE_DOCKER": "rocm/pytorch:rocm6.1_ubuntu22.04_py3.10"
  }
}

Build Arguments

Pass build-time variables:

{
  "docker_build_arg": {
    "ROCM_VERSION": "6.1",
    "PYTHON_VERSION": "3.10",
    "CUSTOM_ARG": "value"
  }
}

Mount Host Directories

Mount host directories inside containers:

{
  "docker_mounts": {
    "/data-inside-container": "/data-on-host",
    "/models": "/home/user/models"
  }
}

Select GPUs and CPUs

Specify GPU and CPU subsets:

{
  "docker_gpus": "0,2-4,7",
  "docker_cpus": "0-15,32-47"
}

Format: Comma-separated list with hyphen ranges.

Performance Configuration

Timeout Settings

{
  "timeout_multiplier": 2.0
}

Or use command-line option:

madengine run --tags model --timeout 7200

Local Data Mirroring

Force local data caching:

{
  "mirrorlocal": "/tmp/local_mirror"
}

Or use command-line option:

madengine run --tags model --force-mirror-local /tmp/mirror

Kubernetes Deployment

Minimal Configuration

{
  "k8s": {
    "gpu_count": 1
  }
}

Automatically applies (see presets under src/madengine/deployment/presets/k8s/):

Namespace: default
Resource limits based on GPU count
Image pull policy: Always (base default)
Service account: default
GPU vendor detection from context
k8s.secrets defaults (see below)

Full Configuration

{
  "k8s": {
    "gpu_count": 2,
    "namespace": "ml-team",
    "gpu_vendor": "AMD",
    "memory": "32Gi",
    "memory_limit": "64Gi",
    "cpu": "16",
    "cpu_limit": "32",
    "service_account": "madengine-sa",
    "image_pull_policy": "Always",
    "ttl_seconds_after_finished": null,
    "allow_privileged_profiling": null,
    "secrets": {
      "strategy": "from_local_credentials",
      "image_pull_secret_names": ["my-registry-secret"],
      "runtime_secret_name": null
    }
  }
}

K8s Options:

gpu_count - Number of GPUs (required)
namespace - Kubernetes namespace (default: default)
gpu_vendor - GPU vendor override (auto-detected from context)
memory - Memory request (default: auto-scaled by GPU count)
memory_limit - Memory limit (default: 2× memory request)
cpu - CPU cores request (default: auto-scaled by GPU count)
cpu_limit - CPU cores limit (default: 2× CPU request)
service_account - Service account name
image_pull_policy - Always, IfNotPresent, or Never
ttl_seconds_after_finished - Optional Job TTL in seconds (auto-delete finished Job); null to omit
allow_privileged_profiling - null means enable elevated securityContext when tools/profiling are configured; true/false to force
secrets.strategy - from_local_credentials (default): create Secret objects from local credential.json at deploy time; existing: only reference pre-created Secrets; omit: no runtime Secret from client
secrets.image_pull_secret_names - Extra pull secret names (strings) merged with any created from credential.json when using from_local_credentials
secrets.runtime_secret_name - Required for existing (pre-created opaque Secret with key credential.json); optional for omit if you still mount a runtime Secret

Multi-Node Kubernetes

{
  "k8s": {
    "gpu_count": 8
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 2,
    "nproc_per_node": 4
  }
}

SLURM Deployment

Basic Configuration

{
  "slurm": {
    "partition": "gpu",
    "gpus_per_node": 4,
    "time": "02:00:00"
  }
}

Full Configuration

{
  "slurm": {
    "partition": "gpu",
    "account": "research_group",
    "qos": "normal",
    "gpus_per_node": 8,
    "nodes": 2,
    "nodelist": "node01,node02",
    "time": "24:00:00",
    "mem": "64G",
    "mail_user": "user@example.com",
    "mail_type": "ALL"
  }
}

Note: nodelist is optional; omit it to let SLURM choose nodes. When set, the job runs only on the listed nodes and node health preflight is skipped.

SLURM Options:

partition - SLURM partition name (required)
account - Billing account
qos - Quality of Service
gpus_per_node - GPUs per node (default: 1)
nodes - Number of nodes (default: 1)
nodelist - Comma-separated node names to run on (e.g. "node01,node02"); when set, job is restricted to these nodes and automatic node health preflight is skipped
time - Wall time limit HH:MM:SS (required)
mem - Memory per node (e.g., "64G")
mail_user - Email for notifications
mail_type - Notification types (BEGIN, END, FAIL, ALL)

Multi-Node SLURM

{
  "slurm": {
    "partition": "gpu",
    "nodes": 4,
    "gpus_per_node": 8,
    "time": "48:00:00"
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 4,
    "nproc_per_node": 8
  }
}

Distributed Training

Launcher Configuration

{
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 2,
    "nproc_per_node": 4,
    "master_port": 29500
  }
}

Launcher Options:

launcher - Framework name (required)
nnodes - Number of nodes
nproc_per_node - Processes/GPUs per node
master_port - Master communication port (default: 29500)

Supported Launchers:

torchrun - PyTorch DDP/FSDP
deepspeed - ZeRO optimization
megatron - Large transformers (K8s + SLURM)
torchtitan - LLM pre-training
vllm - LLM inference
sglang - Structured generation

See Launchers Guide for details.

TorchTitan Configuration

{
  "distributed": {
    "launcher": "torchtitan",
    "nnodes": 4,
    "nproc_per_node": 8
  },
  "env_vars": {
    "TORCHTITAN_TENSOR_PARALLEL_SIZE": "8",
    "TORCHTITAN_PIPELINE_PARALLEL_SIZE": "4",
    "TORCHTITAN_FSDP_ENABLED": "1"
  }
}

vLLM Configuration

{
  "distributed": {
    "launcher": "vllm",
    "nnodes": 2,
    "nproc_per_node": 4
  },
  "vllm": {
    "tensor_parallel_size": 4,
    "pipeline_parallel_size": 1
  }
}

Profiling Configuration

Basic Profiling

{
  "tools": [
    {"name": "rocprof"}
  ]
}

Custom Tool Configuration

{
  "tools": [
    {
      "name": "rocprof",
      "cmd": "rocprof --timestamp on",
      "env_vars": {
        "NCCL_DEBUG": "INFO"
      }
    }
  ]
}

Multiple Tools (Stackable)

{
  "tools": [
    {"name": "rocprof"},
    {"name": "miopen_trace"},
    {"name": "rocblas_trace"}
  ]
}

Available Tools:

rocprof - GPU profiling
rpd - ROCm Profiler Data
rocblas_trace - rocBLAS library tracing
miopen_trace - MIOpen library tracing
tensile_trace - Tensile library tracing
rccl_trace - RCCL communication tracing
gpu_info_power_profiler - Power consumption profiling
gpu_info_vram_profiler - VRAM usage profiling

See Profiling Guide for details.

Pre/Post Execution Scripts

Run scripts before and after model execution:

{
  "pre_scripts": [
    {
      "path": "scripts/common/pre_scripts/setup.sh",
      "args": "-v"
    }
  ],
  "encapsulate_script": "scripts/common/wrapper.sh",
  "post_scripts": [
    {
      "path": "scripts/common/post_scripts/cleanup.sh",
      "args": "-r"
    }
  ]
}

Model Arguments

Pass arguments to model execution script:

{
  "model_args": "--model_name_or_path bigscience/bloom --batch_size 32"
}

Data Provider Configuration

Configure in data.json (MAD package root):

{
  "data_sources": {
    "model_data": {
      "nas": {"path": "/home/datum"},
      "minio": {"path": "s3://datasets/datum"},
      "aws": {"path": "s3://datasets/datum"}
    }
  },
  "mirrorlocal": "/tmp/local_mirror"
}

Credential Configuration

Configure in credential.json (MAD package root):

{
  "dockerhub": {
    "username": "your_username",
    "password": "your_token",
    "repository": "myorg"
  },
  "AMD_GITHUB": {
    "username": "github_username",
    "password": "github_token"
  },
  "MAD_AWS_S3": {
    "username": "aws_access_key",
    "password": "aws_secret_key"
  }
}

Environment Variable Override

export MAD_DOCKERHUB_USER=myusername
export MAD_DOCKERHUB_PASSWORD=mytoken
export MAD_DOCKERHUB_REPO=myorg

Configuration Priority

For Kubernetes/SLURM deployments:

CLI overrides (--additional-context) - Highest
User config file (--additional-context-file)
Profile presets (single-gpu/multi-gpu/multi-node)
GPU vendor presets (AMD/NVIDIA optimizations)
Base defaults (k8s/defaults.json)
Environment variables
Built-in fallbacks - Lowest

Complete Examples

Local GPU Development

{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "docker_gpus": "0",
  "docker_env_vars": {
    "PYTORCH_TUNABLEOP_ENABLED": "1"
  }
}

Kubernetes Single-GPU

{
  "k8s": {
    "gpu_count": 1,
    "namespace": "dev"
  }
}

Kubernetes Multi-GPU Training

{
  "k8s": {
    "gpu_count": 4,
    "memory": "64Gi",
    "cpu": "32"
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 1,
    "nproc_per_node": 4
  }
}

SLURM Multi-Node

{
  "slurm": {
    "partition": "gpu",
    "nodes": 8,
    "gpus_per_node": 8,
    "time": "72:00:00",
    "account": "research_proj"
  },
  "distributed": {
    "launcher": "deepspeed",
    "nnodes": 8,
    "nproc_per_node": 8
  }
}

Production with Profiling

{
  "k8s": {
    "gpu_count": 2,
    "namespace": "production",
    "memory": "32Gi"
  },
  "tools": [
    {"name": "rocprof"},
    {"name": "gpu_info_power_profiler"}
  ],
  "docker_env_vars": {
    "NCCL_DEBUG": "INFO",
    "PYTORCH_TUNABLEOP_ENABLED": "1"
  }
}

Troubleshooting

Configuration Not Applied

# Verify configuration is valid JSON
python -m json.tool config.json

# Use verbose logging
madengine run --tags model \
  --additional-context-file config.json \
  --verbose

Environment Variables Not Set

# Check environment variables
env | grep MAD

# Verify Docker receives env vars
docker inspect container_name | grep -A 10 Env

GPU Vendor Auto-Detection

madengine auto-detects GPU vendor if not specified:

Looks for ROCm drivers → AMD
Looks for CUDA drivers → NVIDIA
Falls back to configuration or fails

Override with explicit configuration:

{
  "gpu_vendor": "AMD"
}

Best Practices

Use configuration files for complex settings
Start with minimal configs and add as needed
Validate JSON syntax before running
Use environment variables for sensitive data
Test locally first before deploying
Enable verbose logging when debugging
Document custom configurations for team use

Next Steps

Usage Guide - Using madengine commands
Deployment Guide - Deploy to clusters
Profiling Guide - Performance analysis
Launchers Guide - Distributed training frameworks

FilesExpand file tree

configuration.md

Latest commit

History

configuration.md

File metadata and controls

Configuration Guide

Configuration Methods

1. Inline JSON String

2. Configuration File

Default Configuration Values

When Defaults Apply

Partial Configuration

Production Recommendations

Run Command Behavior

Run phase: log error pattern scan

Basic Configuration

ROCm path (run only)

Build Configuration

Batch Manifest

Docker Configuration

Environment Variables

Custom Base Image

Build Arguments

Mount Host Directories

Select GPUs and CPUs

Performance Configuration

Timeout Settings

Local Data Mirroring

Kubernetes Deployment

Minimal Configuration

Full Configuration

Multi-Node Kubernetes

SLURM Deployment

Basic Configuration

Full Configuration

Multi-Node SLURM

Distributed Training

Launcher Configuration

TorchTitan Configuration

vLLM Configuration

Profiling Configuration

Basic Profiling

Custom Tool Configuration

Multiple Tools (Stackable)

Pre/Post Execution Scripts

Model Arguments

Data Provider Configuration

Credential Configuration

Environment Variable Override

Configuration Priority

Complete Examples

Local GPU Development

Kubernetes Single-GPU

Kubernetes Multi-GPU Training

SLURM Multi-Node

Production with Profiling

Troubleshooting

Configuration Not Applied

Environment Variables Not Set

GPU Vendor Auto-Detection

Best Practices

Next Steps