Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
- [Reporting and Database](#-reporting-and-database)
- [Installation](#-installation)
- [Tips & Best Practices](#-tips--best-practices)
- [Log error pattern scan](#log-error-pattern-scan)
- [Exit codes and CI](#exit-codes-and-ci)
- [Contributing](#-contributing)
- [License](#-license)
Expand All @@ -46,6 +47,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
- **🎯 ROCprofv3 Profiles** - 8 pre-configured profiles for compute/memory/communication bottleneck analysis
- **🔍 Environment Validation** - TheRock ROCm detection and validation tools
- **⚙️ Intelligent Defaults** - Minimal K8s configs with automatic preset application
- **📋 Configurable log scan** - Optional `--additional-context` keys to disable or tune post-run log substring checks (see [Log error pattern scan](#log-error-pattern-scan))

## 🚀 Quick Start

Expand Down Expand Up @@ -122,7 +124,7 @@ For detailed command options, see the **[CLI Command Reference](docs/cli-referen
| [Usage Guide](docs/usage.md) | Commands, workflows, and examples ([`--skip-model-run`](docs/usage.md#skip-model-run-after-build)) |
| **[CLI Reference](docs/cli-reference.md)** | **Detailed command options and examples** |
| [Deployment](docs/deployment.md) | Kubernetes and SLURM deployment |
| [Configuration](docs/configuration.md) | Advanced configuration options |
| [Configuration](docs/configuration.md) | Advanced options; [run log error pattern scan](docs/configuration.md#run-phase-log-error-pattern-scan) |
| [Batch Build](docs/batch-build.md) | Selective builds for CI/CD |
| [Launchers](docs/launchers.md) | Distributed training frameworks |
| [Profiling](docs/profiling.md) | Performance analysis tools |
Expand Down Expand Up @@ -553,6 +555,13 @@ See [Installation Guide](docs/installation.md) for detailed instructions.
- **Enable verbose logging** (`--verbose`) when debugging issues
- **Use `--live-output`** for real-time monitoring of long-running operations

### Log error pattern scan

After a local Docker run, madengine can scan the captured **run log** for common failure substrings (for example `RuntimeError:`, `CUDA out of memory`, `Traceback`). That helps catch hard failures when exit codes are ambiguous, but some workloads log benign `RuntimeError:` text while tests still pass.

- **Disable** the scan when another signal is authoritative (e.g. pytest/JUnit inside the image): set `"log_error_pattern_scan": false` in `--additional-context` or in the model entry in `models.json`. See [Configuration — Run phase: log error pattern scan](docs/configuration.md#run-phase-log-error-pattern-scan).
- **Extend exclusions** with `log_error_benign_patterns` (list of strings), or **replace** the default pattern list with `log_error_patterns` (non-empty list of strings) for advanced cases.

### CI / Jenkins

- **Exit codes:** The CLI uses fixed exit codes (`ExitCode` in `madengine.cli.constants`, e.g. `SUCCESS=0`, `RUN_FAILURE=3`, `INVALID_ARGS=4`). Pipelines should treat **non-zero** as failure; no log scraping is required for pass/fail.
Expand Down Expand Up @@ -598,6 +607,7 @@ madengine build --tags model --clean-docker-cache --verbose

**Common Issues:**
- **False failures with profiling**: If models show FAILURE but have performance metrics, see [Profiling Troubleshooting](docs/profiling.md#false-failure-detection-with-rocprof)
- **False failures from `RuntimeError:` in logs**: If the workload logs expected exception text but tests pass, disable or tune the scan with `log_error_pattern_scan` / `log_error_benign_patterns` — see [Configuration](docs/configuration.md#run-phase-log-error-pattern-scan)
- **ROCProf log errors**: Messages like `E20251230` are informational logs, not errors (fixed in v2.0+)
- **Configuration errors**: Validate JSON with `python -m json.tool your-config.json`

Expand Down
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Complete documentation for madengine - AI model automation and distributed bench

| Guide | Description |
|-------|-------------|
| [Configuration](configuration.md) | Advanced configuration options |
| [Configuration](configuration.md) | Advanced configuration options (includes [run log error pattern scan](configuration.md#run-phase-log-error-pattern-scan)) |
| [Batch Build](batch-build.md) | Selective builds with batch manifests |
| [Deployment](deployment.md) | Kubernetes and SLURM deployment |
| [Launchers](launchers.md) | Multi-node training frameworks |
Expand Down
10 changes: 10 additions & 0 deletions docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -598,6 +598,16 @@ For complex configurations, use JSON files with `--additional-context-file`:

To run on specific nodes, add `"nodelist": "node01,node02"` to the `slurm` section. When set, the job runs only on those nodes and node health preflight is skipped. See [examples/slurm-configs/basic/03-multi-node-basic-nodelist.json](../examples/slurm-configs/basic/03-multi-node-basic-nodelist.json).

### Run phase: log error pattern scan (optional)

These keys apply to **local Docker runs** when madengine post-processes the run log. Use them when substring matches cause false `FAILURE` status (for example benign `RuntimeError:` lines). Full details: [Configuration — Run phase: log error pattern scan](configuration.md#run-phase-log-error-pattern-scan).

| Key | Description |
|-----|-------------|
| `log_error_pattern_scan` | Default `true`. Set `false` to skip grep-based log failure detection. |
| `log_error_benign_patterns` | Array of extra strings to exclude from matching (merged with built-in benign list). |
| `log_error_patterns` | Non-empty array replaces the default substring list (advanced). |

---

## Environment Variables
Expand Down
35 changes: 35 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,41 @@ For production deployments:
The **run** command does NOT require these values because it can detect GPU vendor at runtime.
Defaults only apply to the **build** command where Dockerfile selection requires them.

## Run phase: log error pattern scan

After a successful container run, madengine may scan the **run log file** for fixed substrings (for example `RuntimeError:`, `OutOfMemoryError`, `Traceback (most recent call last)`). If a match is found, the run can be marked `FAILURE` even when performance metrics exist—intended as a safety net when logs show obvious Python or OOM errors.

Some suites (for example layer unit tests) intentionally print benign `RuntimeError:` text while pytest still passes. In those cases you can **disable** the scan or **narrow** what counts as an error.

Keys can be set in `--additional-context` / `--additional-context-file`, or on the **model** entry in `models.json` (same keys). **Runtime context overrides the model** when both are set.

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `log_error_pattern_scan` | bool or string/number (coerced) | `true` | If `false`, skip substring-based log failure detection entirely (rely on exit codes and other signals). |
| `log_error_benign_patterns` | array of strings | `[]` | Extra lines to **exclude** before matching (appended to built-in exclusions such as ROCProf/metrics noise). Model list is merged first, then context list. |
| `log_error_patterns` | array of strings (non-empty) | (built-in list) | If set, **replaces** the default pattern list. Use only when you need a custom allowlist of failure substrings. |
Comment thread
coketaste marked this conversation as resolved.

**Example — disable scan for a tag (pytest is authoritative):**

```bash
madengine run --tags my_unit_test_suite \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "log_error_pattern_scan": false}'
```

**Example — extra benign substrings (prefer stable strings from real logs):**

```json
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"log_error_benign_patterns": [
"expected benign fragment from workload log"
]
}
```

Disabling the scan does **not** change performance metric extraction from the log; it only affects the post-hoc grep used to set `has_errors` for status.

## Basic Configuration

**gpu_vendor** (case-insensitive):
Expand Down
2 changes: 2 additions & 0 deletions docs/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -841,6 +841,8 @@ ROCProf uses glog-style logging where `E` prefix means "Error level log" (not an

**Fixed in:** madengine v2.0+

For false failures **not** caused by ROCProf (for example workloads that print benign `RuntimeError:` text), see [Configuration — Run phase: log error pattern scan](configuration.md#run-phase-log-error-pattern-scan) (`log_error_pattern_scan`, `log_error_benign_patterns`).

**Verification:**
```bash
# Run with profiling - should show SUCCESS status
Expand Down
2 changes: 2 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -400,6 +400,8 @@ madengine run --tags model --verbose --live-output
madengine run --tags model --keep-alive --verbose --live-output
```

If the run is marked `FAILURE` because the log contains benign substrings (for example `RuntimeError:`) while the workload actually passed, configure [log error pattern scan](configuration.md#run-phase-log-error-pattern-scan) (`log_error_pattern_scan`, `log_error_benign_patterns`).

### Clean Rebuild

```bash
Expand Down
28 changes: 28 additions & 0 deletions src/madengine/cli/validators.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,34 @@ def validate_additional_context_structure(context: Dict[str, Any]) -> None:
if "guest_os" in context and not isinstance(context["guest_os"], str):
_fail_structure("guest_os", "a string")

if "log_error_pattern_scan" in context and not isinstance(
context["log_error_pattern_scan"], (bool, str, int, float, type(None))
):
_fail_structure(
"log_error_pattern_scan",
"a boolean, string, number, or null",
)

if "log_error_benign_patterns" in context:
lebp = context["log_error_benign_patterns"]
if not isinstance(lebp, list) or not all(
isinstance(x, str) for x in lebp
):
_fail_structure(
"log_error_benign_patterns",
"an array of strings",
)

if "log_error_patterns" in context:
lep = context["log_error_patterns"]
if not isinstance(lep, list) or not lep or not all(
isinstance(x, str) for x in lep
):
_fail_structure(
"log_error_patterns",
"a non-empty array of strings",
)
Comment thread
coketaste marked this conversation as resolved.


def _normalize_docker_build_arg_values(context: Dict[str, Any]) -> None:
dba = context.get("docker_build_arg")
Expand Down
38 changes: 18 additions & 20 deletions src/madengine/execution/container_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
from madengine.utils.run_details import get_build_number, get_pipeline
from madengine.execution.container_runner_helpers import (
make_run_log_file_path,
resolve_log_error_scan_config,
resolve_run_timeout,
)

Expand Down Expand Up @@ -1249,27 +1250,18 @@ def run_container(
# Set status based on performance and error patterns
# First check for obvious failure patterns in the logs
try:
# Check for common failure patterns in the log file
# Note: Patterns should be specific enough to avoid false positives
# from profiling tools (rocprof, etc.) that use "Error:" as log level
error_patterns = [
"OutOfMemoryError",
"HIP out of memory",
"CUDA out of memory",
"RuntimeError:", # More specific with colon
"AssertionError:",
"ValueError:",
"SystemExit",
"failed (exitcode:", # Literal text in logs
"Traceback (most recent call last)", # Python tracebacks
"FAILED",
"Exception:",
"ImportError:",
"ModuleNotFoundError:",
]
scan_logs, error_patterns, extra_benign = (
resolve_log_error_scan_config(
model_info, self.additional_context
)
)
Comment thread
coketaste marked this conversation as resolved.

has_errors = False
if log_file_path and os.path.exists(log_file_path):
if (
scan_logs
and log_file_path
and os.path.exists(log_file_path)
):
try:
# Define benign patterns to exclude from error detection
# These are known warnings/info messages that should not trigger failures
Expand All @@ -1289,7 +1281,8 @@ def run_container(
"rocpd_op:", # ROCProf operation logs
"rpd_tracer:", # ROCProf tracer logs
]

benign_patterns.extend(extra_benign)

# Check for error patterns in the log (exclude our own grep commands, output messages, and benign patterns).
Comment thread
coketaste marked this conversation as resolved.
# Use subprocess (not console.sh) so the check runs silently and does not clutter console output.
for pattern in error_patterns:
Expand Down Expand Up @@ -1326,6 +1319,11 @@ def run_container(
pass # Error checking is optional; treat as no match
except Exception:
pass # Error checking is optional
elif not scan_logs:
self.rich_console.print(
"[dim]ℹ️ Log error pattern scan disabled "
"(log_error_pattern_scan).[/dim]"
)

# Status logic: Must have performance AND no errors to be considered success
# Exception: Worker nodes in multi-node training (MAD_COLLECT_METRICS=false)
Expand Down
94 changes: 94 additions & 0 deletions src/madengine/execution/container_runner_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,100 @@

import typing

# Default substrings matched in container run logs post-hoc (see ContainerRunner).
DEFAULT_LOG_ERROR_PATTERNS: typing.Tuple[str, ...] = (
"OutOfMemoryError",
"HIP out of memory",
"CUDA out of memory",
"RuntimeError:",
"AssertionError:",
"ValueError:",
"SystemExit",
"failed (exitcode:",
"Traceback (most recent call last)",
"FAILED",
"Exception:",
"ImportError:",
"ModuleNotFoundError:",
)


def _coerce_bool(value: typing.Any, *, default: bool) -> bool:
"""Interpret JSON/CLI scalars as bool; fall back to *default* if None."""
if value is None:
return default
if isinstance(value, bool):
return value
if isinstance(value, (int, float)) and not isinstance(value, bool):
return value != 0
if isinstance(value, str):
s = value.strip().lower()
if s in ("0", "false", "no", "off", ""):
return False
if s in ("1", "true", "yes", "on"):
return True
return default


def _pick_context_over_model(
model_info: typing.Dict,
additional_context: typing.Dict,
key: str,
default: typing.Any = None,
) -> typing.Any:
"""Resolve key from model_info, overridden by additional_context when present."""
ctx = additional_context or {}
mi = model_info or {}
if key in ctx:
return ctx[key]
if key in mi:
return mi[key]
return default


def resolve_log_error_scan_config(
model_info: typing.Dict,
additional_context: typing.Optional[typing.Dict] = None,
) -> typing.Tuple[bool, typing.List[str], typing.List[str]]:
"""
Resolve whether to scan run logs for error substrings and which patterns to use.

Keys (in ``additional_context`` and/or ``model_info``; context wins):

- ``log_error_pattern_scan`` (default True): set False to skip grep-based failure detection.
- ``log_error_benign_patterns``: list of extra substrings/regex fragments excluded from matches.
- ``log_error_patterns``: non-empty list of strings replaces the default error pattern list.

Returns:
(scan_enabled, error_patterns, extra_benign_patterns)
"""
ctx = additional_context if additional_context is not None else {}
mi = model_info if model_info is not None else {}

scan_enabled = _coerce_bool(
_pick_context_over_model(mi, ctx, "log_error_pattern_scan", True),
default=True,
)

raw_benign_mi = mi.get("log_error_benign_patterns")
raw_benign_ctx = ctx.get("log_error_benign_patterns")
extra_benign: typing.List[str] = []
for part in (raw_benign_mi, raw_benign_ctx):
if isinstance(part, list):
extra_benign.extend(str(x) for x in part if x is not None)

custom_patterns = _pick_context_over_model(mi, ctx, "log_error_patterns", None)
if (
isinstance(custom_patterns, list)
and len(custom_patterns) > 0
and all(isinstance(x, str) for x in custom_patterns)
):
error_patterns = list(custom_patterns)
else:
error_patterns = list(DEFAULT_LOG_ERROR_PATTERNS)

return scan_enabled, error_patterns, extra_benign


def resolve_run_timeout(
model_info: typing.Dict,
Expand Down
Loading