ROCm · coketaste · Apr 8, 2026 · Apr 7, 2026 · Apr 8, 2026 · Apr 8, 2026
diff --git a/README.md b/README.md
@@ -30,6 +30,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
 - [Reporting and Database](#-reporting-and-database)
 - [Installation](#-installation)
 - [Tips & Best Practices](#-tips--best-practices)
+  - [Log error pattern scan](#log-error-pattern-scan)
   - [Exit codes and CI](#exit-codes-and-ci)
 - [Contributing](#-contributing)
 - [License](#-license)
@@ -46,6 +47,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
 - **🎯 ROCprofv3 Profiles** - 8 pre-configured profiles for compute/memory/communication bottleneck analysis
 - **🔍 Environment Validation** - TheRock ROCm detection and validation tools
 - **⚙️ Intelligent Defaults** - Minimal K8s configs with automatic preset application
+- **📋 Configurable log scan** - Optional `--additional-context` keys to disable or tune post-run log substring checks (see [Log error pattern scan](#log-error-pattern-scan))
 
 ## 🚀 Quick Start
 
@@ -122,7 +124,7 @@ For detailed command options, see the **[CLI Command Reference](docs/cli-referen
 | [Usage Guide](docs/usage.md) | Commands, workflows, and examples ([`--skip-model-run`](docs/usage.md#skip-model-run-after-build)) |
 | **[CLI Reference](docs/cli-reference.md)** | **Detailed command options and examples** |
 | [Deployment](docs/deployment.md) | Kubernetes and SLURM deployment |
-| [Configuration](docs/configuration.md) | Advanced configuration options |
+| [Configuration](docs/configuration.md) | Advanced options; [run log error pattern scan](docs/configuration.md#run-phase-log-error-pattern-scan) |
 | [Batch Build](docs/batch-build.md) | Selective builds for CI/CD |
 | [Launchers](docs/launchers.md) | Distributed training frameworks |
 | [Profiling](docs/profiling.md) | Performance analysis tools |
@@ -553,6 +555,13 @@ See [Installation Guide](docs/installation.md) for detailed instructions.
 - **Enable verbose logging** (`--verbose`) when debugging issues
 - **Use `--live-output`** for real-time monitoring of long-running operations
 
+### Log error pattern scan
+
+After a local Docker run, madengine can scan the captured **run log** for common failure substrings (for example `RuntimeError:`, `CUDA out of memory`, `Traceback`). That helps catch hard failures when exit codes are ambiguous, but some workloads log benign `RuntimeError:` text while tests still pass.
+
+- **Disable** the scan when another signal is authoritative (e.g. pytest/JUnit inside the image): set `"log_error_pattern_scan": false` in `--additional-context` or in the model entry in `models.json`. See [Configuration — Run phase: log error pattern scan](docs/configuration.md#run-phase-log-error-pattern-scan).
+- **Extend exclusions** with `log_error_benign_patterns` (list of strings), or **replace** the default pattern list with `log_error_patterns` (non-empty list of strings) for advanced cases.
+
 ### CI / Jenkins
 
 - **Exit codes:** The CLI uses fixed exit codes (`ExitCode` in `madengine.cli.constants`, e.g. `SUCCESS=0`, `RUN_FAILURE=3`, `INVALID_ARGS=4`). Pipelines should treat **non-zero** as failure; no log scraping is required for pass/fail.
@@ -598,6 +607,7 @@ madengine build --tags model --clean-docker-cache --verbose
 
 **Common Issues:**
 - **False failures with profiling**: If models show FAILURE but have performance metrics, see [Profiling Troubleshooting](docs/profiling.md#false-failure-detection-with-rocprof)
+- **False failures from `RuntimeError:` in logs**: If the workload logs expected exception text but tests pass, disable or tune the scan with `log_error_pattern_scan` / `log_error_benign_patterns` — see [Configuration](docs/configuration.md#run-phase-log-error-pattern-scan)
 - **ROCProf log errors**: Messages like `E20251230` are informational logs, not errors (fixed in v2.0+)
 - **Configuration errors**: Validate JSON with `python -m json.tool your-config.json`
 

diff --git a/docs/README.md b/docs/README.md
@@ -15,7 +15,7 @@ Complete documentation for madengine - AI model automation and distributed bench
 
 | Guide | Description |
 |-------|-------------|
-| [Configuration](configuration.md) | Advanced configuration options |
+| [Configuration](configuration.md) | Advanced configuration options (includes [run log error pattern scan](configuration.md#run-phase-log-error-pattern-scan)) |
 | [Batch Build](batch-build.md) | Selective builds with batch manifests |
 | [Deployment](deployment.md) | Kubernetes and SLURM deployment |
 | [Launchers](launchers.md) | Multi-node training frameworks |

diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -598,6 +598,16 @@ For complex configurations, use JSON files with `--additional-context-file`:
 
 To run on specific nodes, add `"nodelist": "node01,node02"` to the `slurm` section. When set, the job runs only on those nodes and node health preflight is skipped. See [examples/slurm-configs/basic/03-multi-node-basic-nodelist.json](../examples/slurm-configs/basic/03-multi-node-basic-nodelist.json).
 
+### Run phase: log error pattern scan (optional)
+
+These keys apply to **local Docker runs** when madengine post-processes the run log. Use them when substring matches cause false `FAILURE` status (for example benign `RuntimeError:` lines). Full details: [Configuration — Run phase: log error pattern scan](configuration.md#run-phase-log-error-pattern-scan).
+
+| Key | Description |
+|-----|-------------|
+| `log_error_pattern_scan` | Default `true`. Set `false` to skip grep-based log failure detection. |
+| `log_error_benign_patterns` | Array of extra strings to exclude from matching (merged with built-in benign list). |
+| `log_error_patterns` | Non-empty array replaces the default substring list (advanced). |
+
 ---
 
 ## Environment Variables

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -84,6 +84,41 @@ For production deployments:
 The **run** command does NOT require these values because it can detect GPU vendor at runtime.
 Defaults only apply to the **build** command where Dockerfile selection requires them.
 
+## Run phase: log error pattern scan
+
+After a successful container run, madengine may scan the **run log file** for fixed substrings (for example `RuntimeError:`, `OutOfMemoryError`, `Traceback (most recent call last)`). If a match is found, the run can be marked `FAILURE` even when performance metrics exist—intended as a safety net when logs show obvious Python or OOM errors.
+
+Some suites (for example layer unit tests) intentionally print benign `RuntimeError:` text while pytest still passes. In those cases you can **disable** the scan or **narrow** what counts as an error.
+
+Keys can be set in `--additional-context` / `--additional-context-file`, or on the **model** entry in `models.json` (same keys). **Runtime context overrides the model** when both are set.
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| `log_error_pattern_scan` | bool or string/number (coerced) | `true` | If `false`, skip substring-based log failure detection entirely (rely on exit codes and other signals). |
+| `log_error_benign_patterns` | array of strings | `[]` | Extra lines to **exclude** before matching (appended to built-in exclusions such as ROCProf/metrics noise). Model list is merged first, then context list. |
+| `log_error_patterns` | array of strings (non-empty) | (built-in list) | If set, **replaces** the default pattern list. Use only when you need a custom allowlist of failure substrings. |
+
+**Example — disable scan for a tag (pytest is authoritative):**
+
+```bash
+madengine run --tags my_unit_test_suite \
+  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "log_error_pattern_scan": false}'
+```
+
+**Example — extra benign substrings (prefer stable strings from real logs):**
+
+```json
+{
+  "gpu_vendor": "AMD",
+  "guest_os": "UBUNTU",
+  "log_error_benign_patterns": [
+    "expected benign fragment from workload log"
+  ]
+}
+```
+
+Disabling the scan does **not** change performance metric extraction from the log; it only affects the post-hoc grep used to set `has_errors` for status.
+
 ## Basic Configuration
 
 **gpu_vendor** (case-insensitive):

diff --git a/docs/profiling.md b/docs/profiling.md
@@ -841,6 +841,8 @@ ROCProf uses glog-style logging where `E` prefix means "Error level log" (not an
 
 **Fixed in:** madengine v2.0+
 
+For false failures **not** caused by ROCProf (for example workloads that print benign `RuntimeError:` text), see [Configuration — Run phase: log error pattern scan](configuration.md#run-phase-log-error-pattern-scan) (`log_error_pattern_scan`, `log_error_benign_patterns`).
+
 **Verification:**
 ```bash
 # Run with profiling - should show SUCCESS status

diff --git a/docs/usage.md b/docs/usage.md
@@ -400,6 +400,8 @@ madengine run --tags model --verbose --live-output
 madengine run --tags model --keep-alive --verbose --live-output
 ```
 
+If the run is marked `FAILURE` because the log contains benign substrings (for example `RuntimeError:`) while the workload actually passed, configure [log error pattern scan](configuration.md#run-phase-log-error-pattern-scan) (`log_error_pattern_scan`, `log_error_benign_patterns`).
+
 ### Clean Rebuild
 
 ```bash

diff --git a/src/madengine/cli/validators.py b/src/madengine/cli/validators.py
@@ -170,6 +170,34 @@ def validate_additional_context_structure(context: Dict[str, Any]) -> None:
     if "guest_os" in context and not isinstance(context["guest_os"], str):
         _fail_structure("guest_os", "a string")
 
+    if "log_error_pattern_scan" in context and not isinstance(
+        context["log_error_pattern_scan"], (bool, str, int, float, type(None))
+    ):
+        _fail_structure(
+            "log_error_pattern_scan",
+            "a boolean, string, number, or null",
+        )
+
+    if "log_error_benign_patterns" in context:
+        lebp = context["log_error_benign_patterns"]
+        if not isinstance(lebp, list) or not all(
+            isinstance(x, str) for x in lebp
+        ):
+            _fail_structure(
+                "log_error_benign_patterns",
+                "an array of strings",
+            )
+
+    if "log_error_patterns" in context:
+        lep = context["log_error_patterns"]
+        if not isinstance(lep, list) or not lep or not all(
+            isinstance(x, str) for x in lep
+        ):
+            _fail_structure(
+                "log_error_patterns",
+                "a non-empty array of strings",
+            )
+
 
 def _normalize_docker_build_arg_values(context: Dict[str, Any]) -> None:
     dba = context.get("docker_build_arg")

diff --git a/src/madengine/execution/container_runner.py b/src/madengine/execution/container_runner.py
@@ -35,6 +35,7 @@
 from madengine.utils.run_details import get_build_number, get_pipeline
 from madengine.execution.container_runner_helpers import (
     make_run_log_file_path,
+    resolve_log_error_scan_config,
     resolve_run_timeout,
 )
 
@@ -1249,27 +1250,18 @@ def run_container(
                         # Set status based on performance and error patterns
                         # First check for obvious failure patterns in the logs
                         try:
-                            # Check for common failure patterns in the log file
-                            # Note: Patterns should be specific enough to avoid false positives
-                            # from profiling tools (rocprof, etc.) that use "Error:" as log level
-                            error_patterns = [
-                                "OutOfMemoryError",
-                                "HIP out of memory",
-                                "CUDA out of memory",
-                                "RuntimeError:",  # More specific with colon
-                                "AssertionError:",
-                                "ValueError:",
-                                "SystemExit",
-                                "failed (exitcode:",  # Literal text in logs
-                                "Traceback (most recent call last)",  # Python tracebacks
-                                "FAILED",
-                                "Exception:",
-                                "ImportError:",
-                                "ModuleNotFoundError:",
-                            ]
+                            scan_logs, error_patterns, extra_benign = (
+                                resolve_log_error_scan_config(
+                                    model_info, self.additional_context
+                                )
+                            )
 
                             has_errors = False
-                            if log_file_path and os.path.exists(log_file_path):
+                            if (
+                                scan_logs
+                                and log_file_path
+                                and os.path.exists(log_file_path)
+                            ):
                                 try:
                                     # Define benign patterns to exclude from error detection
                                     # These are known warnings/info messages that should not trigger failures
@@ -1289,7 +1281,8 @@ def run_container(
                                         "rocpd_op:",                          # ROCProf operation logs
                                         "rpd_tracer:",                        # ROCProf tracer logs
                                     ]
-
+                                    benign_patterns.extend(extra_benign)
+
                                     # Check for error patterns in the log (exclude our own grep commands, output messages, and benign patterns).
                                     # Use subprocess (not console.sh) so the check runs silently and does not clutter console output.
                                     for pattern in error_patterns:
@@ -1326,6 +1319,11 @@ def run_container(
                                             pass  # Error checking is optional; treat as no match
                                 except Exception:
                                     pass  # Error checking is optional
+                            elif not scan_logs:
+                                self.rich_console.print(
+                                    "[dim]ℹ️  Log error pattern scan disabled "
+                                    "(log_error_pattern_scan).[/dim]"
+                                )
 
                             # Status logic: Must have performance AND no errors to be considered success
                             # Exception: Worker nodes in multi-node training (MAD_COLLECT_METRICS=false)

diff --git a/src/madengine/execution/container_runner_helpers.py b/src/madengine/execution/container_runner_helpers.py
@@ -7,6 +7,100 @@
 
 import typing
 
+# Default substrings matched in container run logs post-hoc (see ContainerRunner).
+DEFAULT_LOG_ERROR_PATTERNS: typing.Tuple[str, ...] = (
+    "OutOfMemoryError",
+    "HIP out of memory",
+    "CUDA out of memory",
+    "RuntimeError:",
+    "AssertionError:",
+    "ValueError:",
+    "SystemExit",
+    "failed (exitcode:",
+    "Traceback (most recent call last)",
+    "FAILED",
+    "Exception:",
+    "ImportError:",
+    "ModuleNotFoundError:",
+)
+
+
+def _coerce_bool(value: typing.Any, *, default: bool) -> bool:
+    """Interpret JSON/CLI scalars as bool; fall back to *default* if None."""
+    if value is None:
+        return default
+    if isinstance(value, bool):
+        return value
+    if isinstance(value, (int, float)) and not isinstance(value, bool):
+        return value != 0
+    if isinstance(value, str):
+        s = value.strip().lower()
+        if s in ("0", "false", "no", "off", ""):
+            return False
+        if s in ("1", "true", "yes", "on"):
+            return True
+    return default
+
+
+def _pick_context_over_model(
+    model_info: typing.Dict,
+    additional_context: typing.Dict,
+    key: str,
+    default: typing.Any = None,
+) -> typing.Any:
+    """Resolve key from model_info, overridden by additional_context when present."""
+    ctx = additional_context or {}
+    mi = model_info or {}
+    if key in ctx:
+        return ctx[key]
+    if key in mi:
+        return mi[key]
+    return default
+
+
+def resolve_log_error_scan_config(
+    model_info: typing.Dict,
+    additional_context: typing.Optional[typing.Dict] = None,
+) -> typing.Tuple[bool, typing.List[str], typing.List[str]]:
+    """
+    Resolve whether to scan run logs for error substrings and which patterns to use.
+
+    Keys (in ``additional_context`` and/or ``model_info``; context wins):
+
+    - ``log_error_pattern_scan`` (default True): set False to skip grep-based failure detection.
+    - ``log_error_benign_patterns``: list of extra substrings/regex fragments excluded from matches.
+    - ``log_error_patterns``: non-empty list of strings replaces the default error pattern list.
+
+    Returns:
+        (scan_enabled, error_patterns, extra_benign_patterns)
+    """
+    ctx = additional_context if additional_context is not None else {}
+    mi = model_info if model_info is not None else {}
+
+    scan_enabled = _coerce_bool(
+        _pick_context_over_model(mi, ctx, "log_error_pattern_scan", True),
+        default=True,
+    )
+
+    raw_benign_mi = mi.get("log_error_benign_patterns")
+    raw_benign_ctx = ctx.get("log_error_benign_patterns")
+    extra_benign: typing.List[str] = []
+    for part in (raw_benign_mi, raw_benign_ctx):
+        if isinstance(part, list):
+            extra_benign.extend(str(x) for x in part if x is not None)
+
+    custom_patterns = _pick_context_over_model(mi, ctx, "log_error_patterns", None)
+    if (
+        isinstance(custom_patterns, list)
+        and len(custom_patterns) > 0
+        and all(isinstance(x, str) for x in custom_patterns)
+    ):
+        error_patterns = list(custom_patterns)
+    else:
+        error_patterns = list(DEFAULT_LOG_ERROR_PATTERNS)
+
+    return scan_enabled, error_patterns, extra_benign
+
 
 def resolve_run_timeout(
     model_info: typing.Dict,
-Original file line number
+Diff line change
@@ Expand Up / @@ -400,6 +400,8 @@ madengine run --tags model --verbose --live-output @@
     madengine run --tags model --keep-alive --verbose --live-output
     ```
+    If the run is marked `FAILURE` because the log contains benign substrings (for example `RuntimeError:`) while the workload actually passed, configure [log error pattern scan](configuration.md#run-phase-log-error-pattern-scan) (`log_error_pattern_scan`, `log_error_benign_patterns`).
     ### Clean Rebuild
     ```bash
@@ Expand Down @@