Use header index to replace fixed column order#77
Use header index to replace fixed column order#77coketaste merged 1 commit intocoketaste/refactor-disfrom
Conversation
…performance, metric, and status show in the correct column
There was a problem hiding this comment.
Pull request overview
Updates performance result recording/display so appended perf.csv rows align with the existing file’s header schema (instead of relying on a fixed column order), ensuring key fields like model/performance/metric/status appear in the correct columns.
Changes:
- In Kubernetes result collection, adjust successful run entries for
multiple_resultsfallback so CLI display uses per-row perf data (by settingnodes=[]). - In Kubernetes perf.csv writing, detect and reuse an existing
perf.csvheader to map values by column name when appending. - In CLI utilities, improve formatting/handling of empty values and
nanmodel names when rendering tables.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/madengine/deployment/kubernetes.py |
Reworks perf.csv appending logic to respect existing header order and tweaks result payload for multi-row fallback display. |
src/madengine/cli/utils.py |
Improves table rendering robustness for empty values and nan model fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| file_exists = perf_csv_path.exists() | ||
| existing_header = None | ||
| if file_exists: | ||
| # Read existing header so we write in the same column order (fixes MAD-private | ||
| # and other repos that use a different perf.csv schema with more/different columns) | ||
| with open(perf_csv_path, "r", newline="", encoding="utf-8", errors="replace") as rf: | ||
| reader = csv.reader(rf) | ||
| existing_header = next(reader, None) | ||
| headers = existing_header if existing_header else default_headers | ||
| if file_exists and existing_header: | ||
| # Build row by name so model/performance/metric/status go to correct columns |
There was a problem hiding this comment.
If perf.csv exists but is empty (or the first row is blank), perf_csv_path.exists() will be True while existing_header is None/empty, which skips writeheader() and appends a data row without any header line. Consider treating an empty/invalid header as a new file (e.g., check stat().st_size == 0 or validate existing_header) and write default_headers before appending rows.
| with open(perf_csv_path, "a", newline="", encoding="utf-8") as f: | ||
| writer = csv.DictWriter(f, fieldnames=headers, extrasaction="ignore") | ||
| if not file_exists: | ||
| writer.writeheader() | ||
|
|
||
| # Write data row (only fields in headers will be written) | ||
| writer.writerow(perf_data) | ||
| if file_exists and existing_header: | ||
| csv.writer(f).writerow(row_to_write) | ||
| else: | ||
| writer.writerow(row_to_write) |
There was a problem hiding this comment.
This branch builds writer = csv.DictWriter(...) but then writes rows using csv.writer(f).writerow(...). Mixing DictWriter and writer can lead to subtle differences in quoting/dialect and makes the code harder to follow. Consider consistently using DictWriter.writerow(...) (it already respects fieldnames order and fills missing keys with restval) to write both new and existing-header rows.
* Changed docker.io to dockerhub * Fix the test case of context * Updated README.md * Fix the unit test of e2e distributed run with profiling * Fixed the issue of mocks gpu * Rewrite the unit test gpu version * Fixed the manfiest name error * Fixed the missing manifest file * Updated the warning message of missing cred * Updated the MAD_DOCKERHUB_ creds parsing logic * Updatd README * Implemented a batch input arg for madengine-cli build * enhanced logging system is now active and will automatically highlight all Docker operations * Fix the error local variable docker_image referenced before assignment * Updated the perf dataframe output * The fixes are backward compatible and maintain existing functionality for truly successful runs while correctly identifying and handling various failure scenarios. * Fixed the problematic log * Fixed the error pattern, removed the wrong string * Fixed the error of test prof * Updated the interface of mad_cli * Update README.md * ensure that the DistributedOrchestrator.build_phase method and the underlying build logic use the batch_build_metadata argument to perform the correct tagging and pushing for each model. * Updated the build batch manifest to distributed orchestrator * Debug the batch manifest * Update the flow use per-model registry settings for both build and run phase * correct registry image will be used for each model as intended * The push_image function now accepts and uses the explicit registry_image from batch.json for each model. * Updated the explicit_registry_image assignment * Debug the registry info setting * Updated the function of export build manifest * Add verbose for debugging * Debug the export build manifest * Debug the registry extract from batch build metadata * Debug the exaction * Corrected the content of synthetic image which built_new is false in batch mode * Fixed the type error in additional context * Debug the parsing of gpu vendoer and guest os * Correct the pattern of Dockerfile * Updated the print * Update the rich print * Figured out a critical issue about dual CLI implementation creating maintenance burden * Fixed the dockerfile matched * refactored the logic in _process_batch_manifest_entries() to include all fields from the discovered model in the build_manifest * Added unit tests for new unified error handlers * Updated README.md * Implemented a SLURM runner follows the same comprehensive pattern as the existing SSH, Ansible, and Kubernetes runners, ensuring consistency while highlighting SLURM-specific features like job arrays, HPC module systems, and shared filesystem requirements. * Fixed the errors in unit tests * Used Rich console print to replace part of regular print to enhance the formatting log following best practices * Updated rich conosle print to enhance the log readability * Update the new line * Updated the new line for all sections * Updated final table of dataframe * Updated the display of dataframe from head to tail * Updated the checking gpu status * Cleanup * Updated README * Added discover command to mad_cli * Implemented CLI detect MAD_CONTAINER_IMAGE in additional context, production-ready and maintains full backward compatibility with existing madengine workflows * Implemented the core multi-GPU architectures support for docker image building * Implemented unit tests for the feature of multi-gpu arch * Debug and fix the unit test of multi gpu arch * Debug the issue of display results table * Enhanced the results table, and improved the flow of handle gpu arch surfix at docker image name * Creates architecture-specific images with proper naming and metadata, regardless of the underlying Dockerfile configuration. * Fixed the syntax error * ported changes from coketaste/amd-smi * Revert "ported changes from coketaste/amd-smi" This reverts commit 5444a67. * Fixed the tools for distributed mode * Fixed the cleanup * Fixed the table of resutls * Fixed the GPU Product Name * Fixed the issue in selftest * Enhanced unit tests and cleanup * Refactor the architecture and flow * Updated the REFACTOR Plan * Updated PLAN * Update Plan * Update Plan * Refactor the new madengine cli architecture and flow * Fixed the copy issue of scripts * Fixed the migration issues and fixed the tags multitag inputs * Improved error handler and updated unit tests * Fixed the prescript with prof power and vram * Fixed the issues of data provider and tags * Refactored the GPU Tool Manager with factory to handle AMD ROCm and Nvidia CUDA tools, sync up the changes in legacy madengine, updated their unit tests and cleanup codebase * Reorganize codebase: move docker_builder, discover_models, update_perf_csv - Move docker_builder.py to execution/ (Docker operations) - Move discover_models.py to utils/ (shared utility) - Move update_perf_csv.py to reporting/ (shared by both CLIs) - Create database/ placeholder for future MongoDB API - Update all imports across codebase - Add comprehensive README documentation Tests: 58/59 passing (98.3%) Legacy compatibility: Maintained * Reorganize file structure and cleanup - Move documentation files to better organization - Update code references and imports - Clean up test fixtures with dummy values only * cleanup runners * fixed the unit tests for new madengine cli and depreciated unit tests of legacy madengine * Implemented k8s deployment * Fixed the perf csv unified format * Updated the flow of k8s * Init examples of k8s config for different use cases * Fixed the data provider for minio * Fixed the data nas * Implemented k8s tools package and Enabled download and pre-script tools in the flow running on k8s pod * Fixed the tools encapsulated * Added PVC storage support for results generated by workload on k8s pod * Implemented the torchrun as runner for multigpu and multinode * Fixed the torchrun on multigpu on k8s * Fixed the tools as pre/post scripts running on multigpu k8s * Fixed the error handler of performance empty due to benchmark failed on k8s * Fixed the chain tools * Fixed the multinode on k8s * Updated k8s config * Updated the pod creation with PVC * Clean up * Refactored context to handle new madengine cli to support local and deploy cases * Migrated mad_cli with new cli structure desgin * Fixed the unit tests for new madengine cli * Fixed the unit tests * Improved the interface of k8s-configs to simplify the complex config to minimal * Fixed the timeout issue * Added models for testing distribution with different launchers * Removed non-existent environment variables; Kept only standard MIOpen variables: MIOPEN_FIND_MODE=1 and MIOPEN_USER_DB_PATH=/tmp/.miopen; Added cache clearing in kubernetes/job.yaml.j2 before training starts; Deletes corrupted cache files: rm -rf ${MIOPEN_USER_DB_PATH}/*; Ensures every run starts with a clean cache. * Added new models of dummy_megatron_lm and dummy_deepspeed to validate the implementation of distribution on cluster * Implemented launcher of Slurm running distributed workload * Fixed the issue of depolyment config * Fixed the gpu resolution for setting num gpus * Fixed the multinode configs and cleanup old MULTI_NODE args which are not used anymore * Fixed the streaming log * Refactored the torchrun on slurm with multi gpu and multi node * Fixed the job template of slurm for single and multinode * Fixed the multinode job * Fixed the multinode on slurm node * Fixed the multinode slurm depoly case which no SIGBUS crashes and worker nodes succeed with proper status message * Validate madengine-cli on compute node * Fixed the unit tests for slurm deploy update * Created run script of dummy deepspeed * Implemented inference serving launchers using vllm and sglang for distribution on slurm cluster * Fixed the issue in vllm for v1 engine * Debug and test vllm deploy on multinode of slurm with v1 engine and ray as launcher, refactor k8s-configs and slurm-configs * Implemented launchers of vllm and sglang to run workload on single and multinode of slurm cluster; Fixed the slurm-configs * Fixed the unit tests of skip gpu arch * Fixed the issue of copy common and add pyt_huggingface_gpt2 and pyt_huggingface_bert for validating development * v2.0 development: (1) 6 launchers implementation for multigpu and mutlinode on k8s and slurm cluster. (2) bugs fix for error handler (3) cleanup deadcode (4) update docs of project * Updated the gpu_vendor and guest_os fields in config * Updated README.md and its sections in docs * Updated the README and cleanup * Replace sleep to cat * Updated torchrun with run.sh pattern on k8s deployment * Updated Megatron-lm base image using ROCm/megatron-lm:latest * Implemented report and database commands * Added the feature of perf superset to collect configs and multi-results * Fixed the k8s pvc issue * Updated the context saving logic * Reorganize unit tests, remove reduntant and edge cases, add examples of batch-manifest and update documentation of the feature of batch build * Make an universal soluton with docker exec * Replaced sleep with tail * Fixed teh docker pull issue on compute node if the layer of image crashed * Removed MySQL database interface * Updated the Megatron-lm launcher on both k8s and slurm * Updated k8s-configs megatron-lm and deepspeed * Updated the config of gradient_accumulation_steps * Removed legacy madengine and its relative, use madengine as unified modern CLI with k8s and SLURM * Cleanup and enhance README of project * Fixed the vllm multinode on k8s * Fixed the format error in kubernetes * Implemented sglang-disagg launcher for slurm and k8s on multinode * Updated docs of project refer to recent changes * Updated README of project * Fixed the stack tools for tracing * Fixed the tools stack for gpu info power and gpu info vram profilers * Fixed the error of regex pattern mismatch * Added a model to test theRock and its performance on PyTorch framework; Added tool of detect_therock * Fixed the error handling to validate the exact bug * Updated docs refer to recent changes * Fixed the module combining logs of building and running * Fixed the trace shell script * Updated trace script * Fixed the SLURM job setup HIP_VISIBLE_DEVICES for Ray/vLLM * Debug the vllm multinode on SLURM * Fixed the superset upload to mongodb * Fixed the unit tests checking hardward capability * Fixed the superset data generation and uploading * Added new opt to clean perf intermediated files * Enhanced rocprof v3 profiling: 8 pre-configured profiling profiles for different bottleneck types - Hardware counter definitions for compute, memory, and communication analysis - Ready-to-use configuration files for single-GPU, multi-GPU, and multi-node setups - Perfetto visualization support for timeline analysis - Full custom command support via cmd and env_vars fields - Automatic rocprof/rocprofv3 detection via existing wrapper script - Comprehensive documentation with examples for every scenario * Fixed the configs, tracing and wrapper of rocprof v3 * Added profiling-configs and counters configs, enhanced the multigpu profiling with rocprof v3 and stacked tools chain * Updated the unit tests of profiling with rocprof v3 * Updated the docs of project * Added unit tests for superset * Fixed the error of parsing config and gathering multi_resutls to superset dataframe * Fixed the issue of dockerhub auth in pod init for Image Pulling * Implemented multinode profiling and tracing on k8s and slurm cluster, updated the tables of execution results and performance results * Fixed the multi results case on k8s infra * Updated config for porfiling * Added nodelist arg for slurm, and improved the configuration module * Updated docs about changes in slurm * Added a new rocprofv3 config mode for tool * Updated rocprofv3_agent tool config * Added rocprofv3_agent_full with counter collection or instruction mix * Updated trace config * Added rocprofv3 thread tracing * Updated the config of thread tracing and need to set ROCPROF_ATT_LIBRARY_PATH * Updated tools.json and remove thread tracing config for rocprof v3 * Fixed the error patterns and modified the silent error check via subprocess * Updated the additional context and manifest printout * Updated interface of madengine report to-html using --csv-file-path * madengine v2: ROCm path override + RPD e2e dependency fix (#76) * feat(madengine): ROCm path override and RPD e2e fix ROCM_PATH / --rocm-path: - Add get_rocm_path() and wire through Context, ROCmToolManager, gpu_tool_factory, gpu_validator, container_runner, run orchestrator; add --rocm-path to run CLI - Unit tests for get_rocm_path, Context, ROCmToolManager, run --help; update fixtures and test_get_cached_managers for (vendor, rocm_path) cache RPD e2e: - pre_scripts/trace.sh: install nlohmann-json3-dev (Ubuntu) and json-devel (CentOS) so rocmProfileData rpd_tracer build finds nlohmann/json.hpp * Updated docs * Addd madengine logo icon * Resize the logo icon * Updated the icon of madengine * Updated docs * Updated README * Fixed the issue. Appended rows now match the existing header; model, performance, metric, and status show in the correct column (#77) Use header index to replace fixed column order * Updated README.md * feat(k8s): align K8s perf reporting with Docker (multi_result + perf_super) (#78) K8s orchestrator now uses the same reporting path as single-node Docker when multiple_results CSV is present, so both produce the same artifacts: perf.csv, perf_entry.json, perf_entry.csv, perf_super.json, perf_entry_super.json/csv, and perf_super.csv. - Add Docker-compatible reporting path in _collect_results: - _build_common_info_dict() for common_info (args, gpu_arch, etc.) - _ensure_perf_csv_exists() so update_perf_csv can read perf.csv - Call update_perf_csv, update_perf_super_json, update_perf_super_csv with scripts_base_dir so --config from models.json resolves correctly - Multi-node multiple_results: resolve one CSV path per run - _resolve_multiple_results_csv(): single pod → use that CSV; multi-pod → merge all pod CSVs with sum/average rules - _merge_multi_node_multiple_results_csv(): align rows by index; performance aggregated by metric type (throughput→sum, latency→avg, memory→max); extra columns by _aggregation_for_extra_column (sum/avg/max/first) - _aggregation_for_extra_column() for consistent multi-node semantics - Keep legacy row-by-row _write_to_perf_csv when reporting module unavailable; record failure when no CSV found - job.yaml.j2: no functional change required; existing copy block and find fallback for multiple_results already support this refactor * Fixed launcher type issue on k8s * update model discovery to handle tags in subdirectories for madengineV2 (#85) * fixed model discovery to support tags in subdirectories * made addtional context text and file mutually exclusive on build and run * madengine v2 improve code quality (#81) * refactor(madengine): improve code quality, remove dead code, reduce duplication - Remove dead test: delete tests/test_cleanup.py (imported removed madengine.tools.run_models). - Constants: add _get_env_or_creds_or_default() in core/constants.py and refactor NAS_NODES, MAD_AWS_S3, MAD_MINIO, PUBLIC_GITHUB_ROCM_KEY to use it; add tests/unit/test_constants.py. - Path helpers: add utils/path_utils.py with scripts_base_dir_from() and get_madengine_root(); use in container_runner and kubernetes; add tests/unit/test_path_utils.py. - Run-details: add utils/run_details.py with get_pipeline(), get_build_number(), and flatten_tags_in_place(); refactor container_runner and kubernetes to use them; add tests/unit/test_run_details.py. - Deployment: add apply_deployment_config() in config_loader and create_jinja_env() in base; refactor slurm and kubernetes init to use them; add TestApplyDeploymentConfig and tests/unit/test_deployment_base.py. All changes preserve behavior. Unit test suite (157+ tests) passes. * test: reorganize unit and integration tests, reduce duplication Unit tests (tests/unit/): - Remove duplicate get_rocm_path coverage from test_constants (kept in test_rocm_path). - Consolidate by topic: test_utils (path_utils + run_details), test_deployment (base + common), test_execution (container_runner_helpers + dockerfile_utils), test_orchestration (image_filtering + orchestrator_logic). - Parametrize and merge redundant cases (e.g. falsy inputs, env/default, launcher normalization, timeout behavior). - Drop duplicate test (e.g. gfx_compilation in dockerfile_utils) and trim edge-case bloat; keep important behavior. - Reduce from 16 to 12 unit test files; 203 tests passing. Integration tests (tests/integration/): - Fix 6 failures in multi-GPU-arch tests: use madengine.execution.dockerfile_utils instead of removed DockerBuilder private methods. - Merge error tests into test_errors.py (CLI + unified system, dedupe setup_logging and context serialization). - Merge batch manifest into test_orchestrator_workflows; merge multi_gpu_arch into test_platform_integration. - Reduce from 10 to 7 integration test files; 155 passed, 1 skipped. * - Add shared e2e helpers in tests/fixtures/utils.py: - build_run_command(), assert_model_in_perf_csv(), get_run_live_log_path(), get_timeout_seconds_from_log() - DEFAULT_CLEAN_FILES for consistent cleanup lists - Use DEFAULT_CLEAN_FILES (and + extras) in all e2e modules: test_run_workflows, test_execution_features, test_data_workflows, test_build_workflows, test_profiling_workflows, test_scripting_workflows - Parametrize context tests: merge test_dockerfile_picked_on_detected_context_0/1 into test_dockerfile_picked_on_detected_context (ctx_value, expected_performance) - Parametrize timeout-in-log tests: replace four separate tests with test_timeout_value_in_log (tags, log_base_name, expected_seconds, extra_args) - Leave timeout-fire and other e2e tests unchanged; 60 e2e tests still pass * Stop tracking run_directory (now in .gitignore) * Multi-node SLURM: No node writes perf.csv; after the job, collect_results parses each node’s .out, aggregates (e.g. sum for samples_per_second), and writes one row. 2N×8G should report about 2× the 1N×8G throughput. * Refactored the slurm_output to slurm_results to improve the management of result collected from each node * Fixed the workload's output to slurm results * Refactored the reporting system for local, k8s and slurm, respectively, to handle single result and mult results * Fixed the multi results collection for multi node * Fixed the test duration missing in multi results case * Improve madengine reliability, error handling, and exit codes for Jenkins - Record pre-run failures (e.g. pull/setup) in the performance CSV with status FAILURE and add to failed_runs so the perf table and exit code stay correct. - Preserve typer.Exit in cli_main so run/build exit codes (0/1/2/3/4) are passed through to the shell/Jenkins. - Use ExitCode.BUILD_FAILURE (2) when BuildError is raised; let BuildError propagate from RunOrchestrator instead of wrapping in MADRuntimeError. - Make update_perf_csv create the perf CSV with PERF_CSV_HEADER when the file is missing but a result (exception/single/multiple) is provided. - Use shared PERF_CSV_HEADER in container_runner.ensure_perf_csv_exists and fix read_json to use a context manager for the file handle. - Add unit tests for exit codes, cli_main typer.Exit preservation, BuildError → BUILD_FAILURE, setup-failure recording to CSV, and update_perf_csv create-when-missing; fix unclosed file in update_perf_csv. * Cleanup * Updated docs of project * update model discovery to handle tags in subdirectories (#83) --------- Co-authored-by: leconcio <leconcio@amd.com> * feat(discover): support scope/tag to filter by scripts subdir and tag Add --tags scope/tag (e.g. MAD-private/inference) so discovery is limited to models under scripts/<scope>/ with the given tag. Exactly one / and no : in the tag enables scoped matching; scope/all selects all models in that scope. - Parse scope/tag in DiscoverModels.select_models() - Add _model_entry_has_tag() for list or comma-separated tag checks - Document scoped tags in discover command help - Add unit tests in tests/unit/test_discover_models.py * Cleanup * fix: dummy fixtures and core for multi-node and SLURM - Update .gitignore, docker.py, slurm.py - dummy_vllm: Dockerfile (coreutils), run.sh, run_vllm_inference.py - dummy_torchtitan: Dockerfile, run.sh - dummy_sglang: README, run.sh, run_sglang_inference.py * test: consolidate run exit-code smoke tests in test_cli.py * Updated README of project * madengine v2 refactor(k8s) secrets handling, launcher mixin, and template updates (#88) * refactor(k8s): secrets handling, launcher mixin, and template updates - Add k8s_secrets helpers for image pull / runtime secrets with configurable strategies; add unit tests. - Introduce KubernetesLauncherMixin for torchrun and related launcher command generation; wire into kubernetes deployment. - Update K8s Job/ConfigMap templates, k8s defaults preset, and kubernetes.py. - Refresh configuration/deployment docs and k8s-configs README. - Adjust tests (conftest, dummy credential fixture) for new behavior. - Fix CLI merge of --additional-context with --additional-context-file when the CLI value uses Python dict syntax (fallback after JSON parse failure). * fix(k8s): Path import for vLLM/SGLang commands; public registry secret helper - Import Path at module level in kubernetes_launcher_mixin so _generate_vllm_command and _generate_sglang_command do not raise NameError; drop redundant inner import in _generate_torchrun_command. - Rename _build_registry_secret_data to build_registry_secret_data and document it as the supported API for dockerconfigjson preview logic. - Update kubernetes.py and unit tests accordingly. * feat(cli): add default values for gpu_vendor and guest_os in build CLI (#89) * feat(cli): add default values for gpu_vendor and guest_os in build command Add sensible defaults (AMD/UBUNTU) to simplify build operations for the common use case while maintaining full backward compatibility and explicit override capability. * fix(build): wire gpu_vendor/guest_os defaults into Context for Dockerfile filtering validate_additional_context already applied AMD/UBUNTU defaults, but the merged dict was not passed to BuildOrchestrator, so Context.filter() saw no vendor keys and could build both AMD and NVIDIA Dockerfiles. - Add core/additional_context_defaults.py with apply_build_context_defaults() - Refactor CLI validators to use the shared helper - build command: pass repr(validated context) and drop duplicate file load - BuildOrchestrator: apply defaults after merge; build Context from final additional_context post-ConfigLoader - Update unit/integration tests; add tests for defaults helper * feat(cli): validate additional context and scope GPU arch warnings Add structural checks for --additional-context (nested types, known keys) with clear typer.Exit messages and examples. Unify run merge/validate with build when context is non-empty. Remove the duplicate MAD_SYSTEM_GPU_ARCHITECTURE info line from build-only Context init. After model discovery, warn only when a selected Dockerfile needs ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default and the effective arch cannot be resolved (context or Dockerfile). Add dockerfile_requires_explicit_mad_arch_build_arg helper, drop duplicate DockerBuilder methods, and extend unit tests (validators, CLI, Dockerfile heuristic, orchestrator warning, context messages). * fix(cli): normalize gpu_vendor/guest_os; clean unused imports in validators - After validation, write canonical uppercase gpu_vendor and guest_os into the context dict so Context.filter() matches exactly. - Align test_validate_additional_context_case_insensitive with that behavior. - Drop unused imports: Panel in validators.py; patch/MagicMock in test_validators.py (flake8 F401). * docs(readme): fix misleading AMD/UBUNTU wording for `madengine run` quick start * docs(usage): fix misleading AMD/UBUNTU wording for `madengine run` quick start * Fixed the docker registry url * Added unit tests for docker builder * madengine v2: k8s PVC collection hardening (#90) * fix(k8s): harden collect_results and align PVC defaults with examples - kubernetes: wait for terminal pod phase before recording per-node status; read primary workload container exit code by name (not container_statuses[0]) - presets + PVC templates: align shared/results PVC behavior with nfs-banff / RWX usage - examples: update torchrun single- and multi-node JSON configs; document in k8s-configs README * feat(k8s): per-pod results layout, PVC collection hardening, expand k8s tests Kubernetes deployment: - Collect PVC artifacts into k8s_results/<job>/<pod>/pvc/ by matching /results/<subdir>/ to full pod names (indexed Job pods, unmapped fallback). - Harden collector: short settle delay, retries for listing /results, explicit -c collector on kubectl exec/cp, clearer warnings when collection fails. Docs: - Document local k8s_results layout (pod.log vs pvc/) and multi-node Service behavior (PyTorch vs Ray notes in k8s-configs README). Tests: - Rename test_k8s_secrets.py to test_k8s.py; keep secrets/configmap tests and add unit tests for PVC subdir ↔ pod name mapping helpers. * feat(cli): add --skip-model-run for local full workflows (#91) Restore v1-style --skip-model-run on madengine run: model execution is skipped only when this invocation ran a build (_did_build_phase). If an existing --manifest-file is used, the flag is ignored and a warning is printed. - CLI: new --skip-model-run, passed through args; workflow panel and success messaging when the run phase is skipped - RunOrchestrator: short-circuit before local/distributed execute; empty run summary with skipped_model_run; use (skip_model_run is True) so MagicMock test doubles do not enable the path by accident - Tests: unit coverage for skip vs run-only; help text; conftest skip_model_run * feat(execution): configurable log error pattern scan (#92) * feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes. * refactor(container): Python log error scan; split benign literal vs regex; test validators (#93) * feat(execution): configurable log error pattern scan (additional_context) Allow disabling or tuning post-run log grep via additional_context and/or model_info: - log_error_pattern_scan / disable_log_error_scan to skip substring scan - log_error_benign_patterns to extend exclusions (e.g. noisy DLM logs) - log_error_patterns to override default pattern list Extract resolution into container_runner_helpers.resolve_log_error_scan_config; validate new keys in CLI validators; add unit tests. Motivation: avoid false FAILURE from benign RuntimeError: lines in suites already validated by pytest/JUnit (e.g. pyt_dl_tool_unit_test / AIMODELS-571). * feat(execution): configurable log error pattern scan; document and drop duplicate flag Add optional additional_context / model keys to control post-run log substring checks (log_error_pattern_scan, log_error_benign_patterns, log_error_patterns). Resolve settings in container_runner_helpers; wire into ContainerRunner; validate CLI types. Remove disable_log_error_scan in favor of log_error_pattern_scan only. Document behavior in README, docs/configuration.md, cli-reference, usage, profiling, and docs index. Cross-link troubleshooting for benign RuntimeError: false failures. Tests: extend test_container_runner_helpers; unit suite passes. * refactor: scan container logs in Python; validate log_error_* in tests Replace sh/grep-based log error detection with in-process scanning to avoid shell quoting and injection issues when patterns come from model_info or additional_context. - Add log_text_has_error_pattern() with literal benign substrings vs explicit regex benign rules (ROCProf). - Wire ContainerRunner to read the log once and use the helper. - Document log_error_benign_patterns as literal substrings in helpers. - Extend validate_additional_context tests for log_error_pattern_scan, log_error_benign_patterns, and log_error_patterns (valid + invalid). * refactor: improve code quality and streamline unit tests (#94) --------- Co-authored-by: Satya Nikhil <skodukul@amd.com> Co-authored-by: leconcio <leconcio@amd.com>
Appended rows now match the existing header, model, performance, metric, and status show in the correct column