Skip to content

Update the scripts and dockers in madengine package#2

Merged
gargrahul merged 1 commit intoROCm:mainfrom
coketaste:coketaste/update
Jun 3, 2025
Merged

Update the scripts and dockers in madengine package#2
gargrahul merged 1 commit intoROCm:mainfrom
coketaste:coketaste/update

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Update the scripts and dockers in madengine package

@coketaste coketaste requested a review from gargrahul June 3, 2025 01:23
@coketaste coketaste self-assigned this Jun 3, 2025
@coketaste coketaste requested a review from Rohan138 June 3, 2025 01:23
@gargrahul gargrahul merged commit f3ad109 into ROCm:main Jun 3, 2025
srinivamd added a commit to srinivamd/madengine that referenced this pull request Apr 23, 2026
Extracts duplicated credential loading and Docker registry login logic
from BuildOrchestrator and RunOrchestrator into a shared core/auth.py
module (~120 lines of duplicated login logic removed).

Security improvements over PR ROCm#106:
- Registry password passed via MAD_REGISTRY_PASSWORD env var instead
  of argv (avoids password leaking in process list / shell history)
- docker ps uses --filter name=^/$  with re.escape() to avoid regex
  metachar false positives (see docker.py fix)
- build-arg values wrapped with str() before shlex.quote()

Copilot review comment ROCm#2 fix: login_to_registry typed as
Optional[str] (was str) since callers pass None for DockerHub
and tests call with None.
raviguptaamd added a commit to raviguptaamd/madengine that referenced this pull request May 1, 2026
Address all 9 inline comments from copilot-pull-request-reviewer[bot]:

ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys
   manifest['built_models'] by model_name (not use_image), so multiple
   models that share the same pre-built image are all preserved in the
   manifest.

ROCm#2 build_orchestrator.py — warn when discovered models have differing
   distributed/slurm configs in the prebuilt-image flow; the post-merge
   step still uses models[0]'s config but operators are now told.

ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises
   ConfigurationError early when registry is None instead of falling
   into registry.replace/.split/.lower with NoneType.

ROCm#4 build_orchestrator.py — credentials-required error now emits
   per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io)
   instead of Docker-Hub-only PAT guidance.

ROCm#5 container_runner.py — document the shell=True trust boundary on the
   inner subprocess.run; cmd is internally constructed and any user
   model_args are routed through shlex-quoted assembly in the caller.

ROCm#6 slurm.py — drop duplicate `from typing import Optional` import.

ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes
   `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True)
   to match the standard SLURM template behaviour.

ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once
   and includes both --job-name (when provided) and --reservation (when
   set); the second in-try reassignment that dropped --job-name is gone.

ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with
   a real recursive _deep_merge so the comment ("deep-merge") matches the
   behaviour: nested dicts under slurm/k8s/distributed/etc. are merged
   per-leaf, runtime --additional-context still wins on conflicts.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants