Skip to content

Slurm utils design #107

@ffelten

Description

@ffelten

Discussion with @markfuge and @g-braeunlich

Main use cases

  • Dataset generation: calling optimize or simulate. Currently uses the config and design factories to create a parameter_space using slurm.Args.
  • Starting gradient descent from different initial points: you only change the starting_point of the optimization (all other parameters kept unchanged).
  • We want to run things other than optimize or simulate: render or even custom code could be run inside the job (e.g., using a callback). MapReduce like logic.
  • Group small runs into a bigger job to reduce Slurm scheduling overhead. Might be useful for ML models evaluations.

Additional features

  • Specify how long each runtime should be for Slurm for jobs that are too long (bad simulations). This timeout is job specific and Euler specific.
  • How to check which run failed? How do you track back to the config/parameters/args that triggered this run? What we want: job id, error or timeout, args used to run this job.

Found bugs

  • Job array size limit on Euler (kicked out by Euler)
  • OOM killed for the reduce node (ask for a bigger node for reduce)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions