Skip to content

Frank's feedback on using Slurm #151

@ffelten

Description

@ffelten

Version

  • Not used the latest version -- branch HPC-Zaratan needs to update to Geri's latest version of the Slurm API.

Experience with the Slurm API

  • The API seems to work quite well once you understand it.
  • Did not test the reduce part.

Pain points

  • Singularity to Apptainer Support Apptainer #148
  • Use reset with different seeds to create different engibench_studies in Airfoil. Why is the study_None generated when init is called? (see Frank's branch)
  • More (advanced) examples of usage for instance using the Slurm API would be useful to understand the API. (see actions)
  • Container should be cached and not redownloaded by each job. Slurm scripts caching #140

Gotchas

  • Formula to compute total number of jobs from group size can drop some jobs due to rounding. (see Frank's branch)
  • MPI related tmp files were being written in a memory bounded folder. Solved by using Apptainer instead of Singularity. Can be solved by mounting your own tmp dir but it will clash because all jobs write to the same tmp. (see Actions)
  • Amount of requested resources is different than the job array cap. The fix was to request the job arrays sequentially. (see Frank's PR)

Actions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions