You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dataset generation: calling optimize or simulate. Currently uses the config and design factories to create a parameter_space using slurm.Args.
Starting gradient descent from different initial points: you only change the starting_point of the optimization (all other parameters kept unchanged).
We want to run things other than optimize or simulate: render or even custom code could be run inside the job (e.g., using a callback). MapReduce like logic.
Group small runs into a bigger job to reduce Slurm scheduling overhead. Might be useful for ML models evaluations.
Additional features
Specify how long each runtime should be for Slurm for jobs that are too long (bad simulations). This timeout is job specific and Euler specific.
How to check which run failed? How do you track back to the config/parameters/args that triggered this run? What we want: job id, error or timeout, args used to run this job.
Found bugs
Job array size limit on Euler (kicked out by Euler)
OOM killed for the reduce node (ask for a bigger node for reduce)
Discussion with @markfuge and @g-braeunlich
Main use cases
optimizeorsimulate. Currently uses the config and design factories to create aparameter_spaceusingslurm.Args.Additional features
Found bugs