Rewrite slurm executor#144
Conversation
ffelten
left a comment
There was a problem hiding this comment.
This looks good to me. What are the next steps on this? Waiting for inputs from @markfuge and @fgvangessel-umd ?
Then I think we can work on a solid documentation/classical examples, such as optimizing for an array of designs for a given problem. A page on the website would be a good target for this.
|
I think @fgvangessel-umd would have good inputs on this, since he has been using this functionality most recently and I passed off my existing work to him to use as input for his, so he would have a good overview and could provide some use cases also with running the containerized environments |
fabfd6a to
5bb5622
Compare
|
Hello guys It also includes 2 fixes from the hpc-zaratan branch by @fgvangessel-umd And: we have doctests now. @ffelten , @fgvangessel-umd Please have a look at the docs! Any feedback is welcome! |
|
And if a mac user could examine why |
ffelten
left a comment
There was a problem hiding this comment.
Very cool, thanks for the doc Geri.
I did not see the workflow diagram (GitHub does not render SVGs). But does it explain the "map-reduce/save" flow?
| slurm.Args(problem_args={"problem_arg": 1}, design_args={"design_arg": -1}), | ||
| slurm.Args(problem_args={"problem_arg": 2}, design_args={"design_arg": -2}), | ||
| slurm.Args(problem_args={"problem_arg": 3}, design_args={"design_arg": -3}), | ||
| {"config": {"arg": 1.0}, "a": 1, "b": "1"}, |
There was a problem hiding this comment.
I believe making this more concrete (or providing another example), i.e., illustrate with a design problem would make it easier to understand the hierarchy in these dicts?
I'm thinking about concrete example like: optimizing beams for different parameters, simulating heatconductions, etc. Things that people will want to do, so that they can just take the example and modify to get the behavior they need.
We can also point to existing code directly, e.g., dataset_slurm_test.py in photonics?
There was a problem hiding this comment.
I will add my dataset generator for the airfoil problem which can also serve as a concrete example
There was a problem hiding this comment.
For now, a simple example showing optimize() for beams2d is given
| import os | ||
| from engibench.utils import slurm | ||
| import fake_sbatch | ||
| from example_callback import callback |
There was a problem hiding this comment.
Show what the callback is doing, mention that you can put any code in there
| **Step 4b:** Save all results in one [pickle](https://docs.python.org/3/library/pickle.html) archive | ||
|
|
||
| ```py | ||
| job.save("results.pkl", slurm_args=slurm_args) |
There was a problem hiding this comment.
How are the slurm_args used when pickling?
There was a problem hiding this comment.
Also added an explanation here
|
|
||
| ```{note} | ||
| In contrast to [sbatch_map()](#engibench.utils.slurm.sbatch_map), [job.save()](#engibench.utils.slurm.SubmittedJobArray.save) and [job.reduce()](#engibench.utils.slurm.SubmittedJobArray.reduce) are blocking. | ||
| The only reason that [sbatch_map()](#engibench.utils.slurm.sbatch_map) is non-blocking (unless `wait=True` is passed) is to make it possible to chain this call with [job.save()](#engibench.utils.slurm.SubmittedJobArray.save)/ [job.reduce()](#engibench.utils.slurm.SubmittedJobArray.reduce). |
There was a problem hiding this comment.
I don't get this one. You can have sbatch_map blocking and still chain, that's what functional programming usually does?
There was a problem hiding this comment.
Thanks. I added an explanation.
Should I just launch the tests ? |
Yes, it explains the "map-reduce" flow. Does it not render in the browser either (just point your browser to the svg)? |
I assume yes. I am referring to this error here |
Mmmmh, I got the same locally: Now the odd part: So it is there. I quickly tried to normalize the path, so it becomes: |
| ``` | ||
|
|
||
| and so on. | ||
| By default, [sbatch_map()](#engibench.utils.slurm.sbatch_map) submits the job array in the background. That means that the execution flow of the python script will continue while the jobs are running. |
There was a problem hiding this comment.
I can also imagine a situation in which we'd want to serially run sbatch_map commands (e.g to avoid overutilization of HPC resources). In this case we'd want to monitor the sbatch_map id to determine when resources are available to submit the next batch. An example of how I achieved this is copied from my dataset code below:
submitted_jobs = []
for ibatch in range(int(n_sbatch_maps)):
sim_batch_configs = simulate_configs_designs[ibatch * group_size * n_slurm_array: (ibatch + 1) * group_size * n_slurm_array]
print(len(sim_batch_configs))
print(f"Submitting batch {ibatch + 1}/{int(n_sbatch_maps)}")
job_array = slurm.sbatch_map(
f=simulate_slurm,
args=sim_batch_configs,
slurm_args=slurm_config,
group_size=group_size, # Number of jobs to batch in sequence to reduce job array size
reduce_job=post_process_simulate,
out=None,
)
# Save the job array reference
submitted_jobs.append(job_array)
# Wait for this job to complete by calling save()
# This will submit a dependent job that waits for the array to finish
print(f"Waiting for batch {ibatch + 1} to complete...")
job_array.save(f"results_{ibatch}.pkl", slurm_args=slurm_config)
print(f"Batch {ibatch + 1} completed!")
There was a problem hiding this comment.
Would not be that the job of slurm itself? I.e. slurm knows the available resources and only runs a job when enough resources are available.
If dependency is the reason for waiting for the completion of a batch of jobs, that is also something slurm can handle by its own: https://slurm.schedmd.com/sbatch.html (look for --dependency). This is actually already what is done to delay the reduce or save job until the map batch has completed.
There was a problem hiding this comment.
What you are describing makes sense and I would have thought that to be the case. However, I have found that Slurm will allocate and run jobs, which in aggregate, trigger excessive usage errors on the HPC. When this occurs all jobs associated with my user ID are killed. Like you say, I would have thought that based on the requested cpus and memory, Slurm could avoid this but that appears not to be the case. This would be a good question to address to Zaratan HPC IT for more information.
There was a problem hiding this comment.
OK. On Euler, @markfuge experienced a slightly different issue: Too many small jobs. To this end, I have introduced the group_size argument to sbatch_map. Would this help in your case?
fgvangessel-umd
left a comment
There was a problem hiding this comment.
The Slurm API docs are very helpful for understanding what's going on. I think the next steps on my end are to verify that my dataset generator script works on the main branch now that the slurm and container changes have been incorporated.
b109e50 to
eda7838
Compare
1d4b710 to
e17c1cf
Compare
|
New stuff:
|
@g-braeunlich do you have ideas on how to move forward with this? |
|
Sorry, not any clue, why this happens on Mac / Windows. |
I'm fine with skipping this test for Windows and Mac. This code is intended for HPC environments anyways. |
eef7988 to
9008bb1
Compare
There was a problem hiding this comment.
Just a few minor comments in the doc. LGTM! Really nice work @g-braeunlich
| from engibench.problems.beams2d import Beams2D | ||
|
|
||
|
|
||
| def run_job(config: dict[str, Any]) -> NDArray[np.float64]: |
There was a problem hiding this comment.
Shouldn't the config be passed to optimize? I see in the example that volfrac and forcedist are being passed. These are for optimization, not for problem instantiation (at least they used to be configurable only when calling optimize or simulate).
There was a problem hiding this comment.
OK. The arguments are now optimize args.
| problem = ExampleProblem(some_arg=True, problem_arg=1) | ||
| design = ExampleDesign(design_arg=-1) | ||
| problem.simulate(design, config={"sim_arg": 10}) | ||
| run_job(config={"volfrac": 0.1}, "forcedist": 0.0) # element 1 |
There was a problem hiding this comment.
I think forcedist should be in the dict
|
Slightly modified the error handling in |
No description provided.