Rewrite slurm executor by g-braeunlich · Pull Request #144 · IDEALLab/EngiBench

g-braeunlich · 2025-06-30T12:58:15Z

No description provided.

ffelten

This looks good to me. What are the next steps on this? Waiting for inputs from @markfuge and @fgvangessel-umd ?

Then I think we can work on a solid documentation/classical examples, such as optimizing for an array of designs for a given problem. A page on the website would be a good target for this.

markfuge · 2025-07-01T13:42:46Z

I think @fgvangessel-umd would have good inputs on this, since he has been using this functionality most recently and I passed off my existing work to him to use as input for his, so he would have a good overview and could provide some use cases also with running the containerized environments

g-braeunlich · 2025-07-25T12:49:20Z

Hello guys
New version pushed. It now includes docs and the ported version of engibench/problems/photonics2d/dataset_slurm_test.py. @markfuge please feel free to change this file to your needs. This is just a suggestion.

It also includes 2 fixes from the hpc-zaratan branch by @fgvangessel-umd

And: we have doctests now.

@ffelten , @fgvangessel-umd Please have a look at the docs! Any feedback is welcome!

g-braeunlich · 2025-07-28T07:40:51Z

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

ffelten

Very cool, thanks for the doc Geri.

I did not see the workflow diagram (GitHub does not render SVGs). But does it explain the "map-reduce/save" flow?

ffelten · 2025-07-29T13:48:50Z

-     slurm.Args(problem_args={"problem_arg": 1}, design_args={"design_arg": -1}),
-     slurm.Args(problem_args={"problem_arg": 2}, design_args={"design_arg": -2}),
-     slurm.Args(problem_args={"problem_arg": 3}, design_args={"design_arg": -3}),
+    {"config": {"arg": 1.0}, "a": 1, "b": "1"},


I believe making this more concrete (or providing another example), i.e., illustrate with a design problem would make it easier to understand the hierarchy in these dicts?

I'm thinking about concrete example like: optimizing beams for different parameters, simulating heatconductions, etc. Things that people will want to do, so that they can just take the example and modify to get the behavior they need.

We can also point to existing code directly, e.g., dataset_slurm_test.py in photonics?

I will add my dataset generator for the airfoil problem which can also serve as a concrete example

For now, a simple example showing optimize() for beams2d is given

ffelten · 2025-07-29T13:50:45Z

+import os
+from engibench.utils import slurm
+import fake_sbatch
+from example_callback import callback


Show what the callback is doing, mention that you can put any code in there

ffelten · 2025-07-29T13:52:44Z

+**Step 4b:** Save all results in one [pickle](https://docs.python.org/3/library/pickle.html) archive
+
+```py
+job.save("results.pkl", slurm_args=slurm_args)


How are the slurm_args used when pickling?

Also added an explanation here

ffelten · 2025-07-29T13:54:00Z

+
+```{note}
+In contrast to [sbatch_map()](#engibench.utils.slurm.sbatch_map), [job.save()](#engibench.utils.slurm.SubmittedJobArray.save) and [job.reduce()](#engibench.utils.slurm.SubmittedJobArray.reduce) are blocking.
+The only reason that [sbatch_map()](#engibench.utils.slurm.sbatch_map) is non-blocking (unless `wait=True` is passed) is to make it possible to chain this call with [job.save()](#engibench.utils.slurm.SubmittedJobArray.save)/ [job.reduce()](#engibench.utils.slurm.SubmittedJobArray.reduce).


I don't get this one. You can have sbatch_map blocking and still chain, that's what functional programming usually does?

Thanks. I added an explanation.

ffelten · 2025-07-29T14:20:11Z

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

g-braeunlich · 2025-07-30T06:16:55Z

Very cool, thanks for the doc Geri.

I did not see the workflow diagram (GitHub does not render SVGs). But does it explain the "map-reduce/save" flow?

Yes, it explains the "map-reduce" flow. Does it not render in the browser either (just point your browser to the svg)?

g-braeunlich · 2025-07-30T06:27:08Z

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

I assume yes. I am referring to this error here

ffelten · 2025-07-31T06:09:54Z

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

I assume yes. I am referring to this error here

Mmmmh, I got the same locally:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py'

Now the odd part:
cat /Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py

#!/bin/env python3

import argparse
import shlex
import subprocess


def parse_array_range(s: str) -> tuple[slice, int | None]:
    """Parse a string like 1-3 or 1-3%1000."""
    if "%" in s:
        s, max_jobs_raw = s.split("%", 1)
        max_jobs = int(max_jobs_raw)
    else:
        max_jobs = None

...

So it is there.

I quickly tried to normalize the path, so it becomes: /Users/ffelte/Documents/EngiBench/tests/tools/fake_sbatch.py. Still not found.

fgvangessel-umd · 2025-08-06T15:54:21Z

 ```

-and so on.
+By default, [sbatch_map()](#engibench.utils.slurm.sbatch_map) submits the job array in the background. That means that the execution flow of the python script will continue while the jobs are running.


I can also imagine a situation in which we'd want to serially run sbatch_map commands (e.g to avoid overutilization of HPC resources). In this case we'd want to monitor the sbatch_map id to determine when resources are available to submit the next batch. An example of how I achieved this is copied from my dataset code below:

submitted_jobs = [] for ibatch in range(int(n_sbatch_maps)): sim_batch_configs = simulate_configs_designs[ibatch * group_size * n_slurm_array: (ibatch + 1) * group_size * n_slurm_array] print(len(sim_batch_configs)) print(f"Submitting batch {ibatch + 1}/{int(n_sbatch_maps)}") job_array = slurm.sbatch_map( f=simulate_slurm, args=sim_batch_configs, slurm_args=slurm_config, group_size=group_size, # Number of jobs to batch in sequence to reduce job array size reduce_job=post_process_simulate, out=None, ) # Save the job array reference submitted_jobs.append(job_array) # Wait for this job to complete by calling save() # This will submit a dependent job that waits for the array to finish print(f"Waiting for batch {ibatch + 1} to complete...") job_array.save(f"results_{ibatch}.pkl", slurm_args=slurm_config) print(f"Batch {ibatch + 1} completed!")

Would not be that the job of slurm itself? I.e. slurm knows the available resources and only runs a job when enough resources are available.
If dependency is the reason for waiting for the completion of a batch of jobs, that is also something slurm can handle by its own: https://slurm.schedmd.com/sbatch.html (look for --dependency). This is actually already what is done to delay the reduce or save job until the map batch has completed.

What you are describing makes sense and I would have thought that to be the case. However, I have found that Slurm will allocate and run jobs, which in aggregate, trigger excessive usage errors on the HPC. When this occurs all jobs associated with my user ID are killed. Like you say, I would have thought that based on the requested cpus and memory, Slurm could avoid this but that appears not to be the case. This would be a good question to address to Zaratan HPC IT for more information.

OK. On Euler, @markfuge experienced a slightly different issue: Too many small jobs. To this end, I have introduced the group_size argument to sbatch_map. Would this help in your case?

fgvangessel-umd

The Slurm API docs are very helpful for understanding what's going on. I think the next steps on my end are to verify that my dataset generator script works on the main branch now that the slurm and container changes have been incorporated.

g-braeunlich · 2025-08-08T08:19:48Z

New stuff:

Adapt the docs to the new conditions API (failed on main) -> we always should build the docs but only publish on main
Improve error handling: now JobError contains a backtrace and reduce() raises an exception if any of the jobs failed.
New docs about error handling

ffelten · 2025-08-08T13:44:05Z

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

I assume yes. I am referring to this error here

Mmmmh, I got the same locally: FileNotFoundError: [Errno 2] No such file or directory: '/Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py'

Now the odd part: cat /Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py
#!/bin/env python3

import argparse
import shlex
import subprocess


def parse_array_range(s: str) -> tuple[slice, int | None]:
    """Parse a string like 1-3 or 1-3%1000."""
    if "%" in s:
        s, max_jobs_raw = s.split("%", 1)
        max_jobs = int(max_jobs_raw)
    else:
        max_jobs = None

...
So it is there.

I quickly tried to normalize the path, so it becomes: /Users/ffelte/Documents/EngiBench/tests/tools/fake_sbatch.py. Still not found.

@g-braeunlich do you have ideas on how to move forward with this?

g-braeunlich · 2025-08-08T14:37:29Z

Sorry, not any clue, why this happens on Mac / Windows.
We could skip the tests on windows / mac.
The alternative is that a Windows or Mac user debugs this.

ffelten · 2025-08-08T14:53:48Z

Sorry, not any clue, why this happens on Mac / Windows. We could skip the tests on windows / mac. The alternative is that a Windows or Mac user debugs this.

I'm fine with skipping this test for Windows and Mac. This code is intended for HPC environments anyways.

ffelten

Just a few minor comments in the doc. LGTM! Really nice work @g-braeunlich

ffelten · 2025-08-12T07:14:25Z

+from engibench.problems.beams2d import Beams2D
+
+
+def run_job(config: dict[str, Any]) -> NDArray[np.float64]:


Shouldn't the config be passed to optimize? I see in the example that volfrac and forcedist are being passed. These are for optimization, not for problem instantiation (at least they used to be configurable only when calling optimize or simulate).

OK. The arguments are now optimize args.

ffelten · 2025-08-12T07:15:46Z

-problem = ExampleProblem(some_arg=True, problem_arg=1)
-design = ExampleDesign(design_arg=-1)
-problem.simulate(design, config={"sim_arg": 10})
+run_job(config={"volfrac": 0.1}, "forcedist": 0.0) # element 1


I think forcedist should be in the dict

g-braeunlich · 2025-08-13T07:35:08Z

Slightly modified the error handling in sbatch_map / reduce:
Failed jobs will now first be passed to the reduce callback, giving the callback a chance to handle failed jobs itself.
If that raises an error, the error together with the collected errors of all failed jobs will be raised by reduce.

g-braeunlich requested a review from ffelten June 30, 2025 12:58

g-braeunlich self-assigned this Jun 30, 2025

g-braeunlich mentioned this pull request Jun 30, 2025

Fix docs versioning #143

Merged

ffelten reviewed Jul 1, 2025

View reviewed changes

g-braeunlich force-pushed the slurm branch 2 times, most recently from fabfd6a to 5bb5622 Compare July 10, 2025 09:01

g-braeunlich force-pushed the slurm branch from 5bb5622 to 6d73f9c Compare July 25, 2025 12:38

ffelten reviewed Jul 29, 2025

View reviewed changes

g-braeunlich force-pushed the slurm branch from c1d2578 to 3733836 Compare July 30, 2025 06:28

fgvangessel-umd reviewed Aug 6, 2025

View reviewed changes

fgvangessel-umd approved these changes Aug 6, 2025

View reviewed changes

g-braeunlich force-pushed the slurm branch 3 times, most recently from b109e50 to eda7838 Compare August 7, 2025 13:00

g-braeunlich added 4 commits August 8, 2025 07:57

docs/conf.py: Reformat using ruff

c3aa71c

docs: Update to new constraints API

806a471

docs: Update utils.container (replace singularity with apptainer)

a983c70

tests/test_problem_implementations.py: Remove __future__ import

3ee9fdd

g-braeunlich force-pushed the slurm branch 2 times, most recently from 1d4b710 to e17c1cf Compare August 8, 2025 08:16

g-braeunlich force-pushed the slurm branch 3 times, most recently from eef7988 to 9008bb1 Compare August 8, 2025 16:30

ffelten approved these changes Aug 12, 2025

View reviewed changes

g-braeunlich force-pushed the slurm branch from 9008bb1 to 6c8e74f Compare August 13, 2025 07:32

g-braeunlich and others added 4 commits August 13, 2025 10:25

utils.slurm: Rewrite the module

0fb320d

tests/utils/test_slurm.py: Apply ruff lint eq-without-hash (PLW1641)

4b652aa

utils.slurm.run_job: Fix array size and group end index

db1eebe

CI: Add doctests

cc90879

g-braeunlich force-pushed the slurm branch from 6c8e74f to cc90879 Compare August 13, 2025 08:25

g-braeunlich merged commit c4c667a into main Aug 13, 2025
10 checks passed

g-braeunlich deleted the slurm branch August 13, 2025 18:38

ffelten mentioned this pull request Aug 25, 2025

Slurm utils design #107

Closed

		from engibench.problems.beams2d import Beams2D


		def run_job(config: dict[str, Any]) -> NDArray[np.float64]:

Conversation

g-braeunlich commented Jun 30, 2025

Uh oh!

ffelten left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markfuge commented Jul 1, 2025

Uh oh!

g-braeunlich commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

g-braeunlich commented Jul 28, 2025

Uh oh!

ffelten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ffelten commented Jul 29, 2025

Uh oh!

g-braeunlich commented Jul 30, 2025

Uh oh!

g-braeunlich commented Jul 30, 2025

Uh oh!

ffelten commented Jul 31, 2025

Uh oh!

fgvangessel-umd Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fgvangessel-umd left a comment

Choose a reason for hiding this comment

Uh oh!

g-braeunlich commented Aug 8, 2025

Uh oh!

ffelten commented Aug 8, 2025

Uh oh!

g-braeunlich commented Aug 8, 2025

Uh oh!

ffelten commented Aug 8, 2025

Uh oh!

ffelten left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

g-braeunlich commented Jul 25, 2025 •

edited

Loading

fgvangessel-umd Aug 6, 2025 •

edited

Loading

ffelten left a comment •

edited

Loading

g-braeunlich commented Aug 13, 2025 •

edited

Loading