Skip to content

Rewrite slurm executor#144

Merged
g-braeunlich merged 8 commits into
mainfrom
slurm
Aug 13, 2025
Merged

Rewrite slurm executor#144
g-braeunlich merged 8 commits into
mainfrom
slurm

Conversation

@g-braeunlich
Copy link
Copy Markdown
Collaborator

No description provided.

@g-braeunlich g-braeunlich requested a review from ffelten June 30, 2025 12:58
@g-braeunlich g-braeunlich self-assigned this Jun 30, 2025
@g-braeunlich g-braeunlich mentioned this pull request Jun 30, 2025
Copy link
Copy Markdown
Collaborator

@ffelten ffelten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. What are the next steps on this? Waiting for inputs from @markfuge and @fgvangessel-umd ?

Then I think we can work on a solid documentation/classical examples, such as optimizing for an array of designs for a given problem. A page on the website would be a good target for this.

Comment thread engibench/utils/slurm/run_job.py Outdated
Comment thread engibench/utils/slurm/run_job.py Outdated
Comment thread engibench/utils/slurm/run_job.py Outdated
Comment thread tests/utils/test_slurm.py
Comment thread tests/utils/test_slurm.py
Comment thread engibench/utils/slurm/__init__.py
@markfuge
Copy link
Copy Markdown
Member

markfuge commented Jul 1, 2025

I think @fgvangessel-umd would have good inputs on this, since he has been using this functionality most recently and I passed off my existing work to him to use as input for his, so he would have a good overview and could provide some use cases also with running the containerized environments

@g-braeunlich
Copy link
Copy Markdown
Collaborator Author

g-braeunlich commented Jul 25, 2025

Hello guys
New version pushed. It now includes docs and the ported version of engibench/problems/photonics2d/dataset_slurm_test.py. @markfuge please feel free to change this file to your needs. This is just a suggestion.

It also includes 2 fixes from the hpc-zaratan branch by @fgvangessel-umd

And: we have doctests now.

@ffelten , @fgvangessel-umd Please have a look at the docs! Any feedback is welcome!

@g-braeunlich
Copy link
Copy Markdown
Collaborator Author

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Copy link
Copy Markdown
Collaborator

@ffelten ffelten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, thanks for the doc Geri.

I did not see the workflow diagram (GitHub does not render SVGs). But does it explain the "map-reduce/save" flow?

Comment thread docs/utils/slurm.md Outdated
slurm.Args(problem_args={"problem_arg": 1}, design_args={"design_arg": -1}),
slurm.Args(problem_args={"problem_arg": 2}, design_args={"design_arg": -2}),
slurm.Args(problem_args={"problem_arg": 3}, design_args={"design_arg": -3}),
{"config": {"arg": 1.0}, "a": 1, "b": "1"},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe making this more concrete (or providing another example), i.e., illustrate with a design problem would make it easier to understand the hierarchy in these dicts?

I'm thinking about concrete example like: optimizing beams for different parameters, simulating heatconductions, etc. Things that people will want to do, so that they can just take the example and modify to get the behavior they need.

We can also point to existing code directly, e.g., dataset_slurm_test.py in photonics?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add my dataset generator for the airfoil problem which can also serve as a concrete example

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, a simple example showing optimize() for beams2d is given

Comment thread docs/utils/slurm.md Outdated
import os
from engibench.utils import slurm
import fake_sbatch
from example_callback import callback
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show what the callback is doing, mention that you can put any code in there

Comment thread docs/utils/slurm.md
**Step 4b:** Save all results in one [pickle](https://docs.python.org/3/library/pickle.html) archive

```py
job.save("results.pkl", slurm_args=slurm_args)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the slurm_args used when pickling?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added an explanation here

Comment thread docs/utils/slurm.md Outdated

```{note}
In contrast to [sbatch_map()](#engibench.utils.slurm.sbatch_map), [job.save()](#engibench.utils.slurm.SubmittedJobArray.save) and [job.reduce()](#engibench.utils.slurm.SubmittedJobArray.reduce) are blocking.
The only reason that [sbatch_map()](#engibench.utils.slurm.sbatch_map) is non-blocking (unless `wait=True` is passed) is to make it possible to chain this call with [job.save()](#engibench.utils.slurm.SubmittedJobArray.save)/ [job.reduce()](#engibench.utils.slurm.SubmittedJobArray.reduce).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this one. You can have sbatch_map blocking and still chain, that's what functional programming usually does?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I added an explanation.

Comment thread docs/utils/slurm.md
@ffelten
Copy link
Copy Markdown
Collaborator

ffelten commented Jul 29, 2025

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

@g-braeunlich
Copy link
Copy Markdown
Collaborator Author

Very cool, thanks for the doc Geri.

I did not see the workflow diagram (GitHub does not render SVGs). But does it explain the "map-reduce/save" flow?

Yes, it explains the "map-reduce" flow. Does it not render in the browser either (just point your browser to the svg)?

@g-braeunlich
Copy link
Copy Markdown
Collaborator Author

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

I assume yes. I am referring to this error here

@ffelten
Copy link
Copy Markdown
Collaborator

ffelten commented Jul 31, 2025

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

I assume yes. I am referring to this error here

Mmmmh, I got the same locally:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py'

Now the odd part:
cat /Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py

#!/bin/env python3

import argparse
import shlex
import subprocess


def parse_array_range(s: str) -> tuple[slice, int | None]:
    """Parse a string like 1-3 or 1-3%1000."""
    if "%" in s:
        s, max_jobs_raw = s.split("%", 1)
        max_jobs = int(max_jobs_raw)
    else:
        max_jobs = None

...

So it is there.

I quickly tried to normalize the path, so it becomes: /Users/ffelte/Documents/EngiBench/tests/tools/fake_sbatch.py. Still not found.

Comment thread docs/utils/slurm.md
```

and so on.
By default, [sbatch_map()](#engibench.utils.slurm.sbatch_map) submits the job array in the background. That means that the execution flow of the python script will continue while the jobs are running.
Copy link
Copy Markdown
Contributor

@fgvangessel-umd fgvangessel-umd Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also imagine a situation in which we'd want to serially run sbatch_map commands (e.g to avoid overutilization of HPC resources). In this case we'd want to monitor the sbatch_map id to determine when resources are available to submit the next batch. An example of how I achieved this is copied from my dataset code below:

    submitted_jobs = []
    for ibatch in range(int(n_sbatch_maps)):
        sim_batch_configs = simulate_configs_designs[ibatch * group_size * n_slurm_array: (ibatch + 1) * group_size * n_slurm_array]
        print(len(sim_batch_configs))
        print(f"Submitting batch {ibatch + 1}/{int(n_sbatch_maps)}")

        job_array = slurm.sbatch_map(
        f=simulate_slurm,
        args=sim_batch_configs,
        slurm_args=slurm_config,
        group_size=group_size,  # Number of jobs to batch in sequence to reduce job array size
        reduce_job=post_process_simulate,
        out=None,
    )

        # Save the job array reference
        submitted_jobs.append(job_array)

        # Wait for this job to complete by calling save()
        # This will submit a dependent job that waits for the array to finish
        print(f"Waiting for batch {ibatch + 1} to complete...")
        job_array.save(f"results_{ibatch}.pkl", slurm_args=slurm_config)
        print(f"Batch {ibatch + 1} completed!")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not be that the job of slurm itself? I.e. slurm knows the available resources and only runs a job when enough resources are available.
If dependency is the reason for waiting for the completion of a batch of jobs, that is also something slurm can handle by its own: https://slurm.schedmd.com/sbatch.html (look for --dependency). This is actually already what is done to delay the reduce or save job until the map batch has completed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are describing makes sense and I would have thought that to be the case. However, I have found that Slurm will allocate and run jobs, which in aggregate, trigger excessive usage errors on the HPC. When this occurs all jobs associated with my user ID are killed. Like you say, I would have thought that based on the requested cpus and memory, Slurm could avoid this but that appears not to be the case. This would be a good question to address to Zaratan HPC IT for more information.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. On Euler, @markfuge experienced a slightly different issue: Too many small jobs. To this end, I have introduced the group_size argument to sbatch_map. Would this help in your case?

Copy link
Copy Markdown
Contributor

@fgvangessel-umd fgvangessel-umd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Slurm API docs are very helpful for understanding what's going on. I think the next steps on my end are to verify that my dataset generator script works on the main branch now that the slurm and container changes have been incorporated.

@g-braeunlich g-braeunlich force-pushed the slurm branch 3 times, most recently from b109e50 to eda7838 Compare August 7, 2025 13:00
@g-braeunlich g-braeunlich force-pushed the slurm branch 2 times, most recently from 1d4b710 to e17c1cf Compare August 8, 2025 08:16
@g-braeunlich
Copy link
Copy Markdown
Collaborator Author

New stuff:

  • Adapt the docs to the new conditions API (failed on main) -> we always should build the docs but only publish on main
  • Improve error handling: now JobError contains a backtrace and reduce() raises an exception if any of the jobs failed.
  • New docs about error handling

@ffelten
Copy link
Copy Markdown
Collaborator

ffelten commented Aug 8, 2025

And if a mac user could examine why /Users/runner/work/EngiBench/EngiBench/tests/utils/../tools/fake_sbatch.py is not found and where to look for it instead - that would be awesome!

Should I just launch the tests ?

I assume yes. I am referring to this error here

Mmmmh, I got the same locally: FileNotFoundError: [Errno 2] No such file or directory: '/Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py'

Now the odd part: cat /Users/ffelte/Documents/EngiBench/tests/utils/../tools/fake_sbatch.py

#!/bin/env python3

import argparse
import shlex
import subprocess


def parse_array_range(s: str) -> tuple[slice, int | None]:
    """Parse a string like 1-3 or 1-3%1000."""
    if "%" in s:
        s, max_jobs_raw = s.split("%", 1)
        max_jobs = int(max_jobs_raw)
    else:
        max_jobs = None

...

So it is there.

I quickly tried to normalize the path, so it becomes: /Users/ffelte/Documents/EngiBench/tests/tools/fake_sbatch.py. Still not found.

@g-braeunlich do you have ideas on how to move forward with this?

@g-braeunlich
Copy link
Copy Markdown
Collaborator Author

Sorry, not any clue, why this happens on Mac / Windows.
We could skip the tests on windows / mac.
The alternative is that a Windows or Mac user debugs this.

@ffelten
Copy link
Copy Markdown
Collaborator

ffelten commented Aug 8, 2025

Sorry, not any clue, why this happens on Mac / Windows. We could skip the tests on windows / mac. The alternative is that a Windows or Mac user debugs this.

I'm fine with skipping this test for Windows and Mac. This code is intended for HPC environments anyways.

@g-braeunlich g-braeunlich force-pushed the slurm branch 3 times, most recently from eef7988 to 9008bb1 Compare August 8, 2025 16:30
Copy link
Copy Markdown
Collaborator

@ffelten ffelten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments in the doc. LGTM! Really nice work @g-braeunlich

from engibench.problems.beams2d import Beams2D


def run_job(config: dict[str, Any]) -> NDArray[np.float64]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the config be passed to optimize? I see in the example that volfrac and forcedist are being passed. These are for optimization, not for problem instantiation (at least they used to be configurable only when calling optimize or simulate).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. The arguments are now optimize args.

Comment thread docs/utils/slurm.md Outdated
problem = ExampleProblem(some_arg=True, problem_arg=1)
design = ExampleDesign(design_arg=-1)
problem.simulate(design, config={"sim_arg": 10})
run_job(config={"volfrac": 0.1}, "forcedist": 0.0) # element 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think forcedist should be in the dict

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@g-braeunlich
Copy link
Copy Markdown
Collaborator Author

g-braeunlich commented Aug 13, 2025

Slightly modified the error handling in sbatch_map / reduce:
Failed jobs will now first be passed to the reduce callback, giving the callback a chance to handle failed jobs itself.
If that raises an error, the error together with the collected errors of all failed jobs will be raised by reduce.

@g-braeunlich g-braeunlich merged commit c4c667a into main Aug 13, 2025
10 checks passed
@g-braeunlich g-braeunlich deleted the slurm branch August 13, 2025 18:38
@ffelten ffelten mentioned this pull request Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants