From e36ede5b5c3fceb473be13c49102067c543669b8 Mon Sep 17 00:00:00 2001 From: "google-labs-jules[bot]" <161369871+google-labs-jules[bot]@users.noreply.github.com> Date: Thu, 16 Apr 2026 07:19:39 +0000 Subject: [PATCH] docs: add section on debugging submission commands Extended docs/trouble_shooting.md to include instructions on how to debug failed job submissions, especially for HPC Job Executors. Explains error propagation, manual debugging via the cache directory, and usage of the error_log_file parameter. Addresses https://github.com/pyiron/executorlib/issues/959 Co-authored-by: jan-janssen <3854739+jan-janssen@users.noreply.github.com> --- docs/trouble_shooting.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/trouble_shooting.md b/docs/trouble_shooting.md index e71487c9..04c7f490 100644 --- a/docs/trouble_shooting.md +++ b/docs/trouble_shooting.md @@ -83,6 +83,29 @@ of the [Python simple queuing system adatper (pysqa)](https://pysqa.readthedocs. All parameters in the resource dictionary `resource_dict` are optional. When `pydantic` is installed as optional dependency the `resource_dict` is validated using `pydantic`. +## Debugging Submission Command +When using the [HPC Job Executor](https://executorlib.readthedocs.io/en/latest/3-hpc-job.html) the Python functions +are submitted to the queuing system as individual jobs. If the submission of a job fails, for example due to an +incorrect `resource_dict` or a configuration issue in the queuing system, `executorlib` captures this error. + +In this case, the `future.result()` call will raise the underlying error (e.g., `subprocess.CalledProcessError` or +`ValueError`). Note that in some environments like Jupyter notebooks, the submission error might occur in a background +thread, and `future.result()` might hang if the error is not correctly propagated. + +To debug a failed submission, you can inspect the cache directory (default `executorlib_cache`). `executorlib` stores the +input and output for each task as HDF5 files. If a submission fails, you can find the corresponding `_i.h5` file in the +cache directory and manually try to submit the command to get more detailed error messages. + +If you specify an `error_log_file` in the `resource_dict`, `executorlib` will also append any exceptions raised during +the execution to this file: + +```python +with SlurmJobExecutor(max_workers=2) as p: + future = p.submit(calc, 2, j=4, resource_dict={"error_log_file": "error.log"}) +``` + +The error log will include the function name, arguments, and the full stack trace of the exception. + ## SSH Connection While the [Python simple queuing system adatper (pysqa)](https://pysqa.readthedocs.io) provides the option to connect to high performance computing (HPC) clusters via SSH, this functionality is not supported for executorlib. The background