Skip to content

feat: API with scheduler support#121

Merged
ltalirz merged 49 commits intomainfrom
feat/api-slurm
Feb 18, 2026
Merged

feat: API with scheduler support#121
ltalirz merged 49 commits intomainfrom
feat/api-slurm

Conversation

@ltalirz
Copy link
Copy Markdown
Contributor

@ltalirz ltalirz commented Feb 8, 2026

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 8, 2026

@ltalirz
Copy link
Copy Markdown
Contributor Author

ltalirz commented Feb 9, 2026

@jan-janssen For some reason, I'm seeing a behavior where when I run through my API integration test once, it fails - it keeps pinging the /check endpoint until I kill the test (see also CI).

There are no errors in the API logs. When I locally restart the API and run the test again, it passes immediately (i.e. the simulation worked the first time and I can see it produces the cache files, they are just not picked up the first time).
Note: this also works, if I remove the high-level caching that I introduced (tasks.db, so the problem should be at a lower level).

I'm a bit lost at the moment - if you have any ideas/spot anything, let me know.

Relevant code should be in


workflow:

how it is submitted by the API:

def submit_to_executor(request_data: dict) -> dict:

async def check(task_id: str) -> TaskResponse:

The code is likely overly cautious/complex in some parts (was tries to get it to work that didn't work and can be removed again later)

Comment thread amorphouspy_api/src/amorphouspy_api/jobs.py Outdated
Comment thread amorphouspy_api/src/tests/test_meltquench.py Outdated
@jan-janssen
Copy link
Copy Markdown
Contributor

@ltalirz You were a bit too fast. From my perspective there are currently two conflicting tests the test_check_running_then_complete() test which wants to wait for the future to complete and the test_submit_meltquench_and_check() test which requires the future to return the result directly.

@ltalirz
Copy link
Copy Markdown
Contributor Author

ltalirz commented Feb 9, 2026

There's a few things to be cleaned up - I will likely not have time today to finish it, but should be able to have another look tomorrow

@jan-janssen One question: in 669a33c we introduced workers with separate subprocesses in order to work around the signal handling of pyiron

We no longer need this here, correct?
My plan was to delete this now and just make the job submission async (it should anyhow be fast).

@jan-janssen
Copy link
Copy Markdown
Contributor

jan-janssen commented Feb 9, 2026

We no longer need this here, correct? My plan was to delete this now and just make the job submission async (it should anyhow be fast).

Yes, that works fine.

As suggested in #124 it seems to be an issue with orphan processes being killed from the testing framework, so the transition to flux as backend should solve this issue.

@ltalirz
Copy link
Copy Markdown
Contributor Author

ltalirz commented Feb 13, 2026

@jan-janssen I fixed the basic logic for submission and /check; I also switched to the TestClusterExecutor

With that I now see

INFO:     127.0.0.1:42466 - "GET /check/83c4627d-411a-443a-ba70-78364bc9e323 HTTP/1.1" 200 OK
Exception in thread Thread-61 (execute_tasks_h5):
Traceback (most recent call last):
  File "/home/runner/miniconda3/envs/test/lib/python3.13/threading.py", line 1044, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "/home/runner/miniconda3/envs/test/lib/python3.13/threading.py", line 995, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/miniconda3/envs/test/lib/python3.13/site-packages/executorlib/task_scheduler/file/shared.py", line 151, in execute_tasks_h5
    process_dict[k] for k in future_wait_key_lst
    ~~~~~~~~~~~~^^^
KeyError: 'get_structure_dictf69a7790ce10f1dde00df1103f771a09'

I can reproduce this locally as well.

Maybe some of the logic in /check is still not correct...

@jan-janssen
Copy link
Copy Markdown
Contributor

@jan-janssen I fixed the basic logic for submission and /check; I also switched to the TestClusterExecutor

This is fixed in pyiron/executorlib#913

@ltalirz ltalirz changed the title feat: API with SLURM support feat: API with scheduler support Feb 16, 2026
@ltalirz ltalirz marked this pull request as ready for review February 18, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants