-
Notifications
You must be signed in to change notification settings - Fork 33
Add resource manager support for Flux #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chen Wang <wangvsa@gmail.com>
util/unifyfs/src/unifyfs-rm.c
Outdated
| pclose(pipe_fp); | ||
|
|
||
| // remove the trailing ']' | ||
| nodelist_str[strlen(nodelist_str)-1] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the trailing ']' still here when only allocating one node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's missing, actually:
>>: flux alloc -q pdebug -N 1
>>: flux resource list --states=free -no '{nodelist}'
tioga20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to avoid parsing directly, another option is to rely on /bin/hostlist.
>>: /bin/hostlist -e 'tioga[18-19,21,32]'
tioga18,tioga19,tioga21,tioga32
>>: /bin/hostlist -e 'tioga20'
tioga20
Though I think that command might just be available on LLNL systems and is not distributed with flux.
I think I've seen that flux has some python packages that parse hostlists, too. I can dig that up if you're interested.
I'm also fine with parsing the hostlist directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out the last character of the buffer returned from fgets is either \n or EOF. So my code actually removes that instead of ], and it just happens that the rest of the code still works fine even with ] in the buffer. Anyway, I have fixed this and now I will remove the last two characters all together.
Signed-off-by: Chen Wang <wangvsa@gmail.com>
|
Thanks, @wangvsa |
Description
Tioga uses Flux to schedule jobs and has limited support for srun. Also Tioga does not have the
scontrolcommand, which we use to retrieve the allocated node list.This PR includes native support for Flux. It uses
flux runto run clients and servers and usesflux resourceto retrieve the number of nodes and the node list.PS:
flux resourcereturns a condensed node list, e.g., tioga[3-10, 12, 14]. The existingparse_hostfile()function can't handle this format, I added some code to parse it manually.How Has This Been Tested?
Tested on Tioga with 1, 2, 4 nodes. Also tested
unifyfs-ls,unifyfs-stageand stage-in/out features.Types of changes
Checklist:
TODO
Unlike slurm where
SLURM_JOBIDcan be used to determine a slurm allocation, flux only sets environment variables such asFLUX_JOB_IDfor eachflux runjob (a flux job is similar to a slurm step). At the time of executingunifyfs(batch level), those variables have not been set yet.A short flux script example:
As a result, currently I use
FLUX_EXEC_PATHto determine if the system has flux scheduler. I feel this is not optimal but I couldn't figure out a better way.