Skip to content

Custom parallel job queue #21

@andnp

Description

@andnp

Gnu-parallel is no longer cutting it. A few issues:

  • Need to pass custom signals on to children. These signals are always getting eaten by parallel.
  • Need finer-grained control over ssh processes. Would love to avoid srun where possible due to extreme cost in interacting with the scheduler. Should be able to replace srun with custom ssh + environment build scripts.
  • Challenging to have homogeneity across different compute backends. Likely losing access to compute canada soon, but don't want these scripts to go to waste! Need some homogenous way to make use of beowulf cluster.

Notes:

  • Need a way to get num_cpus and hostnames from slurm when allocated across nodes.
  • Need a way to use MIG when available.
  • Need a way to batch jobs within a process (e.g. so one sub-process can handle a batch)
  • Nice-to-have: mark a parameter as "batchable". For instance, we can jax.vmap over stepsizes, but not neural net sizes. Would be nice to handle that internally, so that we can take advantage of vmap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions