Proposal
This proposal aims to enhance the current Ray-based training scripts in the isaaclab project, with a focus on improving usability and flexibility for distributed training workflows. The goal is to address several limitations related to custom module support, file management, resource specification, and configuration handling.
Motivation
The existing Ray scripts located in the scripts/reinforcement_learning/ray directory exhibit several usability issues that hinder effective deployment and integration in practical scenarios:
- Lack of support for
py_modules prevents users from distributing custom Python packages or third-party libraries to Ray workers, limiting extensibility and customization.
- Model files are written to local or relative directories, such as
logs/, which makes it difficult to access trained models—especially in remote or cloud environments where local filesystems are not directly accessible.
- Resource allocation is not intuitive—users often encounter confusion or make mistakes when specifying CPU, GPU, or memory requirements.
- Heavy reliance on command-line arguments only results in long, complex, and error-prone commands, reducing readability and maintainability.
I'm often frustrated when I need to manually patch the Ray scripts just to load a custom library or retrieve trained models without having to SSH into a remote machine.
Alternatives
Some potential workarounds include:
- Modifying the Ray scripts manually to add support for
py_modules.
- Using external tools (e.g.,
scp, or cloud CLI utilities) to move files after training.
- Writing custom Ray submission scripts to better manage job configuration, resource allocation, and post-training file handling.
While these alternatives can provide temporary relief, they increase complexity and maintenance burden, and do not offer a sustainable or user-friendly solution in the long run.
Additional context
Currently, all training outputs (e.g., model checkpoints and logs) are saved to local working directories used by Ray. This setup works reasonably well in local environments but becomes problematic in remote or cluster-based executions. To improve accessibility and streamline deployment, there should be a built-in way to move or copy these files to mounted storage locations (e.g., network drives, cloud buckets like OSS) after training completes.
Additionally, while the current scripts do allow some level of resource specification, the interface is not intuitive.
Checklist
Acceptance Criteria
Proposal
This proposal aims to enhance the current Ray-based training scripts in the
isaaclabproject, with a focus on improving usability and flexibility for distributed training workflows. The goal is to address several limitations related to custom module support, file management, resource specification, and configuration handling.Motivation
The existing Ray scripts located in the
scripts/reinforcement_learning/raydirectory exhibit several usability issues that hinder effective deployment and integration in practical scenarios:py_modulesprevents users from distributing custom Python packages or third-party libraries to Ray workers, limiting extensibility and customization.logs/, which makes it difficult to access trained models—especially in remote or cloud environments where local filesystems are not directly accessible.I'm often frustrated when I need to manually patch the Ray scripts just to load a custom library or retrieve trained models without having to SSH into a remote machine.
Alternatives
Some potential workarounds include:
py_modules.scp, or cloud CLI utilities) to move files after training.While these alternatives can provide temporary relief, they increase complexity and maintenance burden, and do not offer a sustainable or user-friendly solution in the long run.
Additional context
Currently, all training outputs (e.g., model checkpoints and logs) are saved to local working directories used by Ray. This setup works reasonably well in local environments but becomes problematic in remote or cluster-based executions. To improve accessibility and streamline deployment, there should be a built-in way to move or copy these files to mounted storage locations (e.g., network drives, cloud buckets like OSS) after training completes.
Additionally, while the current scripts do allow some level of resource specification, the interface is not intuitive.
Checklist
Acceptance Criteria
py_modulesto allow distribution of custom Python packages