Describe the bug
When training RL agents in IsaacLab, vision-based environments result in non-deterministic outcomes across multiple runs, even when using a fixed random seed. In contrast, state-based environments exhibit perfect reproducibility under the same conditions.
This issue was confirmed by running five separate tests with identical settings on each of the following three official IsaacLab environments:
Isaac-Cartpole-v0 (state-based): Reproducible
Isaac-Cartpole-RGB-v0 (vision-based): Not reproducible
Isaac-Cartpole-RGB-ResNet18-v0 (vision-based): Not reproducible
The non-determinism appears to be introduced by the vision processing pipeline, as it is the key difference between the reproducible and non-reproducible environments. However, as I have not investigated this in-depth, further analysis is needed to identify the root cause.
The provided WandB logs show the reward curves from several training executions. As illustrated, the training curves for the vision-based environments show significant divergence. This non-reproducibility occurs even though all experimental settings, including the random seed, were kept identical for each run.
(state-{i}: Isaac-Cartpole-v0, rgb-{i}: Isaac-Cartpole-RGB-v0, resnet-{i}: Isaac-Cartpole-RGB-ResNet18-v0)
Steps to reproduce
- Run the state-based environment five times with a fixed seed:
python scripts/reinforcement_learning/rl_games/train.py --task Isaac-Cartpole-v0 --headless --seed 42 --max_iteration 100
- Run the vision-based environment five times with the same seed:
python scripts/reinforcement_learning/rl_games/train.py --task Isaac-Cartpole-RGB-v0 --enable_cameras --headless --seed 42 --max_iteration 100
- Run the vision feature from ResNet18-based environment five times with the same seed:
python scripts/reinforcement_learning/rl_games/train.py --task Isaac-Cartpole-RGB-ResNet18-v0 --enable_cameras --headless --seed 42 --max_iteration 100
All hyperparameters and environment settings not specified in the CLI arguments default to the values defined in the code.
System Info
- Commit: f20d74c
- Isaac Sim Version: 4.5
- OS: Ubuntu 22.04
- GPU: RTX A6000
- CUDA: 12.9
- GPU Driver: 575.64.03
Additional context
A note on the logs: For some runs, WandB logging halted before the experiment's completion, despite all runs being executed for an identical number of steps. This does not impact the overall analysis. For the reproducible environment (Isaac-Cartpole-v0), training curves were perfectly identical until the earliest halt. For the non-reproducible environments, the curves had already diverged long before any logging stopped.
Checklist
Acceptance Criteria
Describe the bug
When training RL agents in IsaacLab, vision-based environments result in non-deterministic outcomes across multiple runs, even when using a fixed random seed. In contrast, state-based environments exhibit perfect reproducibility under the same conditions.
This issue was confirmed by running five separate tests with identical settings on each of the following three official IsaacLab environments:
Isaac-Cartpole-v0(state-based): ReproducibleIsaac-Cartpole-RGB-v0(vision-based): Not reproducibleIsaac-Cartpole-RGB-ResNet18-v0(vision-based): Not reproducibleThe non-determinism appears to be introduced by the vision processing pipeline, as it is the key difference between the reproducible and non-reproducible environments. However, as I have not investigated this in-depth, further analysis is needed to identify the root cause.
The provided WandB logs show the reward curves from several training executions. As illustrated, the training curves for the vision-based environments show significant divergence. This non-reproducibility occurs even though all experimental settings, including the random seed, were kept identical for each run.
(
state-{i}: Isaac-Cartpole-v0,rgb-{i}: Isaac-Cartpole-RGB-v0,resnet-{i}: Isaac-Cartpole-RGB-ResNet18-v0)Steps to reproduce
All hyperparameters and environment settings not specified in the CLI arguments default to the values defined in the code.
System Info
Additional context
A note on the logs: For some runs, WandB logging halted before the experiment's completion, despite all runs being executed for an identical number of steps. This does not impact the overall analysis. For the reproducible environment (
Isaac-Cartpole-v0), training curves were perfectly identical until the earliest halt. For the non-reproducible environments, the curves had already diverged long before any logging stopped.Checklist
Acceptance Criteria