# create the Conda environment
$ conda env create -f block_rl_env.yml
$ conda activate block_rl
# Stable‑Baselines3 extras: usefull to have the training progress bar.
$ pip install stable-baselines3\[extra\]
Low‑level geometry & physics backend. Responsible for:
- keeping the list of blocks (
self.block_list) - collision checks & static stability (
is_stable_rbe) - dense reward heat‑map generation
Gymnasium wrapper that the RL agent actually interacts with. It:
- exposes a
Discrete(300)action space with automatic action‑masking - concatenates state image + reward image into a single flat observation
- offers live Matplotlib rendering (
--render)
Train an agent with Stable‑Baselines3. Key CLI flags (run -h for all):
| Flag | Default | Description |
|---|---|---|
--task |
bridge |
Task to learn: bridge, tower, double_bridge |
--algo |
maskppo |
RL algorithm: maskppo (masked PPO) or plain ppo |
--timesteps |
200_000 |
Total training steps (across all envs) |
--save-freq |
10_000 |
Checkpoint frequency (steps) for saving models & eval |
--logdir |
runs |
Output directory for checkpoints and TensorBoard logs |
--device |
cpu |
Compute device: cpu, cuda, or auto |
--render |
False |
Render the environment (only works when --n-envs 1) |
--debug |
False |
Enable DEBUG‑level logging |
--progress-bar |
False |
Show SB3 progress bar during training |
--config |
None | Path to YAML with extra hyper‑parameters (overrides CLI) |
-m, --resume-model |
None | Path to a .zip model to continue training from |
-n, --n-envs |
1 |
Number of parallel environments (≥2 uses SubprocVecEnv) |
Train from scratch
python train.py --task bridge --algo maskppo --timesteps 100000 --progress-bar --config configs/maskppo.yaml Resume training
python train.py --task bridge --algo maskppo --timesteps 100000 --progress-bar --config configs/maskppo.yaml -m runs/bridge_maskppo_0506204539/best_model/best_model.zipMonitor training To monitor the training you can run the following command in the main directory.
tensorboard --logdir runsRoll out a trained policy for qualitative inspection.
python run_policy.py --model runs/bridge_maskppo_0506212053/best_model/best_model.zip --task bridge --algo maskppo --render --debugHere is an example of a rollout
- Observation 8192‑D vector (2 × 64 × 64 images flattened).
- Action space 300 discrete indices;
- Reward sum of overlaps between the newly placed block and Gaussian blobs centred on targets.
- Masking
sb3_contrib.ActionMaskerremoves illegal moves before softmax → faster learning & fewer crashes.
Other SB3 algorithms (SAC, A2C…) will work, but the policy network must be adapted to flat image inputs.


