Official codebase for the ICLR 2026 poster paper:
Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control
Install dependencies using uv:
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .Set directories for storing datasets and checkpoints:
export HF_HOME=???
export HF_LEROBOT_HOME=???These paths specify where HuggingFace assets and converted LeRobot datasets will be stored locally.
You need to prepare two datasets:
- Optimal dataset (high-quality demonstrations)
- Mixed-quality dataset (suboptimal + optimal trajectories)
For each dataset, run:
CONFIG_NAME=??? # configuration name defined in config.py
NUM_EPISODES=??? # number of episodes per folder
REPO_ID=??? # output dataset name
python examples/r1/convert_r1_data_to_lerobot_rl.py \
--raw_dirs \
"path/to/source/data/folder1" \
"path/to/source/data/folder2" \
... \
--tasks \
"prompt1" \
"prompt2" \
... \
--num_episodes ${NUM_EPISODES} \
--repo_id ${REPO_ID} \
--no_push_to_hub \
--success_onlyAfter completion, the converted dataset will be created at:
${HF_LEROBOT_HOME}/${REPO_ID}
Run the following command to compute normalization statistics:
python scripts/compute_rl_norm_stats.py --config-name ${CONFIG_NAME}This creates:
../assets/CONFIG_NAME/REPO_ID/norm_stats.json
We use shared normalization statistics for both datasets:
- Compute stats using the optimal dataset
- Copy the generated
norm_stats.jsonto the mixed-quality dataset asset directory
The final asset structure used during training should be:
../assets/VALUE_CONFIG_NAME/OPTIMAL_REPO_ID # value model training
../assets/POLICY_CONFIG_NAME/OPTIMAL_REPO_ID # policy training (optimal)
../assets/POLICY_CONFIG_NAME/SUBOPTIMAL_REPO_ID # policy training (mixed-quality)
All listed assets must share identical normalization statistics.
Training consists of two stages:
- Value model training (optimal dataset only)
- Policy model training (value-guided offline RL)
The value model is trained using the optimal dataset.
Add a configuration entry to _CONFIGS in config.py.
Example:
ValueTrainConfig(
name="value_config",
model=pi0_value.Pi0ValueConfig(
action_horizon=20,
hierarchical_actions=[
[7,8,9],
[17,18,19,20],
[0,1,2,3,4,5,6,10,11,12,13,14,15,16],
],
method="exponential",
network_type="hierarchical_q", # set 'q' to disable hierarchy
),
data=LeRobotR1DataConfig(
repo_id="your_converted_folder_name",
action_sequence_keys=[
"action",
"next_action",
"reward",
"terminal",
],
base_config=DataConfig(prompt_from_task=True),
repack_transforms=_transforms.Group(
inputs=[
_transforms.RepackTransform(
{
"images": {
"cam_head": "observation.images.head",
"cam_left_wrist": "observation.images.left_wrist",
"cam_right_wrist": "observation.images.right_wrist",
},
"state": "observation.state",
"actions": "action",
"next_state": "observation.next_state",
"next_actions": "next_action",
"rewards": "reward",
"terminal": "terminal",
"prompt": "prompt",
}
)
]
),
),
num_train_iql_steps=30_001,
batch_size=1,
fsdp_devices=1,
num_workers=1,
)CONFIG_NAME=iql_r1
METHOD=???
NETWORK_TYPE=??? # 'q' or 'hierarchical_q'
EXP_NAME=???
XLA_PYTHON_CLIENT_MEM_FRACTION=0.8 python scripts/train_iql.py \
${CONFIG_NAME} \
--exp_name ${EXP_NAME} \
--model.method ${METHOD} \
--model.network_type ${NETWORK_TYPE} \
--overwriteAfter training the value model, train the policy using value-guided AWR.
Add a configuration entry to _CONFIGS in config.py.
Example:
AWRTrainConfig(
name="policy_config",
model=pi0.Pi0Config(action_horizon=20,
paligemma_variant="gemma_2b_lora",
action_expert_variant="gemma_300m_lora"),
optimal_data=LeRobotR1DataConfig(
repo_id="optimal_repo_name",
base_config=DataConfig(prompt_from_task=True),
repack_transforms=_transforms.Group(
inputs=[
_transforms.RepackTransform(
{
"images": {
"cam_head": "observation.images.head",
"cam_left_wrist": "observation.images.left_wrist",
"cam_right_wrist": "observation.images.right_wrist",
},
"state": "observation.state",
"actions": "action",
"prompt": "prompt",
}
)
]
),
),
suboptimal_data=LeRobotR1DataConfig(
repo_id="mixed_repo_name",
base_config=DataConfig(prompt_from_task=True),
repack_transforms=_transforms.Group(
inputs=[
_transforms.RepackTransform(
{
"images": {
"cam_head": "observation.images.head",
"cam_left_wrist": "observation.images.left_wrist",
"cam_right_wrist": "observation.images.right_wrist",
},
"state": "observation.state",
"actions": "action",
"prompt": "prompt",
}
)
]
),
),
weight_loader=weight_loaders.CheckpointWeightLoader("s3://openpi-assets/checkpoints/pi0_base/params"),
freeze_filter=pi0.Pi0Config(
paligemma_variant="gemma_2b_lora",
action_expert_variant="gemma_300m_lora",
action_horizon=20,
).get_freeze_filter(),
num_train_steps=30_000,
batch_size=16,
fsdp_devices=2,
num_workers=8
)VALUE_CONFIG_NAME=???
VALUE_PATH=???
EXP_PATH=???
POLICY_CONFIG_NAME=???
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 python scripts/train_awr.py \
--policy-config ${POLICY_CONFIG_NAME} \
--value-config ${VALUE_CONFIG_NAME} \
--policy-override exp_name=${EXP_PATH} \
--policy-override overwrite=True \
--value-override pretrained_path=${VALUE_PATH}- Value learning is performed only on optimal demonstrations.
- Policy learning leverages both optimal and mixed-quality datasets.
- Hierarchical value decomposition can be disabled by setting:
network_type = "q"