Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control

Official codebase for the ICLR 2026 poster paper:

Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control

Environment Setup

Install dependencies using uv:

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

Data Preparation

Set directories for storing datasets and checkpoints:

export HF_HOME=???
export HF_LEROBOT_HOME=???

These paths specify where HuggingFace assets and converted LeRobot datasets will be stored locally.

Convert Data to LeRobot Format

You need to prepare two datasets:

Optimal dataset (high-quality demonstrations)
Mixed-quality dataset (suboptimal + optimal trajectories)

For each dataset, run:

CONFIG_NAME=???      # configuration name defined in config.py
NUM_EPISODES=???     # number of episodes per folder
REPO_ID=???          # output dataset name

python examples/r1/convert_r1_data_to_lerobot_rl.py \
    --raw_dirs \
    "path/to/source/data/folder1" \
    "path/to/source/data/folder2" \
        ... \
    --tasks \
    "prompt1" \
    "prompt2" \
        ... \
    --num_episodes ${NUM_EPISODES} \
    --repo_id ${REPO_ID} \
    --no_push_to_hub \
    --success_only

After completion, the converted dataset will be created at:

${HF_LEROBOT_HOME}/${REPO_ID}

Compute Normalization Statistics

Run the following command to compute normalization statistics:

python scripts/compute_rl_norm_stats.py --config-name ${CONFIG_NAME}

This creates:

../assets/CONFIG_NAME/REPO_ID/norm_stats.json

Important

We use shared normalization statistics for both datasets:

Compute stats using the optimal dataset
Copy the generated norm_stats.json to the mixed-quality dataset asset directory

The final asset structure used during training should be:

../assets/VALUE_CONFIG_NAME/OPTIMAL_REPO_ID      # value model training
../assets/POLICY_CONFIG_NAME/OPTIMAL_REPO_ID     # policy training (optimal)
../assets/POLICY_CONFIG_NAME/SUBOPTIMAL_REPO_ID  # policy training (mixed-quality)

All listed assets must share identical normalization statistics.

Training

Training consists of two stages:

Value model training (optimal dataset only)
Policy model training (value-guided offline RL)

1. Training the Value Model

The value model is trained using the optimal dataset.

Configure Value Training

Add a configuration entry to _CONFIGS in config.py.

Example:

ValueTrainConfig(
    name="value_config",
    model=pi0_value.Pi0ValueConfig(
        action_horizon=20,
        hierarchical_actions=[
            [7,8,9],
            [17,18,19,20],
            [0,1,2,3,4,5,6,10,11,12,13,14,15,16],
        ],
        method="exponential",
        network_type="hierarchical_q",  # set 'q' to disable hierarchy
    ),
    data=LeRobotR1DataConfig(
        repo_id="your_converted_folder_name",
        action_sequence_keys=[
            "action",
            "next_action",
            "reward",
            "terminal",
        ],
        base_config=DataConfig(prompt_from_task=True),
        repack_transforms=_transforms.Group(
            inputs=[
                _transforms.RepackTransform(
                    {
                        "images": {
                            "cam_head": "observation.images.head",
                            "cam_left_wrist": "observation.images.left_wrist",
                            "cam_right_wrist": "observation.images.right_wrist",
                        },
                        "state": "observation.state",
                        "actions": "action",
                        "next_state": "observation.next_state",
                        "next_actions": "next_action",
                        "rewards": "reward",
                        "terminal": "terminal",
                        "prompt": "prompt",
                    }
                )
            ]
        ),
    ),
    num_train_iql_steps=30_001,
    batch_size=1,
    fsdp_devices=1,
    num_workers=1,
)

Start Value Training

CONFIG_NAME=iql_r1
METHOD=???
NETWORK_TYPE=???   # 'q' or 'hierarchical_q'
EXP_NAME=???

XLA_PYTHON_CLIENT_MEM_FRACTION=0.8 python scripts/train_iql.py \
    ${CONFIG_NAME} \
    --exp_name ${EXP_NAME} \
    --model.method ${METHOD} \
    --model.network_type ${NETWORK_TYPE} \
    --overwrite

2. Training the Policy Model

After training the value model, train the policy using value-guided AWR.

Configure Value Training

Add a configuration entry to _CONFIGS in config.py.

Example:

AWRTrainConfig(
    name="policy_config",
    model=pi0.Pi0Config(action_horizon=20, 
                        paligemma_variant="gemma_2b_lora", 
                        action_expert_variant="gemma_300m_lora"), 
    optimal_data=LeRobotR1DataConfig(
        repo_id="optimal_repo_name",
        base_config=DataConfig(prompt_from_task=True),
        repack_transforms=_transforms.Group(
            inputs=[   
                _transforms.RepackTransform(
                    {
                        "images": {
                            "cam_head": "observation.images.head",
                            "cam_left_wrist": "observation.images.left_wrist",
                            "cam_right_wrist": "observation.images.right_wrist",
                        },
                        "state": "observation.state", 
                        "actions": "action",
                        "prompt": "prompt",
                    }
                )
            ]
        ),
    ),
    suboptimal_data=LeRobotR1DataConfig(
        repo_id="mixed_repo_name",
        base_config=DataConfig(prompt_from_task=True),
        repack_transforms=_transforms.Group(
            inputs=[
                _transforms.RepackTransform(
                    {
                        "images": {
                            "cam_head": "observation.images.head",
                            "cam_left_wrist": "observation.images.left_wrist",
                            "cam_right_wrist": "observation.images.right_wrist",
                        },
                        "state": "observation.state", 
                        "actions": "action",
                        "prompt": "prompt",
                    }
                )
            ]
        ),
    ),
    weight_loader=weight_loaders.CheckpointWeightLoader("s3://openpi-assets/checkpoints/pi0_base/params"),
    freeze_filter=pi0.Pi0Config(
        paligemma_variant="gemma_2b_lora", 
        action_expert_variant="gemma_300m_lora", 
        action_horizon=20,
    ).get_freeze_filter(),
    num_train_steps=30_000,
    batch_size=16,
    fsdp_devices=2,
    num_workers=8
)

Start Policy Training

VALUE_CONFIG_NAME=???
VALUE_PATH=???
EXP_PATH=???
POLICY_CONFIG_NAME=???

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 python scripts/train_awr.py \
    --policy-config ${POLICY_CONFIG_NAME} \
    --value-config ${VALUE_CONFIG_NAME} \
    --policy-override exp_name=${EXP_PATH} \
    --policy-override overwrite=True \
    --value-override pretrained_path=${VALUE_PATH}

Notes

Value learning is performed only on optimal demonstrations.
Policy learning leverages both optimal and mixed-quality datasets.
Hierarchical value decomposition can be disabled by setting:

network_type = "q"

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples/r1		examples/r1
packages/openpi-client		packages/openpi-client
scripts		scripts
src/openpi		src/openpi
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control

Environment Setup

Data Preparation

Convert Data to LeRobot Format

Compute Normalization Statistics

Important

Training

1. Training the Value Model

Configure Value Training

Start Value Training

2. Training the Policy Model

Configure Value Training

Start Policy Training

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control

Environment Setup

Data Preparation

Convert Data to LeRobot Format

Compute Normalization Statistics

Important

Training

1. Training the Value Model

Configure Value Training

Start Value Training

2. Training the Policy Model

Configure Value Training

Start Policy Training

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages