Skip to content

LAMDA-RL/HVD

Repository files navigation

Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control

Official codebase for the ICLR 2026 poster paper:

Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control


Environment Setup

Install dependencies using uv:

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

Data Preparation

Set directories for storing datasets and checkpoints:

export HF_HOME=???
export HF_LEROBOT_HOME=???

These paths specify where HuggingFace assets and converted LeRobot datasets will be stored locally.


Convert Data to LeRobot Format

You need to prepare two datasets:

  • Optimal dataset (high-quality demonstrations)
  • Mixed-quality dataset (suboptimal + optimal trajectories)

For each dataset, run:

CONFIG_NAME=???      # configuration name defined in config.py
NUM_EPISODES=???     # number of episodes per folder
REPO_ID=???          # output dataset name

python examples/r1/convert_r1_data_to_lerobot_rl.py \
    --raw_dirs \
    "path/to/source/data/folder1" \
    "path/to/source/data/folder2" \
        ... \
    --tasks \
    "prompt1" \
    "prompt2" \
        ... \
    --num_episodes ${NUM_EPISODES} \
    --repo_id ${REPO_ID} \
    --no_push_to_hub \
    --success_only

After completion, the converted dataset will be created at:

${HF_LEROBOT_HOME}/${REPO_ID}

Compute Normalization Statistics

Run the following command to compute normalization statistics:

python scripts/compute_rl_norm_stats.py --config-name ${CONFIG_NAME}

This creates:

../assets/CONFIG_NAME/REPO_ID/norm_stats.json

Important

We use shared normalization statistics for both datasets:

  1. Compute stats using the optimal dataset
  2. Copy the generated norm_stats.json to the mixed-quality dataset asset directory

The final asset structure used during training should be:

../assets/VALUE_CONFIG_NAME/OPTIMAL_REPO_ID      # value model training
../assets/POLICY_CONFIG_NAME/OPTIMAL_REPO_ID     # policy training (optimal)
../assets/POLICY_CONFIG_NAME/SUBOPTIMAL_REPO_ID  # policy training (mixed-quality)

All listed assets must share identical normalization statistics.


Training

Training consists of two stages:

  1. Value model training (optimal dataset only)
  2. Policy model training (value-guided offline RL)

1. Training the Value Model

The value model is trained using the optimal dataset.

Configure Value Training

Add a configuration entry to _CONFIGS in config.py.

Example:

ValueTrainConfig(
    name="value_config",
    model=pi0_value.Pi0ValueConfig(
        action_horizon=20,
        hierarchical_actions=[
            [7,8,9],
            [17,18,19,20],
            [0,1,2,3,4,5,6,10,11,12,13,14,15,16],
        ],
        method="exponential",
        network_type="hierarchical_q",  # set 'q' to disable hierarchy
    ),
    data=LeRobotR1DataConfig(
        repo_id="your_converted_folder_name",
        action_sequence_keys=[
            "action",
            "next_action",
            "reward",
            "terminal",
        ],
        base_config=DataConfig(prompt_from_task=True),
        repack_transforms=_transforms.Group(
            inputs=[
                _transforms.RepackTransform(
                    {
                        "images": {
                            "cam_head": "observation.images.head",
                            "cam_left_wrist": "observation.images.left_wrist",
                            "cam_right_wrist": "observation.images.right_wrist",
                        },
                        "state": "observation.state",
                        "actions": "action",
                        "next_state": "observation.next_state",
                        "next_actions": "next_action",
                        "rewards": "reward",
                        "terminal": "terminal",
                        "prompt": "prompt",
                    }
                )
            ]
        ),
    ),
    num_train_iql_steps=30_001,
    batch_size=1,
    fsdp_devices=1,
    num_workers=1,
)

Start Value Training

CONFIG_NAME=iql_r1
METHOD=???
NETWORK_TYPE=???   # 'q' or 'hierarchical_q'
EXP_NAME=???

XLA_PYTHON_CLIENT_MEM_FRACTION=0.8 python scripts/train_iql.py \
    ${CONFIG_NAME} \
    --exp_name ${EXP_NAME} \
    --model.method ${METHOD} \
    --model.network_type ${NETWORK_TYPE} \
    --overwrite

2. Training the Policy Model

After training the value model, train the policy using value-guided AWR.

Configure Value Training

Add a configuration entry to _CONFIGS in config.py.

Example:

AWRTrainConfig(
    name="policy_config",
    model=pi0.Pi0Config(action_horizon=20, 
                        paligemma_variant="gemma_2b_lora", 
                        action_expert_variant="gemma_300m_lora"), 
    optimal_data=LeRobotR1DataConfig(
        repo_id="optimal_repo_name",
        base_config=DataConfig(prompt_from_task=True),
        repack_transforms=_transforms.Group(
            inputs=[   
                _transforms.RepackTransform(
                    {
                        "images": {
                            "cam_head": "observation.images.head",
                            "cam_left_wrist": "observation.images.left_wrist",
                            "cam_right_wrist": "observation.images.right_wrist",
                        },
                        "state": "observation.state", 
                        "actions": "action",
                        "prompt": "prompt",
                    }
                )
            ]
        ),
    ),
    suboptimal_data=LeRobotR1DataConfig(
        repo_id="mixed_repo_name",
        base_config=DataConfig(prompt_from_task=True),
        repack_transforms=_transforms.Group(
            inputs=[
                _transforms.RepackTransform(
                    {
                        "images": {
                            "cam_head": "observation.images.head",
                            "cam_left_wrist": "observation.images.left_wrist",
                            "cam_right_wrist": "observation.images.right_wrist",
                        },
                        "state": "observation.state", 
                        "actions": "action",
                        "prompt": "prompt",
                    }
                )
            ]
        ),
    ),
    weight_loader=weight_loaders.CheckpointWeightLoader("s3://openpi-assets/checkpoints/pi0_base/params"),
    freeze_filter=pi0.Pi0Config(
        paligemma_variant="gemma_2b_lora", 
        action_expert_variant="gemma_300m_lora", 
        action_horizon=20,
    ).get_freeze_filter(),
    num_train_steps=30_000,
    batch_size=16,
    fsdp_devices=2,
    num_workers=8
)

Start Policy Training

VALUE_CONFIG_NAME=???
VALUE_PATH=???
EXP_PATH=???
POLICY_CONFIG_NAME=???

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 python scripts/train_awr.py \
    --policy-config ${POLICY_CONFIG_NAME} \
    --value-config ${VALUE_CONFIG_NAME} \
    --policy-override exp_name=${EXP_PATH} \
    --policy-override overwrite=True \
    --value-override pretrained_path=${VALUE_PATH}

Notes

  • Value learning is performed only on optimal demonstrations.
  • Policy learning leverages both optimal and mixed-quality datasets.
  • Hierarchical value decomposition can be disabled by setting:
network_type = "q"

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors