- March 2026: π We have released the pre-trained BitVLA model! The evaluation results in the table below have been updated to reflect the performance after pre-training. You can try our new pre-trained model out-of-the-box.
- June 2025: BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
- β Paper, Pre-trained VLM and evaluation code.
- β Fine-tuned VLA code and models
- β Pre-trained VLA.
- π§ Pre-training code
| Models | Size | Memory Usageβ | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal | LIBERO-Long | Avg. |
|---|---|---|---|---|---|---|---|
| Large Models | |||||||
| OpenVLA | 7.5B | 15.1GB (10.79Γ) | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| CoT-VLA | 8.0B | 16.2GB (11.57Γ) | 87.5 | 91.6 | 87.6 | 69.0 | 81.1 |
| UniVLA | 8.5B | 17.0GB (12.14Γ) | 96.5 | 96.8 | 95.6 | 92.0 | 95.2 |
| UnifiedVLA | 8.5B | 17.0GB (12.14Γ) | 95.4 | 98.8 | 93.6 | 94.0 | 95.5 |
| OpenVLA-OFT | 7.7B | 15.4GB (11.00Γ) | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| Small Models | |||||||
| SpatialVLA | 4.2B | 8.5GB (6.07Γ) | 88.2 | 89.9 | 78.6 | 55.5 | 78.1 |
| NORA-Long | 3.8B | 7.5GB (5.36Γ) | 92.2 | 95.4 | 89.4 | 74.6 | 87.9 |
| 4D-VLA | 4.1B | 8.3GB (5.93Γ) | 88.9 | 95.2 | 90.9 | 79.1 | 88.6 |
| SmolVLA | 2.3B | 4.6GB (3.29Γ) | 93.0 | 94.0 | 91.0 | 77.0 | 88.8 |
| GROOT-N1 | 2.2B | 4.4GB (3.14Γ) | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 |
| Οβ | 3.5B | 7.0GB (5.00Γ) | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 |
| BitVLA w/o pre-training | 3.0B | 1.4GB (1.00Γ) | 97.4 | 99.6 | 94.4 | 87.6 | 94.8 |
| πBitVLA | 3.0B | 1.4GB (1.00Γ) | 96.6 | 99.0 | 95.4 | 92.8 | 96.0 |
| Model | Path |
|---|---|
| πBitVLA - VL&VLA pre-trained | lxsy/bitvla-bf16 |
| BitVLA - VL pre-trained | hongyuw/bitvla-bitsiglipL-224px-bf16 |
| BitVLA finetuned on LIBERO-Spatial | hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 |
| BitVLA finetuned on LIBERO-Object | hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16 |
| BitVLA finetuned on LIBERO-Goal | hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16 |
| BitVLA finetuned on LIBERO-Long | hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16 |
| BitVLA w/ BF16 SigLIP | hongyuw/bitvla-siglipL-224px-bf16 |
*Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost. A dedicated inference framework and model are coming soon.
We use the LMM-Eval toolkit to conduct evaluations on VQA tasks. We provide the transformers repo in which we modify the modeling_llava.py and modeling_siglip.py to support the W1.58-A8 quantization.
The evaluation should use nvidia_24_07 docker. Install the packages:
docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity # only use for multimodal evaluation
docker exec -it nvidia_24_07 bash
git clone https://github.com/ustcwhy/BitVLA.git
cd BitVLA/
bash vl_eval_setup.sh # only use for multimodal evaluationFirst, download the BitVLA model from HuggingFace:
git clone https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16 # BitVLA w/ W1.58-A8 SigLIP-L
git clone https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16 # BitVLA w/ BF16 SigLIP-LThen run the following scripts to conduct evaluations:
cd lmms-eval/
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-bitsiglipL-224px-bf16
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost.
To endow BitVLA with generalizable manipulation priors that transfer across embodiments and environments, we pre-train it with an autoregressive next-action prediction objective following OpenVLA.
Pre-training Details:
- Base model: We use hongyuw/bitvla-bitsiglipL-224px-bf16 as the base model.
- Dataset: Following OpenVLA, we use a curated large-scale corpus based on a subset of the Open X-Embodiment dataset, resulting in ~1M training samples.
- Hyperparameters: We train the model for 200K steps with a total batch size of 2048. The peak learning rates are set to 3Γ10β»β΄ for the LLM and 1Γ10β»β΄ for the ViT.
- Compute: The full pre-training takes approximately 14 days on 16 NVIDIA H800 (80GB) GPUs.
We fine-tune BitVLA using OFT training shown in OpenVLA-OFT. First setup the environment as required by that project. You can refer to SETUP.md and LIBERO.md for detailed instructions.
conda create -n bitvla python=3.10 -y
conda activate bitvla
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
# or use the provided docker
# docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity
cd BitVLA
pip install -e openvla-oft/
pip install -e transformers
cd openvla-oft/
# install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO/
pip install -r experiments/robot/libero/libero_requirements.txt
# install bitvla
pip install -e bitvla/
We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from HuggingFace.
git clone git@hf.co:datasets/openvla/modified_libero_rlds
- π New pre-trained model (Recommended): This model is ready to use out-of-the-box. No additional processing is required, and you can directly execute our provided scripts.
- π°οΈ Old model: This version was not pre-trained on the Open X-Embodiment dataset. To use this model, you must first convert the model into a format compatible with our codebase before using it.
python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
After that, you can finetune BitVLA using the provided shell script.
sh ft_script/ft_bitvla_libero_spatial.sh
sh ft_script/ft_bitvla_libero_object.sh
sh ft_script/ft_bitvla_libero_goal.sh
sh ft_script/ft_bitvla_libero_long.sh
You can download our fine-tuned BitVLA models from HuggingFace. As an example for spatial set in LIBERO, run the following script for evaluation:
python experiments/robot/libero/run_libero_eval_bitnet.py \
--pretrained_checkpoint /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
--task_suite_name libero_spatial \
--info_in_path "information you want to show in path" \
--model_family "bitnet"
This repository is built using LMM-Eval, the HuggingFace's transformers, OpenVLA-OFT and OpenVLA.
If you find this repository useful, please consider citing our work:
@article{bitvla,
title={BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation},
author={Hongyu Wang and Chuyan Xiong and Ruiping Wang and Xilin Chen},
year={2025},
eprint={2506.07530},
archivePrefix={arXiv},
primaryClass={cs.RO},
}
This project is licensed under the MIT License.
For help or issues using models, please submit a GitHub issue.