Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

📖 Introduction

Driven by the rapid evolution of Vision-Action (VA) and Vision-Language-Action (VLA) models, imitation learning has significantly advanced robotic manipulation capabilities. However, evaluation methodologies have lagged behind, hindering the establishment of Trustworthy Evaluation for these behaviors.

Current paradigms rely predominantly on binary success rates, failing to address the critical dimensions of trust:

Source Authenticity: Distinguishing genuine policy behaviors from human teleoperation.
Execution Quality: Assessing metrics such as smoothness and safety.

To bridge these gaps, we propose a comprehensive solution comprising the Eval-Actions benchmark and the AutoEval architecture.

Key Features

Eval-Actions Benchmark: A dataset supporting trustworthiness analysis. Unlike existing datasets restricted to successful human demonstrations, it integrates VA and VLA policy execution trajectories alongside human teleoperation data, explicitly including failure scenarios.
AutoEval-S (Small): Leverages Spatio-Temporal Aggregation for semantic assessment, augmented by an auxiliary Kinematic Calibration Signal to refine motion smoothness.
AutoEval-P (Plus): Incorporates the Group Relative Policy Optimization (GRPO) paradigm to enhance logical reasoning capabilities, achieving robust source discrimination (99.6% accuracy).

🔥 News

[2026-01-24] Repository created. Dataset coming soon.

🏗️ Pipeline

Overview of the proposed AutoEval framework. The system processes a robot manipulation video sequence (e.g., 32 frames) alongside kinematic prompts.

Top (AutoEval-S): Designed for Expert Grading and Rank-Guided tasks. This branch employs a Spatio-Temporal Aggregation Strategy to compress high-frequency motion details into composite visual tokens. It generates structured text predictions; following format decomposition, the model is optimized via Supervised Fine-Tuning (SFT) using Cross-Entropy Loss.
Bottom (AutoEval-P): Tailored for Chain-of-Thought (CoT) reasoning. This branch adopts the Group Relative Policy Optimization (GRPO) paradigm (Guo et al., 2025). The policy model generates multiple reasoning paths (containing <think> tokens), optimized against a hybrid reward function comprising content accuracy ($r_{Content}$) and format constraints ($r_{Format}$) to enhance physical reasoning capabilities.

💻 Installation

Clone the repository and set up the environment:

git clone [https://github.com/YourUsername/AutoEval.git](https://github.com/YourUsername/AutoEval.git)
cd AutoEval

# Create conda environment from environment.yml
conda env create -f environment.yml

# Activate the environment
conda activate autoeval

📚 Data Eval-Actions Benchmark You can collect data yourself or download our Eval-Actions benchmark. The dataset is structured around three core supervision signals: Expert Grading (EG), Rank-Guided preferences (RG), and Chain-of-Thought (CoT).

(Download links coming soon)

🛠️ Usage

Data Preparation

Step 1: Split the Dataset Split the raw data into training and validation sets. Replace ROOT_DATA with your source path and TARGET_ROOT with your desired output path.

python process_data/splite_data_big.py \
    --root ROOT_DATA \
    --new_data_root TARGET_ROOT

Step 2: Generate Training JSON (Spatio-Temporal Aggregation) Run the generation script to create the training dataset file (sft_train_2_2_score). By default, this script processes the training split.

python process_data/generate_llama_big_frames.py

Step 3: Generate Validation JSON To generate the validation dataset (sft_val_2_2_score), you need to modify the script configuration manually:

Open process_data/generate_llama_big_frames.py.

Find the variable setting for the data split and change it to val (e.g., set MODEL = 'val').

Run the script again:

python process_data/generate_llama_big_frames.py

Training (SFT)

Launch the Supervised Fine-Tuning (SFT) using the generated datasets.

python launch_sft.py \
    --template intern_vl \
    --base_model MODEL_PATH \
    --train_dataset sft_train_2_2_score \
    --val_dataset sft_val_2_2_score

Note: Replace MODEL_PATH with the actual path to your base model (e.g., Qwen3-VL-4B-Instruct).

🙏 Acknowledgement

Our code is built upon LlamaFactory and EasyR1. We would like to thank the authors for their excellent works.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
docker		docker
examples		examples
process_data		process_data
requirements		requirements
scripts		scripts
src		src
tests		tests
tests_v1		tests_v1
.dockerignore		.dockerignore
.env.local		.env.local
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
launch_sft.py		launch_sft.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

📖 Introduction

Key Features

🔥 News

🏗️ Pipeline

💻 Installation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

LogSSim/TERM-Bench

Folders and files

Latest commit

History

Repository files navigation

Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

📖 Introduction

Key Features

🔥 News

🏗️ Pipeline

💻 Installation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages