Driven by the rapid evolution of Vision-Action (VA) and Vision-Language-Action (VLA) models, imitation learning has significantly advanced robotic manipulation capabilities. However, evaluation methodologies have lagged behind, hindering the establishment of Trustworthy Evaluation for these behaviors.
Current paradigms rely predominantly on binary success rates, failing to address the critical dimensions of trust:
- Source Authenticity: Distinguishing genuine policy behaviors from human teleoperation.
- Execution Quality: Assessing metrics such as smoothness and safety.
To bridge these gaps, we propose a comprehensive solution comprising the Eval-Actions benchmark and the AutoEval architecture.
- Eval-Actions Benchmark: A dataset supporting trustworthiness analysis. Unlike existing datasets restricted to successful human demonstrations, it integrates VA and VLA policy execution trajectories alongside human teleoperation data, explicitly including failure scenarios.
- AutoEval-S (Small): Leverages Spatio-Temporal Aggregation for semantic assessment, augmented by an auxiliary Kinematic Calibration Signal to refine motion smoothness.
- AutoEval-P (Plus): Incorporates the Group Relative Policy Optimization (GRPO) paradigm to enhance logical reasoning capabilities, achieving robust source discrimination (99.6% accuracy).
- [2026-01-24] Repository created. Dataset coming soon.
Overview of the proposed AutoEval framework. The system processes a robot manipulation video sequence (e.g., 32 frames) alongside kinematic prompts.
- Top (AutoEval-S): Designed for Expert Grading and Rank-Guided tasks. This branch employs a Spatio-Temporal Aggregation Strategy to compress high-frequency motion details into composite visual tokens. It generates structured text predictions; following format decomposition, the model is optimized via Supervised Fine-Tuning (SFT) using Cross-Entropy Loss.
-
Bottom (AutoEval-P): Tailored for Chain-of-Thought (CoT) reasoning. This branch adopts the Group Relative Policy Optimization (GRPO) paradigm (Guo et al., 2025). The policy model generates multiple reasoning paths (containing
<think>tokens), optimized against a hybrid reward function comprising content accuracy ($r_{Content}$ ) and format constraints ($r_{Format}$ ) to enhance physical reasoning capabilities.
Clone the repository and set up the environment:
git clone [https://github.com/YourUsername/AutoEval.git](https://github.com/YourUsername/AutoEval.git)
cd AutoEval
# Create conda environment from environment.yml
conda env create -f environment.yml
# Activate the environment
conda activate autoeval📚 Data Eval-Actions Benchmark You can collect data yourself or download our Eval-Actions benchmark. The dataset is structured around three core supervision signals: Expert Grading (EG), Rank-Guided preferences (RG), and Chain-of-Thought (CoT).
(Download links coming soon)
🛠️ Usage
- Data Preparation
Step 1: Split the Dataset Split the raw data into training and validation sets. Replace ROOT_DATA with your source path and TARGET_ROOT with your desired output path.
python process_data/splite_data_big.py \
--root ROOT_DATA \
--new_data_root TARGET_ROOTStep 2: Generate Training JSON (Spatio-Temporal Aggregation) Run the generation script to create the training dataset file (sft_train_2_2_score). By default, this script processes the training split.
python process_data/generate_llama_big_frames.pyStep 3: Generate Validation JSON To generate the validation dataset (sft_val_2_2_score), you need to modify the script configuration manually:
Open process_data/generate_llama_big_frames.py.
Find the variable setting for the data split and change it to val (e.g., set MODEL = 'val').
Run the script again:
python process_data/generate_llama_big_frames.py- Training (SFT)
Launch the Supervised Fine-Tuning (SFT) using the generated datasets.
python launch_sft.py \
--template intern_vl \
--base_model MODEL_PATH \
--train_dataset sft_train_2_2_score \
--val_dataset sft_val_2_2_scoreNote: Replace MODEL_PATH with the actual path to your base model (e.g., Qwen3-VL-4B-Instruct).
Our code is built upon LlamaFactory and EasyR1. We would like to thank the authors for their excellent works.
