This repository contains the official implementation for the paper "VideoSSR: Video Self-Supervised Reinforcement Learning".
VideoSSR is a novel framework designed to enhance the video understanding capabilities of Multimodal Large Language Models (MLLMs). Instead of relying on prohibitively expensive manually annotated data or biased model-annotated data, VideoSSR harnesses the rich, intrinsic information within videos to generate high-quality, verifiable training data. We introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. Building upon these tasks, we construct the VideoSSR-30K dataset and train models with Reinforcement Learning with Verifiable Rewards (RLVR), establishing a potent foundational framework for developing more advanced video understanding in MLLMs.
To rigorously test the capabilities of modern MLLMs on fundamental video understanding, we introduce the Video Intrinsic Understanding Benchmark (VIUBench). This benchmark is systematically constructed from our three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. It specifically evaluates a model's ability to reason about intrinsic video properties—such as temporal coherence and fine-grained details—independent of external annotations. Our results show that VIUBench poses a significant challenge even for the most advanced models, highlighting a critical area for improvement and validating the effectiveness of our approach.

We recommend directly using the latest version of verl and following its instructions to configure the environment.
The key configurations of the environment we used are as follows:
vllm==0.11.0
transformers==4.57.0.dev0
torch==2.8.0
torchcodec==0.7.0+cu128
First, download the VideoSSR-30K dataset or build your own training data.
Then, run the training script:
bash ./train/train.shUpdate: For better compatibility with Qwen3-VL, we recommend using the latest version of verl. If you do this, you only need to copy the reward function from ./verl/verl/utils/reward_score to the corresponding location.
python ./eval/vqa.py python ./eval/vtg.py To facilitate standardized testing, we organize the data for all evaluation tasks into the following JSON format.
{
"video": "fFjv93ACGo8",
"question": "...",
"answer": "C"
}We provide the necessary scripts in the ./pretext_tasks directory for you to create your own datasets using our self-supervised methods.
The process involves two main stages:
-
Frame Sampling
First, prepare your source videos. Then, use the
sample_frames.pyscript to preprocess them and extract frames. This step prepares the visual data in a format required by the task generation scripts.# Example usage: python ./pretext_tasks/sample_frames.py -
Generating Pretext Task Data
Once your frames are sampled, you can use the following scripts to generate training data for each of our self-supervised pretext tasks:
grounding.py: To create data for the Anomaly Grounding task.counting.py: To create data for the Object Counting task.jigsaw.py: To create data for the Temporal Jigsaw task.
python ./pretext_tasks/grounding.py python ./pretext_tasks/counting.py python ./pretext_tasks/jigsaw.py
This work was developed upon verl. We also thank the great work of Visual Jigsaw for the inspiration.


