AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Zhizhou Zhong · Yicheng Ji · Zhe Kong · Yiying Liu* · Jiarui Wang · Jiasun Feng · Lupeng Liu · Xiangyi Wang · Yanjia Li · Yuqing She · Ying Qin · Huan Li · Shuiyang Mao · Wei Liu · Wenhan Luo^✉

^*Project Leader ^✉Corresponding Author

TL; DR: AnyTalker is an audio-driven framework for generating multi-person talking videos. It features a flexible multi-stream structure to scale identities while ensuring seamless inter-identity interactions.

Video Demos (Generated with the 1.3B model; 14B results here)

Input Image	Generated Video
	weather_en.mp4
	2p-0-en.mp4
	default.mp4

🔥 Latest News

🔥 Nov 30, 2025: We release the AnyTalker weights, inference code, technique-report and project page.

📑 Todo List

Inference code
1.3B Stage 1 Checkpoint (trained exclusively on single-person data)
Benchmark for evaluate Interactivity
Technical report (Coming Soon in a few days!)
14B Model (Coming soon to the Video Rebirth Creation Platform)

Quick Start

🛠️Installation

1. Create a conda environment and install pytorch

conda create -n AnyTalker python=3.10
conda activate AnyTalker 
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

2. Other dependencies

pip install -r requirements.txt

3. Flash-attn installation:

pip install ninja 
pip install flash_attn==2.8.1 --no-build-isolation

4. FFmeg installation

conda install -c conda-forge ffmpeg

or

yum install ffmpeg ffmpeg-devel

or

apt-get install ffmpeg

🧱Model Preparation

Models	Download Link	Notes
Wan2.1-Fun-V1.1-1.3B-InP	🤗 Huggingface	Base model
wav2vec2-base	🤗 Huggingface	Audio encoder
AnyTalker-1.3B	🤗 Huggingface	Our weights

Download models using huggingface-cli:

# !pip install -U "huggingface_hub[cli]"
huggingface-cli download alibaba-pai/Wan2.1-Fun-V1.1-1.3B-InP --local-dir ./checkpoints/Wan2.1-Fun-1.3B-Inp
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./checkpoints/wav2vec2-base-960h
huggingface-cli download zzz66/AnyTalker-1.3B --local-dir ./checkpoints/AnyTalker

The directory shoube be organized as follows.

checkpoints/
├── Wan2.1-Fun-V1.1-1.3B-InP
├── wav2vec2-base-960h
└── AnyTalker

🔑 Quick Inference

The provided script currently performs 480p inference on a single GPU and automatically switches between single-person and multi-person generation modes according to the length of the input audio list.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
python generate_a2v_batch_multiID.py \
		--ckpt_dir="./checkpoints/Wan2.1-Fun-1.3B-Inp" \
		--task="a2v-1.3B" \
		--size="832*480" \
		--batch_gen_json="/nfs/zzzhong/codes/virtual_human/AnyTalker/input_example/good.json" \
		--batch_output="./outputs" \
		--post_trained_checkpoint_path="./checkpoints/AnyTalker/1_3B-single-v1.pth" \
		--sample_fps=24 \
		--sample_guide_scale=4.5 \
		--offload_model=True \
		--base_seed=44 \
		--dit_config="./checkpoints/AnyTalker/config_af2v_1_3B.json" \
		--det_thresh=0.15 \
		--mode="pad" \
		--use_half=True \

or

sh infer_a2v_1_3B_batch.sh

Descriptions on some hyper-parameters

--offload_model: Whether to offload the model to CPU after each model forward, reducing GPU memory usage.
--det_thresh: detection threshold for the InsightFace model; a lower value improves performance on abstract-style images.
--sample_guide_scale: recommended value is 4.5; applied to both text and audio.
--mode: select "pad" if every audio input track has already been zero-padded to a common length; select "concat" if you instead want the script to chain each speaker’s clips together and then zero-pad the non-speaker segments to reach a uniform length.
--use_half: Whether to enable half-precision (FP16) inference for faster acceleration.

Benchmark

We provide the benchmark used in our paper to evaluate Interactivity, including the dataset and the metric computation script.

Download the Dataset from YoTube

1. Install yt-dlp

python -m pip install -U yt-dlp

2. Run the downlaod script

cd ./benchmark
python download.py

The directory shoube be organized as follows.

benchmark/
├── audio_left            # Audio for left speaker (zero-padded to full length)
├── audio_right           # Audio for right speaker (zero-padded to full length)
├── speaker_duration.json # Start/end timestamps for each speaker
├── interact_11.mp4       # Example video 
└── frames                # Reference image supplied as the first video frame

Interactivity evaluation

# single video
python calculate_interactivity.py --video interact_11.mp4

# entire directory
python calculate_interactivity.py --dir ./your_dir

The script prints the Interactivity score defined in the paper. Note: generated videos must keep the exact same names listed in speaker_duration.json.

📚 Citation

If you find our work useful in your research, please consider citing:

@article{zhong2025anytalker,
    title={AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement},
    author={Zhong, Zhizhou and Ji, Yicheng and Kong, Zhe and Liu, YiYing and Wang, Jiarui and Feng, Jiasun and Liu, Lupeng and Wang, Xiangyi and Li, Yanjia and She, Yuqing and Qin, Ying and Li, Huan and Mao, Shuiyang and Liu, Wei and Luo, Wenhan},
    journal={arXiv preprint},
    year={2025}
}

📜 License

The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
benchmark		benchmark
input_example		input_example
utils		utils
wan		wan
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
env.sh		env.sh
generate_a2v_batch_multiID.py		generate_a2v_batch_multiID.py
infer_a2v_1_3B_batch.sh		infer_a2v_1_3B_batch.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Video Demos (Generated with the 1.3B model; 14B results here)

🔥 Latest News

📑 Todo List

Quick Start

🛠️Installation

1. Create a conda environment and install pytorch

2. Other dependencies

3. Flash-attn installation:

4. FFmeg installation

🧱Model Preparation

🔑 Quick Inference

Descriptions on some hyper-parameters

Benchmark

Download the Dataset from YoTube

1. Install yt-dlp

2. Run the downlaod script

Interactivity evaluation

📚 Citation

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Video Demos (Generated with the 1.3B model; 14B results here)

🔥 Latest News

📑 Todo List

Quick Start

🛠️Installation

1. Create a conda environment and install pytorch

2. Other dependencies

3. Flash-attn installation:

4. FFmeg installation

🧱Model Preparation

🔑 Quick Inference

Descriptions on some hyper-parameters

Benchmark

Download the Dataset from YoTube

1. Install yt-dlp

2. Run the downlaod script

Interactivity evaluation

📚 Citation

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages