-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Huge thanks to the original authors of SyncNet for this awesome codebase!
We use SyncNet as a standard evaluation metric for Talking Character Generation models.
To support broader adoption and easier integration into modern pipelines, we're actively maintaining and modernizing this classic implementation.
π Check out the updated version here: MoChaBench
If you find our work helpful, please consider citing both the original SyncNet and MoCha in your research. Your support means a lot!
The implementation follows a Hugging Face Diffusers-style structure.
We provided a
SyncNetPipeline Class, located at eval-lipsync\script\syncnet_pipeline.py.
You can initialize SyncNetPipeline by providing the weights and configs:
pipe = SyncNetPipeline(
{
"s3fd_weights": "path to sfd_face.pth",
"syncnet_weights": "path to syncnet_v2.model",
},
device="cuda", # or "cpu"
)The pipeline offers an inference function to score a single pair of video and speech. For fair comparison, the input speech should be a denoised vocal source extracted from your audio. You can use seperator like Kim_Vocal_2 for general noise remvoal and Demucs_mdx_extra for music removal
av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
video_path="path to video.mp4", # RGB video
audio_path="path to speech.wav", # speech track (must be denoised from audio, ffmpeg-readable format)
cache_dir= "path to store intermediate output", # optional; omit to auto-cleanup intermediates
)