Skip to content

wildminder/ComfyUI-Kandinsky

Repository files navigation

ComfyUI-Kandinsky

ComfyUI-Kandinsky logo

A custom node for ComfyUI that integrates Kandinsky 5.0, a powerful family of open-source text-to-video diffusion models.

Report Bug · Request Feature

Stargazers Issues Forks


About The Project

This project brings the state-of-the-art Kandinsky 5.0 T2V Lite text-to-video model into the ComfyUI ecosystem. Kandinsky 5 is a latent diffusion pipeline built on a Flow Matching and Diffusion Transformer (DiT) backbone, capable of generating high-quality video from text prompts.

It leverages a powerful combination of Qwen2.5-VL and CLIP for text conditioning and the HunyuanVideo VAE for latent space encoding, enabling a nuanced understanding of prompts and impressive visual results.

workflow

This custom node suite provides all the necessary tools to run the Kandinsky 5 pipeline natively in ComfyUI, including a custom sampler for its specific inference loop and efficient memory management to run on consumer-grade hardware.

✨ Key Features:

  • Native Kandinsky 5.0 Integration
  • High-Quality Video Generation
  • Custom Sampler Node
  • Efficient Memory Management
  • Multiple Model Variants: Supports SFT (high quality), no-CFG (faster), and distilled (fastest) model versions.
  • Familiar ComfyUI Workflow

(back to top)

🚀 Getting Started

The easiest way to install is via ComfyUI Manager. Search for ComfyUI-Kandinsky and click "Install".

Alternatively, to install manually:

  1. Clone the Repository: Navigate to your ComfyUI/custom_nodes/ directory and clone this repository:

    git clone https://github.com/wildminder/ComfyUI-Kandinsky.git
  2. Install Dependencies: This node relies on packages from the original Kandinsky repository. Navigate into the cloned ComfyUI-Kandinsky directory and install the required dependencies:

    cd ComfyUI-Kandinsky
    pip install -r requirements.txt
  3. Download Models: This node does not automatically download models. You must download the required models and place them in the correct ComfyUI directories. See the Model Zoo table below for links.

    • Place Kandinsky DiT models (.safetensors) in ComfyUI/models/diffusion_models/kandinsky/.
    • Place the HunyuanVideo VAE in ComfyUI/models/vae/.
    • Place the CLIP-L and Qwen2.5-VL text encoders in ComfyUI/models/clip/.
  4. Start/Restart ComfyUI: Launch ComfyUI. The Kandinsky nodes will appear under the Kandinsky category.

Model Zoo

The Kandinsky 5 Loader node uses the config name to identify the correct checkpoint file from the kandinsky/ subdirectory in your diffusion_models folder.

Kandinsky DiT Models

Model Config Name Duration Hugging Face Link
Kandinsky 5.0 T2V Lite SFT 5s config_5s_sft.yaml 5s 🤗 HF
Kandinsky 5.0 T2V Lite SFT 10s config_10s_sft.yaml 10s 🤗 HF
Kandinsky 5.0 T2V Lite pretrain 5s config_5s_pretrain.yaml 5s 🤗 HF
Kandinsky 5.0 T2V Lite pretrain 10s config_10s_pretrain.yaml 10s 🤗 HF
Kandinsky 5.0 T2V Lite no-CFG 5s config_5s_nocfg.yaml 5s 🤗 HF
Kandinsky 5.0 T2V Lite no-CFG 10s config_10s_nocfg.yaml 10s 🤗 HF
Kandinsky 5.0 T2V Lite distill 5s config_5s_distil.yaml 5s 🤗 HF
Kandinsky 5.0 T2V Lite distill 10s config_10s_distil.yaml 10s 🤗 HF

Required Dependency Models

These are common models used in many ComfyUI workflows and are required for the Kandinsky.

Model Purpose Hugging Face Link
HunyuanVideo VAE Latent Encoding/Decoding 🤗 HF
HunyuanVideo VAE bf16 Latent Encoding/Decoding 🤗 HF ComfyUI
CLIP-ViT-L-14 Text Conditioning 🤗 HF
Qwen2.5-VL-7B fp8 scaled Text Conditioning 🤗 HF ComfyUI
Qwen2.5-VL-7B bf16 Text Conditioning 🤗 HF Kijai

(back to top)

🛠️ Node Parameters

Note

The quality of the generated video is highly dependent on the quality of your prompt. The output strongly depends on both the user prompt and the underlying system prompt used by the Qwen2.5-VL encoder. Experiment with descriptive phrasing to achieve the best results.

Kandinsky 5 Loader

  • variant: Select the Kandinsky DiT model variant to load. The name corresponds to the config files.

Kandinsky 5 Text Encode

  • clip: The standard CLIP-L model.
  • qwen_vl: The Qwen2.5-VL model. Must be loaded with the qwen_image type in the CLIPLoader node.
  • text: The positive text prompt describing the desired video.
  • negative_text: The negative text prompt describing what to avoid.
  • content_type: Sets the internal prompt template for either video or image generation.

Empty Kandinsky 5 Latent

  • width/height: The dimensions of the video to be generated.
  • time_length: The desired duration of the video in seconds. Set to 0 for single image generation.
  • batch_size: The number of videos to generate in one run.

Kandinsky 5 Sampler

  • seed: The random seed used for creating the initial noise.
  • steps: The number of sampling steps. Should generally match the model type (e.g., 50 for sft models, 16 for distill models).
  • cfg: Classifier-Free Guidance scale. Higher values increase adherence to the prompt.
  • scheduler_scale: Controls the timestep distribution during sampling.

(back to top)

📊 Performance

Video generation is computationally intensive. As a baseline, generating a 5-second video (768x512) with the pretrain_5s model on an NVIDIA 4070Ti (16GB VRAM) can take approximately 15 minutes. Distilled models will be significantly faster.

(back to top)

⚠️ Risks and Limitations

  • Potential for Misuse: The ability to generate video from text could be misused. Users of this node must not use it to create content that infringes upon the rights of individuals or is intended to mislead or harm. It is strictly forbidden to use this for any illegal or unethical purposes.
  • Technical Limitations: The model may occasionally struggle with very long, complex prompts or maintaining perfect temporal consistency.
  • Language Support: The model is trained primarily on English and has a strong understanding of Russian concepts. Performance on other languages is not guaranteed.
  • This node is released for research and development purposes. Please use it responsibly.

(back to top)

══════════════════════════════════

Beyond the code, I believe in the power of community and continuous learning. I invite you to join the 'TokenDiff AI News' and 'TokenDiff Community Hub'

TokenDiff AI News

tokendiff-tg-qw

🗞️ AI for every home, creativity for every mind!

TokenDiff Community Hub

token_hub-tg-qr

💬 questions, help, and thoughtful discussion.

══════════════════════════════════

License

This custom node is subject to its own repository license. The Kandinsky 5 model and its components are subject to the license provided by the original authors at the AI Forever Kandinsky-5 repository.

(back to top)

Acknowledgments

  • The AI Forever team for creating and open-sourcing the incredible Kandinsky 5 project.
  • Qwen Team for Qwen2.5-VL.
  • OpenAI for CLIP.
  • Tencent for the HunyuanVideo VAE.
  • The ComfyUI team for their powerful and extensible platform.

(back to top)

About

ComfyUI nodes for Kandinsky 5 DiT. T2V model generates high-quality video with advanced text conditioning.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors

Languages