Skip to content

Add Z-Image Turbo demo#115

Open
Honry wants to merge 1 commit intomicrosoft:mainfrom
Honry:z-image-turbo
Open

Add Z-Image Turbo demo#115
Honry wants to merge 1 commit intomicrosoft:mainfrom
Honry:z-image-turbo

Conversation

@Honry
Copy link
Contributor

@Honry Honry commented Mar 12, 2026

Models

The base model is Z-Image Turbo.

Z-Image-Turbo – A distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers ⚡️sub-second inference latency⚡️ on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.

It is consist of 3 models:

  • Text Encoder: a Qwen3 4B based model supports 512 max sequence length.
  • Transformer: Scalable Single-Stream DiT (S3-DiT), where text, image latent, and timestep embeddings are processed together in a single transformer stream.
  • VAE Decoder: Flux VAE

We converted the model to ONNX format with several optimization as follows:

  • Int4 Quantization Support: Enables MatMulNBits quantization for text encoder and transformer.
  • FP16 Quantization Support: Converts most weights to fp16 except those precision-sensitive ones.
  • OP Fusion: Fuses nodes into high-level ONNX ops: MultiHeadAttention, SimplifiedLayerNormalization, RotaryEmbedding, etc.
  • Model Pruning: Eliminate unused nodes and inputs/outputs from the Text Encoder.

ONNX models have been published at https://huggingface.co/webnn/Z-Image-Turbo.

Model Size:

  • Text Encoder(q4f16): 2.06G
  • Transformer(q4f16): 3.44G
  • Vae Decoder(fp16): 94.6M
  • Safety Checker(fp16): 580M

Demo

Based on SDXL-Turbo, made some adjustments on the UI, and optimized the pipeline to improve the performance and memory efficiency through following strategies:

  1. Enhance the pre- and post-processing stages among the models to minimize memory copying, by creating several mini ONNX models to make all models connective. Thus we only need to readback the output from the last vae decoder model.
  2. Use pre-allocated input and output tensors to improve memory efficiency.
  3. Supports 512x512 and 1024x1024 resolutions, along with configurable steps. (Note: the more steps the better image quality, but slower performance)

By default it uses the WebGPU EP, WebNN is not available now. It depends on the dynamic shape support. (User prompt is a dynamic input shape).

RAM requirements (WebGPU):

  • For 512x512 resolution, it requires the device has at least 32GB RAM.
  • For 1024x1024 resolution, it requires the device has at least 64GB RAM.

Preview

https://honry.github.io/webnn-developer-preview/demos/z-image-turbo/

The base model is: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

This is a JavaScript demo of Z-Image Turbo accelerated by WebNN and WebGPU.

By default it uses WebGPU, since WebNN is not available, it depends on
the dynamic shape support.

ONNX models: https://huggingface.co/webnn/Z-Image-Turbo

Co-authored-by: Belem Zhang <belem.zhang@intel.com>
@Honry
Copy link
Contributor Author

Honry commented Mar 12, 2026

@fdwr, PTAL, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant