A Gradio web UI for Z-Image-Turbo on Intel XPU (e.g. Intel Arc B580) using native PyTorch XPU support.
| Component | Source | Implementation |
|---|---|---|
| Transformer | Comfy-Org/z_image_turbo (single .safetensors) |
modules/transformer.py — self-contained nn modules ported from diffusers ZImagePipeline |
| Scheduler | — | modules/scheduler.py — FlowMatchEulerDiscreteScheduler with exponential shift |
| Text encoder | Tongyi-MAI/Z-Image-Turbo |
Qwen3 loaded from local .safetensors with transformers.AutoModel.from_config(...) + load_state_dict(...) |
| VAE | Tongyi-MAI/Z-Image-Turbo |
AutoencoderKL loaded from local .safetensors + local config |
- Intel Arc GPU with up-to-date drivers
- Python 3.10+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpupip install -r requirements.txtpython app.pyOpen your browser at http://localhost:7860.
All model files must exist locally before launch.
For low-VRAM GPUs, runtime stages are loaded on-demand during generation:
- Load tokenizer + text encoder, encode prompt, unload both.
- Transformer stage (preloaded to CPU at startup):
- Mode
offload(default): stream blocks CPU→GPU during denoising, minimizing VRAM. - Mode
persistent: keep full transformer resident on GPU across generations (higher VRAM, lower latency). The default mode prioritizes compatibility with lower-VRAM GPUs.
- Mode
Expected local layout:
models/
diffusion_models/
z_image_turbo_bf16.safetensors
text_encoders/
qwen_3_4b.safetensors
config.json # optional; built-in fallback exists for Z-Image-Turbo
tokenizer.json
tokenizer_config.json
merges.txt
vocab.json
vae/
ae.safetensors
config.json # optional; built-in fallback exists for Z-Image-Turbo
The tokenizer files (tokenizer.json, tokenizer_config.json, merges.txt, vocab.json) come from the upstream tokenizer/ folder (https://huggingface.co/Tongyi-MAI/Z-Image-Turbo/tree/main/tokenizer). The text encoder and VAE weights are loaded directly from local .safetensors files rather than via from_pretrained(...) weight loading.
| Parameter | Default | Description |
|---|---|---|
| Prompt | — | Text description of the image to generate |
| Negative prompt | — | Features to suppress (only used when Guidance scale > 0) |
| Width / Height | 1024 | Output resolution; must be multiples of 16, max 1536 |
| Inference steps | 9 | Recommended by upstream; more steps = higher quality |
| Guidance scale | 0.0 | CFG weight — 0 = turbo mode (no classifier-free guidance) |
| Transformer runtime mode | offload | offload (default, lower VRAM) or persistent (higher VRAM, faster after first load) |
| Seed | -1 | Fixed seed for reproducibility; -1 uses a random seed |
Z-Image-Turbo is a flow-matching text-to-image model by Alibaba Tongyi.
The transformer uses a single-stream architecture with adaLN modulation, RoPE position embeddings, and a Qwen3 text encoder.