UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
UltraFlux is a diffusion transformer that extends Flux backbones to native 4K synthesis with consistent quality across a wide range of aspect ratios. The project unifies data, architecture, objectives, and optimization so that positional encoding, VAE compression, and loss design reinforce each other rather than compete.
Tian Ye1*‡,Song Fei1*, Lei Zhu1,2†
1The Hong Kong University of Science and Technology (Guangzhou)
2The Hong Kong University of Science and Technology*Equal Contribution, ‡Project Leader, †Corresponding Author
[2026.04.09] - UltraFlux is selected as CVPR 2026 Highlight (top 3%).
[2026.04.01] - We released the MultiAspect-4K-1M dataset and the filtering pipeline.
[2026.02.21] - UltraFlux is accepted by CVPR'26.
[2025.12.17] — Thanks to the community’s help, we fixed the implementation of Resonance alignment for the 2D RoPE.
[2025.11.26] — Thanks to smthemex for developing ComfyUI_UltraFlux T2I&I2I, which enables UltraFlux to run with as little as 8 GB GB of memory through the GGUF integration !!
[2025.11.21] – We released the UltraFlux-v1.1 transformer checkpoint. It is fine-tuned on a carefully curated set of high-aesthetic synthetic images to further improve visual aesthetics and composition quality. You can now enable it easily by uncommenting the corresponding lines in inf_ultraflux.py!
[2025.11.20] – We released the UltraFlux-v1 checkpoint, inference code, and the accompanying tech report.
- The script
inf_ultraflux.pydownloads the latestOwen777/UltraFlux-v1weights (transformer + VAE) and runs a set of curated prompts. - Ensure PyTorch,
diffusers, and CUDA are available, then run:
python inf_ultraflux.py- Generated images are saved into
results/ultra_flux_*.jpegat 4096×4096 resolution; edit the prompt list or pipeline arguments inside the script to customize inference.
We have released the MultiAspect-4K-1M dataset, together with the filtering pipeline.
Each sample in MultiAspect-4K-1M provides an image_url for downloading the image. The metadata also contains the attributes, including bilingual captions, character tag, VLM-based quality and aesthetic scores, and classical interpretable signals—flatness and information entropy. To better respect image provenance and the original creators, about 98% of the dataset also includes source attribution metadata: work_url refers to the original webpage where the image was published, photographer gives the creator name, and photographer_url links to the creator’s profile or source page.
Images can be downloaded and filtering scores can be computed with:
# download the image
python tools/download_from_image_url.py "image_url in metadata"
# compute filtering scores
python tools/filtering_pipeline.py /path/to/image.jpg- 4K positional robustness. Resonance 2D RoPE with YaRN keeps training-window awareness while remaining band-aware and aspect-ratio aware to avoiding ghosting.
- Detail-preserving compression. A lightweight, non-adversarial post-training routine sharpens Flux VAE reconstructions at 4K without sacrificing throughput, resolving the usual trade-off between speed and micro-detail.
- 4K-aware objectives. The SNR-Aware Huber Wavelet Training Objective emphasizes high-frequency fidelity in the latent space so gradients stay balanced across timesteps and frequency bands.
- Aesthetic-aware scheduling. Stage-wise Aesthetic Curriculum Learning (SACL) routes high-aesthetic supervision toward high-noise steps, sculpting the model prior where it matters most for vivid detail and alignment.
- Scale and coverage. 1M native and near-4K images with controlled aspect-ratio sampling to ensure both wide and portrait regimes are equally represented.
- Content balance. A dual-channel collection pipeline debiases landscape-heavy sources toward human-centric content.
- Rich metadata. Every sample includes bilingual captions, subject tags, CLIP/VLM-based quality and aesthetic scores, and classical IQA metrics, enabling targeted subset sampling for specific training stages.
- Backbone. Flux-style DiT trained directly on MultiAspect-4K-1M with token-efficient blocks and Resonance 2D RoPE + YaRN for AR-aware positional encoding.
- Objective. SNR-Aware Huber Wavelet loss aligns gradient magnitudes with 4K statistics, reinforcing high-frequency fidelity under strong VAE compression.
- Curriculum. SACL injects high-aesthetic data primarily into high-noise timesteps so the model’s prior captures human-desired structure early in the trajectory.
- VAE Post-training. A simple, non-adversarial fine-tuning pass boosts 4K reconstruction quality while keeping inference cost low.
UltraFlux surpasses recent native-4K and training-free scaling baselines on standard 4K benchmarks spanning:
- Image fidelity at 4096×4096 and higher
- Aesthetic preference scores
- Text-image alignment metrics across diverse aspect ratios
We will release the full stack upon publication:
- MultiAspect-4K-1M dataset with metadata loaders
- Training pipelines
- Evaluation code covering fidelity, aesthetic, and alignment metrics
For the purpose of fostering research and the open-source community, we plan to open-source the entire project, encompassing training, inference, weights, etc. Thank you for your patience and support! 🌟
- Release GitHub repo.
- Release inference code (
inf_ultraflux.py). - Release training code.
- Release model checkpoints.
- Release arXiv paper.
- Release HuggingFace Space demo.
- Release dataset (MultiAspect-4K-1M).
Stay tuned for links and usage instructions. For updates, please watch this repository or open an issue.
We are grateful for the following projects:
@misc{ye2025ultrafluxdatamodelcodesignhighquality,
title={UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios},
author={Tian Ye and Song Fei and Lei Zhu},
year={2025},
eprint={2511.18050},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.18050},
}
















