UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

UltraFlux is a diffusion transformer that extends Flux backbones to native 4K synthesis with consistent quality across a wide range of aspect ratios. The project unifies data, architecture, objectives, and optimization so that positional encoding, VAE compression, and loss design reinforce each other rather than compete.

Show UltraFlux generation examples

Each sample is rendered at 4096×4096 resolution.

👥 Authors

Tian Ye¹*‡,Song Fei¹*, Lei Zhu^1,2†

¹The Hong Kong University of Science and Technology (Guangzhou)
²The Hong Kong University of Science and Technology

*Equal Contribution, ‡Project Leader, †Corresponding Author

📰 News ✨✨

[2026.04.09] - UltraFlux is selected as CVPR 2026 Highlight (top 3%).

[2026.04.01] - We released the MultiAspect-4K-1M dataset and the filtering pipeline.

[2026.02.21] - UltraFlux is accepted by CVPR'26.

[2025.12.17] — Thanks to the community’s help, we fixed the implementation of Resonance alignment for the 2D RoPE.

[2025.11.26] — Thanks to smthemex for developing ComfyUI_UltraFlux T2I&I2I, which enables UltraFlux to run with as little as 8 GB GB of memory through the GGUF integration !!

[2025.11.21] – We released the UltraFlux-v1.1 transformer checkpoint. It is fine-tuned on a carefully curated set of high-aesthetic synthetic images to further improve visual aesthetics and composition quality. You can now enable it easily by uncommenting the corresponding lines in inf_ultraflux.py!

[2025.11.20] – We released the UltraFlux-v1 checkpoint, inference code, and the accompanying tech report.

Inference Quickstart

The script inf_ultraflux.py downloads the latest Owen777/UltraFlux-v1 weights (transformer + VAE) and runs a set of curated prompts.
Ensure PyTorch, diffusers, and CUDA are available, then run:

python inf_ultraflux.py

Generated images are saved into results/ultra_flux_*.jpeg at 4096×4096 resolution; edit the prompt list or pipeline arguments inside the script to customize inference.

MultiAspect-4K-1M Dataset and Filtering Pipeline

We have released the MultiAspect-4K-1M dataset, together with the filtering pipeline.

Each sample in MultiAspect-4K-1M provides an image_url for downloading the image. The metadata also contains the attributes, including bilingual captions, character tag, VLM-based quality and aesthetic scores, and classical interpretable signals—flatness and information entropy. To better respect image provenance and the original creators, about 98% of the dataset also includes source attribution metadata: work_url refers to the original webpage where the image was published, photographer gives the creator name, and photographer_url links to the creator’s profile or source page.

Images can be downloaded and filtering scores can be computed with:

# download the image
python tools/download_from_image_url.py "image_url in metadata"

# compute filtering scores
python tools/filtering_pipeline.py /path/to/image.jpg

Why UltraFlux?

4K positional robustness. Resonance 2D RoPE with YaRN keeps training-window awareness while remaining band-aware and aspect-ratio aware to avoiding ghosting.
Detail-preserving compression. A lightweight, non-adversarial post-training routine sharpens Flux VAE reconstructions at 4K without sacrificing throughput, resolving the usual trade-off between speed and micro-detail.
4K-aware objectives. The SNR-Aware Huber Wavelet Training Objective emphasizes high-frequency fidelity in the latent space so gradients stay balanced across timesteps and frequency bands.
Aesthetic-aware scheduling. Stage-wise Aesthetic Curriculum Learning (SACL) routes high-aesthetic supervision toward high-noise steps, sculpting the model prior where it matters most for vivid detail and alignment.

MultiAspect-4K-1M Dataset

Scale and coverage. 1M native and near-4K images with controlled aspect-ratio sampling to ensure both wide and portrait regimes are equally represented.
Content balance. A dual-channel collection pipeline debiases landscape-heavy sources toward human-centric content.
Rich metadata. Every sample includes bilingual captions, subject tags, CLIP/VLM-based quality and aesthetic scores, and classical IQA metrics, enabling targeted subset sampling for specific training stages.

Model & Training Recipe

Backbone. Flux-style DiT trained directly on MultiAspect-4K-1M with token-efficient blocks and Resonance 2D RoPE + YaRN for AR-aware positional encoding.
Objective. SNR-Aware Huber Wavelet loss aligns gradient magnitudes with 4K statistics, reinforcing high-frequency fidelity under strong VAE compression.
Curriculum. SACL injects high-aesthetic data primarily into high-noise timesteps so the model’s prior captures human-desired structure early in the trajectory.
VAE Post-training. A simple, non-adversarial fine-tuning pass boosts 4K reconstruction quality while keeping inference cost low.

Results

UltraFlux surpasses recent native-4K and training-free scaling baselines on standard 4K benchmarks spanning:

Image fidelity at 4096×4096 and higher
Aesthetic preference scores
Text-image alignment metrics across diverse aspect ratios

Resources

We will release the full stack upon publication:

MultiAspect-4K-1M dataset with metadata loaders
Training pipelines
Evaluation code covering fidelity, aesthetic, and alignment metrics

🚀 Updates

For the purpose of fostering research and the open-source community, we plan to open-source the entire project, encompassing training, inference, weights, etc. Thank you for your patience and support! 🌟

Release GitHub repo.
Release inference code (inf_ultraflux.py).
Release training code.
Release model checkpoints.
Release arXiv paper.
Release HuggingFace Space demo.
Release dataset (MultiAspect-4K-1M).

Stay tuned for links and usage instructions. For updates, please watch this repository or open an issue.

Acknowledgement

We are grateful for the following projects:

BibTeX citation

@misc{ye2025ultrafluxdatamodelcodesignhighquality,
      title={UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios}, 
      author={Tian Ye and Song Fei and Lei Zhu},
      year={2025},
      eprint={2511.18050},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.18050}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
fig		fig
tools		tools
ultraflux		ultraflux
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inf_ultraflux.py		inf_ultraflux.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

👥 Authors

📰 News ✨✨

Inference Quickstart

MultiAspect-4K-1M Dataset and Filtering Pipeline

Why UltraFlux?

MultiAspect-4K-1M Dataset

Model & Training Recipe

Results

Resources

🚀 Updates

Acknowledgement

BibTeX citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

👥 Authors

📰 News ✨✨

Inference Quickstart

MultiAspect-4K-1M Dataset and Filtering Pipeline

Why UltraFlux?

MultiAspect-4K-1M Dataset

Model & Training Recipe

Results

Resources

🚀 Updates

Acknowledgement

BibTeX citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages