UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye1, * ‡, Song Fei1, *, Lei Zhu1, 2, †

1 The Hong Kong University of Science and Technology (Guangzhou), 2 The Hong Kong University of Science and Technology

*Equal Contribution ‡Project Leader †Corresponding Author

UltraFlux is a diffusion transformer that extends Flux backbones to native 4K synthesis with consistent quality across a wide range of aspect ratios.

The project unifies data, architecture, objectives, and optimization so that positional encoding, VAE compression, and loss design reinforce each other rather than compete.

UltraFlux native 4K samples across diverse aspect ratios

UltraFlux native 4K renders spanning landscapes, portraits, food, still life, and mosaic-like artistic scenes.

Why UltraFlux?

MultiAspect-4K-1M Dataset

Model & Training Recipe

  1. Backbone. Flux-style DiT trained directly on MultiAspect-4K-1M with token-efficient blocks plus Resonance 2D RoPE + YaRN for AR-aware positional encoding.
  2. Objective. The SNR-Aware Huber Wavelet loss aligns gradient magnitudes with 4K statistics, reinforcing high-frequency fidelity under strong VAE compression.
  3. Curriculum. SACL injects high-aesthetic data primarily into high-noise timesteps so the model prior captures human-desired structure early in the denoising trajectory.
  4. VAE Post-training. A non-adversarial fine-tuning pass boosts 4K reconstruction quality while keeping inference cost low.

Resources

We will release the full stack upon publication, including:

Contact

For collaboration or questions please reach out to tye610@connect.hkust-gz.edu.cn.