Optimizing Diffusion Model Training Efficiency to Generate High-Resolution Images

In order to solve the bottleneck problem of diffusion model training efficiency in high-resolution image generation tasks, this paper proposes a method to optimize the diffusion model training efficiency to generate high-quality high-resolution images. This method integrates the two-stage process of...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2025 International Conference on Intelligent Computing and Knowledge Extraction (ICICKE) s. 1 - 6
Hlavní autoři: Wang, Junhua, Jiang, Yuan
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 06.06.2025
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:In order to solve the bottleneck problem of diffusion model training efficiency in high-resolution image generation tasks, this paper proposes a method to optimize the diffusion model training efficiency to generate high-quality high-resolution images. This method integrates the two-stage process of latent space compression and multi-stage diffusion generation, and constructs a fusion architecture of conditional input and latent representation. This paper uses Vector Quantized-Variational AutoEncoder (VQ-VAE) to compress the latent space of high-resolution images, maps the images to low-dimensional latent space, designs a multi-stage diffusion generation process, and subdivides the diffusion process into multiple stages. In order to achieve the fusion of conditional input and potential representation, a cross-modal cross-attention mechanism is introduced to allow the model to receive additional conditional input during the generation process. In addition, this paper also integrates time step clustering with multi-decoder architecture and adaptive time step reduction strategy to improve training efficiency while maintaining generation quality. The results show that after optimization, the training time of the model is reduced (from a maximum of 543 milliseconds to 291 milliseconds) and the memory occupancy rate is reduced (from an average of 42.06% to 22.12%). At the same time, the PSNR (Peak Signal-to-Noise Ratio) value of the generated image is improved, the FID (Fréchet Inception Distance) value is reduced, and the generation quality is significantly improved. The optimization strategy proposed in this paper has achieved results in improving the training efficiency and generation quality of the diffusion model.
DOI:10.1109/ICICKE65317.2025.11136237