PSHead: 3D Head Reconstruction from a Single Image with Diffusion Prior and Self‐Enhancement

Text‐to‐3D avatar generation has shown that diffusion models trained on general objects can capture head structure. However, image‐to‐3D avatar that creates a high‐fidelity 3D avatar from a single image remains challenging due to additional constraints. It requires recovering a detailed 3D represent...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer graphics forum
Hauptverfasser: Yang, Jing, Wu, Tianhan, Fogarty, Kyle, Zhong, Fangcheng, Oztireli, Cengiz
Format: Journal Article
Sprache:Englisch
Veröffentlicht: 01.10.2025
ISSN:0167-7055, 1467-8659
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text‐to‐3D avatar generation has shown that diffusion models trained on general objects can capture head structure. However, image‐to‐3D avatar that creates a high‐fidelity 3D avatar from a single image remains challenging due to additional constraints. It requires recovering a detailed 3D representation from limited cues while capturing complex facial features like wrinkles and hair. To address these challenges, we introduce PSHead, a coarse‐to‐fine framework guided by both object and face priors, to produce a Gaussian‐based 3D avatar for a single frontal‐view reference image. In the coarse stage, we create an initial 3D representation by applying diffusion models trained for general object generation, using Score Distillation Sampling losses over novel views. This approach marks the first integration of text‐to‐image, image‐to‐image, and text‐to‐video diffusion priors, with insights into each module's contribution to learning a 3D representation. In the fine stage, we refine this representation with pretrained face generation models, which denoise rendered images and use these refined outputs as supervision to further improve 3D detail fidelity. Leveraging the versatility of 2D objects prior, PSHead is robust across various different face framings. Our method outperforms existing approaches on in‐the‐wild images, proving its robustness and ability to capture intricate details without the need for extensive 3D supervision.
ISSN:0167-7055
1467-8659
DOI:10.1111/cgf.70279