GLIGEN: Open-Set Grounded Text-to-Image Generation

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of exis...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 22511 - 22521
Hlavní autoři:	Li, Yuheng, Liu, Haotian, Wu, Qingyang, Mu, Fangzhou, Yang, Jianwei, Gao, Jianfeng, Li, Chunyuan, Lee, Yong Jae
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.06.2023
Témata:	Computational modeling Computer vision Graphics Grounding Image and video synthesis and generation Image edge detection Logic gates Training data
ISSN:	1063-6919
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.
AbstractList	Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.
Author	Liu, Haotian Yang, Jianwei Li, Chunyuan Li, Yuheng Gao, Jianfeng Wu, Qingyang Lee, Yong Jae Mu, Fangzhou
Author_xml	– sequence: 1 givenname: Yuheng surname: Li fullname: Li, Yuheng organization: University of Wisconsin-Madison – sequence: 2 givenname: Haotian surname: Liu fullname: Liu, Haotian organization: University of Wisconsin-Madison – sequence: 3 givenname: Qingyang surname: Wu fullname: Wu, Qingyang organization: Columbia University – sequence: 4 givenname: Fangzhou surname: Mu fullname: Mu, Fangzhou organization: University of Wisconsin-Madison – sequence: 5 givenname: Jianwei surname: Yang fullname: Yang, Jianwei organization: Microsoft – sequence: 6 givenname: Jianfeng surname: Gao fullname: Gao, Jianfeng organization: Microsoft – sequence: 7 givenname: Chunyuan surname: Li fullname: Li, Chunyuan organization: Microsoft – sequence: 8 givenname: Yong Jae surname: Lee fullname: Lee, Yong Jae organization: University of Wisconsin-Madison
BookMark	eNotzs1Kw0AUQOFRFKy1b9BFXmDq_ZmZ5LqTUmOhWNHqtkySG4nYpCQR9O0VdHV2H-fSnLVdq8bMERaIINfL18cnTynJgoB4AYQ-nJiZpJKxBwYkyU7NBCGwDYJyYWbD8A4ATIhBsomhfLPOVw83yfaorX3WMcn77rOttEp2-jXasbPrQ3zTJNdW-zg2XXtlzuv4Mejsv1PzcrfaLe_tZpuvl7cb2xC40RZlGYvCp9XvCnLlxKc1lS5m7ILUFCryPkIpGCKX4Gqo1bFqHVDSlJzy1Mz_3EZV98e-OcT-e49AwF6YfwClF0X7
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR52729.2023.02156
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	9798350301298
EISSN	1063-6919
EndPage	22521
ExternalDocumentID	10203593
Genre	orig-research
GrantInformation_xml	– fundername: NASA grantid: 80NSSC21K0295 funderid: 10.13039/100000104 – fundername: Institute of Information & communications Technology Planning & Evaluation(IITP) grantid: 2022- 0–00871 funderid: 10.13039/501100010418 – fundername: NSF CAREER grantid: IIS2150012 funderid: 10.13039/100000001
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i204t-bccabb57d83513d4957f2c4a83469f26d255a0c916a3c04f0fe43eef6197724e3
IEDL.DBID	RIE
ISICitedReferencesCount	275
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001062531306081&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:56:31 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i204t-bccabb57d83513d4957f2c4a83469f26d255a0c916a3c04f0fe43eef6197724e3
PageCount	11
ParticipantIDs	ieee_primary_10203593
PublicationCentury	2000
PublicationDate	2023-June
PublicationDateYYYYMMDD	2023-06-01
PublicationDate_xml	– month: 06 year: 2023 text: 2023-June
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev	CVPR
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211698
Score	2.7094305
Snippet	Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In...
SourceID	ieee
SourceType	Publisher
StartPage	22511
SubjectTerms	Computational modeling Computer vision Graphics Grounding Image and video synthesis and generation Image edge detection Logic gates Training data
Title	GLIGEN: Open-Set Grounded Text-to-Image Generation
URI	https://ieeexplore.ieee.org/document/10203593
WOSCitedRecordID	wos001062531306081&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4I8eAJHxjFR3rwWuy2Zct6JYAkhBBFwo30mXAQDCz-fqe7K8aDB2-bJptNZ9Kd75vONwPwIDNnpWSOquAlEhRrqdGJp1oGHZiV1hU3pvOxmky6i0U2rcTqhRbGe18Un_l2fCzu8t3G7mOqDE84jx3nRA1qSqlSrHVIqAikMmnWreRxCcsee_PpS4cjemzHGeHtGN3SX0NUihgyaPzz66fQ_FHjkekhzpzBkV-fQ6OCj6Q6nLsL4MPxaNifPJFYJEJffU5iYilmuMks8tt8Q0fv-PsgZavp6JEmvA36s94zrUYi0BVnMqcGDW5MRzkETolwyG5U4FbqrkCaG3jqkCFoZhHzaWGZDAw9ILwPSJMQRksvLqG-3qz9FRARgnYyNTYzHWm905rj2zpLTOxJFNQ1NKMNlh9l14vl9_Zbf6zfwEk0c1lGdQv1fLv3d3BsP_PVbntf-OoLcH2UZg
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED5BQYKpPIp444HVJbGdpGGt-hIhqqBU3SrHD6kDDWpTfj_nJBQxMLBFlqLId3Lu-8733QHci1grITxNI2sEEhSlaCZ9Q6Ww0npKKF3emE6TKE07s1k8rsXqpRbGGFMWn5m2eyzv8nWuNi5VhiecuY5zfBf2AiGYX8m1tikVjmQmjDu1QM734ofudPwSMMSPbTclvO3iW_hrjEoZRfrNf37_CFo_ejwy3kaaY9gxyxNo1gCS1MdzfQpskIwGvfSRuDIR-moK4lJLLsdNJo7hFjkdveMPhFTNpp1PWvDW7026Q1oPRaAL5omCZmjyLAsijdDJ5xr5TWSZErLDkehaFmrkCNJTiPokV56wHvqAG2ORKCGQFoafQWOZL805EG6t1CLMVJwFQhktJcO3ZexnriuRjS6g5Www_6j6Xsy_t3_5x_odHAwnz8k8GaVPV3DoTF4VVV1Do1htzA3sq89isV7dln77Al2Ll60
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=GLIGEN%3A+Open-Set+Grounded+Text-to-Image+Generation&rft.au=Li%2C+Yuheng&rft.au=Liu%2C+Haotian&rft.au=Wu%2C+Qingyang&rft.au=Mu%2C+Fangzhou&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=22511&rft.epage=22521&rft_id=info:doi/10.1109%2FCVPR52729.2023.02156&rft.externalDocID=10203593