GLIGEN: Open-Set Grounded Text-to-Image Generation

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of exis...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 22511 - 22521
Hlavní autoři: Li, Yuheng, Liu, Haotian, Wu, Qingyang, Mu, Fangzhou, Yang, Jianwei, Gao, Jianfeng, Li, Chunyuan, Lee, Yong Jae
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.06.2023
Témata:
ISSN:1063-6919
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.
AbstractList Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.
Author Liu, Haotian
Yang, Jianwei
Li, Chunyuan
Li, Yuheng
Gao, Jianfeng
Wu, Qingyang
Lee, Yong Jae
Mu, Fangzhou
Author_xml – sequence: 1
  givenname: Yuheng
  surname: Li
  fullname: Li, Yuheng
  organization: University of Wisconsin-Madison
– sequence: 2
  givenname: Haotian
  surname: Liu
  fullname: Liu, Haotian
  organization: University of Wisconsin-Madison
– sequence: 3
  givenname: Qingyang
  surname: Wu
  fullname: Wu, Qingyang
  organization: Columbia University
– sequence: 4
  givenname: Fangzhou
  surname: Mu
  fullname: Mu, Fangzhou
  organization: University of Wisconsin-Madison
– sequence: 5
  givenname: Jianwei
  surname: Yang
  fullname: Yang, Jianwei
  organization: Microsoft
– sequence: 6
  givenname: Jianfeng
  surname: Gao
  fullname: Gao, Jianfeng
  organization: Microsoft
– sequence: 7
  givenname: Chunyuan
  surname: Li
  fullname: Li, Chunyuan
  organization: Microsoft
– sequence: 8
  givenname: Yong Jae
  surname: Lee
  fullname: Lee, Yong Jae
  organization: University of Wisconsin-Madison
BookMark eNotzs1Kw0AUQOFRFKy1b9BFXmDq_ZmZ5LqTUmOhWNHqtkySG4nYpCQR9O0VdHV2H-fSnLVdq8bMERaIINfL18cnTynJgoB4AYQ-nJiZpJKxBwYkyU7NBCGwDYJyYWbD8A4ATIhBsomhfLPOVw83yfaorX3WMcn77rOttEp2-jXasbPrQ3zTJNdW-zg2XXtlzuv4Mejsv1PzcrfaLe_tZpuvl7cb2xC40RZlGYvCp9XvCnLlxKc1lS5m7ILUFCryPkIpGCKX4Gqo1bFqHVDSlJzy1Mz_3EZV98e-OcT-e49AwF6YfwClF0X7
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52729.2023.02156
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350301298
EISSN 1063-6919
EndPage 22521
ExternalDocumentID 10203593
Genre orig-research
GrantInformation_xml – fundername: NASA
  grantid: 80NSSC21K0295
  funderid: 10.13039/100000104
– fundername: Institute of Information & communications Technology Planning & Evaluation(IITP)
  grantid: 2022- 0–00871
  funderid: 10.13039/501100010418
– fundername: NSF CAREER
  grantid: IIS2150012
  funderid: 10.13039/100000001
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i204t-bccabb57d83513d4957f2c4a83469f26d255a0c916a3c04f0fe43eef6197724e3
IEDL.DBID RIE
ISICitedReferencesCount 275
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001062531306081&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:56:31 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-bccabb57d83513d4957f2c4a83469f26d255a0c916a3c04f0fe43eef6197724e3
PageCount 11
ParticipantIDs ieee_primary_10203593
PublicationCentury 2000
PublicationDate 2023-June
PublicationDateYYYYMMDD 2023-06-01
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.7094305
Snippet Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In...
SourceID ieee
SourceType Publisher
StartPage 22511
SubjectTerms Computational modeling
Computer vision
Graphics
Grounding
Image and video synthesis and generation
Image edge detection
Logic gates
Training data
Title GLIGEN: Open-Set Grounded Text-to-Image Generation
URI https://ieeexplore.ieee.org/document/10203593
WOSCitedRecordID wos001062531306081&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEG2EePCEHxi_04PX4m7b3bJeCSAJIUTRcCPbdppwEAws_n5ndleMBw_eml6aTtPOe9N5M4zdqyTX-OgpEYE2QvughI2lF3lAcB5TowVTCoXHZjLpzufZtBarl1oYACiTz6BDw_Iv36_djkJleMMlVZxTDdYwxlRirX1ARSGVSbNuLY-Lo-yh9zZ9TiSixw71CO-Qd0t_NVEpfcig9c_Vj1n7R43Hp3s_c8IOYHXKWjV85PXl3J4xORyPhv3JI6ckEfECBafAEkW4-Yz4bbEWo3d8PnhVappOpM1eB_1Z70nULRHEUka6EBYNbm1iPAKnWHlkNyZIp_OuQpobZOqRIeSRQ8yXKxfpEAXQCiAgTUIYrUGds-ZqvYILxrXzKk-s9B7dtHVAUEDGLkPWmgaQ9pK1yQaLj6rqxeJ7-1d_zF-zIzJzlUZ1w5rFZge37NB9Fsvt5q48qy8qLZJE
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED1BQYKpfBTxjQdWl8R2koa1Km1FiCooqFuV2GepAy1qU34_5yQUMTCwWV4sn2Xfe-d7dwC3MsgUPXqSe6giroyVPPeF4ZklcO67RgtRKRROojTtTCbxqBarl1oYRCyTz7DthuVfvlnotQuV0Q0XruKc3IadQCnhV3KtTUhFEpkJ404tkPO9-K77NnoOBOHHtusS3nb-LfzVRqX0Ig_Nf65_AK0fPR4bbTzNIWzh_AiaNYBk9fVcHYPoJ8N-L71nLk2Ev2DBXGjJxbjZ2DHcYsGH7_SAsKrYtDuTFrw-9MbdAa-bIvCZ8FTBczJ5ngeRIejkS0P8JrJCq6wjiehaERriCJmnCfVlUnvKehaVRLRElAhIK5Qn0Jgv5ngKTGkjsyAXxpCjzjU6MCB8HRNvDS2K_AxazgbTj6ruxfR7--d_zN_A3mD8lEyTYfp4AfvO5FVS1SU0iuUar2BXfxaz1fK6PLcv-pqViw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=GLIGEN%3A+Open-Set+Grounded+Text-to-Image+Generation&rft.au=Li%2C+Yuheng&rft.au=Liu%2C+Haotian&rft.au=Wu%2C+Qingyang&rft.au=Mu%2C+Fangzhou&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=22511&rft.epage=22521&rft_id=info:doi/10.1109%2FCVPR52729.2023.02156&rft.externalDocID=10203593