GLIGEN: Open-Set Grounded Text-to-Image Generation
Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of exis...
Uloženo v:
| Vydáno v: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 22511 - 22521 |
|---|---|
| Hlavní autoři: | , , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.06.2023
|
| Témata: | |
| ISSN: | 1063-6919 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin. |
|---|---|
| AbstractList | Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin. |
| Author | Liu, Haotian Yang, Jianwei Li, Chunyuan Li, Yuheng Gao, Jianfeng Wu, Qingyang Lee, Yong Jae Mu, Fangzhou |
| Author_xml | – sequence: 1 givenname: Yuheng surname: Li fullname: Li, Yuheng organization: University of Wisconsin-Madison – sequence: 2 givenname: Haotian surname: Liu fullname: Liu, Haotian organization: University of Wisconsin-Madison – sequence: 3 givenname: Qingyang surname: Wu fullname: Wu, Qingyang organization: Columbia University – sequence: 4 givenname: Fangzhou surname: Mu fullname: Mu, Fangzhou organization: University of Wisconsin-Madison – sequence: 5 givenname: Jianwei surname: Yang fullname: Yang, Jianwei organization: Microsoft – sequence: 6 givenname: Jianfeng surname: Gao fullname: Gao, Jianfeng organization: Microsoft – sequence: 7 givenname: Chunyuan surname: Li fullname: Li, Chunyuan organization: Microsoft – sequence: 8 givenname: Yong Jae surname: Lee fullname: Lee, Yong Jae organization: University of Wisconsin-Madison |
| BookMark | eNotzs1Kw0AUQOFRFKy1b9BFXmDq_ZmZ5LqTUmOhWNHqtkySG4nYpCQR9O0VdHV2H-fSnLVdq8bMERaIINfL18cnTynJgoB4AYQ-nJiZpJKxBwYkyU7NBCGwDYJyYWbD8A4ATIhBsomhfLPOVw83yfaorX3WMcn77rOttEp2-jXasbPrQ3zTJNdW-zg2XXtlzuv4Mejsv1PzcrfaLe_tZpuvl7cb2xC40RZlGYvCp9XvCnLlxKc1lS5m7ILUFCryPkIpGCKX4Gqo1bFqHVDSlJzy1Mz_3EZV98e-OcT-e49AwF6YfwClF0X7 |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CVPR52729.2023.02156 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9798350301298 |
| EISSN | 1063-6919 |
| EndPage | 22521 |
| ExternalDocumentID | 10203593 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: NASA grantid: 80NSSC21K0295 funderid: 10.13039/100000104 – fundername: Institute of Information & communications Technology Planning & Evaluation(IITP) grantid: 2022- 0–00871 funderid: 10.13039/501100010418 – fundername: NSF CAREER grantid: IIS2150012 funderid: 10.13039/100000001 |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-i204t-bccabb57d83513d4957f2c4a83469f26d255a0c916a3c04f0fe43eef6197724e3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 275 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001062531306081&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:56:31 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i204t-bccabb57d83513d4957f2c4a83469f26d255a0c916a3c04f0fe43eef6197724e3 |
| PageCount | 11 |
| ParticipantIDs | ieee_primary_10203593 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-June |
| PublicationDateYYYYMMDD | 2023-06-01 |
| PublicationDate_xml | – month: 06 year: 2023 text: 2023-June |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 |
| Score | 2.7094305 |
| Snippet | Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 22511 |
| SubjectTerms | Computational modeling Computer vision Graphics Grounding Image and video synthesis and generation Image edge detection Logic gates Training data |
| Title | GLIGEN: Open-Set Grounded Text-to-Image Generation |
| URI | https://ieeexplore.ieee.org/document/10203593 |
| WOSCitedRecordID | wos001062531306081&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEG2EePCEHxi_04PX4m7b3bJeCSAJIUTRcCPbdppwEAws_n5ndleMBw_eml6aTtPOe9N5M4zdqyTX-OgpEYE2QvughI2lF3lAcB5TowVTCoXHZjLpzufZtBarl1oYACiTz6BDw_Iv36_djkJleMMlVZxTDdYwxlRirX1ARSGVSbNuLY-Lo-yh9zZ9TiSixw71CO-Qd0t_NVEpfcig9c_Vj1n7R43Hp3s_c8IOYHXKWjV85PXl3J4xORyPhv3JI6ckEfECBafAEkW4-Yz4bbEWo3d8PnhVappOpM1eB_1Z70nULRHEUka6EBYNbm1iPAKnWHlkNyZIp_OuQpobZOqRIeSRQ8yXKxfpEAXQCiAgTUIYrUGds-ZqvYILxrXzKk-s9B7dtHVAUEDGLkPWmgaQ9pK1yQaLj6rqxeJ7-1d_zF-zIzJzlUZ1w5rFZge37NB9Fsvt5q48qy8qLZJE |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED1BQYKpfBTxjQdWl8R2koa1Km1FiCooqFuV2GepAy1qU34_5yQUMTCwWV4sn2Xfe-d7dwC3MsgUPXqSe6giroyVPPeF4ZklcO67RgtRKRROojTtTCbxqBarl1oYRCyTz7DthuVfvlnotQuV0Q0XruKc3IadQCnhV3KtTUhFEpkJ404tkPO9-K77NnoOBOHHtusS3nb-LfzVRqX0Ig_Nf65_AK0fPR4bbTzNIWzh_AiaNYBk9fVcHYPoJ8N-L71nLk2Ev2DBXGjJxbjZ2DHcYsGH7_SAsKrYtDuTFrw-9MbdAa-bIvCZ8FTBczJ5ngeRIejkS0P8JrJCq6wjiehaERriCJmnCfVlUnvKehaVRLRElAhIK5Qn0Jgv5ngKTGkjsyAXxpCjzjU6MCB8HRNvDS2K_AxazgbTj6ruxfR7--d_zN_A3mD8lEyTYfp4AfvO5FVS1SU0iuUar2BXfxaz1fK6PLcv-pqViw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=GLIGEN%3A+Open-Set+Grounded+Text-to-Image+Generation&rft.au=Li%2C+Yuheng&rft.au=Liu%2C+Haotian&rft.au=Wu%2C+Qingyang&rft.au=Mu%2C+Fangzhou&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=22511&rft.epage=22521&rft_id=info:doi/10.1109%2FCVPR52729.2023.02156&rft.externalDocID=10203593 |