Transformer-based image generation from scene graphs
Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In th...
Uloženo v:
| Vydáno v: | Computer vision and image understanding Ročník 233; s. 103721 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier Inc
01.08.2023
|
| Témata: | |
| ISSN: | 1077-3142, 1090-235X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability.
The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im.
•Multi-head attention on graphs with geometric and edge features for layout estimation.•GPT decoder for conditioned image generation on the latent space.•Improved IS results on the generated images.•Achieved high robustness to scene graph perturbations.•Increased diversity of generated images. |
|---|---|
| AbstractList | Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability.
The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im.
•Multi-head attention on graphs with geometric and edge features for layout estimation.•GPT decoder for conditioned image generation on the latent space.•Improved IS results on the generated images.•Achieved high robustness to scene graph perturbations.•Increased diversity of generated images. |
| ArticleNumber | 103721 |
| Author | Spampinato, Concetto Sortino, Renato Rundo, Francesco Palazzo, Simone |
| Author_xml | – sequence: 1 givenname: Renato orcidid: 0000-0002-3906-797X surname: Sortino fullname: Sortino, Renato email: renato.sortino@phd.unict.it organization: PeRCeiVe Lab at the Department of Electrical Electronical Engineering and Computer Science, University of Catania, Italy – sequence: 2 givenname: Simone orcidid: 0000-0002-2441-0982 surname: Palazzo fullname: Palazzo, Simone organization: PeRCeiVe Lab at the Department of Electrical Electronical Engineering and Computer Science, University of Catania, Italy – sequence: 3 givenname: Francesco surname: Rundo fullname: Rundo, Francesco organization: ADG, R&D Power and Discretes, STMicroelectronics, Catania, Italy – sequence: 4 givenname: Concetto orcidid: 0000-0001-6653-2577 surname: Spampinato fullname: Spampinato, Concetto organization: PeRCeiVe Lab at the Department of Electrical Electronical Engineering and Computer Science, University of Catania, Italy |
| BookMark | eNp9j01Lw0AQhhepYFv9A57yB1JndpImAS9S_IKClwrelunupG5pk7IbC_57E-vJQ0_zzgvPMM9EjZq2EaVuEWYIOL_bzuzRf800aOoLKjReqDFCBamm_GM05KJICTN9pSYxbgEQswrHKlsFbmLdhr2EdM1RXOL3vJFkI40E7nzbJHVo90m0fZFsAh8-47W6rHkX5eZvTtX70-Nq8ZIu355fFw_L1FKWdamtieuKQQTAIWRguZISy8JxRcLIJeE6z8mVjikvi5xoDdIvWs_d3AJNVXm6a0MbY5DaWN_9_tQF9juDYAZ7szWDvRnszcm-R_U_9BB6sfB9Hro_QdJLHb0EE62XxorzQWxnXOvP4T8OlHXV |
| CitedBy_id | crossref_primary_10_3390_electronics14061158 crossref_primary_10_1016_j_dcan_2025_04_010 crossref_primary_10_1016_j_cviu_2024_103951 crossref_primary_10_1016_j_inffus_2025_102951 crossref_primary_10_1016_j_ipm_2025_104297 crossref_primary_10_1007_s42979_024_02791_8 |
| Cites_doi | 10.1145/3477495.3532076 10.1109/TPAMI.2021.3127346 10.1109/CVPR42600.2020.00813 10.1109/CVPR52688.2022.01042 10.1109/CVPR46437.2021.01268 10.1109/CVPR52688.2022.00761 10.1109/CVPR.2018.00133 10.1145/3477495.3532038 10.1109/ICCV.2017.244 10.1109/CVPR.2018.00132 10.1109/ICCV48922.2021.00986 10.1109/CVPR.2017.632 10.1109/ICCV.2019.01063 10.1145/3503161.3548180 10.1609/aaai.v34i07.6999 10.1007/s11263-016-0981-7 |
| ContentType | Journal Article |
| Copyright | 2023 The Authors |
| Copyright_xml | – notice: 2023 The Authors |
| DBID | 6I. AAFTH AAYXX CITATION |
| DOI | 10.1016/j.cviu.2023.103721 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Engineering Computer Science |
| EISSN | 1090-235X |
| ExternalDocumentID | 10_1016_j_cviu_2023_103721 S1077314223001017 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29F 4.4 457 4G. 5GY 5VS 6I. 6TJ 7-5 71M 8P~ AABNK AACTN AAEDT AAEDW AAFTH AAIAV AAIKC AAIKJ AAKOC AALRI AAMNW AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABJNI ABMAC ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 EBS EFBJH EFLBG EJD EO8 EO9 EP2 EP3 F0J F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HF~ HVGLF HZ~ IHE J1W JJJVA KOM LG5 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG RNS ROL RPZ SDF SDG SDP SES SEW SPC SPCBC SSV SSZ T5K TN5 XPP ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS SST ~HD |
| ID | FETCH-LOGICAL-c344t-cf3af9a0ee00d1040ca9e8187da93ea1a831b553d8da3587533b0e8da226d6c03 |
| ISICitedReferencesCount | 7 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001009106900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1077-3142 |
| IngestDate | Sat Nov 29 07:10:05 EST 2025 Tue Nov 18 21:44:07 EST 2025 Fri Feb 23 02:34:41 EST 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Scene graphs Transformers Generative models Conditional image generation |
| Language | English |
| License | This is an open access article under the CC BY license. |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c344t-cf3af9a0ee00d1040ca9e8187da93ea1a831b553d8da3587533b0e8da226d6c03 |
| ORCID | 0000-0001-6653-2577 0000-0002-2441-0982 0000-0002-3906-797X |
| OpenAccessLink | https://dx.doi.org/10.1016/j.cviu.2023.103721 |
| ParticipantIDs | crossref_citationtrail_10_1016_j_cviu_2023_103721 crossref_primary_10_1016_j_cviu_2023_103721 elsevier_sciencedirect_doi_10_1016_j_cviu_2023_103721 |
| PublicationCentury | 2000 |
| PublicationDate | August 2023 2023-08-00 |
| PublicationDateYYYYMMDD | 2023-08-01 |
| PublicationDate_xml | – month: 08 year: 2023 text: August 2023 |
| PublicationDecade | 2020 |
| PublicationTitle | Computer vision and image understanding |
| PublicationYear | 2023 |
| Publisher | Elsevier Inc |
| Publisher_xml | – name: Elsevier Inc |
| References | Fan, Chen, Chen, Cheng, Yuan, Wang (b10) 2022 Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D., 2020. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. pp. 12993–13000. Saharia, Chan, Saxena, Li, Whang, Denton, Ghasemipour, Ayan, Mahdavi, Lopes (b43) 2022 Ho, J., Salimans, T., Classifier-Free Diffusion Guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. Krishna, Zhu, Groth, Johnson, Hata, Kravitz, Chen, Kalantidis, Li, Shamma (b30) 2017; 123 Jangra, Mukherjee, Jatowt, Saha, Hasanuzzaman (b23) 2021 Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, Zettlemoyer (b31) 2019 Van Den Oord, Vinyals (b49) 2017; 30 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695. Fang, Liao, Wang, Fang, Qi, Wu, Niu, Liu (b11) 2021; 34 Vo, Sugimoto (b52) 2020 Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly (b7) 2020 Farshad, Musatian, Dhamo, Navab (b12) 2021 Caesar, H., Uijlings, J., Ferrari, V., 2018. Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1209–1218. Mittal, Agrawal, Agarwal, Mehta, Marwah (b36) 2019 Mirza, Osindero (b35) 2014 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022. Yu, Dai, Liu, Fung (b55) 2021 Jahn, Rombach, Ommer (b22) 2021 Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever (b38) 2021; Vol. 139 Yang, Z., Liu, D., Wang, C., Yang, J., Tao, D., 2022. Modeling image composition for complex scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7764–7773. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1125–1134. Veličković, Cucurull, Casanova, Romero, Lio, Bengio (b51) 2017 Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T., 2020. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8110–8119. Heusel, Ramsauer, Unterthiner, Nessler, Hochreiter (b16) 2017; 30 Ivgi, Benny, Ben-David, Berant, Wolf (b21) 2021 Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell (b2) 2020; 33 Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b50) 2017; 30 Dwivedi, V.P., Bresson, X., 2021. A Generalization of Transformer Networks to Graphs. In: AAAI Workshop on Deep Learning on Graphs: Methods and Applications. Guan, W., Jiao, F., Song, X., Wen, H., Yeh, C.-H., Chang, X., 2022. Personalized Fashion Compatibility Modeling via Metapath-guided Heterogeneous Graph Learning. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 482–491. Chang, Ren, Xu, Li, Chen, Hauptmann (b5) 2021 Yan, Chang, Li, Guan, Ge, Zhu, Zheng (b53) 2021; 44 Herzig, Bar, Xu, Chechik, Darrell, Globerson (b15) 2020 Ramesh, Dhariwal, Nichol, Chu, Chen (b39) 2022 Li, Ma, Bai, Duan, Wei, Wang (b32) 2019; 32 Jiang, Chang, Wang (b24) 2021; 34 Sun, W., Wu, T., 2019. Image synthesis from reconfigurable layout and style. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10531–10540. Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio (b13) 2014 Arad Hudson, Zitnick (b1) 2021; 34 Zhu, J.-Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232. Song, X., Jing, L., Lin, D., Zhao, Z., Chen, H., Nie, L., 2022. V2P: Vision-to-prompt based multi-modal product summary generation. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 992–1001. Nie, L., Qu, L., Meng, D., Zhang, M., Tian, Q., Bimbo, A.D., 2022. Search-oriented Micro-video Captioning. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3234–3243. Crowson, Biderman, Kornis, Stander, Hallahan, Castricato, Raff (b6) 2022 Kingma, Ba (b28) 2014 Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, Zitnick (b33) 2014 Reed, Akata, Yan, Logeswaran, Schiele, Lee (b41) 2016 Salimans, Goodfellow, Zaremba, Cheung, Radford, Chen (b44) 2016; 29 Carion, Massa, Synnaeve, Usunier, Kirillov, Zagoruyko (b4) 2020 Touvron, Cord, Douze, Massa, Sablayrolles, Jégou (b48) 2021 Smith, Topin (b45) 2019 Johnson, J., Gupta, A., Fei-Fei, L., 2018. Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1219–1228. Hudson, Zitnick (b19) 2021 Kottur, Moura, Parikh, Batra, Rohrbach (b29) 2019 Kim, Cha, Kim, Lee, Kim (b27) 2017 Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12873–12883. Ho, Jain, Abbeel (b17) 2020; 33 Ramesh, Pavlov, Goh, Gray, Voss, Radford, Chen, Sutskever (b40) 2021 Vaswani (10.1016/j.cviu.2023.103721_b50) 2017; 30 Farshad (10.1016/j.cviu.2023.103721_b12) 2021 Goodfellow (10.1016/j.cviu.2023.103721_b13) 2014 Arad Hudson (10.1016/j.cviu.2023.103721_b1) 2021; 34 Crowson (10.1016/j.cviu.2023.103721_b6) 2022 Kottur (10.1016/j.cviu.2023.103721_b29) 2019 Vo (10.1016/j.cviu.2023.103721_b52) 2020 10.1016/j.cviu.2023.103721_b25 10.1016/j.cviu.2023.103721_b26 Dosovitskiy (10.1016/j.cviu.2023.103721_b7) 2020 Herzig (10.1016/j.cviu.2023.103721_b15) 2020 10.1016/j.cviu.2023.103721_b20 Touvron (10.1016/j.cviu.2023.103721_b48) 2021 Brown (10.1016/j.cviu.2023.103721_b2) 2020; 33 Mirza (10.1016/j.cviu.2023.103721_b35) 2014 Fang (10.1016/j.cviu.2023.103721_b11) 2021; 34 10.1016/j.cviu.2023.103721_b37 Salimans (10.1016/j.cviu.2023.103721_b44) 2016; 29 Chang (10.1016/j.cviu.2023.103721_b5) 2021 10.1016/j.cviu.2023.103721_b34 Li (10.1016/j.cviu.2023.103721_b32) 2019; 32 Smith (10.1016/j.cviu.2023.103721_b45) 2019 10.1016/j.cviu.2023.103721_b3 Radford (10.1016/j.cviu.2023.103721_b38) 2021; Vol. 139 Carion (10.1016/j.cviu.2023.103721_b4) 2020 Saharia (10.1016/j.cviu.2023.103721_b43) 2022 Jahn (10.1016/j.cviu.2023.103721_b22) 2021 Mittal (10.1016/j.cviu.2023.103721_b36) 2019 10.1016/j.cviu.2023.103721_b47 10.1016/j.cviu.2023.103721_b9 Ramesh (10.1016/j.cviu.2023.103721_b40) 2021 10.1016/j.cviu.2023.103721_b46 Veličković (10.1016/j.cviu.2023.103721_b51) 2017 10.1016/j.cviu.2023.103721_b8 10.1016/j.cviu.2023.103721_b42 Ho (10.1016/j.cviu.2023.103721_b17) 2020; 33 Krishna (10.1016/j.cviu.2023.103721_b30) 2017; 123 Kingma (10.1016/j.cviu.2023.103721_b28) 2014 Yu (10.1016/j.cviu.2023.103721_b55) 2021 Reed (10.1016/j.cviu.2023.103721_b41) 2016 Yan (10.1016/j.cviu.2023.103721_b53) 2021; 44 Ivgi (10.1016/j.cviu.2023.103721_b21) 2021 10.1016/j.cviu.2023.103721_b18 Jangra (10.1016/j.cviu.2023.103721_b23) 2021 10.1016/j.cviu.2023.103721_b14 Van Den Oord (10.1016/j.cviu.2023.103721_b49) 2017; 30 Jiang (10.1016/j.cviu.2023.103721_b24) 2021; 34 Heusel (10.1016/j.cviu.2023.103721_b16) 2017; 30 Lin (10.1016/j.cviu.2023.103721_b33) 2014 10.1016/j.cviu.2023.103721_b54 Kim (10.1016/j.cviu.2023.103721_b27) 2017 Lewis (10.1016/j.cviu.2023.103721_b31) 2019 10.1016/j.cviu.2023.103721_b56 10.1016/j.cviu.2023.103721_b57 Fan (10.1016/j.cviu.2023.103721_b10) 2022 Hudson (10.1016/j.cviu.2023.103721_b19) 2021 Ramesh (10.1016/j.cviu.2023.103721_b39) 2022 |
| References_xml | – reference: Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12873–12883. – volume: 34 start-page: 26183 year: 2021 end-page: 26197 ident: b11 article-title: You only look at one sequence: Rethinking transformer in vision through object detection publication-title: Adv. Neural Inf. Process. Syst. – volume: 30 year: 2017 ident: b50 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – volume: 33 start-page: 6840 year: 2020 end-page: 6851 ident: b17 article-title: Denoising diffusion probabilistic models publication-title: Adv. Neural Inf. Process. Syst. – start-page: 2428 year: 2021 end-page: 2432 ident: b21 article-title: Scene graph to image generation with contextualized object layout refinement publication-title: 2021 IEEE International Conference on Image Processing (ICIP) – volume: 32 year: 2019 ident: b32 article-title: Pastegan: A semi-parametric method to generate image from scene graph publication-title: Adv. Neural Inf. Process. Syst. – year: 2014 ident: b35 article-title: Conditional generative adversarial nets – reference: Nie, L., Qu, L., Meng, D., Zhang, M., Tian, Q., Bimbo, A.D., 2022. Search-oriented Micro-video Captioning. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3234–3243. – year: 2021 ident: b12 article-title: Migs: Meta image generation from scene graphs – volume: 123 start-page: 32 year: 2017 end-page: 73 ident: b30 article-title: Visual genome: Connecting language and vision using crowdsourced dense image annotations publication-title: Int. J. Comput. Vis. – volume: 30 year: 2017 ident: b49 article-title: Neural discrete representation learning publication-title: Adv. Neural Inf. Process. Syst. – reference: Caesar, H., Uijlings, J., Ferrari, V., 2018. Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1209–1218. – reference: Johnson, J., Gupta, A., Fei-Fei, L., 2018. Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1219–1228. – start-page: 1060 year: 2016 end-page: 1069 ident: b41 article-title: Generative adversarial text to image synthesis publication-title: International Conference on Machine Learning – year: 2019 ident: b36 article-title: Interactive image generation using scene graphs – start-page: 369 year: 2019 end-page: 386 ident: b45 article-title: Super-convergence: Very fast training of neural networks using large learning rates publication-title: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006 – start-page: 213 year: 2020 end-page: 229 ident: b4 article-title: End-to-end object detection with transformers publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 – year: 2022 ident: b10 article-title: Frido: Feature pyramid diffusion for complex scene image synthesis – year: 2021 ident: b5 article-title: Scene graphs: A survey of generations and applications – volume: 34 start-page: 14745 year: 2021 end-page: 14758 ident: b24 article-title: Transgan: Two pure transformers can make one strong gan, and that can scale up publication-title: Adv. Neural Inf. Process. Syst. – start-page: 8821 year: 2021 end-page: 8831 ident: b40 article-title: Zero-shot text-to-image generation publication-title: International Conference on Machine Learning – reference: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695. – reference: Zhu, J.-Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232. – year: 2022 ident: b39 article-title: Hierarchical text-conditional image generation with clip latents – start-page: 1857 year: 2017 end-page: 1865 ident: b27 article-title: Learning to discover cross-domain relations with generative adversarial networks publication-title: International Conference on Machine Learning – volume: Vol. 139 start-page: 8748 year: 2021 end-page: 8763 ident: b38 article-title: Learning transferable visual models from natural language supervision publication-title: Proceedings of the 38th International Conference on Machine Learning – start-page: 740 year: 2014 end-page: 755 ident: b33 article-title: Microsoft coco: Common objects in context publication-title: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 – start-page: 290 year: 2020 end-page: 306 ident: b52 article-title: Visual-relation conscious image generation from structured-text publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16 – year: 2014 ident: b13 article-title: Generative adversarial nets publication-title: NeurIPS – year: 2017 ident: b51 article-title: Graph attention networks – year: 2022 ident: b43 article-title: Photorealistic text-to-image diffusion models with deep language understanding – reference: Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T., 2020. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8110–8119. – reference: Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1125–1134. – year: 2019 ident: b31 article-title: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension – reference: Sun, W., Wu, T., 2019. Image synthesis from reconfigurable layout and style. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10531–10540. – volume: 44 start-page: 9733 year: 2021 end-page: 9740 ident: b53 article-title: Zeronas: Differentiable generative adversarial networks search for zero-shot learning publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – year: 2021 ident: b55 article-title: Vision guided generative pre-trained language models for multimodal abstractive summarization – reference: Guan, W., Jiao, F., Song, X., Wen, H., Yeh, C.-H., Chang, X., 2022. Personalized Fashion Compatibility Modeling via Metapath-guided Heterogeneous Graph Learning. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 482–491. – reference: Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D., 2020. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. pp. 12993–13000. – volume: 30 year: 2017 ident: b16 article-title: Gans trained by a two time-scale update rule converge to a local nash equilibrium publication-title: Adv. Neural Inf. Process. Syst. – year: 2014 ident: b28 article-title: Adam: A method for stochastic optimization – start-page: 10347 year: 2021 end-page: 10357 ident: b48 article-title: Training data-efficient image transformers & distillation through attention publication-title: International Conference on Machine Learning – volume: 34 start-page: 9506 year: 2021 end-page: 9520 ident: b1 article-title: Compositional transformers for scene generation publication-title: Adv. Neural Inf. Process. Syst. – volume: 33 start-page: 1877 year: 2020 end-page: 1901 ident: b2 article-title: Language models are few-shot learners publication-title: Adv. Neural Inf. Process. Syst. – year: 2019 ident: b29 article-title: Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog – start-page: 210 year: 2020 end-page: 227 ident: b15 article-title: Learning canonical representations for scene graph to image generation publication-title: European Conference on Computer Vision – reference: Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022. – reference: Song, X., Jing, L., Lin, D., Zhao, Z., Chen, H., Nie, L., 2022. V2P: Vision-to-prompt based multi-modal product summary generation. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 992–1001. – volume: 29 year: 2016 ident: b44 article-title: Improved techniques for training gans publication-title: Adv. Neural Inf. Process. Syst. – reference: Yang, Z., Liu, D., Wang, C., Yang, J., Tao, D., 2022. Modeling image composition for complex scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7764–7773. – year: 2021 ident: b23 article-title: A survey on multi-modal summarization publication-title: ACM Comput. Surv. – start-page: 4487 year: 2021 end-page: 4499 ident: b19 article-title: Generative adversarial transformers publication-title: International Conference on Machine Learning – start-page: 88 year: 2022 end-page: 105 ident: b6 article-title: Vqgan-clip: Open domain image generation and editing with natural language guidance publication-title: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII – year: 2020 ident: b7 article-title: An image is worth 16x16 words: Transformers for image recognition at scale – reference: Ho, J., Salimans, T., Classifier-Free Diffusion Guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. – year: 2021 ident: b22 article-title: High-resolution complex scene synthesis with transformers – reference: Dwivedi, V.P., Bresson, X., 2021. A Generalization of Transformer Networks to Graphs. In: AAAI Workshop on Deep Learning on Graphs: Methods and Applications. – ident: 10.1016/j.cviu.2023.103721_b46 doi: 10.1145/3477495.3532076 – start-page: 2428 year: 2021 ident: 10.1016/j.cviu.2023.103721_b21 article-title: Scene graph to image generation with contextualized object layout refinement – year: 2019 ident: 10.1016/j.cviu.2023.103721_b31 – volume: 33 start-page: 1877 year: 2020 ident: 10.1016/j.cviu.2023.103721_b2 article-title: Language models are few-shot learners publication-title: Adv. Neural Inf. Process. Syst. – start-page: 88 year: 2022 ident: 10.1016/j.cviu.2023.103721_b6 article-title: Vqgan-clip: Open domain image generation and editing with natural language guidance – start-page: 8821 year: 2021 ident: 10.1016/j.cviu.2023.103721_b40 article-title: Zero-shot text-to-image generation – year: 2017 ident: 10.1016/j.cviu.2023.103721_b51 – year: 2022 ident: 10.1016/j.cviu.2023.103721_b39 – start-page: 213 year: 2020 ident: 10.1016/j.cviu.2023.103721_b4 article-title: End-to-end object detection with transformers – year: 2021 ident: 10.1016/j.cviu.2023.103721_b5 – volume: 44 start-page: 9733 issue: 12 year: 2021 ident: 10.1016/j.cviu.2023.103721_b53 article-title: Zeronas: Differentiable generative adversarial networks search for zero-shot learning publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2021.3127346 – ident: 10.1016/j.cviu.2023.103721_b26 doi: 10.1109/CVPR42600.2020.00813 – start-page: 740 year: 2014 ident: 10.1016/j.cviu.2023.103721_b33 article-title: Microsoft coco: Common objects in context – start-page: 4487 year: 2021 ident: 10.1016/j.cviu.2023.103721_b19 article-title: Generative adversarial transformers – volume: 30 year: 2017 ident: 10.1016/j.cviu.2023.103721_b50 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.cviu.2023.103721_b42 doi: 10.1109/CVPR52688.2022.01042 – year: 2014 ident: 10.1016/j.cviu.2023.103721_b35 – ident: 10.1016/j.cviu.2023.103721_b18 – ident: 10.1016/j.cviu.2023.103721_b9 doi: 10.1109/CVPR46437.2021.01268 – year: 2021 ident: 10.1016/j.cviu.2023.103721_b22 – start-page: 369 year: 2019 ident: 10.1016/j.cviu.2023.103721_b45 article-title: Super-convergence: Very fast training of neural networks using large learning rates – volume: 33 start-page: 6840 year: 2020 ident: 10.1016/j.cviu.2023.103721_b17 article-title: Denoising diffusion probabilistic models publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.cviu.2023.103721_b54 doi: 10.1109/CVPR52688.2022.00761 – year: 2019 ident: 10.1016/j.cviu.2023.103721_b29 – year: 2022 ident: 10.1016/j.cviu.2023.103721_b10 – year: 2022 ident: 10.1016/j.cviu.2023.103721_b43 – start-page: 1060 year: 2016 ident: 10.1016/j.cviu.2023.103721_b41 article-title: Generative adversarial text to image synthesis – year: 2020 ident: 10.1016/j.cviu.2023.103721_b7 – ident: 10.1016/j.cviu.2023.103721_b25 doi: 10.1109/CVPR.2018.00133 – volume: 30 year: 2017 ident: 10.1016/j.cviu.2023.103721_b49 article-title: Neural discrete representation learning publication-title: Adv. Neural Inf. Process. Syst. – year: 2014 ident: 10.1016/j.cviu.2023.103721_b13 article-title: Generative adversarial nets publication-title: NeurIPS – ident: 10.1016/j.cviu.2023.103721_b14 doi: 10.1145/3477495.3532038 – volume: 30 year: 2017 ident: 10.1016/j.cviu.2023.103721_b16 article-title: Gans trained by a two time-scale update rule converge to a local nash equilibrium publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.cviu.2023.103721_b57 doi: 10.1109/ICCV.2017.244 – volume: 34 start-page: 14745 year: 2021 ident: 10.1016/j.cviu.2023.103721_b24 article-title: Transgan: Two pure transformers can make one strong gan, and that can scale up publication-title: Adv. Neural Inf. Process. Syst. – year: 2021 ident: 10.1016/j.cviu.2023.103721_b12 – start-page: 290 year: 2020 ident: 10.1016/j.cviu.2023.103721_b52 article-title: Visual-relation conscious image generation from structured-text – year: 2019 ident: 10.1016/j.cviu.2023.103721_b36 – ident: 10.1016/j.cviu.2023.103721_b3 doi: 10.1109/CVPR.2018.00132 – ident: 10.1016/j.cviu.2023.103721_b34 doi: 10.1109/ICCV48922.2021.00986 – volume: 34 start-page: 26183 year: 2021 ident: 10.1016/j.cviu.2023.103721_b11 article-title: You only look at one sequence: Rethinking transformer in vision through object detection publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.cviu.2023.103721_b20 doi: 10.1109/CVPR.2017.632 – volume: 29 year: 2016 ident: 10.1016/j.cviu.2023.103721_b44 article-title: Improved techniques for training gans publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.cviu.2023.103721_b8 – ident: 10.1016/j.cviu.2023.103721_b47 doi: 10.1109/ICCV.2019.01063 – year: 2021 ident: 10.1016/j.cviu.2023.103721_b23 article-title: A survey on multi-modal summarization publication-title: ACM Comput. Surv. – year: 2021 ident: 10.1016/j.cviu.2023.103721_b55 – ident: 10.1016/j.cviu.2023.103721_b37 doi: 10.1145/3503161.3548180 – start-page: 1857 year: 2017 ident: 10.1016/j.cviu.2023.103721_b27 article-title: Learning to discover cross-domain relations with generative adversarial networks – volume: 34 start-page: 9506 year: 2021 ident: 10.1016/j.cviu.2023.103721_b1 article-title: Compositional transformers for scene generation publication-title: Adv. Neural Inf. Process. Syst. – start-page: 210 year: 2020 ident: 10.1016/j.cviu.2023.103721_b15 article-title: Learning canonical representations for scene graph to image generation – start-page: 10347 year: 2021 ident: 10.1016/j.cviu.2023.103721_b48 article-title: Training data-efficient image transformers & distillation through attention – ident: 10.1016/j.cviu.2023.103721_b56 doi: 10.1609/aaai.v34i07.6999 – volume: 123 start-page: 32 year: 2017 ident: 10.1016/j.cviu.2023.103721_b30 article-title: Visual genome: Connecting language and vision using crowdsourced dense image annotations publication-title: Int. J. Comput. Vis. doi: 10.1007/s11263-016-0981-7 – volume: 32 year: 2019 ident: 10.1016/j.cviu.2023.103721_b32 article-title: Pastegan: A semi-parametric method to generate image from scene graph publication-title: Adv. Neural Inf. Process. Syst. – year: 2014 ident: 10.1016/j.cviu.2023.103721_b28 – volume: Vol. 139 start-page: 8748 year: 2021 ident: 10.1016/j.cviu.2023.103721_b38 article-title: Learning transferable visual models from natural language supervision |
| SSID | ssj0011491 |
| Score | 2.4812834 |
| Snippet | Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 103721 |
| SubjectTerms | Conditional image generation Generative models Scene graphs Transformers |
| Title | Transformer-based image generation from scene graphs |
| URI | https://dx.doi.org/10.1016/j.cviu.2023.103721 |
| Volume | 233 |
| WOSCitedRecordID | wos001009106900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: ScienceDirect Freedom Collection - Elsevier customDbUrl: eissn: 1090-235X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011491 issn: 1077-3142 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwEB61Sw_tAVraqryqHHpbZeXEMYmPCFEVhFBVqLS3yHGcahFkV_tAiF_fmdhOArSoHHqJEm88G2W-jGfG8wD4IhQu69TQRJqiCBOVFaGKqyqMeFHhaq6VrJo6s6fp2Vk2Hsvvbitm0bQTSOs6u72Vs__KahxDZlPq7DPY3RLFATxHpuMR2Y7Hf2O8V0XNPKQ1qhxOrikw51dTYNqGFlJOCZVxwlEqWL3oq6i-z8PQpp3b3YWGwqqfCNO6ZqZUh2BqWVWjCd9tSl2pu7vmh3MERLd9_wPpTL3SjGJKt1POZ-p6NiEijf-WMiqXjqBzTMS8DYvzspSl5AJN7gnbmPOeuKQkRZsg_UiSW6fC5UjfTFYjIj_qbr5fNvvBctYGGfr4tcucaOREI7c0XsJanAqZDWDt4PhofNJuO6G5GNkgVfvkLsvKBgQ-fJI_azI97eTiLaw7syI4sHB4By9MvQkbzsQInABf4JDnrh_bhDe9kpTvIXkEn6BhftDBJyD4BA18AgufD_Dz69HF4bfQddYINU-SZagrriqpmDGMlWiQM_wmDapuaakkNyrCDzgqhOBlViouyKTlBTN4gcp6ua8Z_wiDGoHzCQJS0bUWFdOZTvaFUkqijcEKmRVxIiK1BZF_S7l2Zeep-8lV_nf-bMGwnTOzRVeevFv4l587tdGqgzli6Yl528_6lx143YF8FwbL-crswSt9s5ws5p8dkH4DfrqPSw |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Transformer-based+image+generation+from+scene+graphs&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Sortino%2C+Renato&rft.au=Palazzo%2C+Simone&rft.au=Rundo%2C+Francesco&rft.au=Spampinato%2C+Concetto&rft.date=2023-08-01&rft.issn=1077-3142&rft.volume=233&rft.spage=103721&rft_id=info:doi/10.1016%2Fj.cviu.2023.103721&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_cviu_2023_103721 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon |