Transformer-based image generation from scene graphs

Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In th...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Computer vision and image understanding Ročník 233; s. 103721
Hlavní autoři: Sortino, Renato, Palazzo, Simone, Rundo, Francesco, Spampinato, Concetto
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Inc 01.08.2023
Témata:
ISSN:1077-3142, 1090-235X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im. •Multi-head attention on graphs with geometric and edge features for layout estimation.•GPT decoder for conditioned image generation on the latent space.•Improved IS results on the generated images.•Achieved high robustness to scene graph perturbations.•Increased diversity of generated images.
AbstractList Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im. •Multi-head attention on graphs with geometric and edge features for layout estimation.•GPT decoder for conditioned image generation on the latent space.•Improved IS results on the generated images.•Achieved high robustness to scene graph perturbations.•Increased diversity of generated images.
ArticleNumber 103721
Author Spampinato, Concetto
Sortino, Renato
Rundo, Francesco
Palazzo, Simone
Author_xml – sequence: 1
  givenname: Renato
  orcidid: 0000-0002-3906-797X
  surname: Sortino
  fullname: Sortino, Renato
  email: renato.sortino@phd.unict.it
  organization: PeRCeiVe Lab at the Department of Electrical Electronical Engineering and Computer Science, University of Catania, Italy
– sequence: 2
  givenname: Simone
  orcidid: 0000-0002-2441-0982
  surname: Palazzo
  fullname: Palazzo, Simone
  organization: PeRCeiVe Lab at the Department of Electrical Electronical Engineering and Computer Science, University of Catania, Italy
– sequence: 3
  givenname: Francesco
  surname: Rundo
  fullname: Rundo, Francesco
  organization: ADG, R&D Power and Discretes, STMicroelectronics, Catania, Italy
– sequence: 4
  givenname: Concetto
  orcidid: 0000-0001-6653-2577
  surname: Spampinato
  fullname: Spampinato, Concetto
  organization: PeRCeiVe Lab at the Department of Electrical Electronical Engineering and Computer Science, University of Catania, Italy
BookMark eNp9j01Lw0AQhhepYFv9A57yB1JndpImAS9S_IKClwrelunupG5pk7IbC_57E-vJQ0_zzgvPMM9EjZq2EaVuEWYIOL_bzuzRf800aOoLKjReqDFCBamm_GM05KJICTN9pSYxbgEQswrHKlsFbmLdhr2EdM1RXOL3vJFkI40E7nzbJHVo90m0fZFsAh8-47W6rHkX5eZvTtX70-Nq8ZIu355fFw_L1FKWdamtieuKQQTAIWRguZISy8JxRcLIJeE6z8mVjikvi5xoDdIvWs_d3AJNVXm6a0MbY5DaWN_9_tQF9juDYAZ7szWDvRnszcm-R_U_9BB6sfB9Hro_QdJLHb0EE62XxorzQWxnXOvP4T8OlHXV
CitedBy_id crossref_primary_10_3390_electronics14061158
crossref_primary_10_1016_j_dcan_2025_04_010
crossref_primary_10_1016_j_cviu_2024_103951
crossref_primary_10_1016_j_inffus_2025_102951
crossref_primary_10_1016_j_ipm_2025_104297
crossref_primary_10_1007_s42979_024_02791_8
Cites_doi 10.1145/3477495.3532076
10.1109/TPAMI.2021.3127346
10.1109/CVPR42600.2020.00813
10.1109/CVPR52688.2022.01042
10.1109/CVPR46437.2021.01268
10.1109/CVPR52688.2022.00761
10.1109/CVPR.2018.00133
10.1145/3477495.3532038
10.1109/ICCV.2017.244
10.1109/CVPR.2018.00132
10.1109/ICCV48922.2021.00986
10.1109/CVPR.2017.632
10.1109/ICCV.2019.01063
10.1145/3503161.3548180
10.1609/aaai.v34i07.6999
10.1007/s11263-016-0981-7
ContentType Journal Article
Copyright 2023 The Authors
Copyright_xml – notice: 2023 The Authors
DBID 6I.
AAFTH
AAYXX
CITATION
DOI 10.1016/j.cviu.2023.103721
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Engineering
Computer Science
EISSN 1090-235X
ExternalDocumentID 10_1016_j_cviu_2023_103721
S1077314223001017
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29F
4.4
457
4G.
5GY
5VS
6I.
6TJ
7-5
71M
8P~
AABNK
AACTN
AAEDT
AAEDW
AAFTH
AAIAV
AAIKC
AAIKJ
AAKOC
AALRI
AAMNW
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABJNI
ABMAC
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADFGL
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
EBS
EFBJH
EFLBG
EJD
EO8
EO9
EP2
EP3
F0J
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HF~
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
LG5
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
RNS
ROL
RPZ
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SSV
SSZ
T5K
TN5
XPP
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
CITATION
EFKBS
SST
~HD
ID FETCH-LOGICAL-c344t-cf3af9a0ee00d1040ca9e8187da93ea1a831b553d8da3587533b0e8da226d6c03
ISICitedReferencesCount 7
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001009106900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1077-3142
IngestDate Sat Nov 29 07:10:05 EST 2025
Tue Nov 18 21:44:07 EST 2025
Fri Feb 23 02:34:41 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords Scene graphs
Transformers
Generative models
Conditional image generation
Language English
License This is an open access article under the CC BY license.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c344t-cf3af9a0ee00d1040ca9e8187da93ea1a831b553d8da3587533b0e8da226d6c03
ORCID 0000-0001-6653-2577
0000-0002-2441-0982
0000-0002-3906-797X
OpenAccessLink https://dx.doi.org/10.1016/j.cviu.2023.103721
ParticipantIDs crossref_citationtrail_10_1016_j_cviu_2023_103721
crossref_primary_10_1016_j_cviu_2023_103721
elsevier_sciencedirect_doi_10_1016_j_cviu_2023_103721
PublicationCentury 2000
PublicationDate August 2023
2023-08-00
PublicationDateYYYYMMDD 2023-08-01
PublicationDate_xml – month: 08
  year: 2023
  text: August 2023
PublicationDecade 2020
PublicationTitle Computer vision and image understanding
PublicationYear 2023
Publisher Elsevier Inc
Publisher_xml – name: Elsevier Inc
References Fan, Chen, Chen, Cheng, Yuan, Wang (b10) 2022
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D., 2020. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. pp. 12993–13000.
Saharia, Chan, Saxena, Li, Whang, Denton, Ghasemipour, Ayan, Mahdavi, Lopes (b43) 2022
Ho, J., Salimans, T., Classifier-Free Diffusion Guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
Krishna, Zhu, Groth, Johnson, Hata, Kravitz, Chen, Kalantidis, Li, Shamma (b30) 2017; 123
Jangra, Mukherjee, Jatowt, Saha, Hasanuzzaman (b23) 2021
Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, Zettlemoyer (b31) 2019
Van Den Oord, Vinyals (b49) 2017; 30
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695.
Fang, Liao, Wang, Fang, Qi, Wu, Niu, Liu (b11) 2021; 34
Vo, Sugimoto (b52) 2020
Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly (b7) 2020
Farshad, Musatian, Dhamo, Navab (b12) 2021
Caesar, H., Uijlings, J., Ferrari, V., 2018. Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1209–1218.
Mittal, Agrawal, Agarwal, Mehta, Marwah (b36) 2019
Mirza, Osindero (b35) 2014
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
Yu, Dai, Liu, Fung (b55) 2021
Jahn, Rombach, Ommer (b22) 2021
Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever (b38) 2021; Vol. 139
Yang, Z., Liu, D., Wang, C., Yang, J., Tao, D., 2022. Modeling image composition for complex scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7764–7773.
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1125–1134.
Veličković, Cucurull, Casanova, Romero, Lio, Bengio (b51) 2017
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T., 2020. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8110–8119.
Heusel, Ramsauer, Unterthiner, Nessler, Hochreiter (b16) 2017; 30
Ivgi, Benny, Ben-David, Berant, Wolf (b21) 2021
Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell (b2) 2020; 33
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b50) 2017; 30
Dwivedi, V.P., Bresson, X., 2021. A Generalization of Transformer Networks to Graphs. In: AAAI Workshop on Deep Learning on Graphs: Methods and Applications.
Guan, W., Jiao, F., Song, X., Wen, H., Yeh, C.-H., Chang, X., 2022. Personalized Fashion Compatibility Modeling via Metapath-guided Heterogeneous Graph Learning. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 482–491.
Chang, Ren, Xu, Li, Chen, Hauptmann (b5) 2021
Yan, Chang, Li, Guan, Ge, Zhu, Zheng (b53) 2021; 44
Herzig, Bar, Xu, Chechik, Darrell, Globerson (b15) 2020
Ramesh, Dhariwal, Nichol, Chu, Chen (b39) 2022
Li, Ma, Bai, Duan, Wei, Wang (b32) 2019; 32
Jiang, Chang, Wang (b24) 2021; 34
Sun, W., Wu, T., 2019. Image synthesis from reconfigurable layout and style. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10531–10540.
Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio (b13) 2014
Arad Hudson, Zitnick (b1) 2021; 34
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232.
Song, X., Jing, L., Lin, D., Zhao, Z., Chen, H., Nie, L., 2022. V2P: Vision-to-prompt based multi-modal product summary generation. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 992–1001.
Nie, L., Qu, L., Meng, D., Zhang, M., Tian, Q., Bimbo, A.D., 2022. Search-oriented Micro-video Captioning. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3234–3243.
Crowson, Biderman, Kornis, Stander, Hallahan, Castricato, Raff (b6) 2022
Kingma, Ba (b28) 2014
Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, Zitnick (b33) 2014
Reed, Akata, Yan, Logeswaran, Schiele, Lee (b41) 2016
Salimans, Goodfellow, Zaremba, Cheung, Radford, Chen (b44) 2016; 29
Carion, Massa, Synnaeve, Usunier, Kirillov, Zagoruyko (b4) 2020
Touvron, Cord, Douze, Massa, Sablayrolles, Jégou (b48) 2021
Smith, Topin (b45) 2019
Johnson, J., Gupta, A., Fei-Fei, L., 2018. Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1219–1228.
Hudson, Zitnick (b19) 2021
Kottur, Moura, Parikh, Batra, Rohrbach (b29) 2019
Kim, Cha, Kim, Lee, Kim (b27) 2017
Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12873–12883.
Ho, Jain, Abbeel (b17) 2020; 33
Ramesh, Pavlov, Goh, Gray, Voss, Radford, Chen, Sutskever (b40) 2021
Vaswani (10.1016/j.cviu.2023.103721_b50) 2017; 30
Farshad (10.1016/j.cviu.2023.103721_b12) 2021
Goodfellow (10.1016/j.cviu.2023.103721_b13) 2014
Arad Hudson (10.1016/j.cviu.2023.103721_b1) 2021; 34
Crowson (10.1016/j.cviu.2023.103721_b6) 2022
Kottur (10.1016/j.cviu.2023.103721_b29) 2019
Vo (10.1016/j.cviu.2023.103721_b52) 2020
10.1016/j.cviu.2023.103721_b25
10.1016/j.cviu.2023.103721_b26
Dosovitskiy (10.1016/j.cviu.2023.103721_b7) 2020
Herzig (10.1016/j.cviu.2023.103721_b15) 2020
10.1016/j.cviu.2023.103721_b20
Touvron (10.1016/j.cviu.2023.103721_b48) 2021
Brown (10.1016/j.cviu.2023.103721_b2) 2020; 33
Mirza (10.1016/j.cviu.2023.103721_b35) 2014
Fang (10.1016/j.cviu.2023.103721_b11) 2021; 34
10.1016/j.cviu.2023.103721_b37
Salimans (10.1016/j.cviu.2023.103721_b44) 2016; 29
Chang (10.1016/j.cviu.2023.103721_b5) 2021
10.1016/j.cviu.2023.103721_b34
Li (10.1016/j.cviu.2023.103721_b32) 2019; 32
Smith (10.1016/j.cviu.2023.103721_b45) 2019
10.1016/j.cviu.2023.103721_b3
Radford (10.1016/j.cviu.2023.103721_b38) 2021; Vol. 139
Carion (10.1016/j.cviu.2023.103721_b4) 2020
Saharia (10.1016/j.cviu.2023.103721_b43) 2022
Jahn (10.1016/j.cviu.2023.103721_b22) 2021
Mittal (10.1016/j.cviu.2023.103721_b36) 2019
10.1016/j.cviu.2023.103721_b47
10.1016/j.cviu.2023.103721_b9
Ramesh (10.1016/j.cviu.2023.103721_b40) 2021
10.1016/j.cviu.2023.103721_b46
Veličković (10.1016/j.cviu.2023.103721_b51) 2017
10.1016/j.cviu.2023.103721_b8
10.1016/j.cviu.2023.103721_b42
Ho (10.1016/j.cviu.2023.103721_b17) 2020; 33
Krishna (10.1016/j.cviu.2023.103721_b30) 2017; 123
Kingma (10.1016/j.cviu.2023.103721_b28) 2014
Yu (10.1016/j.cviu.2023.103721_b55) 2021
Reed (10.1016/j.cviu.2023.103721_b41) 2016
Yan (10.1016/j.cviu.2023.103721_b53) 2021; 44
Ivgi (10.1016/j.cviu.2023.103721_b21) 2021
10.1016/j.cviu.2023.103721_b18
Jangra (10.1016/j.cviu.2023.103721_b23) 2021
10.1016/j.cviu.2023.103721_b14
Van Den Oord (10.1016/j.cviu.2023.103721_b49) 2017; 30
Jiang (10.1016/j.cviu.2023.103721_b24) 2021; 34
Heusel (10.1016/j.cviu.2023.103721_b16) 2017; 30
Lin (10.1016/j.cviu.2023.103721_b33) 2014
10.1016/j.cviu.2023.103721_b54
Kim (10.1016/j.cviu.2023.103721_b27) 2017
Lewis (10.1016/j.cviu.2023.103721_b31) 2019
10.1016/j.cviu.2023.103721_b56
10.1016/j.cviu.2023.103721_b57
Fan (10.1016/j.cviu.2023.103721_b10) 2022
Hudson (10.1016/j.cviu.2023.103721_b19) 2021
Ramesh (10.1016/j.cviu.2023.103721_b39) 2022
References_xml – reference: Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12873–12883.
– volume: 34
  start-page: 26183
  year: 2021
  end-page: 26197
  ident: b11
  article-title: You only look at one sequence: Rethinking transformer in vision through object detection
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 30
  year: 2017
  ident: b50
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 33
  start-page: 6840
  year: 2020
  end-page: 6851
  ident: b17
  article-title: Denoising diffusion probabilistic models
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 2428
  year: 2021
  end-page: 2432
  ident: b21
  article-title: Scene graph to image generation with contextualized object layout refinement
  publication-title: 2021 IEEE International Conference on Image Processing (ICIP)
– volume: 32
  year: 2019
  ident: b32
  article-title: Pastegan: A semi-parametric method to generate image from scene graph
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2014
  ident: b35
  article-title: Conditional generative adversarial nets
– reference: Nie, L., Qu, L., Meng, D., Zhang, M., Tian, Q., Bimbo, A.D., 2022. Search-oriented Micro-video Captioning. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3234–3243.
– year: 2021
  ident: b12
  article-title: Migs: Meta image generation from scene graphs
– volume: 123
  start-page: 32
  year: 2017
  end-page: 73
  ident: b30
  article-title: Visual genome: Connecting language and vision using crowdsourced dense image annotations
  publication-title: Int. J. Comput. Vis.
– volume: 30
  year: 2017
  ident: b49
  article-title: Neural discrete representation learning
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: Caesar, H., Uijlings, J., Ferrari, V., 2018. Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1209–1218.
– reference: Johnson, J., Gupta, A., Fei-Fei, L., 2018. Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1219–1228.
– start-page: 1060
  year: 2016
  end-page: 1069
  ident: b41
  article-title: Generative adversarial text to image synthesis
  publication-title: International Conference on Machine Learning
– year: 2019
  ident: b36
  article-title: Interactive image generation using scene graphs
– start-page: 369
  year: 2019
  end-page: 386
  ident: b45
  article-title: Super-convergence: Very fast training of neural networks using large learning rates
  publication-title: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006
– start-page: 213
  year: 2020
  end-page: 229
  ident: b4
  article-title: End-to-end object detection with transformers
  publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16
– year: 2022
  ident: b10
  article-title: Frido: Feature pyramid diffusion for complex scene image synthesis
– year: 2021
  ident: b5
  article-title: Scene graphs: A survey of generations and applications
– volume: 34
  start-page: 14745
  year: 2021
  end-page: 14758
  ident: b24
  article-title: Transgan: Two pure transformers can make one strong gan, and that can scale up
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 8821
  year: 2021
  end-page: 8831
  ident: b40
  article-title: Zero-shot text-to-image generation
  publication-title: International Conference on Machine Learning
– reference: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695.
– reference: Zhu, J.-Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232.
– year: 2022
  ident: b39
  article-title: Hierarchical text-conditional image generation with clip latents
– start-page: 1857
  year: 2017
  end-page: 1865
  ident: b27
  article-title: Learning to discover cross-domain relations with generative adversarial networks
  publication-title: International Conference on Machine Learning
– volume: Vol. 139
  start-page: 8748
  year: 2021
  end-page: 8763
  ident: b38
  article-title: Learning transferable visual models from natural language supervision
  publication-title: Proceedings of the 38th International Conference on Machine Learning
– start-page: 740
  year: 2014
  end-page: 755
  ident: b33
  article-title: Microsoft coco: Common objects in context
  publication-title: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13
– start-page: 290
  year: 2020
  end-page: 306
  ident: b52
  article-title: Visual-relation conscious image generation from structured-text
  publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16
– year: 2014
  ident: b13
  article-title: Generative adversarial nets
  publication-title: NeurIPS
– year: 2017
  ident: b51
  article-title: Graph attention networks
– year: 2022
  ident: b43
  article-title: Photorealistic text-to-image diffusion models with deep language understanding
– reference: Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T., 2020. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8110–8119.
– reference: Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1125–1134.
– year: 2019
  ident: b31
  article-title: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension
– reference: Sun, W., Wu, T., 2019. Image synthesis from reconfigurable layout and style. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10531–10540.
– volume: 44
  start-page: 9733
  year: 2021
  end-page: 9740
  ident: b53
  article-title: Zeronas: Differentiable generative adversarial networks search for zero-shot learning
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– year: 2021
  ident: b55
  article-title: Vision guided generative pre-trained language models for multimodal abstractive summarization
– reference: Guan, W., Jiao, F., Song, X., Wen, H., Yeh, C.-H., Chang, X., 2022. Personalized Fashion Compatibility Modeling via Metapath-guided Heterogeneous Graph Learning. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 482–491.
– reference: Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D., 2020. Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. pp. 12993–13000.
– volume: 30
  year: 2017
  ident: b16
  article-title: Gans trained by a two time-scale update rule converge to a local nash equilibrium
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2014
  ident: b28
  article-title: Adam: A method for stochastic optimization
– start-page: 10347
  year: 2021
  end-page: 10357
  ident: b48
  article-title: Training data-efficient image transformers & distillation through attention
  publication-title: International Conference on Machine Learning
– volume: 34
  start-page: 9506
  year: 2021
  end-page: 9520
  ident: b1
  article-title: Compositional transformers for scene generation
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 33
  start-page: 1877
  year: 2020
  end-page: 1901
  ident: b2
  article-title: Language models are few-shot learners
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2019
  ident: b29
  article-title: Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog
– start-page: 210
  year: 2020
  end-page: 227
  ident: b15
  article-title: Learning canonical representations for scene graph to image generation
  publication-title: European Conference on Computer Vision
– reference: Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
– reference: Song, X., Jing, L., Lin, D., Zhao, Z., Chen, H., Nie, L., 2022. V2P: Vision-to-prompt based multi-modal product summary generation. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 992–1001.
– volume: 29
  year: 2016
  ident: b44
  article-title: Improved techniques for training gans
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: Yang, Z., Liu, D., Wang, C., Yang, J., Tao, D., 2022. Modeling image composition for complex scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7764–7773.
– year: 2021
  ident: b23
  article-title: A survey on multi-modal summarization
  publication-title: ACM Comput. Surv.
– start-page: 4487
  year: 2021
  end-page: 4499
  ident: b19
  article-title: Generative adversarial transformers
  publication-title: International Conference on Machine Learning
– start-page: 88
  year: 2022
  end-page: 105
  ident: b6
  article-title: Vqgan-clip: Open domain image generation and editing with natural language guidance
  publication-title: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII
– year: 2020
  ident: b7
  article-title: An image is worth 16x16 words: Transformers for image recognition at scale
– reference: Ho, J., Salimans, T., Classifier-Free Diffusion Guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
– year: 2021
  ident: b22
  article-title: High-resolution complex scene synthesis with transformers
– reference: Dwivedi, V.P., Bresson, X., 2021. A Generalization of Transformer Networks to Graphs. In: AAAI Workshop on Deep Learning on Graphs: Methods and Applications.
– ident: 10.1016/j.cviu.2023.103721_b46
  doi: 10.1145/3477495.3532076
– start-page: 2428
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b21
  article-title: Scene graph to image generation with contextualized object layout refinement
– year: 2019
  ident: 10.1016/j.cviu.2023.103721_b31
– volume: 33
  start-page: 1877
  year: 2020
  ident: 10.1016/j.cviu.2023.103721_b2
  article-title: Language models are few-shot learners
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 88
  year: 2022
  ident: 10.1016/j.cviu.2023.103721_b6
  article-title: Vqgan-clip: Open domain image generation and editing with natural language guidance
– start-page: 8821
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b40
  article-title: Zero-shot text-to-image generation
– year: 2017
  ident: 10.1016/j.cviu.2023.103721_b51
– year: 2022
  ident: 10.1016/j.cviu.2023.103721_b39
– start-page: 213
  year: 2020
  ident: 10.1016/j.cviu.2023.103721_b4
  article-title: End-to-end object detection with transformers
– year: 2021
  ident: 10.1016/j.cviu.2023.103721_b5
– volume: 44
  start-page: 9733
  issue: 12
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b53
  article-title: Zeronas: Differentiable generative adversarial networks search for zero-shot learning
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2021.3127346
– ident: 10.1016/j.cviu.2023.103721_b26
  doi: 10.1109/CVPR42600.2020.00813
– start-page: 740
  year: 2014
  ident: 10.1016/j.cviu.2023.103721_b33
  article-title: Microsoft coco: Common objects in context
– start-page: 4487
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b19
  article-title: Generative adversarial transformers
– volume: 30
  year: 2017
  ident: 10.1016/j.cviu.2023.103721_b50
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.cviu.2023.103721_b42
  doi: 10.1109/CVPR52688.2022.01042
– year: 2014
  ident: 10.1016/j.cviu.2023.103721_b35
– ident: 10.1016/j.cviu.2023.103721_b18
– ident: 10.1016/j.cviu.2023.103721_b9
  doi: 10.1109/CVPR46437.2021.01268
– year: 2021
  ident: 10.1016/j.cviu.2023.103721_b22
– start-page: 369
  year: 2019
  ident: 10.1016/j.cviu.2023.103721_b45
  article-title: Super-convergence: Very fast training of neural networks using large learning rates
– volume: 33
  start-page: 6840
  year: 2020
  ident: 10.1016/j.cviu.2023.103721_b17
  article-title: Denoising diffusion probabilistic models
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.cviu.2023.103721_b54
  doi: 10.1109/CVPR52688.2022.00761
– year: 2019
  ident: 10.1016/j.cviu.2023.103721_b29
– year: 2022
  ident: 10.1016/j.cviu.2023.103721_b10
– year: 2022
  ident: 10.1016/j.cviu.2023.103721_b43
– start-page: 1060
  year: 2016
  ident: 10.1016/j.cviu.2023.103721_b41
  article-title: Generative adversarial text to image synthesis
– year: 2020
  ident: 10.1016/j.cviu.2023.103721_b7
– ident: 10.1016/j.cviu.2023.103721_b25
  doi: 10.1109/CVPR.2018.00133
– volume: 30
  year: 2017
  ident: 10.1016/j.cviu.2023.103721_b49
  article-title: Neural discrete representation learning
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2014
  ident: 10.1016/j.cviu.2023.103721_b13
  article-title: Generative adversarial nets
  publication-title: NeurIPS
– ident: 10.1016/j.cviu.2023.103721_b14
  doi: 10.1145/3477495.3532038
– volume: 30
  year: 2017
  ident: 10.1016/j.cviu.2023.103721_b16
  article-title: Gans trained by a two time-scale update rule converge to a local nash equilibrium
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.cviu.2023.103721_b57
  doi: 10.1109/ICCV.2017.244
– volume: 34
  start-page: 14745
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b24
  article-title: Transgan: Two pure transformers can make one strong gan, and that can scale up
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2021
  ident: 10.1016/j.cviu.2023.103721_b12
– start-page: 290
  year: 2020
  ident: 10.1016/j.cviu.2023.103721_b52
  article-title: Visual-relation conscious image generation from structured-text
– year: 2019
  ident: 10.1016/j.cviu.2023.103721_b36
– ident: 10.1016/j.cviu.2023.103721_b3
  doi: 10.1109/CVPR.2018.00132
– ident: 10.1016/j.cviu.2023.103721_b34
  doi: 10.1109/ICCV48922.2021.00986
– volume: 34
  start-page: 26183
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b11
  article-title: You only look at one sequence: Rethinking transformer in vision through object detection
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.cviu.2023.103721_b20
  doi: 10.1109/CVPR.2017.632
– volume: 29
  year: 2016
  ident: 10.1016/j.cviu.2023.103721_b44
  article-title: Improved techniques for training gans
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.cviu.2023.103721_b8
– ident: 10.1016/j.cviu.2023.103721_b47
  doi: 10.1109/ICCV.2019.01063
– year: 2021
  ident: 10.1016/j.cviu.2023.103721_b23
  article-title: A survey on multi-modal summarization
  publication-title: ACM Comput. Surv.
– year: 2021
  ident: 10.1016/j.cviu.2023.103721_b55
– ident: 10.1016/j.cviu.2023.103721_b37
  doi: 10.1145/3503161.3548180
– start-page: 1857
  year: 2017
  ident: 10.1016/j.cviu.2023.103721_b27
  article-title: Learning to discover cross-domain relations with generative adversarial networks
– volume: 34
  start-page: 9506
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b1
  article-title: Compositional transformers for scene generation
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 210
  year: 2020
  ident: 10.1016/j.cviu.2023.103721_b15
  article-title: Learning canonical representations for scene graph to image generation
– start-page: 10347
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b48
  article-title: Training data-efficient image transformers & distillation through attention
– ident: 10.1016/j.cviu.2023.103721_b56
  doi: 10.1609/aaai.v34i07.6999
– volume: 123
  start-page: 32
  year: 2017
  ident: 10.1016/j.cviu.2023.103721_b30
  article-title: Visual genome: Connecting language and vision using crowdsourced dense image annotations
  publication-title: Int. J. Comput. Vis.
  doi: 10.1007/s11263-016-0981-7
– volume: 32
  year: 2019
  ident: 10.1016/j.cviu.2023.103721_b32
  article-title: Pastegan: A semi-parametric method to generate image from scene graph
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2014
  ident: 10.1016/j.cviu.2023.103721_b28
– volume: Vol. 139
  start-page: 8748
  year: 2021
  ident: 10.1016/j.cviu.2023.103721_b38
  article-title: Learning transferable visual models from natural language supervision
SSID ssj0011491
Score 2.4812834
Snippet Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 103721
SubjectTerms Conditional image generation
Generative models
Scene graphs
Transformers
Title Transformer-based image generation from scene graphs
URI https://dx.doi.org/10.1016/j.cviu.2023.103721
Volume 233
WOSCitedRecordID wos001009106900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: ScienceDirect Freedom Collection - Elsevier
  customDbUrl:
  eissn: 1090-235X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011491
  issn: 1077-3142
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwEB61Sw_tAVraqryqHHpbZeXEMYmPCFEVhFBVqLS3yHGcahFkV_tAiF_fmdhOArSoHHqJEm88G2W-jGfG8wD4IhQu69TQRJqiCBOVFaGKqyqMeFHhaq6VrJo6s6fp2Vk2Hsvvbitm0bQTSOs6u72Vs__KahxDZlPq7DPY3RLFATxHpuMR2Y7Hf2O8V0XNPKQ1qhxOrikw51dTYNqGFlJOCZVxwlEqWL3oq6i-z8PQpp3b3YWGwqqfCNO6ZqZUh2BqWVWjCd9tSl2pu7vmh3MERLd9_wPpTL3SjGJKt1POZ-p6NiEijf-WMiqXjqBzTMS8DYvzspSl5AJN7gnbmPOeuKQkRZsg_UiSW6fC5UjfTFYjIj_qbr5fNvvBctYGGfr4tcucaOREI7c0XsJanAqZDWDt4PhofNJuO6G5GNkgVfvkLsvKBgQ-fJI_azI97eTiLaw7syI4sHB4By9MvQkbzsQInABf4JDnrh_bhDe9kpTvIXkEn6BhftDBJyD4BA18AgufD_Dz69HF4bfQddYINU-SZagrriqpmDGMlWiQM_wmDapuaakkNyrCDzgqhOBlViouyKTlBTN4gcp6ua8Z_wiDGoHzCQJS0bUWFdOZTvaFUkqijcEKmRVxIiK1BZF_S7l2Zeep-8lV_nf-bMGwnTOzRVeevFv4l587tdGqgzli6Yl528_6lx143YF8FwbL-crswSt9s5ws5p8dkH4DfrqPSw
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Transformer-based+image+generation+from+scene+graphs&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Sortino%2C+Renato&rft.au=Palazzo%2C+Simone&rft.au=Rundo%2C+Francesco&rft.au=Spampinato%2C+Concetto&rft.date=2023-08-01&rft.issn=1077-3142&rft.volume=233&rft.spage=103721&rft_id=info:doi/10.1016%2Fj.cviu.2023.103721&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_cviu_2023_103721
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon