Learned Representation-Guided Diffusion Models for Large-Image Generation
To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized do-mains like histopathology and satellite imagery; it is of-ten perform...
Uloženo v:
| Vydáno v: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Ročník 2024; s. 8532 - 8542 |
|---|---|
| Hlavní autoři: | , , , , , , |
| Médium: | Konferenční příspěvek Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
United States
IEEE
01.06.2024
|
| Témata: | |
| ISSN: | 1063-6919, 1063-6919 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized do-mains like histopathology and satellite imagery; it is of-ten performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by as-sembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Gen-erating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. 1 1 Code is available at this link |
|---|---|
| AbstractList | To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. 1 To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized do-mains like histopathology and satellite imagery; it is of-ten performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by as-sembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Gen-erating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. 1 1 Code is available at this link |
| Author | Graikos, Alexandros Saltz, Joel Samaras, Dimitris Yellapragada, Srikar Kapse, Saarthak Le, Minh-Quan Prasanna, Prateek |
| Author_xml | – sequence: 1 givenname: Alexandros surname: Graikos fullname: Graikos, Alexandros email: agraikos@cs.stonybrook.edu organization: Stony Brook University – sequence: 2 givenname: Srikar surname: Yellapragada fullname: Yellapragada, Srikar organization: Stony Brook University – sequence: 3 givenname: Minh-Quan surname: Le fullname: Le, Minh-Quan organization: Stony Brook University – sequence: 4 givenname: Saarthak surname: Kapse fullname: Kapse, Saarthak organization: Stony Brook University – sequence: 5 givenname: Prateek surname: Prasanna fullname: Prasanna, Prateek organization: Stony Brook University – sequence: 6 givenname: Joel surname: Saltz fullname: Saltz, Joel organization: Stony Brook University – sequence: 7 givenname: Dimitris surname: Samaras fullname: Samaras, Dimitris organization: Stony Brook University |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39606708$$D View this record in MEDLINE/PubMed |
| BookMark | eNpVUU1PwzAMDWiIjbF_MKEeuXTko0mbE0IDxqQh0DRxrdLEHUFdO5IWiX9PxMY0LrZlP7_njwvUq5saEBoTPCEEy5vp2-uS05SxCcU0mWCcEX6CRjKVGeOYcYaxOEUDggWLhSSydxT30cj7D4wxo4QImZ2jPpMCixRnAzRfgHI1mGgJWwce6la1tqnjWWdNyN7bsux8SETPjYHKR2XjooVya4jnG7WGaAY1uN-WS3RWqsrDaO-HaPX4sJo-xYuX2Xx6t4gtY0kbq9SEYTUpQBeZJrJQkKWGJlQkmlNQ2hDBCpVhoVWwSamFEZQbAalkCtgQ3e5ot12xAaPDxE5V-dbZjXLfeaNs_r9S2_d83XzlYXdMCCOB4XrP4JrPDnybb6zXUFWqhqbzeYCwlGWE8gC9OhY7qPzdLwDGO4AFgEM5nJ5zGl7zA-_Dgw4 |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding Journal Article |
| DBID | 6IE 6IH CBEJK RIE RIO NPM 7X8 5PM |
| DOI | 10.1109/CVPR52733.2024.00815 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present PubMed MEDLINE - Academic PubMed Central (Full Participant titles) |
| DatabaseTitle | PubMed MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic PubMed |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Computer Science |
| EISBN | 9798350353006 |
| EISSN | 1063-6919 |
| EndPage | 8542 |
| ExternalDocumentID | PMC11601131 39606708 10655298 |
| Genre | orig-research Journal Article |
| GrantInformation_xml | – fundername: NSF grantid: IIS-2123920,IIS-2212046 funderid: 10.13039/100000001 – fundername: NCI grantid: 5U24CA215109,1R21CA258493-01A1,UR3CA225021 funderid: 10.13039/100000054 – fundername: NCI NIH HHS grantid: R21 CA258493 – fundername: NCI NIH HHS grantid: U24 CA215109 – fundername: NCI NIH HHS grantid: UH3 CA225021 |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO 23M 29F 29O 6IK ABDPE ACGFS IPLJI M43 NPM RIG RNS 7X8 5PM |
| ID | FETCH-LOGICAL-i334t-a7d300c1becb8c19bae87d24264c52eacd163ba806caa804fc6d625d6e793ae3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 14 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001322555908090&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1063-6919 |
| IngestDate | Tue Sep 30 17:06:48 EDT 2025 Thu Oct 02 11:55:37 EDT 2025 Wed Feb 19 02:04:06 EST 2025 Wed Aug 27 01:55:19 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i334t-a7d300c1becb8c19bae87d24264c52eacd163ba806caa804fc6d625d6e793ae3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Equal contribution. |
| OpenAccessLink | https://www.ncbi.nlm.nih.gov/pmc/articles/11601131 |
| PMID | 39606708 |
| PQID | 3133738125 |
| PQPubID | 23479 |
| PageCount | 11 |
| ParticipantIDs | pubmed_primary_39606708 ieee_primary_10655298 pubmedcentral_primary_oai_pubmedcentral_nih_gov_11601131 proquest_miscellaneous_3133738125 |
| PublicationCentury | 2000 |
| PublicationDate | 20240601 |
| PublicationDateYYYYMMDD | 2024-06-01 |
| PublicationDate_xml | – month: 6 year: 2024 text: 20240601 day: 1 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationTitleAlternate | Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 ssj0023720 |
| Score | 2.5975955 |
| Snippet | To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure... |
| SourceID | pubmedcentral proquest pubmed ieee |
| SourceType | Open Access Repository Aggregation Database Index Database Publisher |
| StartPage | 8532 |
| SubjectTerms | Adaptation models Annotations Computational modeling diffusion models generative models Histopathology Image synthesis remote sensing Training Visualization |
| Title | Learned Representation-Guided Diffusion Models for Large-Image Generation |
| URI | https://ieeexplore.ieee.org/document/10655298 https://www.ncbi.nlm.nih.gov/pubmed/39606708 https://www.proquest.com/docview/3133738125 https://pubmed.ncbi.nlm.nih.gov/PMC11601131 |
| Volume | 2024 |
| WOSCitedRecordID | wos001322555908090&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dS8MwEA9u-ODT_Jg6P0YFX6Pt0ubjeTodyBhjyN5KmqRY0E7W1b_fS9pVJ_rgSylpA-FySe53ufsdQtcpjYRhJMFaCYJDIynmgZE4kb6fSsAYvvJdsQk2mfDFQkzrZHWXC2OMccFn5sa-urt8vVSldZXBCqdRNBC8hVqMsSpZq3GoEIAyVPA6PS7wxe3weTqz_GIEYODAkmRzW_zWFVH5zZ78GRb57ZwZdf45wn3U_crY86bNWXSAdkx-iDq1ienVC7g4QmPHpwpNMxcBWyce5fihzDS03mVpWlr_mWdrpL0WHpi03pMNFsfjN9h5vIql2nbpovnofj58xHU1BZwREq6xZJqA5AOYtISrQCTScKadRaSiAey_GkyzRHKfKgnPMFVUAzjS1MASloYco3a-zM0p8lIeKqJkkAI0DAHIch9ExhJOEka0EbSHulYy8XvFlxFvhNJDVxuhx6DE9mZC5mZZFjEBpMzAdhhEPXRSTULTm1iMxXzozbemp_nBEmRvf8mzF0eUDQoB2xcJzv4Y0Dnas5pRRX5doPZ6VZpLtKs-1lmx6oOSLXjfKdknJrvUBQ |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDLZgIMFpPAaMZ5G4Btqmj_Q8HpsY0zRNaLcqTVJRCTq0rvx-nLQbDMGBS1WljRQ5TuLPsT8DXKWBH6mQJkSKiBJP8YAwR3GScNtOOWIMW9im2EQ4GLDJJBrWyeomF0YpZYLP1LV-NXf5cipK7SrDFR74vhuxddjwPc91qnStpUuFIpgJIlYnyDl2dNN5Ho40wxhFIOhqmmymy9-aMiq_WZQ_AyO_nTT3zX-OcQdaXzl71nB5Gu3Cmsr3oFkbmVa9hIt96BlGVWwamRjYOvUoJw9lJrH1NkvTUnvQLF0l7bWw0Ki1-jpcnPTecO-xKp5q3aUF4_u7cadL6noKJKPUmxMeSoqyd3DaEiacKOGKhdLYRMJ3cQeWaJwlnNmB4Pj0UhFIhEcyULiIuaIH0MinuToCK2WeoII7KYJDD6Ess1FkYcJoElKpoqANLS2Z-L1izIgXQmnD5ULoMaqxvpvguZqWRUwRK4doPbh-Gw6rSVj2phplhTb2ZivTs_xBU2SvfsmzF0OVjQqBGxh1jv8Y0AVsdcdP_bjfGzyewLbWkioO7BQa81mpzmBTfMyzYnZuVO0TT7nWZA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Learned+Representation-Guided+Diffusion+Models+for+Large-Image+Generation&rft.au=Graikos%2C+Alexandros&rft.au=Yellapragada%2C+Srikar&rft.au=Le%2C+Minh-Quan&rft.au=Kapse%2C+Saarthak&rft.date=2024-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=8532&rft.epage=8542&rft_id=info:doi/10.1109%2FCVPR52733.2024.00815&rft.externalDocID=10655298 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon |