Federated learning for generating synthetic data: a scoping review
IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may...
Uloženo v:
| Vydáno v: | International journal of population data science Ročník 8; číslo 1; s. 2158 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Wales
Swansea University
01.01.2023
|
| Témata: | |
| ISSN: | 2399-4908, 2399-4908 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.
ObjectivesThe objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.
MethodsA scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.
ResultsA total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.
ConclusionsFederated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks. |
|---|---|
| AbstractList | Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.ObjectivesThe objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.MethodsA scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.ResultsA total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.ConclusionsFederated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks. Introduction Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. Objectives The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. Methods A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. Results A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. Conclusions Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks. IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. ObjectivesThe objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. MethodsA scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. ResultsA total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. ConclusionsFederated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks. Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks. |
| Author | Elliot, Mark Little, Claire Allmendinger, Richard |
| Author_xml | – sequence: 1 givenname: Claire orcidid: 0000-0003-4803-3007 surname: Little fullname: Little, Claire – sequence: 2 givenname: Mark surname: Elliot fullname: Elliot, Mark – sequence: 3 givenname: Richard surname: Allmendinger fullname: Allmendinger, Richard |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/38414544$$D View this record in MEDLINE/PubMed |
| BookMark | eNp1kU1vEzEQhi1UREvpnRPaI5cE22t7bS4IKgqVKnGBszVrj1NHGzvYm1T99-wmLWqROHk8H887mvc1OUk5ISFvGV3yVmvzIa63vi73OrIlZ1K_IGe8NWYhDNUnT-JTclHrmlLKmeCdYq_IaasFE1KIM_LlCj0WGNE3A0JJMa2akEuzwjSn52-9T-MtjtE1Hkb42EBTXd7OlYL7iHdvyMsAQ8WLh_ec_Lr6-vPy--Lmx7fry883CydUOy6C7JzXVHjohRfTHgG4k15xZoznWvLglDNOdwp7yXsRQIXeCM54UC4I3Z6T6yPXZ1jbbYkbKPc2Q7SHRC4rC2Vac0ArWM-YabmHVoqWMgClGGDHOkcBupn16cja7voNeodpLDA8gz6vpHhrV3lvGdVGSyonwvsHQsm_d1hHu4nV4TBAwryrlk_y5nDnqfXdU7G_Ko8uTA3q2OBKrrVgsC6O0_HzrB2HSdQeHLcHx-3suJ0dnwbpP4OP7P-O_AHSiLC0 |
| CitedBy_id | crossref_primary_10_1051_itmconf_20257804003 crossref_primary_10_1109_OJCOMS_2025_3597019 crossref_primary_10_1002_psp4_70021 crossref_primary_10_1016_j_jafr_2025_102238 crossref_primary_10_1088_1361_6501_ad7878 crossref_primary_10_3934_aci_2024009 crossref_primary_10_3390_info15070379 crossref_primary_10_1108_JSM_11_2023_0441 crossref_primary_10_1186_s12911_024_02731_9 |
| ContentType | Journal Article |
| DBID | AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 5PM DOA |
| DOI | 10.23889/ijpds.v8i1.2158 |
| DatabaseName | CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic CrossRef MEDLINE |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Economics |
| EISSN | 2399-4908 |
| ExternalDocumentID | oai_doaj_org_article_41b11932da354301aa661ae717c0aa78 PMC10898505 38414544 10_23889_ijpds_v8i1_2158 |
| Genre | Journal Article Scoping Review |
| GroupedDBID | AAFWJ AAYXX ADBBV AFPKN ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ M~E OK1 RPM CGR CUY CVF ECM EIF NPM 7X8 5PM |
| ID | FETCH-LOGICAL-c463t-f57cd804dab4d4761fa2c5d62199d2852fc6c9c876eb52b4fa6fb94212f6cf483 |
| IEDL.DBID | DOA |
| ISICitedReferencesCount | 13 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001114361600004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2399-4908 |
| IngestDate | Fri Oct 03 12:44:13 EDT 2025 Thu Aug 21 18:35:09 EDT 2025 Fri Jul 11 15:45:00 EDT 2025 Wed Sep 10 03:21:08 EDT 2025 Tue Nov 18 21:26:42 EST 2025 Sat Nov 29 06:19:45 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Keywords | data utility data confidentiality review synthetic data federated learning |
| Language | English |
| License | http://creativecommons.org/licenses/by/4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c463t-f57cd804dab4d4761fa2c5d62199d2852fc6c9c876eb52b4fa6fb94212f6cf483 |
| Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-3 content type line 23 ObjectType-Review-1 Statement on conflicts of interest: The authors declare that they have no conflicts to report. |
| ORCID | 0000-0003-4803-3007 |
| OpenAccessLink | https://doaj.org/article/41b11932da354301aa661ae717c0aa78 |
| PMID | 38414544 |
| PQID | 2932938414 |
| PQPubID | 23479 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_41b11932da354301aa661ae717c0aa78 pubmedcentral_primary_oai_pubmedcentral_nih_gov_10898505 proquest_miscellaneous_2932938414 pubmed_primary_38414544 crossref_citationtrail_10_23889_ijpds_v8i1_2158 crossref_primary_10_23889_ijpds_v8i1_2158 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-01-01 |
| PublicationDateYYYYMMDD | 2023-01-01 |
| PublicationDate_xml | – month: 01 year: 2023 text: 2023-01-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | Wales |
| PublicationPlace_xml | – name: Wales |
| PublicationTitle | International journal of population data science |
| PublicationTitleAlternate | Int J Popul Data Sci |
| PublicationYear | 2023 |
| Publisher | Swansea University |
| Publisher_xml | – name: Swansea University |
| SSID | ssj0002142761 |
| Score | 2.3206322 |
| SecondaryResourceType | review_article |
| Snippet | IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing... Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global... Introduction Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing... |
| SourceID | doaj pubmedcentral proquest pubmed crossref |
| SourceType | Open Website Open Access Repository Aggregation Database Index Database Enrichment Source |
| StartPage | 2158 |
| SubjectTerms | data confidentiality data utility Databases, Factual Disclosure Evidence Gaps federated learning Humans Interior Design and Furnishings Medical Records Systems, Computerized Population Data Science review synthetic data |
| Title | Federated learning for generating synthetic data: a scoping review |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/38414544 https://www.proquest.com/docview/2932938414 https://pubmed.ncbi.nlm.nih.gov/PMC10898505 https://doaj.org/article/41b11932da354301aa661ae717c0aa78 |
| Volume | 8 |
| WOSCitedRecordID | wos001114361600004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2399-4908 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002142761 issn: 2399-4908 databaseCode: DOA dateStart: 20170101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2399-4908 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002142761 issn: 2399-4908 databaseCode: M~E dateStart: 20170101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LT9wwEB61qFJ7QUChLC-5Ui8c0s1jktjcALHiAuqhlfZmOXZMt0IBkQWJC7-dGSe72kWoXJCiHOI8nM8Tzzex_Q3AD5SuyixihNbEtMuLSNrYR1KZFCt05MHjkGyivLyU47H6tZDqi-eEdfLAHXBDTKqESYYzWY5kjcaQRzE1RSE2NqYMy3yJ9SwEU9wHs5AYBejduCR5JamGk3-3rv35ICcUFSac4X3BDwW5_tc45supkgu-Z7QGqz1pFMddZdfhQ91swOfZmuL2K5yMWBOCaKMTfRqIK0FsVFwFUWme2Szax4a4Ht1A8KTQI2EEL0jhkm71yib8GZ39Pj2P-uwIkcUim0Y-L62TMTrDkNLLepPa3BXUBSmXyjz1trDKUm9XV3laoTeFrxQPAPvCepTZFqw0N029DaKyWRoXxpfSK5QFGlsR3mnsEuUUhc8DGM6w0raXDucMFteaQoiArg7oakZXM7oDOJxfcdvJZvzn3BOGf34eC16HA2QGujcD_ZYZDOD7rPE0fSA86mGa-ua-1cRnaJOY4AC-dY05f1Q4nCOVyKVmXqrLckkz-RtEuJNYKkn0cec9ar8LXziNffdrZw9Wpnf39T58sg_TSXt3AB_LsTwIBk77i6ezZyRmAYw |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Federated+learning+for+generating+synthetic+data%3A+a+scoping+review&rft.jtitle=International+journal+of+population+data+science&rft.au=Little%2C+Claire&rft.au=Elliot%2C+Mark&rft.au=Allmendinger%2C+Richard&rft.date=2023-01-01&rft.issn=2399-4908&rft.eissn=2399-4908&rft.volume=8&rft.issue=1&rft.spage=2158&rft_id=info:doi/10.23889%2Fijpds.v8i1.2158&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2399-4908&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2399-4908&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2399-4908&client=summon |