Federated learning for generating synthetic data: a scoping review

IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:International journal of population data science Ročník 8; číslo 1; s. 2158
Hlavní autoři: Little, Claire, Elliot, Mark, Allmendinger, Richard
Médium: Journal Article
Jazyk:angličtina
Vydáno: Wales Swansea University 01.01.2023
Témata:
ISSN:2399-4908, 2399-4908
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. ObjectivesThe objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. MethodsA scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. ResultsA total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. ConclusionsFederated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.
AbstractList Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format.The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.ObjectivesThe objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps.A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.MethodsA scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk.A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.ResultsA total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data.Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.ConclusionsFederated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.
Introduction Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. Objectives The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. Methods A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. Results A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. Conclusions Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.
IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. ObjectivesThe objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. MethodsA scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. ResultsA total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. ConclusionsFederated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.
Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.
Author Elliot, Mark
Little, Claire
Allmendinger, Richard
Author_xml – sequence: 1
  givenname: Claire
  orcidid: 0000-0003-4803-3007
  surname: Little
  fullname: Little, Claire
– sequence: 2
  givenname: Mark
  surname: Elliot
  fullname: Elliot, Mark
– sequence: 3
  givenname: Richard
  surname: Allmendinger
  fullname: Allmendinger, Richard
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38414544$$D View this record in MEDLINE/PubMed
BookMark eNp1kU1vEzEQhi1UREvpnRPaI5cE22t7bS4IKgqVKnGBszVrj1NHGzvYm1T99-wmLWqROHk8H887mvc1OUk5ISFvGV3yVmvzIa63vi73OrIlZ1K_IGe8NWYhDNUnT-JTclHrmlLKmeCdYq_IaasFE1KIM_LlCj0WGNE3A0JJMa2akEuzwjSn52-9T-MtjtE1Hkb42EBTXd7OlYL7iHdvyMsAQ8WLh_ec_Lr6-vPy--Lmx7fry883CydUOy6C7JzXVHjohRfTHgG4k15xZoznWvLglDNOdwp7yXsRQIXeCM54UC4I3Z6T6yPXZ1jbbYkbKPc2Q7SHRC4rC2Vac0ArWM-YabmHVoqWMgClGGDHOkcBupn16cja7voNeodpLDA8gz6vpHhrV3lvGdVGSyonwvsHQsm_d1hHu4nV4TBAwryrlk_y5nDnqfXdU7G_Ko8uTA3q2OBKrrVgsC6O0_HzrB2HSdQeHLcHx-3suJ0dnwbpP4OP7P-O_AHSiLC0
CitedBy_id crossref_primary_10_1051_itmconf_20257804003
crossref_primary_10_1109_OJCOMS_2025_3597019
crossref_primary_10_1002_psp4_70021
crossref_primary_10_1016_j_jafr_2025_102238
crossref_primary_10_1088_1361_6501_ad7878
crossref_primary_10_3934_aci_2024009
crossref_primary_10_3390_info15070379
crossref_primary_10_1108_JSM_11_2023_0441
crossref_primary_10_1186_s12911_024_02731_9
ContentType Journal Article
DBID AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
7X8
5PM
DOA
DOI 10.23889/ijpds.v8i1.2158
DatabaseName CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
PubMed Central (Full Participant titles)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic

CrossRef
MEDLINE
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Economics
EISSN 2399-4908
ExternalDocumentID oai_doaj_org_article_41b11932da354301aa661ae717c0aa78
PMC10898505
38414544
10_23889_ijpds_v8i1_2158
Genre Journal Article
Scoping Review
GroupedDBID AAFWJ
AAYXX
ADBBV
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
GROUPED_DOAJ
M~E
OK1
RPM
CGR
CUY
CVF
ECM
EIF
NPM
7X8
5PM
ID FETCH-LOGICAL-c463t-f57cd804dab4d4761fa2c5d62199d2852fc6c9c876eb52b4fa6fb94212f6cf483
IEDL.DBID DOA
ISICitedReferencesCount 13
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001114361600004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2399-4908
IngestDate Fri Oct 03 12:44:13 EDT 2025
Thu Aug 21 18:35:09 EDT 2025
Fri Jul 11 15:45:00 EDT 2025
Wed Sep 10 03:21:08 EDT 2025
Tue Nov 18 21:26:42 EST 2025
Sat Nov 29 06:19:45 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Keywords data utility
data confidentiality
review
synthetic data
federated learning
Language English
License http://creativecommons.org/licenses/by/4.0
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c463t-f57cd804dab4d4761fa2c5d62199d2852fc6c9c876eb52b4fa6fb94212f6cf483
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-3
content type line 23
ObjectType-Review-1
Statement on conflicts of interest: The authors declare that they have no conflicts to report.
ORCID 0000-0003-4803-3007
OpenAccessLink https://doaj.org/article/41b11932da354301aa661ae717c0aa78
PMID 38414544
PQID 2932938414
PQPubID 23479
ParticipantIDs doaj_primary_oai_doaj_org_article_41b11932da354301aa661ae717c0aa78
pubmedcentral_primary_oai_pubmedcentral_nih_gov_10898505
proquest_miscellaneous_2932938414
pubmed_primary_38414544
crossref_citationtrail_10_23889_ijpds_v8i1_2158
crossref_primary_10_23889_ijpds_v8i1_2158
PublicationCentury 2000
PublicationDate 2023-01-01
PublicationDateYYYYMMDD 2023-01-01
PublicationDate_xml – month: 01
  year: 2023
  text: 2023-01-01
  day: 01
PublicationDecade 2020
PublicationPlace Wales
PublicationPlace_xml – name: Wales
PublicationTitle International journal of population data science
PublicationTitleAlternate Int J Popul Data Sci
PublicationYear 2023
Publisher Swansea University
Publisher_xml – name: Swansea University
SSID ssj0002142761
Score 2.3206322
SecondaryResourceType review_article
Snippet IntroductionFederated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing...
Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global...
Introduction Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing...
SourceID doaj
pubmedcentral
proquest
pubmed
crossref
SourceType Open Website
Open Access Repository
Aggregation Database
Index Database
Enrichment Source
StartPage 2158
SubjectTerms data confidentiality
data utility
Databases, Factual
Disclosure
Evidence Gaps
federated learning
Humans
Interior Design and Furnishings
Medical Records Systems, Computerized
Population Data Science
review
synthetic data
Title Federated learning for generating synthetic data: a scoping review
URI https://www.ncbi.nlm.nih.gov/pubmed/38414544
https://www.proquest.com/docview/2932938414
https://pubmed.ncbi.nlm.nih.gov/PMC10898505
https://doaj.org/article/41b11932da354301aa661ae717c0aa78
Volume 8
WOSCitedRecordID wos001114361600004&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2399-4908
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002142761
  issn: 2399-4908
  databaseCode: DOA
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2399-4908
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002142761
  issn: 2399-4908
  databaseCode: M~E
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LT9wwEB61qFJ7QUChLC-5Ui8c0s1jktjcALHiAuqhlfZmOXZMt0IBkQWJC7-dGSe72kWoXJCiHOI8nM8Tzzex_Q3AD5SuyixihNbEtMuLSNrYR1KZFCt05MHjkGyivLyU47H6tZDqi-eEdfLAHXBDTKqESYYzWY5kjcaQRzE1RSE2NqYMy3yJ9SwEU9wHs5AYBejduCR5JamGk3-3rv35ICcUFSac4X3BDwW5_tc45supkgu-Z7QGqz1pFMddZdfhQ91swOfZmuL2K5yMWBOCaKMTfRqIK0FsVFwFUWme2Szax4a4Ht1A8KTQI2EEL0jhkm71yib8GZ39Pj2P-uwIkcUim0Y-L62TMTrDkNLLepPa3BXUBSmXyjz1trDKUm9XV3laoTeFrxQPAPvCepTZFqw0N029DaKyWRoXxpfSK5QFGlsR3mnsEuUUhc8DGM6w0raXDucMFteaQoiArg7oakZXM7oDOJxfcdvJZvzn3BOGf34eC16HA2QGujcD_ZYZDOD7rPE0fSA86mGa-ua-1cRnaJOY4AC-dY05f1Q4nCOVyKVmXqrLckkz-RtEuJNYKkn0cec9ar8LXziNffdrZw9Wpnf39T58sg_TSXt3AB_LsTwIBk77i6ezZyRmAYw
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Federated+learning+for+generating+synthetic+data%3A+a+scoping+review&rft.jtitle=International+journal+of+population+data+science&rft.au=Little%2C+Claire&rft.au=Elliot%2C+Mark&rft.au=Allmendinger%2C+Richard&rft.date=2023-01-01&rft.issn=2399-4908&rft.eissn=2399-4908&rft.volume=8&rft.issue=1&rft.spage=2158&rft_id=info:doi/10.23889%2Fijpds.v8i1.2158&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2399-4908&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2399-4908&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2399-4908&client=summon