Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation

Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of Big Data, in which t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on signal and information processing over networks Jg. 7; S. 699 - 715
Hauptverfasser: Gang, Arpita, Xiang, Bingqing, Bajwa, Waheed U.
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Piscataway IEEE 2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:
ISSN:2373-776X, 2373-7778
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of Big Data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to the study of distributed PSA/PCA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA/PCA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA/PCA, with one in the case of data partitioned across samples and the other in the case of data partitioned across (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study the effects of straggler machines on the proposed algorithms.
AbstractList Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of Big Data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to the study of distributed PSA/PCA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA/PCA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA/PCA, with one in the case of data partitioned across samples and the other in the case of data partitioned across (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study the effects of straggler machines on the proposed algorithms.
Author Bajwa, Waheed U.
Gang, Arpita
Xiang, Bingqing
Author_xml – sequence: 1
  givenname: Arpita
  orcidid: 0000-0003-3825-130X
  surname: Gang
  fullname: Gang, Arpita
  email: arpita.gang@rutgers.edu
  organization: Department of Electrical and Computer Engineering, Rutgers University, New Brunswick, NJ, USA
– sequence: 2
  givenname: Bingqing
  surname: Xiang
  fullname: Xiang, Bingqing
  email: xiangbqxyy@gmail.com
  organization: Department of Electrical and Computer Engineering, Rutgers University, New Brunswick, NJ, USA
– sequence: 3
  givenname: Waheed U.
  orcidid: 0000-0003-4406-5263
  surname: Bajwa
  fullname: Bajwa, Waheed U.
  email: waheed.bajwa@rutgers.edu
  organization: Department of Electrical and Computer Engineering, Rutgers University, New Brunswick, NJ, USA
BookMark eNp9kMtOwzAQRS0EEqX0B2BjiW1T_EjimF1peVSqoFKLxC6aJE5xlRe2s-jfk5KqCxasZqS5Z-bOvULnVV0phG4omVBK5P1mvVi9TRhhdMIpY0yKMzRgXHBPCBGdn_rw8xKNrN0RQmggfCHlAO3m2jqjk9apDK-MrlLdQIHXbWIbSBWeVlDsrbY4rw1egXHa6e56hh_1Fs_BwQOeFtvaaPdV2vFJPsZQZXhRNoUqVeXgAF2jixwKq0bHOkQfz0-b2au3fH9ZzKZLL2UycJ4KckKyTEKYi9D3FRWgEirDKJUJz7uhCDklAgRQYBEkJA1IJ827VyGURPIhuuv3Nqb-bpV18a5uTWfMxiyQgWCUB36ninpVamprjcrjVPc-nQFdxJTEh3Dj33DjQ7jxMdwOZX_QxugSzP5_6LaHtFLqBMhAUj-i_Af9W4gp
CODEN ITSIBW
CitedBy_id crossref_primary_10_1109_TSIPN_2022_3190743
crossref_primary_10_1109_TII_2023_3323685
crossref_primary_10_1109_TSIPN_2023_3302658
crossref_primary_10_1109_TSP_2022_3229635
crossref_primary_10_1109_TSP_2023_3239806
Cites_doi 10.1137/14096668X
10.1109/CVPR.2009.5206848
10.1109/GlobalSIP.2016.7905887
10.1109/ICASSP.2019.8683095
10.1080/03081087.2016.1267104
10.1016/0893-6080(89)90014-2
10.1109/TSP.2016.2523448
10.1016/j.sysconle.2004.02.022
10.1109/ICASSP.2017.7952998
10.1002/9781119387596
10.1109/TSIPN.2016.2524588
10.1109/TAC.2008.2009515
10.1093/imanum/17.1.1
10.1007/978-3-540-30218-6_19
10.1109/MSP.2020.2973345
10.1007/BF01933494
10.1037/h0071325
10.1080/14786440109462720
10.1137/15M1054201
10.1109/ACSSC.2008.5074720
10.1137/1024100
10.1007/978-3-642-31464-3_24
10.1109/TSP.2015.2472372
10.1109/JPROC.2018.2846568
10.1016/j.jpdc.2005.03.010
10.1016/j.jcss.2007.04.014
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID 97E
RIA
RIE
AAYXX
CITATION
7SP
8FD
L7M
DOI 10.1109/TSIPN.2021.3122297
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE/IET Electronic Library (IEL) (UW System Shared)
CrossRef
Electronics & Communications Abstracts
Technology Research Database
Advanced Technologies Database with Aerospace
DatabaseTitle CrossRef
Technology Research Database
Advanced Technologies Database with Aerospace
Electronics & Communications Abstracts
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL) - NZ
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2373-7778
EndPage 715
ExternalDocumentID 10_1109_TSIPN_2021_3122297
9591481
Genre orig-research
GrantInformation_xml – fundername: Army Research Office
  funderid: 10.13039/100000183
– fundername: W911NF-17-1-0546
  grantid: W911NF-21-1-0301
– fundername: National Science Foundation
  grantid: CCF-1453073; CCF-1907658; OAC-1940074
  funderid: 10.13039/501100004802
GroupedDBID 0R~
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFS
AGQYO
AGSQL
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
EJD
IFIPE
IPLJI
JAVBF
M43
O9-
OCL
RIA
RIE
AAYXX
CITATION
7SP
8FD
L7M
ID FETCH-LOGICAL-c295t-e5f00dd9a6f7644e17aeb1968c9b3f5f0763107a7a1a28ab0c506f7f778a69093
IEDL.DBID RIE
ISICitedReferencesCount 10
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000716688100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2373-776X
IngestDate Sun Jun 29 14:38:16 EDT 2025
Sat Nov 29 02:52:15 EST 2025
Tue Nov 18 21:49:35 EST 2025
Wed Aug 27 03:02:54 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c295t-e5f00dd9a6f7644e17aeb1968c9b3f5f0763107a7a1a28ab0c506f7f778a69093
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-4406-5263
0000-0003-3825-130X
PQID 2595721354
PQPubID 4437207
PageCount 17
ParticipantIDs crossref_citationtrail_10_1109_TSIPN_2021_3122297
proquest_journals_2595721354
crossref_primary_10_1109_TSIPN_2021_3122297
ieee_primary_9591481
PublicationCentury 2000
PublicationDate 20210000
2021-00-00
20210101
PublicationDateYYYYMMDD 2021-01-01
PublicationDate_xml – year: 2021
  text: 20210000
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE transactions on signal and information processing over networks
PublicationTitleAbbrev TSIPN
PublicationYear 2021
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref35
ref13
ref12
raja (ref17) 1017
raja (ref14) 0
ref15
ref30
ref33
ref11
ref32
ref10
lecun (ref36) 1998; 2
ref2
ref39
ref16
huang (ref38) 2007
ref18
loan (ref7) 1983
hong (ref22) 0; 70
gang (ref19) 2021
stewart (ref42) 1991
karakus (ref34) 0
mattaway (ref31) 2000; 6
ref23
ref26
ref20
ref41
ref21
ref28
krizhevsky (ref37) 2009
ye (ref27) 2021; 22
ref29
lanczos (ref8) 1950; 45
ref9
ref4
ref3
chen (ref24) 2021; 139
ref6
ref5
xiang (ref1) 2020
ref40
the (ref25) 0
References_xml – ident: ref28
  doi: 10.1137/14096668X
– ident: ref39
  doi: 10.1109/CVPR.2009.5206848
– ident: ref23
  doi: 10.1109/GlobalSIP.2016.7905887
– ident: ref18
  doi: 10.1109/ICASSP.2019.8683095
– volume: 45
  start-page: 255
  year: 1950
  ident: ref8
  article-title: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators
  publication-title: J
– year: 2020
  ident: ref1
  article-title: Edge-friendly distributed PCA
– volume: 6
  start-page: 131
  year: 2000
  ident: ref31
  article-title: Point-to-point computer network communication utility utilizing dynamically assigned network protocol addresses
– ident: ref3
  doi: 10.1080/03081087.2016.1267104
– volume: 2
  year: 1998
  ident: ref36
  article-title: MNIST handwritten digit database
  publication-title: ATT Labs
– ident: ref5
  doi: 10.1016/0893-6080(89)90014-2
– ident: ref11
  doi: 10.1109/TSP.2016.2523448
– ident: ref16
  doi: 10.1016/j.sysconle.2004.02.022
– start-page: 99
  year: 1991
  ident: ref42
  article-title: Perturbation theory for the singular value decomposition
  publication-title: SVD and Signal Processing II
– ident: ref15
  doi: 10.1109/ICASSP.2017.7952998
– ident: ref33
  doi: 10.1002/9781119387596
– year: 1017
  ident: ref17
  article-title: Distributed stochastic algorithms for high-rate streaming principal component analysis
  publication-title: Comput Res Repository
– ident: ref21
  doi: 10.1109/TSIPN.2016.2524588
– ident: ref35
  doi: 10.1109/TAC.2008.2009515
– start-page: 878
  year: 0
  ident: ref25
  article-title: MPI: A message passing interface
  publication-title: Proc 1993 ACM/IEEE Conf Supercomputing
– start-page: 1474
  year: 0
  ident: ref14
  article-title: Computing data-adaptive representations in the cloud
  publication-title: Proc 51st Annu Allerton Conf Commun Control Comput
– year: 1983
  ident: ref7
  publication-title: Matrix Computations
– ident: ref41
  doi: 10.1093/imanum/17.1.1
– ident: ref30
  doi: 10.1007/978-3-540-30218-6_19
– ident: ref6
  doi: 10.1109/MSP.2020.2973345
– ident: ref40
  doi: 10.1007/BF01933494
– year: 2009
  ident: ref37
  article-title: Learning multiple layers of features from tiny images
– year: 2007
  ident: ref38
  article-title: Labeled faces in the wild: A database for studying face recognition in unconstrained environments
– ident: ref2
  doi: 10.1037/h0071325
– ident: ref4
  doi: 10.1080/14786440109462720
– ident: ref32
  doi: 10.1137/15M1054201
– ident: ref10
  doi: 10.1109/ACSSC.2008.5074720
– ident: ref26
  doi: 10.1137/1024100
– year: 2021
  ident: ref19
  article-title: A linearly convergent algorithm for distributed principal component analysis
– volume: 139
  start-page: 1594
  year: 2021
  ident: ref24
  article-title: Decentralized riemannian gradient descent on the stiefel manifold
  publication-title: Proc 38th Int Conf Mach Learn
– volume: 22
  start-page: 1
  year: 2021
  ident: ref27
  article-title: DeEPCA: Decentralized exact PCA with linear convergence rate
  publication-title: J Mach Learn Res
– ident: ref12
  doi: 10.1007/978-3-642-31464-3_24
– volume: 70
  start-page: 1529
  year: 0
  ident: ref22
  article-title: Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks
  publication-title: Proc 34th Int Conf Mach Learn
– ident: ref13
  doi: 10.1109/TSP.2015.2472372
– ident: ref20
  doi: 10.1109/JPROC.2018.2846568
– start-page: 5440
  year: 0
  ident: ref34
  article-title: Straggler mitigation in distributed optimization through data encoding
  publication-title: Proc 31st Int Conf Neural Inform Process Syst
– ident: ref29
  doi: 10.1016/j.jpdc.2005.03.010
– ident: ref9
  doi: 10.1016/j.jcss.2007.04.014
SSID ssj0001574799
Score 2.2518115
Snippet Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in...
Principal Subspace Analysis (PSA)—and its sibling, Principal Component Analysis (PCA)—is one of the most popular approaches for dimensionality reduction in...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 699
SubjectTerms Algorithms
Big Data
Covariance matrices
Dimensionality reduction
Distributed data
Eigenvalues and eigenfunctions
Machine learning
Network topologies
orthogonal iteration
Partitioning algorithms
Principal component analysis
Principal components analysis
principal subspace
Signal processing
straggler effect
Subspaces
Title Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation
URI https://ieeexplore.ieee.org/document/9591481
https://www.proquest.com/docview/2595721354
Volume 7
WOSCitedRecordID wos000716688100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL) - NZ
  customDbUrl:
  eissn: 2373-7778
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001574799
  issn: 2373-776X
  databaseCode: RIE
  dateStart: 20150101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFG6QeNCDv9CIounBm0zKWNfVG4pEL2SJmHBburZTDAwDw7_f1-6HGo2Jl2XJXpdmX_f6fW977yF04SWCikAqh1E4eDrmTqAId5hgwhUJ0XEibbMJNhoFkwkPa6hd5cJore3PZ_rKnNpv-Woh1yZU1uGUA3sHrbPBmJ_nan3GUygQY87LvBjCO-PHh3AECtDtgjA1bavZt73HNlP54YHttjLc_d-E9tBOQR9xP8d7H9V0eoC2vxQVbKDXgamFa9pYaYXDPJYOQ4yHAH2scVmGBANdxaFZObZakcI302c8EJm4xv3Z82I5zV7mq3Zl3sYiVdgWE54X-UrpIXoa3o1v752io4IjXU4zR9OEEKW48BMGREh3mQBfzf1A8riXwEXwNqAHAaeucAMRE0kJmCaMBQJkNO8doXoKUzpGWHqu8EnsU-ALng94A8-IfSYTzaWUmjZRt3zWkSzKjZuuF7PIyg7CI4tPZPCJCnya6LIa85YX2_jTumEQqSwLMJqoVUIaFe_jKgKRR0Hr9qh38vuoU7Rl7p0HV1qoni3X-gxtyvdsulqe26X2ARRJ1BY
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS8MwED7GFNQHf01xOjUPvrlq2jVN45s6RXGWgRP2VtI01YlO2Tr_fi9ZVxVF8KUUeqGhX3r5vmvvDuDAzySToUodzvDg60Q4YUqFwyWXnsyoTjJlm03wKAr7fdGtQLPMhdFa25_P9JE5td_y01c1MaGyY8EEsnfUOnOmc1aRrfUZUWFIjYWYZcZQcdy7u-5GqAE9F6WpaVzNv-0-tp3KDx9sN5bLlf9NaRWWCwJJTqeIr0FFD9dh6UtZwRo8tU01XNPISqekO42m4xDjI1AhazIrREKQsJKuWTu2XlFKzgYPpC1zeUJOnx9eR4P88WXcLM2bRA5TYssJvxQZS8MNuL-86J1fOUVPBUd5guWOZhmlaSpkkHGkQtrlEr21CEIlklaGF9HfoCJEpFzphTKhilE0zTgPJQpp0dqE6hCntAVE-Z4MaBIwZAx-gIgj00gCrjItlFKa1cGdPetYFQXHTd-L59gKDypii09s8IkLfOpwWI55m5bb-NO6ZhApLQsw6tCYQRoXb-Q4RpnHUO22mL_9-6h9WLjq3XbiznV0swOL5j7TUEsDqvloondhXr3ng_Fozy67D0iq118
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Distributed+Principal+Subspace+Analysis+for+Partitioned+Big+Data%3A+Algorithms%2C+Analysis%2C+and+Implementation&rft.jtitle=IEEE+transactions+on+signal+and+information+processing+over+networks&rft.au=Gang%2C+Arpita&rft.au=Xiang%2C+Bingqing&rft.au=Bajwa%2C+Waheed+U.&rft.date=2021&rft.pub=IEEE&rft.eissn=2373-7778&rft.volume=7&rft.spage=699&rft.epage=715&rft_id=info:doi/10.1109%2FTSIPN.2021.3122297&rft.externalDocID=9591481
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2373-776X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2373-776X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2373-776X&client=summon