Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation
Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of Big Data, in which t...
Gespeichert in:
| Veröffentlicht in: | IEEE transactions on signal and information processing over networks Jg. 7; S. 699 - 715 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Piscataway
IEEE
2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Schlagworte: | |
| ISSN: | 2373-776X, 2373-7778 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of Big Data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to the study of distributed PSA/PCA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA/PCA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA/PCA, with one in the case of data partitioned across samples and the other in the case of data partitioned across (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study the effects of straggler machines on the proposed algorithms. |
|---|---|
| AbstractList | Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of Big Data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to the study of distributed PSA/PCA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA/PCA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA/PCA, with one in the case of data partitioned across samples and the other in the case of data partitioned across (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study the effects of straggler machines on the proposed algorithms. |
| Author | Bajwa, Waheed U. Gang, Arpita Xiang, Bingqing |
| Author_xml | – sequence: 1 givenname: Arpita orcidid: 0000-0003-3825-130X surname: Gang fullname: Gang, Arpita email: arpita.gang@rutgers.edu organization: Department of Electrical and Computer Engineering, Rutgers University, New Brunswick, NJ, USA – sequence: 2 givenname: Bingqing surname: Xiang fullname: Xiang, Bingqing email: xiangbqxyy@gmail.com organization: Department of Electrical and Computer Engineering, Rutgers University, New Brunswick, NJ, USA – sequence: 3 givenname: Waheed U. orcidid: 0000-0003-4406-5263 surname: Bajwa fullname: Bajwa, Waheed U. email: waheed.bajwa@rutgers.edu organization: Department of Electrical and Computer Engineering, Rutgers University, New Brunswick, NJ, USA |
| BookMark | eNp9kMtOwzAQRS0EEqX0B2BjiW1T_EjimF1peVSqoFKLxC6aJE5xlRe2s-jfk5KqCxasZqS5Z-bOvULnVV0phG4omVBK5P1mvVi9TRhhdMIpY0yKMzRgXHBPCBGdn_rw8xKNrN0RQmggfCHlAO3m2jqjk9apDK-MrlLdQIHXbWIbSBWeVlDsrbY4rw1egXHa6e56hh_1Fs_BwQOeFtvaaPdV2vFJPsZQZXhRNoUqVeXgAF2jixwKq0bHOkQfz0-b2au3fH9ZzKZLL2UycJ4KckKyTEKYi9D3FRWgEirDKJUJz7uhCDklAgRQYBEkJA1IJ827VyGURPIhuuv3Nqb-bpV18a5uTWfMxiyQgWCUB36ninpVamprjcrjVPc-nQFdxJTEh3Dj33DjQ7jxMdwOZX_QxugSzP5_6LaHtFLqBMhAUj-i_Af9W4gp |
| CODEN | ITSIBW |
| CitedBy_id | crossref_primary_10_1109_TSIPN_2022_3190743 crossref_primary_10_1109_TII_2023_3323685 crossref_primary_10_1109_TSIPN_2023_3302658 crossref_primary_10_1109_TSP_2022_3229635 crossref_primary_10_1109_TSP_2023_3239806 |
| Cites_doi | 10.1137/14096668X 10.1109/CVPR.2009.5206848 10.1109/GlobalSIP.2016.7905887 10.1109/ICASSP.2019.8683095 10.1080/03081087.2016.1267104 10.1016/0893-6080(89)90014-2 10.1109/TSP.2016.2523448 10.1016/j.sysconle.2004.02.022 10.1109/ICASSP.2017.7952998 10.1002/9781119387596 10.1109/TSIPN.2016.2524588 10.1109/TAC.2008.2009515 10.1093/imanum/17.1.1 10.1007/978-3-540-30218-6_19 10.1109/MSP.2020.2973345 10.1007/BF01933494 10.1037/h0071325 10.1080/14786440109462720 10.1137/15M1054201 10.1109/ACSSC.2008.5074720 10.1137/1024100 10.1007/978-3-642-31464-3_24 10.1109/TSP.2015.2472372 10.1109/JPROC.2018.2846568 10.1016/j.jpdc.2005.03.010 10.1016/j.jcss.2007.04.014 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
| DBID | 97E RIA RIE AAYXX CITATION 7SP 8FD L7M |
| DOI | 10.1109/TSIPN.2021.3122297 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE/IET Electronic Library (IEL) (UW System Shared) CrossRef Electronics & Communications Abstracts Technology Research Database Advanced Technologies Database with Aerospace |
| DatabaseTitle | CrossRef Technology Research Database Advanced Technologies Database with Aerospace Electronics & Communications Abstracts |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) - NZ url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISSN | 2373-7778 |
| EndPage | 715 |
| ExternalDocumentID | 10_1109_TSIPN_2021_3122297 9591481 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Army Research Office funderid: 10.13039/100000183 – fundername: W911NF-17-1-0546 grantid: W911NF-21-1-0301 – fundername: National Science Foundation grantid: CCF-1453073; CCF-1907658; OAC-1940074 funderid: 10.13039/501100004802 |
| GroupedDBID | 0R~ 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFS AGQYO AGSQL AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD IFIPE IPLJI JAVBF M43 O9- OCL RIA RIE AAYXX CITATION 7SP 8FD L7M |
| ID | FETCH-LOGICAL-c295t-e5f00dd9a6f7644e17aeb1968c9b3f5f0763107a7a1a28ab0c506f7f778a69093 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 10 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000716688100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2373-776X |
| IngestDate | Sun Jun 29 14:38:16 EDT 2025 Sat Nov 29 02:52:15 EST 2025 Tue Nov 18 21:49:35 EST 2025 Wed Aug 27 03:02:54 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c295t-e5f00dd9a6f7644e17aeb1968c9b3f5f0763107a7a1a28ab0c506f7f778a69093 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0003-4406-5263 0000-0003-3825-130X |
| PQID | 2595721354 |
| PQPubID | 4437207 |
| PageCount | 17 |
| ParticipantIDs | crossref_citationtrail_10_1109_TSIPN_2021_3122297 proquest_journals_2595721354 crossref_primary_10_1109_TSIPN_2021_3122297 ieee_primary_9591481 |
| PublicationCentury | 2000 |
| PublicationDate | 20210000 2021-00-00 20210101 |
| PublicationDateYYYYMMDD | 2021-01-01 |
| PublicationDate_xml | – year: 2021 text: 20210000 |
| PublicationDecade | 2020 |
| PublicationPlace | Piscataway |
| PublicationPlace_xml | – name: Piscataway |
| PublicationTitle | IEEE transactions on signal and information processing over networks |
| PublicationTitleAbbrev | TSIPN |
| PublicationYear | 2021 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref35 ref13 ref12 raja (ref17) 1017 raja (ref14) 0 ref15 ref30 ref33 ref11 ref32 ref10 lecun (ref36) 1998; 2 ref2 ref39 ref16 huang (ref38) 2007 ref18 loan (ref7) 1983 hong (ref22) 0; 70 gang (ref19) 2021 stewart (ref42) 1991 karakus (ref34) 0 mattaway (ref31) 2000; 6 ref23 ref26 ref20 ref41 ref21 ref28 krizhevsky (ref37) 2009 ye (ref27) 2021; 22 ref29 lanczos (ref8) 1950; 45 ref9 ref4 ref3 chen (ref24) 2021; 139 ref6 ref5 xiang (ref1) 2020 ref40 the (ref25) 0 |
| References_xml | – ident: ref28 doi: 10.1137/14096668X – ident: ref39 doi: 10.1109/CVPR.2009.5206848 – ident: ref23 doi: 10.1109/GlobalSIP.2016.7905887 – ident: ref18 doi: 10.1109/ICASSP.2019.8683095 – volume: 45 start-page: 255 year: 1950 ident: ref8 article-title: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators publication-title: J – year: 2020 ident: ref1 article-title: Edge-friendly distributed PCA – volume: 6 start-page: 131 year: 2000 ident: ref31 article-title: Point-to-point computer network communication utility utilizing dynamically assigned network protocol addresses – ident: ref3 doi: 10.1080/03081087.2016.1267104 – volume: 2 year: 1998 ident: ref36 article-title: MNIST handwritten digit database publication-title: ATT Labs – ident: ref5 doi: 10.1016/0893-6080(89)90014-2 – ident: ref11 doi: 10.1109/TSP.2016.2523448 – ident: ref16 doi: 10.1016/j.sysconle.2004.02.022 – start-page: 99 year: 1991 ident: ref42 article-title: Perturbation theory for the singular value decomposition publication-title: SVD and Signal Processing II – ident: ref15 doi: 10.1109/ICASSP.2017.7952998 – ident: ref33 doi: 10.1002/9781119387596 – year: 1017 ident: ref17 article-title: Distributed stochastic algorithms for high-rate streaming principal component analysis publication-title: Comput Res Repository – ident: ref21 doi: 10.1109/TSIPN.2016.2524588 – ident: ref35 doi: 10.1109/TAC.2008.2009515 – start-page: 878 year: 0 ident: ref25 article-title: MPI: A message passing interface publication-title: Proc 1993 ACM/IEEE Conf Supercomputing – start-page: 1474 year: 0 ident: ref14 article-title: Computing data-adaptive representations in the cloud publication-title: Proc 51st Annu Allerton Conf Commun Control Comput – year: 1983 ident: ref7 publication-title: Matrix Computations – ident: ref41 doi: 10.1093/imanum/17.1.1 – ident: ref30 doi: 10.1007/978-3-540-30218-6_19 – ident: ref6 doi: 10.1109/MSP.2020.2973345 – ident: ref40 doi: 10.1007/BF01933494 – year: 2009 ident: ref37 article-title: Learning multiple layers of features from tiny images – year: 2007 ident: ref38 article-title: Labeled faces in the wild: A database for studying face recognition in unconstrained environments – ident: ref2 doi: 10.1037/h0071325 – ident: ref4 doi: 10.1080/14786440109462720 – ident: ref32 doi: 10.1137/15M1054201 – ident: ref10 doi: 10.1109/ACSSC.2008.5074720 – ident: ref26 doi: 10.1137/1024100 – year: 2021 ident: ref19 article-title: A linearly convergent algorithm for distributed principal component analysis – volume: 139 start-page: 1594 year: 2021 ident: ref24 article-title: Decentralized riemannian gradient descent on the stiefel manifold publication-title: Proc 38th Int Conf Mach Learn – volume: 22 start-page: 1 year: 2021 ident: ref27 article-title: DeEPCA: Decentralized exact PCA with linear convergence rate publication-title: J Mach Learn Res – ident: ref12 doi: 10.1007/978-3-642-31464-3_24 – volume: 70 start-page: 1529 year: 0 ident: ref22 article-title: Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks publication-title: Proc 34th Int Conf Mach Learn – ident: ref13 doi: 10.1109/TSP.2015.2472372 – ident: ref20 doi: 10.1109/JPROC.2018.2846568 – start-page: 5440 year: 0 ident: ref34 article-title: Straggler mitigation in distributed optimization through data encoding publication-title: Proc 31st Int Conf Neural Inform Process Syst – ident: ref29 doi: 10.1016/j.jpdc.2005.03.010 – ident: ref9 doi: 10.1016/j.jcss.2007.04.014 |
| SSID | ssj0001574799 |
| Score | 2.2518115 |
| Snippet | Principal Subspace Analysis (PSA)-and its sibling, Principal Component Analysis (PCA)-is one of the most popular approaches for dimensionality reduction in... Principal Subspace Analysis (PSA)—and its sibling, Principal Component Analysis (PCA)—is one of the most popular approaches for dimensionality reduction in... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 699 |
| SubjectTerms | Algorithms Big Data Covariance matrices Dimensionality reduction Distributed data Eigenvalues and eigenfunctions Machine learning Network topologies orthogonal iteration Partitioning algorithms Principal component analysis Principal components analysis principal subspace Signal processing straggler effect Subspaces |
| Title | Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation |
| URI | https://ieeexplore.ieee.org/document/9591481 https://www.proquest.com/docview/2595721354 |
| Volume | 7 |
| WOSCitedRecordID | wos000716688100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) - NZ customDbUrl: eissn: 2373-7778 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001574799 issn: 2373-776X databaseCode: RIE dateStart: 20150101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFG6QeNCDv9CIounBm0zKWNfVG4pEL2SJmHBburZTDAwDw7_f1-6HGo2Jl2XJXpdmX_f6fW977yF04SWCikAqh1E4eDrmTqAId5hgwhUJ0XEibbMJNhoFkwkPa6hd5cJore3PZ_rKnNpv-Woh1yZU1uGUA3sHrbPBmJ_nan3GUygQY87LvBjCO-PHh3AECtDtgjA1bavZt73HNlP54YHttjLc_d-E9tBOQR9xP8d7H9V0eoC2vxQVbKDXgamFa9pYaYXDPJYOQ4yHAH2scVmGBANdxaFZObZakcI302c8EJm4xv3Z82I5zV7mq3Zl3sYiVdgWE54X-UrpIXoa3o1v752io4IjXU4zR9OEEKW48BMGREh3mQBfzf1A8riXwEXwNqAHAaeucAMRE0kJmCaMBQJkNO8doXoKUzpGWHqu8EnsU-ALng94A8-IfSYTzaWUmjZRt3zWkSzKjZuuF7PIyg7CI4tPZPCJCnya6LIa85YX2_jTumEQqSwLMJqoVUIaFe_jKgKRR0Hr9qh38vuoU7Rl7p0HV1qoni3X-gxtyvdsulqe26X2ARRJ1BY |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS8MwED7GFNQHf01xOjUPvrlq2jVN45s6RXGWgRP2VtI01YlO2Tr_fi9ZVxVF8KUUeqGhX3r5vmvvDuDAzySToUodzvDg60Q4YUqFwyWXnsyoTjJlm03wKAr7fdGtQLPMhdFa25_P9JE5td_y01c1MaGyY8EEsnfUOnOmc1aRrfUZUWFIjYWYZcZQcdy7u-5GqAE9F6WpaVzNv-0-tp3KDx9sN5bLlf9NaRWWCwJJTqeIr0FFD9dh6UtZwRo8tU01XNPISqekO42m4xDjI1AhazIrREKQsJKuWTu2XlFKzgYPpC1zeUJOnx9eR4P88WXcLM2bRA5TYssJvxQZS8MNuL-86J1fOUVPBUd5guWOZhmlaSpkkHGkQtrlEr21CEIlklaGF9HfoCJEpFzphTKhilE0zTgPJQpp0dqE6hCntAVE-Z4MaBIwZAx-gIgj00gCrjItlFKa1cGdPetYFQXHTd-L59gKDypii09s8IkLfOpwWI55m5bb-NO6ZhApLQsw6tCYQRoXb-Q4RpnHUO22mL_9-6h9WLjq3XbiznV0swOL5j7TUEsDqvloondhXr3ng_Fozy67D0iq118 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Distributed+Principal+Subspace+Analysis+for+Partitioned+Big+Data%3A+Algorithms%2C+Analysis%2C+and+Implementation&rft.jtitle=IEEE+transactions+on+signal+and+information+processing+over+networks&rft.au=Gang%2C+Arpita&rft.au=Xiang%2C+Bingqing&rft.au=Bajwa%2C+Waheed+U.&rft.date=2021&rft.pub=IEEE&rft.eissn=2373-7778&rft.volume=7&rft.spage=699&rft.epage=715&rft_id=info:doi/10.1109%2FTSIPN.2021.3122297&rft.externalDocID=9591481 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2373-776X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2373-776X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2373-776X&client=summon |