Spectral methods in machine learning and new strategies for very large datasets

Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the National Academy of Sciences - PNAS Vol. 106; no. 2; p. 369
Main Authors: Belabbas, Mohamed-Ali, Wolfe, Patrick J
Format: Journal Article
Language:English
Published: United States 13.01.2009
Subjects:
ISSN:1091-6490, 1091-6490
Online Access:Get more information
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.
AbstractList Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.
Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.
Author Belabbas, Mohamed-Ali
Wolfe, Patrick J
Author_xml – sequence: 1
  givenname: Mohamed-Ali
  surname: Belabbas
  fullname: Belabbas, Mohamed-Ali
  organization: Department of Statistics, School of Engineering and Applied Sciences, Oxford Street, Harvard University, Cambridge, MA 02138, USA
– sequence: 2
  givenname: Patrick J
  surname: Wolfe
  fullname: Wolfe, Patrick J
BackLink https://www.ncbi.nlm.nih.gov/pubmed/19129490$$D View this record in MEDLINE/PubMed
BookMark eNpNkEtLxDAUhYOMOA9du5Os3HW8SZu0WcrgCwZmoa5L2tzMVNq0Jqky_96CI7g6h8PHWXxLMnO9Q0KuGawZ5Ond4HRYQ8FAAjAQZ2TBQLFEZgpm__qcLEP4AAAlCrggc6YYV9O-ILvXAevodUs7jIfeBNo42un60DikLWrvGren2hnq8JuGiYy4bzBQ23v6hf5IW-33SI2OOmAMl-Tc6jbg1SlX5P3x4W3znGx3Ty-b-21SC57FJLcFClYJWxWWYSbr3MpKG2sN5FwLVfHcgskgTVPGM1ColAXOa5lWIKxRfEVuf38H33-OGGLZNaHGttUO-zGUUhZMTmIm8OYEjlWHphx802l_LP8U8B_41WB2
CitedBy_id crossref_primary_10_1080_03610911003699901
crossref_primary_10_1109_LSP_2020_2972164
crossref_primary_10_1016_j_brachy_2022_06_007
crossref_primary_10_1080_10618600_2014_995799
crossref_primary_10_1137_23M1565139
crossref_primary_10_1007_s00500_016_2160_8
crossref_primary_10_1214_24_AOS2418
crossref_primary_10_1007_s10618_010_0207_5
crossref_primary_10_1080_10618600_2018_1425625
crossref_primary_10_1007_s10994_022_06165_0
crossref_primary_10_1007_s11425_016_0274_0
crossref_primary_10_1016_j_inffus_2015_03_001
crossref_primary_10_1109_TNNLS_2019_2935502
crossref_primary_10_1109_TIP_2018_2796860
crossref_primary_10_1016_j_neucom_2016_12_047
crossref_primary_10_1016_j_patrec_2010_12_005
crossref_primary_10_3233_JAD_220776
crossref_primary_10_1109_TNNLS_2015_2490080
crossref_primary_10_1007_s00138_015_0664_3
crossref_primary_10_1007_s11222_012_9326_8
crossref_primary_10_1007_s11042_017_4566_4
crossref_primary_10_1109_TIM_2023_3291800
crossref_primary_10_1016_j_patcog_2012_02_012
crossref_primary_10_1038_s41598_017_05275_3
crossref_primary_10_1109_TFUZZ_2012_2201485
crossref_primary_10_1111_j_1467_9574_2012_00514_x
crossref_primary_10_1109_TCYB_2014_2311578
crossref_primary_10_1016_j_patcog_2016_03_018
crossref_primary_10_1002_cpa_22234
crossref_primary_10_1016_j_jfranklin_2020_07_050
crossref_primary_10_1109_TASLP_2023_3341000
crossref_primary_10_1016_j_cam_2012_04_006
crossref_primary_10_1016_j_neucom_2014_08_090
crossref_primary_10_1016_j_jksuci_2022_04_009
crossref_primary_10_1049_iet_cvi_2019_0780
crossref_primary_10_1073_pnas_1317797110
crossref_primary_10_1371_journal_pone_0274299
crossref_primary_10_1016_j_compeleceng_2021_107564
crossref_primary_10_1016_j_patcog_2014_10_023
crossref_primary_10_1016_j_sigpro_2021_108451
crossref_primary_10_1109_TIT_2013_2271378
crossref_primary_10_1137_24M1678027
crossref_primary_10_1016_j_neucom_2016_09_023
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1073/pnas.0810600105
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Sciences (General)
EISSN 1091-6490
ExternalDocumentID 19129490
Genre Research Support, U.S. Gov't, Non-P.H.S
Journal Article
GroupedDBID ---
-DZ
-~X
.55
0R~
123
29P
2AX
2FS
2WC
4.4
53G
5RE
5VS
85S
AACGO
AAFWJ
AANCE
AAYJJ
ABBHK
ABOCM
ABPLY
ABPPZ
ABTLG
ABXSQ
ABZEH
ACGOD
ACHIC
ACIWK
ACNCT
ACPRK
ADQXQ
ADULT
AENEX
AEUPB
AEXZC
AFFNX
AFOSN
AFRAH
ALMA_UNASSIGNED_HOLDINGS
AQVQM
AS~
BKOMP
CGR
CS3
CUY
CVF
D0L
DCCCD
DIK
DU5
E3Z
EBS
ECM
EIF
EJD
F5P
FRP
GX1
H13
HH5
HQ3
HTVGU
HYE
IPSME
JAAYA
JBMMH
JENOY
JHFFW
JKQEH
JLS
JLXEF
JPM
JSG
JST
KQ8
L7B
LU7
MVM
N9A
NPM
N~3
O9-
OK1
P-O
PNE
PQQKQ
R.V
RHI
RNA
RNS
RPM
RXW
SA0
SJN
TAE
TN5
UKR
W8F
WH7
WOQ
WOW
X7M
XSW
Y6R
YBH
YKV
YSK
ZCA
~02
~KM
7X8
ADXHL
ID FETCH-LOGICAL-c524t-7f8e51b5fb8f1e46c7f6badffd072a59b27f0d4033312409e99f022c63b05fd92
IEDL.DBID 7X8
ISICitedReferencesCount 90
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000262804000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1091-6490
IngestDate Tue Aug 19 06:03:03 EDT 2025
Thu Apr 03 07:04:43 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c524t-7f8e51b5fb8f1e46c7f6badffd072a59b27f0d4033312409e99f022c63b05fd92
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
OpenAccessLink http://doi.org/10.1073/pnas.0810600105
PMID 19129490
PQID 66816081
PQPubID 23479
ParticipantIDs proquest_miscellaneous_66816081
pubmed_primary_19129490
PublicationCentury 2000
PublicationDate 2009-01-13
PublicationDateYYYYMMDD 2009-01-13
PublicationDate_xml – month: 01
  year: 2009
  text: 2009-01-13
  day: 13
PublicationDecade 2000
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Proceedings of the National Academy of Sciences - PNAS
PublicationTitleAlternate Proc Natl Acad Sci U S A
PublicationYear 2009
References 15376896 - IEEE Trans Pattern Anal Mach Intell. 2004 Feb;26(2):214-25
15899970 - Proc Natl Acad Sci U S A. 2005 May 24;102(21):7426-31
18056803 - Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20167-72
16576753 - Proc Natl Acad Sci U S A. 2003 May 13;100(10):5591-6
11125149 - Science. 2000 Dec 22;290(5500):2319-23
References_xml – reference: 18056803 - Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20167-72
– reference: 16576753 - Proc Natl Acad Sci U S A. 2003 May 13;100(10):5591-6
– reference: 11125149 - Science. 2000 Dec 22;290(5500):2319-23
– reference: 15376896 - IEEE Trans Pattern Anal Mach Intell. 2004 Feb;26(2):214-25
– reference: 15899970 - Proc Natl Acad Sci U S A. 2005 May 24;102(21):7426-31
SSID ssj0009580
Score 2.351449
Snippet Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 369
SubjectTerms Algorithms
Artificial Intelligence
Data Interpretation, Statistical
Databases, Factual
Information Storage and Retrieval - methods
Methods
Models, Statistical
Software
Title Spectral methods in machine learning and new strategies for very large datasets
URI https://www.ncbi.nlm.nih.gov/pubmed/19129490
https://www.proquest.com/docview/66816081
Volume 106
WOSCitedRecordID wos000262804000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV25TsQwELWApaDhvi8XFFAY7DjxISEhhEAUsGwB0nYrO7bRSpBdSEDi7xnnkGgQBUXSJYomY8-z5_k9hI6Y8k7aVBG4HEktF8RQTUnOeO6dEJba-qDwnez31XCoBzPovDsLE2mV3ZxYT9Ruksc98jMhFBNQvy6mbyR6RsXeamugMYt6HIBMJHTJofohuasaLQLNiEg17YR9JD-bFqY8hZdRUTtE_o4u6ypzs_S_71tGiy26xJdNOqygGV-sopV2_Jb4uBWZPllDD9F4Pu5y4MZDusTjAr_WzEqPWyuJZ2wKhwF347LqFCUwgFwM6f-FXyKHHEeGaemrch093Vw_Xt2S1lyB5FmSVkQG5TNms2BVYD4VuQzCGheCozIxmbaJDNSllHMOEIBqr3WAep8LbmkWnE420FwxKfwWwlYISU0A7GGy1DihRFQFyyVgCVh_MbWNDruQjSB5Y0fCFH7yUY66oG2jzSbqo2mjsTGCZWSi4aft_PnsLlpoOjyMML6HegGGrd9H8_lnNS7fD-qcgHt_cP8NpxXBBg
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Spectral+methods+in+machine+learning+and+new+strategies+for+very+large+datasets&rft.jtitle=Proceedings+of+the+National+Academy+of+Sciences+-+PNAS&rft.au=Belabbas%2C+Mohamed-Ali&rft.au=Wolfe%2C+Patrick+J&rft.date=2009-01-13&rft.issn=1091-6490&rft.eissn=1091-6490&rft.volume=106&rft.issue=2&rft.spage=369&rft_id=info:doi/10.1073%2Fpnas.0810600105&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1091-6490&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1091-6490&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1091-6490&client=summon