Spectral methods in machine learning and new strategies for very large datasets
Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank...
Uloženo v:
| Vydáno v: | Proceedings of the National Academy of Sciences - PNAS Ročník 106; číslo 2; s. 369 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
United States
13.01.2009
|
| Témata: | |
| ISSN: | 1091-6490, 1091-6490 |
| On-line přístup: | Zjistit podrobnosti o přístupu |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods. |
|---|---|
| AbstractList | Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods. Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods. |
| Author | Belabbas, Mohamed-Ali Wolfe, Patrick J |
| Author_xml | – sequence: 1 givenname: Mohamed-Ali surname: Belabbas fullname: Belabbas, Mohamed-Ali organization: Department of Statistics, School of Engineering and Applied Sciences, Oxford Street, Harvard University, Cambridge, MA 02138, USA – sequence: 2 givenname: Patrick J surname: Wolfe fullname: Wolfe, Patrick J |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/19129490$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkEtLxDAUhYOMOA9du5Os3HW8SZu0WcrgCwZmoa5L2tzMVNq0Jqky_96CI7g6h8PHWXxLMnO9Q0KuGawZ5Ond4HRYQ8FAAjAQZ2TBQLFEZgpm__qcLEP4AAAlCrggc6YYV9O-ILvXAevodUs7jIfeBNo42un60DikLWrvGren2hnq8JuGiYy4bzBQ23v6hf5IW-33SI2OOmAMl-Tc6jbg1SlX5P3x4W3znGx3Ty-b-21SC57FJLcFClYJWxWWYSbr3MpKG2sN5FwLVfHcgskgTVPGM1ColAXOa5lWIKxRfEVuf38H33-OGGLZNaHGttUO-zGUUhZMTmIm8OYEjlWHphx802l_LP8U8B_41WB2 |
| CitedBy_id | crossref_primary_10_1080_03610911003699901 crossref_primary_10_1109_LSP_2020_2972164 crossref_primary_10_1016_j_brachy_2022_06_007 crossref_primary_10_1080_10618600_2014_995799 crossref_primary_10_1137_23M1565139 crossref_primary_10_1007_s00500_016_2160_8 crossref_primary_10_1214_24_AOS2418 crossref_primary_10_1007_s10618_010_0207_5 crossref_primary_10_1080_10618600_2018_1425625 crossref_primary_10_1007_s10994_022_06165_0 crossref_primary_10_1007_s11425_016_0274_0 crossref_primary_10_1016_j_inffus_2015_03_001 crossref_primary_10_1109_TNNLS_2019_2935502 crossref_primary_10_1109_TIP_2018_2796860 crossref_primary_10_1016_j_neucom_2016_12_047 crossref_primary_10_1016_j_patrec_2010_12_005 crossref_primary_10_3233_JAD_220776 crossref_primary_10_1109_TNNLS_2015_2490080 crossref_primary_10_1007_s00138_015_0664_3 crossref_primary_10_1007_s11222_012_9326_8 crossref_primary_10_1007_s11042_017_4566_4 crossref_primary_10_1109_TIM_2023_3291800 crossref_primary_10_1016_j_patcog_2012_02_012 crossref_primary_10_1038_s41598_017_05275_3 crossref_primary_10_1109_TFUZZ_2012_2201485 crossref_primary_10_1111_j_1467_9574_2012_00514_x crossref_primary_10_1109_TCYB_2014_2311578 crossref_primary_10_1016_j_patcog_2016_03_018 crossref_primary_10_1002_cpa_22234 crossref_primary_10_1016_j_jfranklin_2020_07_050 crossref_primary_10_1109_TASLP_2023_3341000 crossref_primary_10_1016_j_cam_2012_04_006 crossref_primary_10_1016_j_neucom_2014_08_090 crossref_primary_10_1016_j_jksuci_2022_04_009 crossref_primary_10_1049_iet_cvi_2019_0780 crossref_primary_10_1073_pnas_1317797110 crossref_primary_10_1371_journal_pone_0274299 crossref_primary_10_1016_j_compeleceng_2021_107564 crossref_primary_10_1016_j_patcog_2014_10_023 crossref_primary_10_1016_j_sigpro_2021_108451 crossref_primary_10_1109_TIT_2013_2271378 crossref_primary_10_1137_24M1678027 crossref_primary_10_1016_j_neucom_2016_09_023 |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1073/pnas.0810600105 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Sciences (General) |
| EISSN | 1091-6490 |
| ExternalDocumentID | 19129490 |
| Genre | Research Support, U.S. Gov't, Non-P.H.S Journal Article |
| GroupedDBID | --- -DZ -~X .55 0R~ 123 29P 2AX 2FS 2WC 4.4 53G 5RE 5VS 85S AACGO AAFWJ AANCE AAYJJ ABBHK ABOCM ABPLY ABPPZ ABTLG ABXSQ ABZEH ACGOD ACHIC ACIWK ACNCT ACPRK ADQXQ ADULT AENEX AEUPB AEXZC AFFNX AFOSN AFRAH ALMA_UNASSIGNED_HOLDINGS AQVQM AS~ BKOMP CGR CS3 CUY CVF D0L DCCCD DIK DU5 E3Z EBS ECM EIF EJD F5P FRP GX1 H13 HH5 HQ3 HTVGU HYE IPSME JAAYA JBMMH JENOY JHFFW JKQEH JLS JLXEF JPM JSG JST KQ8 L7B LU7 MVM N9A NPM N~3 O9- OK1 P-O PNE PQQKQ R.V RHI RNA RNS RPM RXW SA0 SJN TAE TN5 UKR W8F WH7 WOQ WOW X7M XSW Y6R YBH YKV YSK ZCA ~02 ~KM 7X8 ADXHL |
| ID | FETCH-LOGICAL-c524t-7f8e51b5fb8f1e46c7f6badffd072a59b27f0d4033312409e99f022c63b05fd92 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 90 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000262804000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1091-6490 |
| IngestDate | Tue Aug 19 06:03:03 EDT 2025 Thu Apr 03 07:04:43 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c524t-7f8e51b5fb8f1e46c7f6badffd072a59b27f0d4033312409e99f022c63b05fd92 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| OpenAccessLink | http://doi.org/10.1073/pnas.0810600105 |
| PMID | 19129490 |
| PQID | 66816081 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_66816081 pubmed_primary_19129490 |
| PublicationCentury | 2000 |
| PublicationDate | 2009-01-13 |
| PublicationDateYYYYMMDD | 2009-01-13 |
| PublicationDate_xml | – month: 01 year: 2009 text: 2009-01-13 day: 13 |
| PublicationDecade | 2000 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Proceedings of the National Academy of Sciences - PNAS |
| PublicationTitleAlternate | Proc Natl Acad Sci U S A |
| PublicationYear | 2009 |
| References | 15376896 - IEEE Trans Pattern Anal Mach Intell. 2004 Feb;26(2):214-25 15899970 - Proc Natl Acad Sci U S A. 2005 May 24;102(21):7426-31 18056803 - Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20167-72 16576753 - Proc Natl Acad Sci U S A. 2003 May 13;100(10):5591-6 11125149 - Science. 2000 Dec 22;290(5500):2319-23 |
| References_xml | – reference: 18056803 - Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20167-72 – reference: 16576753 - Proc Natl Acad Sci U S A. 2003 May 13;100(10):5591-6 – reference: 11125149 - Science. 2000 Dec 22;290(5500):2319-23 – reference: 15376896 - IEEE Trans Pattern Anal Mach Intell. 2004 Feb;26(2):214-25 – reference: 15899970 - Proc Natl Acad Sci U S A. 2005 May 24;102(21):7426-31 |
| SSID | ssj0009580 |
| Score | 2.351449 |
| Snippet | Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 369 |
| SubjectTerms | Algorithms Artificial Intelligence Data Interpretation, Statistical Databases, Factual Information Storage and Retrieval - methods Methods Models, Statistical Software |
| Title | Spectral methods in machine learning and new strategies for very large datasets |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/19129490 https://www.proquest.com/docview/66816081 |
| Volume | 106 |
| WOSCitedRecordID | wos000262804000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV27TsMwFLUKZWAByrM8PTDAYGrHjh1LSAghKhZKB5C6VfELVYK04IDE32PnIbEgBpZMSRTdXPse-x6fA8BpSqiNSwvEmFCIcSyQFCpBxGqpE0yZtrXZhBiNsslEjjvgsj0LE2mV7ZxYTdRmruMe-YDzjPBQv64Wbyh6RsXeamOgsQS6NACZSOgSk-yH5G5WaxFIgjiTuBX2EXSwKHJ_EV6GeeUQ-Tu6rKrMcP1_37cB1hp0Ca_rdOiBji02Qa8Zvx6eNSLT51vgIRrPx10OWHtIezgr4GvFrLSwsZJ4hnlhYMDd0JetogQMIBeG9P-CL5FDDiPD1NvSb4On4e3jzR1qzBWQThNWIuEymxKVOpU5YhnXwnGVG-cMFkmeSpUIhw3DlNIAAbC0UrpQ7zWnCqfOyGQHLBfzwu4BmBtlZXSsd0Qzw1PlhMtZxhilItyJ--CkDdk0JG_sSOSFnX_4aRu0Ptitoz5d1Bob07CMTGT4aft_PnsAVusOD0GEHoKuC8PWHoEV_VnO_PtxlRPhOhrffwNtfsEH |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Spectral+methods+in+machine+learning+and+new+strategies+for+very+large+datasets&rft.jtitle=Proceedings+of+the+National+Academy+of+Sciences+-+PNAS&rft.au=Belabbas%2C+Mohamed-Ali&rft.au=Wolfe%2C+Patrick+J&rft.date=2009-01-13&rft.eissn=1091-6490&rft.volume=106&rft.issue=2&rft.spage=369&rft_id=info:doi/10.1073%2Fpnas.0810600105&rft_id=info%3Apmid%2F19129490&rft_id=info%3Apmid%2F19129490&rft.externalDocID=19129490 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1091-6490&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1091-6490&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1091-6490&client=summon |