Distributed Strategies for Mining Outliers in Large Data Sets
We introduce a distributed method for detecting distance-based outliers in very large data sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting novel outliers. The method exploits parallel co...
Uloženo v:
| Vydáno v: | IEEE transactions on knowledge and data engineering Ročník 25; číslo 7; s. 1520 - 1532 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.07.2013
|
| Témata: | |
| ISSN: | 1041-4347 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | We introduce a distributed method for detecting distance-based outliers in very large data sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits excellent performances. From the theoretical point of view, for common settings, the temporal cost of our algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to detect outliers. Experimental results show that the algorithm is efficient and that its running time scales quite well for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the solving set computed by our approach in a distributed environment has the same quality as that produced by the corresponding centralized method. |
|---|---|
| AbstractList | We introduce a distributed method for detecting distance-based outliers in very large data sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits excellent performances. From the theoretical point of view, for common settings, the temporal cost of our algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to detect outliers. Experimental results show that the algorithm is efficient and that its running time scales quite well for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the solving set computed by our approach in a distributed environment has the same quality as that produced by the corresponding centralized method. |
| Author | Angiulli, F. Basta, S. Lodi, S. Sartori, C. |
| Author_xml | – sequence: 1 givenname: F. surname: Angiulli fullname: Angiulli, F. email: f.angiulli@dimes.unical.it organization: DIMES Dept., Univ. of Calabria, Rende, Italy – sequence: 2 givenname: S. surname: Basta fullname: Basta, S. email: basta@icar.cnr.it organization: Inst. of High-Performance Comput. & Networking, Rende, Italy – sequence: 3 givenname: S. surname: Lodi fullname: Lodi, S. email: stefano.lodi@unibo.it organization: Dept. of Comput. Sci. & Eng., Univ. of Bologna, Bologna, Italy – sequence: 4 givenname: C. surname: Sartori fullname: Sartori, C. email: claudio.sartori@unibo.it organization: Dept. of Comput. Sci. & Eng., Univ. of Bologna, Bologna, Italy |
| BookMark | eNp1jz1PwzAURT0UibawsbH4B5DwbMd1MjCgpnyIoA4tc-Q4z5FRSJDtDvx7GhUxIDHd5dyrexZkNowDEnLFIGUMitv9S7lJOTCeKjYjcwYZSzKRqXOyCOEdAHKVszm5K12I3jWHiC3dRa8jdg4DtaOnr25wQ0e3h9g79IG6gVbad0hLHTXdYQwX5MzqPuDlTy7J28Nmv35Kqu3j8_q-SgyXMiYFAtqMc5RSoW2FkMxY1Fy1KjcGGpu3bQMrpsAgWqVzgaAFcqNawQGZWBJ-2jV-DMGjrY2LOrpxOD52fc2gnqTrSbqepGs1lW7-lD69-9D-6z_8-oQ7RPxFj6dkXqzEN-VlZbg |
| CODEN | ITKEEH |
| CitedBy_id | crossref_primary_10_1016_j_ins_2019_07_045 crossref_primary_10_1016_j_neucom_2015_05_135 crossref_primary_10_1016_j_eswa_2016_10_026 crossref_primary_10_1080_0305215X_2024_2315501 crossref_primary_10_1109_TPDS_2016_2528984 crossref_primary_10_1155_2017_2649535 crossref_primary_10_1016_j_is_2020_101569 crossref_primary_10_1016_j_ins_2015_11_005 crossref_primary_10_1016_j_procs_2020_01_006 crossref_primary_10_1109_TKDE_2016_2555804 crossref_primary_10_1007_s10586_018_1767_1 crossref_primary_10_1109_TCSS_2019_2918193 crossref_primary_10_1080_01431161_2024_2377837 crossref_primary_10_1016_j_knosys_2021_107256 crossref_primary_10_1109_TSMC_2017_2718592 crossref_primary_10_1007_s11390_015_1596_0 crossref_primary_10_1080_08839514_2021_1975393 crossref_primary_10_1016_j_eswa_2020_113215 crossref_primary_10_1007_s11042_018_5905_9 crossref_primary_10_1186_s40537_020_00320_x crossref_primary_10_1016_j_eswa_2019_02_020 crossref_primary_10_3390_f16060915 crossref_primary_10_1145_3381028 crossref_primary_10_1080_00051144_2019_1576966 crossref_primary_10_1016_j_trc_2014_07_005 |
| Cites_doi | 10.1145/956750.956758 10.1109/TKDE.2005.31 10.1007/s10618-008-0093-2 10.1007/s10618-005-0014-6 10.1007/978-3-642-15277-1_32 10.1145/1541880.1541882 10.1109/60.749142 10.1145/1497577.1497581 10.1145/1150402.1150447 10.1109/TKDE.2006.29 10.1049/cp:19950597 10.1109/ICDM.2005.116 10.1109/ACC.2002.1024528 10.1145/342009.335437 10.1137/1.9781611972771.47 |
| ContentType | Journal Article |
| DBID | 97E RIA RIE AAYXX CITATION |
| DOI | 10.1109/TKDE.2012.71 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE/IET Electronic Library (IEL) (UW System Shared) CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EndPage | 1532 |
| ExternalDocumentID | 10_1109_TKDE_2012_71 6175896 |
| Genre | orig-research |
| GroupedDBID | -~X .DC 0R~ 1OL 29I 4.4 5GY 5VS 6IK 97E 9M8 AAJGR AARMG AASAJ AAWTH ABAZT ABFSI ABQJQ ABVLG ACGFO ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ H~9 ICLAB IEDLZ IFIPE IFJZH IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNI RNS RXW RZB TAE TAF TN5 UHB VH1 AAYXX CITATION |
| ID | FETCH-LOGICAL-c255t-9e0ef422e557efd3351cfea27d78cc0bf8ddb06170ceef7a83e0a3e2c7d320e13 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 42 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000319461800007&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1041-4347 |
| IngestDate | Tue Nov 18 22:35:36 EST 2025 Sat Nov 29 08:05:25 EST 2025 Wed Aug 27 02:52:16 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 7 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c255t-9e0ef422e557efd3351cfea27d78cc0bf8ddb06170ceef7a83e0a3e2c7d320e13 |
| PageCount | 13 |
| ParticipantIDs | crossref_primary_10_1109_TKDE_2012_71 crossref_citationtrail_10_1109_TKDE_2012_71 ieee_primary_6175896 |
| PublicationCentury | 2000 |
| PublicationDate | 2013-07-01 |
| PublicationDateYYYYMMDD | 2013-07-01 |
| PublicationDate_xml | – month: 07 year: 2013 text: 2013-07-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | IEEE transactions on knowledge and data engineering |
| PublicationTitleAbbrev | TKDE |
| PublicationYear | 2013 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| References | Han (ref11) ref13 Asuncion (ref5) ref20 (ref14) 2000 Hung (ref12); 12 ref22 ref10 ref21 ref2 ref1 ref17 ref19 ref18 ref8 ref7 Knorr (ref15) ref9 ref4 ref3 ref6 Koufakou (ref16); 20 |
| References_xml | – volume-title: Advances in Distributed and Parallel Knowledge Discovery year: 2000 ident: ref14 – start-page: 392 volume-title: Proc. 24rd Int’l Conf. Very Large Data Bases (VLDB) ident: ref15 article-title: Algorithms for Mining Distance-Based Outliers in Large Datasets – ident: ref6 doi: 10.1145/956750.956758 – ident: ref4 doi: 10.1109/TKDE.2005.31 – ident: ref9 doi: 10.1007/s10618-008-0093-2 – ident: ref18 doi: 10.1007/s10618-005-0014-6 – ident: ref1 doi: 10.1007/978-3-642-15277-1_32 – volume: 20 start-page: 259 volume-title: Data Mining Knowledge Discovery ident: ref16 article-title: A Fast Outlier Detection Strategy for Distributed High-Dimensional Data Sets with Mixed Attributes – ident: ref7 doi: 10.1145/1541880.1541882 – ident: ref10 doi: 10.1109/60.749142 – ident: ref3 doi: 10.1145/1497577.1497581 – ident: ref20 doi: 10.1145/1150402.1150447 – ident: ref2 doi: 10.1109/TKDE.2006.29 – ident: ref21 doi: 10.1049/cp:19950597 – volume: 12 start-page: 5 issue: 1 volume-title: Distributed and Parallel Databases ident: ref12 article-title: Parallel Mining of Outliers in Large Database – ident: ref17 doi: 10.1109/ICDM.2005.116 – ident: ref13 doi: 10.1109/ACC.2002.1024528 – ident: ref19 doi: 10.1145/342009.335437 – ident: ref8 doi: 10.1137/1.9781611972771.47 – volume-title: Data Mining, Concepts and Technique ident: ref11 – volume-title: Large-Scale Parallel Data Mining ident: ref22 – volume-title: UCI Machine Learning Repository ident: ref5 |
| SSID | ssj0008781 |
| Score | 2.298693 |
| Snippet | We introduce a distributed method for detecting distance-based outliers in very large data sets. Our approach is based on the concept of outlier detection... |
| SourceID | crossref ieee |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 1520 |
| SubjectTerms | Arrays Data mining Decision support systems Distance-based outliers Distributed databases Nickel outlier detection parallel and distributed algorithms Upper bound |
| Title | Distributed Strategies for Mining Outliers in Large Data Sets |
| URI | https://ieeexplore.ieee.org/document/6175896 |
| Volume | 25 |
| WOSCitedRecordID | wos000319461800007&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE/IET Electronic Library (IEL) (UW System Shared) issn: 1041-4347 databaseCode: RIE dateStart: 19890101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://ieeexplore.ieee.org/ omitProxy: false ssIdentifier: ssj0008781 providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwGP0CxIMeREEj_koPetLBtq60PXgwAjFR0URMuC1r9zUhIWBg-Pe7lrFgogdvy_Idlrd13_u21_cArrpdHelAGI-ZgHmRVtITSkjPRNTHnD9ExlcubIIPh2I8lm8VuC33wiCiE59h2x66f_npXK_sp7JO3m2ZkN0qVDnn671a5VtXcBdImk8X-UxEI16K3GVn9NTrWxFX2ObBj_azlafi2smg_r8LOYD9gjaS-_V9PoQKzhpQ30QykGKFNmBvy1-wCXc9a4trE60wJRsfWlySnKiSF5cMQV5X2dSmYZPJjDxbVTjpJVlC3jFbHsHHoD96ePSKvARP54NB5kn00URhiIxxNCmlLNAGk5CnXGjtKyPSVFnK4ued0fBEUPQTiqHmKQ19DOgx1GbzGZ4AoUmEnOfUTlEWKRqohEnBpOmahAmjeAtuNjDGujATt5kW09gNFb6MLeixBT3mQQuuy-rPtYnGH3VNi3VZU8B8-vvpM9gNXTaF1c6eQy1brPACdvRXNlkuLt3T8Q2Ewbi6 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwGP3QKagHf4u_zUFPWm2bZEkOHsQpinMKTvBWkvQLCLLJ1vn322RdmaAHb6V8h_La9Htf-_IewHGzaZlNpIu4S3jErFGRNFJFjtEYS_7AXGxC2ITodOTbm3qegbN6LwwiBvEZnvvD8C8_79uR_1R2UXZbLlVzFuY4Y2ky3q1Vv3elCJGk5XxRTkWUiVrmri66D60bL-NKz0XyowFNJaqEhnK78r9LWYXlijiSq_GdXoMZ7K3DyiSUgVRrdB2WphwGN-Cy5Y1xfaYV5mTiRItDUlJV8hiyIcjTqPjwedjkvUfaXhdOWrrQ5AWL4Sa83t50r--iKjEhsuVoUEQKY3QsTZFzgS6nlCfWoU5FLqS1sXEyz40nLXHZG53QkmKsKaZW5DSNMaFb0Oj1e7gNhGqGQpTkzlDODE2M5kpy5ZpOc-mM2IHTCYyZrezEfarFRxbGilhlHvTMg56JZAdO6urPsY3GH3UbHuu6poJ59_fTR7Bw131sZ-37zsMeLKYhqcIrafehUQxGeADz9qt4Hw4Ow5PyDUz_vAE |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Distributed+Strategies+for+Mining+Outliers+in+Large+Data+Sets&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Angiulli%2C+F.&rft.au=Basta%2C+S.&rft.au=Lodi%2C+S.&rft.au=Sartori%2C+C.&rft.date=2013-07-01&rft.pub=IEEE&rft.issn=1041-4347&rft.volume=25&rft.issue=7&rft.spage=1520&rft.epage=1532&rft_id=info:doi/10.1109%2FTKDE.2012.71&rft.externalDocID=6175896 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon |