Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets

In this work, the parallel fast condensed nearest neighbor (PFCNN) rule, a distributed method for computing a consistent subset of a very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on knowledge and data engineering Vol. 19; no. 12; pp. 1593 - 1606
Main Authors:	Angiulli, F., Folino, G.
Format:	Journal Article
Language:	English
Published:	New York IEEE 01.12.2007 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Algorithm design and analysis Algorithms and association rules Cellular neural networks Classification Clustering Collection Computation Concurrent computing Condensing Cost analysis Costs Data mining Data reduction Distributed algorithms Distributed computing Distributed systems Grid computing Nearest neighbor searches Neural networks Clustering classification Data mining Distributed systems and association rules
ISSN:	1041-4347, 1558-2191
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this work, the parallel fast condensed nearest neighbor (PFCNN) rule, a distributed method for computing a consistent subset of a very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 content type line 23
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2007.190665