A Data Clustering Algorithm on Distributed Memory Multiprocessors

Abstract: To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops.

Topics:
Data Clustering

Download: pdf

Citation

A Data Clustering Algorithm on Distributed Memory Multiprocessors (pdf, software)
I. Dhillon, D. Modha.
In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), August 1999.
(Also appears as IBM Research Report RJ 10134.)

Bibtex:
@inproceedings{dhillon1999adataclu, author = "Inderjit S. Dhillon AND Dharmendra S. Modha", title = "A Data Clustering Algorithm on Distributed Memory Multiprocessors", booktitle = "ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)", year = "1999", month = "aug", note = "; Also appears as IBM Research Report RJ 10134.", abstract = "To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops." }

Center for Big Data Analytics

A Data Clustering Algorithm on Distributed Memory Multiprocessors

Inderjit Dhillon, Dharmendra Modha

Download: pdf

Citation