Efficient Clustering of Very Large Document Collections

Abstract: An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

Topics:
Data Clustering

Download: pdf

Citation

Efficient Clustering of Very Large Document Collections (pdf, software)
I. Dhillon, Y. Guan, J. Fan.
Data Mining for Scientific and Engineering Applications, pp. 357-381, 2001.
(Invited chapter)

Bibtex:
@book{dhillon2001efficient, author = "Inderjit S. Dhillon AND Yuqiang Guan AND J. Fan", title = "Efficient Clustering of Very Large Document Collections", booktitle = "Data Mining for Scientific and Engineering Applications", publisher = "Kluwer Academic Publishers", publisher = "Kluwer Academic Publishers", page = "357–381", chapter = "20", year = "2001", note = "; Invited chapter", abstract = "An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption." }

Center for Big Data Analytics

Efficient Clustering of Very Large Document Collections

Inderjit Dhillon, Yuqiang Guan, J. Fan

Download: pdf

Citation