Efficient Clustering of Very Large Document Collections

Inderjit Dhillon, Yuqiang Guan, J. Fan

Abstract:   An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.

Download: pdf

Citation

  • Efficient Clustering of Very Large Document Collections (pdf, software)
    I. Dhillon, Y. Guan, J. Fan.
    Data Mining for Scientific and Engineering Applications, pp. 357-381, 2001.
    (Invited chapter)

    Bibtex: