Feature Selection and Document Clustering

Inderjit Dhillon, J. Kogan, M. Nicholas

Abstract:   Feature selection is a basic step in the construction of a vector space or bag of words model [BB99]. In particular, when the processing task is to partition a given document collection into clusters of similar documents a choice of good features along with good clustering algorithms is of paramount importance. This chapter suggests two techniques for feature or term selection along with a number of clustering strategies. The selection techniques significantly reduce the dimension of the vector space model. Examples that illustrate the effectiveness of the proposed algorithms are provided.

Download: pdf


  • Feature Selection and Document Clustering (pdf, software)
    I. Dhillon, J. Kogan, M. Nicholas.
    A Comprehensive Survey of Text Mining, pp. 73-100, January 2003.