Abstract: In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such “distributional clustering” of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the “within-cluster Jensen-Shannon divergence” while simultaneously maximizing the “between-cluster Jensen-Shannon divergence”. In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3-level hierarchy of HTML documents collected from Dmoz Open Directory.
- Enhanced Word Clustering for Hierarchical Text Classification (pdf, software)
I. Dhillon, S. Mallela, R. Kumar.
In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), July 2002.