Abstract: High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new information theoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such “distributional clustering” of words are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the “within-cluster Jensen-Shannon divergence” while simultaneously maximizing the “between-cluster Jensen-Shannon divergence”. In comparison to the previously proposed agglomerative strategies our divisive algorithm is much faster and achieves comparable or higher classification accuracies. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20Newsgroups data set and a 3-level hierarchy of HTML documents collected from the Open Directory project (www.dmoz.org).
- A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification (pdf, software)
I. Dhillon, S. Mallela, R. Kumar.
Journal of Machine Learning Research (JMLR) 3, pp. 1265-1287, March 2003.