# Topic Models with Word Dependencies

*Primary researcher(s):*David Inouye

### Project Summary

This project develops a **two** new topic model based on an Poisson Markov Random Fields (PMRF), which can model dependencies between words. Previous independent topic models such as PLSA (Hofmann, 1999), LDA (Blei, 2003) or SAM (Reisinger, 2010) did not explicitly model dependencies between words. More generally, this project considers a two classes of admixture models that generalizes previous topic models. The first model is called an Admixture of Poisson MRFs (APM) and the second model is called a Fixed-Length Poisson MRF Topic Model (LPMRFTM). These models are both distinct relaxations of the LDA modeling assumptions. This project also considers tractable methods for estimating the parameters of these models based on the pseudo log-likelihood. Essentially, the estimation method turns into a large-scale optimization problem. Some preliminary qualitative and quantitative experiments show that APM and LPMRFTM have potential but more experimental analysis will be needed.

### Project Description

**Poisson MRF (PMRF) vs. Independent Distributions**

As stated in the previous section, instead of the component distributions being independent Poisson distributions, we propose a Poisson MRF for the component distributions so that we can model *dependencies between words*. A multivariate generalization of the univariate Poisson was recently introduced by (Yang, 2012). This new Poisson MRF model assumes that the conditional distributions are univariate Poisson which is similar to a Gaussian MRF. However, unlike a Guassian MRF, the marginals are not univariate Poisson. This allows for a much more powerful topic distribution than an independent distribution. For example, if the word “classification” appears, “supervised” is more likely to appear than in general documents. Moreover, since these are graphical model distributions, the dependencies are Markov with respect to an undirected graph, which thus provides a visually appealing representation of any topic in contrast to just a list of words as in PLSA/LDA.

**Fixed-Length Poisson MRFs (LPMRF)**

We propose a novel distribution that generalizes the Multinomial distribution to enable dependencies between the dimensions. Our novel distribution is based on the parametric form of the Poisson MRF model (Yang, 2012) but is fundamentally different because of the domain restriction to a fixed-length vector like in a Multinomial where the number of trials is fixed or known. Unlike the original PMRF (Yang, 2012) that could only model negative dependencies, this distribution allows positive dependencies between words but does not artificially restrict the domain as in Truncated PMRFs (Yang, 2013) or alter the sufficient statistics as in Sublinear PMRFs (Yang, 2013). Because an LPMRF distribution has a known length like with Multinomials, we can combine it with a distribution on length such as a Poisson on length to obtain a complete distribution of all count-valued vectors as illustrated in the following figures.

Significant advantages of the LPMRF distribution over PMRFs include:

- Allows positive dependencies without unrealistic constraints like a truncated domain.
- Tractable estimation of the normalizing constant (log partition function) using Annealed Importance Sampling (AIS).
- Calculation of likelihood because we can now estimate the normalizing constant.

**Topic Models as Admixtures**

In this project, we develop the notion of

*admixtures*and show that this notion generalizes many previous topic models including PLSA (Hofmann, 1999), LDA (Blei, 2003) and SAM (Reisinger, 2010). In contrast to

*mixture*distributions which assume that each observation is drawn from 1 of

*k*component distributions,

*admixture*distributions assume that each observation is drawn from an admixed distribution whose

*parameters*are a mixture of component

*parameters*. See Figure 1 for an illustration of the difference between

*admixtures*and

*mixtures*. As examples of admixtures, PLSA and LDA are admixtures of Multinomials whereas SAM is an admixture of Von-Mises Fisher distributions. In addition, because of the connections between Poissons and Multinomials, PLSA and LDA can be seen as admixtures of independent Poisson distributions. In this project, we remove this independence assumption from the Poisson distributions.

**Admixture of Poisson MRFs**

The main model introduced by this project is an admixture of Poisson MRFs and thus combines the two concepts discussed in the previous sections to provide a topic model whose topics can model the dependencies between words.

**LPMRF Topic Model (LPMRF-TM)**

The LPMRF-TM is another distinct generalization of topic models that directly replaces the Multinomial with an LPMRF distribution. Unlike the APM generalization, the LPMRF-TM generalizes the topic modeling assumption by using fixed-length distributions. Essentially, the document word-count vector **x_i** is split into *k* component vectors (**z_i^j**) of the same dimension such that adding them together yields the original count vector. See the illustration below.

**Tractable Parameter Estimation**

Essentially, we choose to maximize the posterior (i.e. a MAP estimate) with an exponential prior on the dependency matrix (which is similar to the precision matrix of a Gaussian MRF). This is equivalent to adding an L1 regularization term to the log likelihood. The exponential prior helps to enforce sparsity of the parameters for reasons of interpretability, generalizability, and computational tractability. Because the likelihood of a Poisson MRF is intractable, we use the psuedo-likelihood instead. The optimization can be split into two alternating problems as follows:

**Evocation Metric**

Boyd-Graber et al. (2006) introduced the notion of *evocation* which denotes the idea of which words “evoke” or “bring to mind” other words. There can be many types of evocation including: [rose – flower] (example), [brave – noble] (kind), [yell – talk] (manner), [eggs – bacon] (co-occurence), [snore – sleep] (setting), [wet – desert] (antonymy), [work – lazy] (exclusivity), and [banana – kiwi] (likeness). This is distinctive from word similarity or synonymy since two words can have very different meanings but “bring to mind” the other word (e.g. antonyms). This notion of word relatedness is a much simpler but potentially more semantically meaningful and interpretable than word similarity. For instance, “work” and “lazy” do not have similar meanings but are related through the semantic meanings of the words. Another difference is that—unlike word semantic similarity— words that generally appear in very different contexts yet mean the same thing would probably not have a high evocation score. For example, “networks” and “graphs” both have a definition that means a set of nodes and edges yet usually one word is chosen in a particular context.

With the human scores provided by Boyd-Graber et al., word pairs discovered by a topic model (either implicitly as in LDA or explicitly as in APM) can be evaluated as illustrated in the figure below. The results of APM when compared to many other topic models demonstrate that APM can be much more effective at capturing semantically meaningful word pairs.

To validate the intuition of using evocation as an human-standard evaluation metric, we present the top 20 word pairs for the best standard topic model (in this case LDA) and the best APM model for the Evoc-2 metric as seen below.

As seen in the table, APM captures many more word pairs with a human score greater than 50, whereas LDA only captures a few. One interesting example is that LDA finds two word pairs [woman.n – wife.n] and [wife.n – man.n] that capture some semantic notion of marriage. However, APM directly captures this semantic meaning with [husband.n – wife.n]. APM also discovers several other familial relationships such as [aunt.n – uncle.n] and [mother.n – baby.n]. In addition, APM identifies multiple semantically coherent yet high-level word pairs such as [residential.a – home.n], [steel.n – iron.n], [job.n – employment.n] and [question.n – answer.n], whereas LDA finds several low-level word pairs such as [member.n – give.v], [west.n – state.n] and [year.n – day.n]. These overall trends become even more evident when looking at the top 50 word pairs. Both the quantitative evaluation metrics (i.e. Evoc-1 and Evoc-2) as well as a qualitative exploration of the top word pairs give strong evidence that APM can succinctly capture both more interesting and higher-level semantic concepts through word dependencies than previous independent topic models.

### Main Publications

- Fixed-Length Poisson MRF: Adding Dependencies to the Multinomial (pdf, poster, code)

D. Inouye, P. Ravikumar, I. Dhillon.

In*Neural Information Processing Systems (NIPS)*, pp. 3195-3203, December 2015. - Capturing Semantically Meaningful Word Dependencies with an Admixture of Poisson MRFs (pdf, poster, software, code)

D. Inouye, P. Ravikumar, I. Dhillon.

In*Neural Information Processing Systems (NIPS)*, pp. 3158-3166, December 2014. - Admixture of Poisson MRFs: A Topic Model with Word Dependencies (pdf, slides, poster, code)

D. Inouye, P. Ravikumar, I. Dhillon.

In*International Conference on Machine Learning (ICML)*, pp. 683-691, June 2014. - On Poisson Graphical Models (pdf)

E. Yang, P. Ravikumar, G. Allen, Z. Liu.

In*Neural Information Processing Systems (NIPS)*, December 2013. - Graphical Models via Generalized Linear Models (pdf)

E. Yang, P. Ravikumar, G. Allen, Z. Liu.

In*Neural Information Processing Systems (NIPS)*, 2012. (**Oral**)