Topic Models with Word Dependencies

Topic Models with Word Dependencies

Primary researcher(s): David Inouye

Project Summary

This project develops a two new topic model based on an Poisson Markov Random Fields (PMRF), which can model dependencies between words. Previous independent topic models such as PLSA (Hofmann, 1999), LDA (Blei, 2003) or SAM (Reisinger, 2010) did not explicitly model dependencies between words.  More generally, this project considers a two classes of admixture models that generalizes previous topic models.  The first model is called an Admixture of Poisson MRFs (APM) and the second model is called a Fixed-Length Poisson MRF Topic Model (LPMRFTM). These models are both distinct relaxations of the LDA modeling assumptions. This project also considers tractable methods for estimating the parameters of these models based on the pseudo log-likelihood.  Essentially, the estimation method turns into a large-scale optimization problem.  Some preliminary qualitative and quantitative experiments show that APM and LPMRFTM have potential but more experimental analysis will be needed.

Project Description

Previous topic models assume words are conditionally independent of each other given a topic, and thus, these previous models can only represent topics as a list of words ordered by frequency. However, an Admixture of Poisson MRFs (APM) can model dependencies between words and hence can represent each topic as a graph over words.

Previous topic models assume words are conditionally independent of each other given a topic, and thus, these previous models can only represent topics as a list of words ordered by frequency. However, an Admixture of Poisson MRFs (APM) can model dependencies between words and hence can represent each topic as a graph over words.

Poisson MRF (PMRF) vs. Independent Distributions
As stated in the previous section, instead of the component distributions being independent Poisson distributions, we propose a Poisson MRF for the component distributions so that we can model dependencies between words. A multivariate generalization of the univariate Poisson was recently introduced by (Yang, 2012). This new Poisson MRF model assumes that the conditional distributions are univariate Poisson which is similar to a Gaussian MRF. However, unlike a Guassian MRF, the marginals are not univariate Poisson. This allows for a much more powerful topic distribution than an independent distribution. For example, if the word “classification” appears, “supervised” is more likely to appear than in general documents. Moreover, since these are graphical model distributions, the dependencies are Markov with respect to an undirected graph, which thus provides a visually appealing representation of any topic in contrast to just a list of words as in PLSA/LDA.

The densities of three 2D Poisson MRFs that show possible dependency structures between two words. Negative dependencies (left) suggest that two words rarely co-occur whereas positive dependencies (right) suggest that two words often co-occur.

The densities of three 2D Poisson MRFs that show possible dependency structures between two words. Negative dependencies (left) suggest that two words rarely co-occur whereas positive dependencies (right) suggest that two words often co-occur.

Fixed-Length Poisson MRFs (LPMRF)
We propose a novel distribution that generalizes the Multinomial distribution to enable dependencies between the dimensions. Our novel distribution is based on the parametric form of the Poisson MRF model (Yang, 2012) but is fundamentally different because of the domain restriction to a fixed-length vector like in a Multinomial where the number of trials is fixed or known. Unlike the original PMRF (Yang, 2012) that could only model negative dependencies, this distribution allows positive dependencies between words but does not artificially restrict the domain as in Truncated PMRFs (Yang, 2013) or alter the sufficient statistics as in Sublinear PMRFs (Yang, 2013). Because an LPMRF distribution has a known length like with Multinomials, we can combine it with a distribution on length such as a Poisson on length to obtain a complete distribution of all count-valued vectors as illustrated in the following figures.

LPMRF distribution for $\totalnwords = 20$ with negative, zero and positive dependencies. The distribution of LPMRF can be quite different than a Multinomial (zero dependency) and thus provides a much more flexible parametric distribution for count data.

LPMRF distribution for L = 20 with negative, zero and positive dependencies. The distribution of LPMRF can be quite different than a Multinomial (zero dependency) and thus provides a much more flexible parametric distribution for count data.

LPMRF_Illustration_Horizontal

The Fixed-Length Poisson MRF (LPMRF) can replace the Multinomial by relaxing the independence assumption and allowing for positive and negative dependencies between words (or more generally between covariates).

Significant advantages of the LPMRF distribution over PMRFs include:

  1. Allows positive dependencies without unrealistic constraints like a truncated domain.
  2. Tractable estimation of the normalizing constant (log partition function) using Annealed Importance Sampling (AIS).
  3. Calculation of likelihood because we can now estimate the normalizing constant.

Topic Models as Admixtures
In this project, we develop the notion of admixtures and show that this notion generalizes many previous topic models including PLSA (Hofmann, 1999), LDA (Blei, 2003) and SAM (Reisinger, 2010). In contrast to mixture distributions which assume that each observation is drawn from 1 of k component distributions, admixture distributions assume that each observation is drawn from an admixed distribution whose parameters are a mixture of component parameters. See Figure 1 for an illustration of the difference between admixtures and mixtures. As examples of admixtures, PLSA and LDA are admixtures of Multinomials whereas SAM is an admixture of Von-Mises Fisher distributions. In addition, because of the connections between Poissons and Multinomials, PLSA and LDA can be seen as admixtures of independent Poisson distributions. In this project, we remove this independence assumption from the Poisson distributions.

Mixture_Admixture_Illustration

Figure 1: (Left) In mixtures, documents are drawn from exactly one component distribution. (Right) In admixtures, documents are drawn from a distribution whose parameters are a convex combination of component parameters.

Admixture of Poisson MRFs
The main model introduced by this project is an admixture of Poisson MRFs and thus combines the two concepts discussed in the previous sections to provide a topic model whose topics can model the dependencies between words.

LPMRF Topic Model (LPMRF-TM)
The LPMRF-TM is another distinct generalization of topic models that directly replaces the Multinomial with an LPMRF distribution. Unlike the APM generalization, the LPMRF-TM generalizes the topic modeling assumption by using fixed-length distributions. Essentially, the document word-count vector x_i is split into k component vectors (z_i^j) of the same dimension such that adding them together yields the original count vector. See the illustration below.

The

The left is a generic generalization of LDA whereas the right is our generalization using LPMRF distributions to replace the Multinomial distribution.

Tractable Parameter Estimation
Essentially, we choose to maximize the posterior (i.e. a MAP estimate) with an exponential prior on the dependency matrix (which is similar to the precision matrix of a Gaussian MRF). This is equivalent to adding an L1 regularization term to the log likelihood. The exponential prior helps to enforce sparsity of the parameters for reasons of interpretability, generalizability, and computational tractability. Because the likelihood of a Poisson MRF is intractable, we use the psuedo-likelihood instead. The optimization can be split into two alternating problems as follows:

The MAP optimization can be split into alternating convex problems. Both subproblems have independent parts that can be solved independently leading to almost linear parallel speedup.

The MAP optimization can be split into alternating convex problems. Both subproblems have independent parts that can be solved independently leading to almost linear parallel speedup.

Timing results for different sizes of a Wikipedia dataset show that the algorithm scales approximately as O(np^2). (Fixing k = 5, λ = 0.5)

Timing results for different sizes of a Wikipedia dataset show that the algorithm scales approximately as O(np^2). (Fixing k = 5, λ = 0.5)

Evocation Metric
Boyd-Graber et al. (2006) introduced the notion of evocation which denotes the idea of which words “evoke” or “bring to mind” other words. There can be many types of evocation including: [rose – flower] (example), [brave – noble] (kind), [yell – talk] (manner), [eggs – bacon] (co-occurence), [snore – sleep] (setting), [wet – desert] (antonymy), [work – lazy] (exclusivity), and [banana – kiwi] (likeness). This is distinctive from word similarity or synonymy since two words can have very different meanings but “bring to mind” the other word (e.g. antonyms). This notion of word relatedness is a much simpler but potentially more semantically meaningful and interpretable than word similarity. For instance, “work” and “lazy” do not have similar meanings but are related through the semantic meanings of the words. Another difference is that—unlike word semantic similarity— words that generally appear in very different contexts yet mean the same thing would probably not have a high evocation score. For example, “networks” and “graphs” both have a definition that means a set of nodes and edges yet usually one word is chosen in a particular context.

With the human scores provided by Boyd-Graber et al., word pairs discovered by a topic model (either implicitly as in LDA or explicitly as in APM) can be evaluated as illustrated in the figure below. The results of APM when compared to many other topic models demonstrate that APM can be much more effective at capturing semantically meaningful word pairs.

The evocation metric evaluates whether the proposed model is capturing word pairs that are semantically interesting based on human scores.

The evocation metric evaluates whether the proposed model is capturing word pairs that are semantically interesting based on human scores.

Both Evoc-1 and Evoc-2 scores demonstrate that APM significantly outperforms other topic models in capturing semantically meaningful word pairs.

Both Evoc-1 and Evoc-2 scores demonstrate that APM significantly outperforms other topic models in capturing semantically meaningful word pairs.

To validate the intuition of using evocation as an human-standard evaluation metric, we present the top 20 word pairs for the best standard topic model (in this case LDA) and the best APM model for the Evoc-2 metric as seen below.

Top 20 Words for LDA (left) and APM (right)

Top 20 Words for LDA (left) and APM (right)

As seen in the table, APM captures many more word pairs with a human score greater than 50, whereas LDA only captures a few. One interesting example is that LDA finds two word pairs [woman.n – wife.n] and [wife.n – man.n] that capture some semantic notion of marriage. However, APM directly captures this semantic meaning with [husband.n – wife.n]. APM also discovers several other familial relationships such as [aunt.n – uncle.n] and [mother.n – baby.n]. In addition, APM identifies multiple semantically coherent yet high-level word pairs such as [residential.a – home.n], [steel.n – iron.n], [job.n – employment.n] and [question.n – answer.n], whereas LDA finds several low-level word pairs such as [member.n – give.v], [west.n – state.n] and [year.n – day.n]. These overall trends become even more evident when looking at the top 50 word pairs. Both the quantitative evaluation metrics (i.e. Evoc-1 and Evoc-2) as well as a qualitative exploration of the top word pairs give strong evidence that APM can succinctly capture both more interesting and higher-level semantic concepts through word dependencies than previous independent topic models.

Main Publications