Should we model “topics” as associative clusters, or as statistical factors?

I should say up front that this is going to be a geeky post about things that happen under the hood of the car. Many readers may be better served by scrolling down to “Trends, topics, and trending topics,” which has more to say about the visible applications of text mining.

I’ve developed a clustering methodology that I like pretty well. It allows me to map patterns of usage in a large collection by treating each term as a vector; I assess how often words occur together by measuring the angle between vectors, and then group the words with Ward’s clustering method. This produces a topic tree that seems to be both legible (in the sense that most branches have obvious affinities to a genre or subject category) and surprising (in the sense that they also reveal thematic connections I wouldn’t have expected). It’s a relatively simple technique that does what I want to do, practically, as a literary historian. (You can explore this map of eighteenth-century diction to check it out yourself; and I should link once again to Sapping Attention, which convinced me clustering could be useful.)

But as I learn more about Bayesian statistics, I’m coming to realize that it’s debatable whether the clusters of terms I’m finding count as topics at all. The topic-modeling algorithms that have achieved wide acceptance (for instance, Latent Dirichlet Allocation) are based on a clear definition of what a “topic” is. They hypothesize that the observed complexity of usage patterns is actually produced by a smaller set of hidden variables. Because those variables can be represented as lists of words, they’re called topics. But the algorithm isn’t looking for thematic connections between words so much as resolving a collection into a set of components or factors that could have generated it. In this sense, it’s related to a technique like Principal Component Analysis.

Deriving those hidden variables is a mathematical task of considerable complexity. In fact, it’s impossible to derive them precisely: you have to estimate. I can only understand the math for this when my head is constantly bathed in cool running water to keep the processor from overheating, so I won’t say much more about it — except that “factoring” is just a metaphor I’m using to convey the sort of causal logic involved. The actual math involves Bayesian inference rather than algebra. But it should be clear, anyway, that this is completely different from what I’m doing. My approach isn’t based on any generative model, and can’t claim to reveal the hidden factors that produce texts. It simply clusters words that are in practice associated with each other in a corpus.

I haven’t tried the Bayesian approach yet, but it has some clear advantages. For one thing, it should work better for purposes of classification and information retrieval, because it’s looking for topics that vary (at least in principle) independently of each other.* If you want to use the presence of a topic in a document to guide classification, this matters. A topic that correlated positively or negatively with another topic would become redundant; it wouldn’t tell you much you didn’t already know. It makes sense to me that people working in library science and information retrieval have embraced an approach that resolves a collection into independent variables, because problems of document classification are central to those disciplines.

On the other hand, if you’re interested in mapping associations between terms, or topics, the clustering approach has advantages. It doesn’t assume that topics vary independently. On the contrary, it’s based on a measure of association between terms that naturally extends to become a measure of association between the topics themselves. The clustering algorithm produces a branching tree structure that highlights some of the strongest relationships and contrasts, but you don’t have to stop there: any list of terms can be treated as a vector, and compared to any other list of terms.

A fragment of a larger tree, produced by clustering the top 1650 terms in a collection of 2,193 18c documents. Click through for a larger image giving more context.

Moreover, this flexibility means that you don’t have to treat the boundaries of “topics” as fixed. For instance, here’s part of the eighteenth-century tree that I found interesting: the words on this branch seemed to imply a strange connection between temporality and feeling, and they turned out to be particularly common in late-eighteenth-century novels by female writers. Intriguing, but we’re just looking at five words. Maybe the apparent connection is a coincidence. Besides, “cried” is ambiguous; it can mean “exclaimed” in this period more often than it means “wept.” How do we know what to make of a clue like this? Well, given the nature of the vector space model that produced the tree, you can do this: treat the cluster of terms itself as a vector, and look for other terms that are strongly related to it. When I did that, I got a list that confirmed the apparent thematic connection, and helped me begin to understand it.

A list of terms most strongly associated with the group "instantly, cried, felt, moment, longer" in a generically diverse corpus of 2,193 18c documents ranging from sermons to cookbooks. "Similarity" here is technically measured as the cosine of the angle between two vectors.

This is, very definitely, a list of words associated with temporality (moment, hastily, longer, instantly, recollected) and feeling (felt, regret, anxiety, astonishment, agony). Moreover, it’s fairly clear that the common principle uniting them is something like “suspense” (waiting, eagerly, impatiently, shocked, surprise). Gender is also involved — which might not have been immediately evident from the initial cluster, because gendered words were drawn even more strongly to other parts of the tree. But the associational logic of the clustering process makes it easy to treat topic boundaries as porous; the clusters that result don’t have to be treated as rigid partitions; they’re more appropriately understood as starting-places for exploration of a larger associational web.

[This would incidentally be my way of answering a valid critique of clustering — that it doesn’t handle polysemy well. The clustering algorithm has to make a choice when it encounters a word like “cried.” The word might in practice have different sets of associations, based on different uses (weep/exclaim), but it’s got to go in one branch or another. It can’t occupy multiple locations in the tree. We could try to patch that problem, but I think it may be better to realize that the problem isn’t as important as it appears, because the clusters aren’t end-points. Whether a term is obviously polysemous, or more subtly so, we’re always going to need to make a second pass where we explore the associations of the cluster itself in order to shake off the artificiality of the tree structure, and get a richer sense of multi-dimensional context. When we do that we’ll pick up words like “herself,” which could justifiably be located at any number of places in the tree.]

Much of this may already be clear to people in informatics, but I had to look at the math in order to understand that different kinds of “topic modeling” are really doing different things. Humanists are going to have some tricky choices to make here that I’m not sure we understand yet. Right now the Bayesian “factoring” approach is more prominent, partly because the people who develop text-mining algorithms tend to work in disciplines where classification problems are paramount, and where it’s important to prove that they can be solved without human supervision. For literary critics and historians, the appropriate choice is less clear. We may sometimes be interested in classifying documents (for instance, when we’re reasoning about genre), and in that case we too may need something like Latent Dirichlet Allocation or Principal Component Analysis to factor out underlying generative variables. But we’re just as often interested in thematic questions — and I think it’s possible that those questions may be more intuitively, and transparently, explored through associational clustering. To my mind, it’s fair to call both processes “topic modeling” — but they’re exploring topics of basically different kinds.

Postscript: I should acknowledge that there are lots of ways of combining these approaches, either by refining LDA itself, or by combining that sort of topic-factoring approach with an associational web. My point isn’t that we have to make a final choice between these processes; I’m just reflecting that, in principle, they do different things.

* (My limited understanding of the math behind Latent Dirichlet Allocation is based on a 2009 paper by D.M. Blei and J. D. Lafferty available here.)

Seriously geeking out.

The pace of posts here has slowed, and it may stay pretty slow until I get some new data-slicing tools set up.

I spent the weekend trying to understand when I might want to use a vector space model to compare documents or terms, and when ordinary Pearson’s correlation would be better. Also, I now understand how Ward’s method of hierarchical agglomerative clustering is different from all the other methods.

I know kung fu.

Aside from the sheer fun of geekery, what I’ve learned is that the digital humanities have become *much* easier to enter than they were in the 90s. I attempted a bit of data-mining in the early 90s, and published an article containing a few graphs in Studies in Romanticism, but didn’t pursue the approach much further because I found it nearly impossible to produce the kind of results I wanted on the necessary scale. (You have to remember that my interests lean toward the large end of the scale continuum in DH.)

I told myself that I would get back in the game when the kinds of collections I needed began to become available, and in the last couple of years it became clear to me that they were, if not available, at least possible to construct. But I actually had no idea how transparent and accessible things have become. So much information is freely available on the web, and with tools like Zotero and SEASR the web is also becoming a medium in which one can do the work itself. Everything’s frickin interoperable. It’s so different from the 90s when you had to build things more or less from scratch yourself.