What kinds of “topics” does topic modeling actually produce?

I’m having an interesting discussion with Lisa Rhody about the significance of topic modeling at different scales that I’d like to follow up with some examples.

I’ve been doing topic modeling on collections of eighteenth- and nineteenth-century volumes, using volumes themselves as the “documents” being modeled. Lisa has been pursuing topic modeling on a collection of poems, using individual poems as the documents being modeled.

The math we’re using is probably similar. I believe Lisa is using MALLET. I’m using a version of Latent Dirichlet Allocation that I wrote in Java so I could tinker with it.

But the interesting question we’re exploring is this: How does the meaning of LDA change when it’s applied to writing at different scales of granularity? Lisa’s documents (poems) are a typical size for LDA: this technique is often applied to identify topics in newspaper articles, for instance. This is a scale that seems roughly in keeping with the meaning of the word “topic.” We often assume that the topic of written discourse changes from paragraph to paragraph, “topic sentence” to “topic sentence.”

By contrast, I’m using documents (volumes) that are much larger than a paragraph, so how is it possible to produce topics as narrowly defined as this one?


This is based on a generically diverse collection of 1,782 19c volumes, not all of which are plotted here (only the volumes where the topic is most prominent are plotted; the gray line represents an aggregate frequency including unplotted volumes). The most prominent words in this topic are “mother, little, child, children, old, father, poor, boy, young, family.” It’s clearly a topic about familial relationships, and more specifically about parent-child relationships. But there aren’t a whole lot of books in my collection specifically about parent-child relationships! True, the most prominent books in the topic are A. F. Chamberlain’s The Child and Childhood in Folk Thought (1896) and Alice Earl Morse’s Child Life in Colonial Days (1899), but most of the rest of the prominent volumes are novels — by, for instance, Catharine Sedgwick, William Thackeray, Louisa May Alcott, and so on. Since few novels are exclusively about parent-child relations, how can the differences between novels help LDA identify this topic?

The answer is that the LDA algorithm doesn’t demand anything remotely like a one-to-one relationship between documents and topics. LDA uses the differences between documents to distinguish topics — but not by establishing a one-to-one mapping. On the contrary, every document contains a bit of every topic, although it contains them in different proportions. The numerical variation of topic proportions between documents provides a kind of mathematical leverage that distinguishes topics from each other.

The implication of this is that your documents can be considerably larger than the kind of granularity you’re trying to model. As long as the documents are small enough that the proportions between topics vary significantly from one document to the next, you’ll get the leverage you need to discriminate those topics. Thus you can model a collection of volumes and get topics that are not mere “subject classifications” for volumes.

Now, in the comments to an earlier post I also said that I thought “topic” was not always the right word to use for the categories that are produced by topic modeling. I suggested that “discourse” might be better, because topics are not always unified semantically. This is a place where Lisa starts to question my methodology a little, and I don’t blame her for doing so; I’m making a claim that runs against the grain of a lot of existing discussion about “topic modeling.” The computer scientists who invented this technique certainly thought they were designing it to identify semantically coherent “topics.” If I’m not doing that, then, frankly, am I using it right? Let’s consider this example:


This is based on the same generically diverse 19c collection. The most prominent words are “love, life, soul, world, god, death, things, heart, men, man, us, earth.” Now, I would not call that a semantically coherent topic. There is some religious language in there, but it’s not about religion as such. “Love” and “heart” are mixed in there; so are “men” and “man,” “world” and “earth.” It’s clearly a kind of poetic diction (as you can tell from the color of the little circles), and one that increases in prominence as the nineteenth century goes on. But you would be hard pressed to identify this topic with a single concept.

Does that mean topic modeling isn’t working well here? Does it mean that I should fix the system so that it would produce topics that are easier to label with a single concept? Or does it mean that LDA is telling me something interesting about Victorian poetry — something that might be roughly outlined as an emergent discourse of “spiritual earnestness” and “self-conscious simplicity”? It’s an open question, but I lean toward the latter alternative. (By the way, the writers most prominently involved here include Christina Rossetti, Algernon Swinburne, and both Brownings.)

In an earlier comment I implied that the choice between “semantic” topics and “discourses” might be aligned with topic modeling at different scales, but I’m not really sure that’s true. I’m sure that the document size we choose does affect the level of granularity we’re modeling, but I’m not sure how radically it affects it. (I believe Matt Jockers has done some systematic work on that question, but I’ll also be interested to see the results Lisa gets when she models differences between poems.)

I actually suspect that the topics identified by LDA probably always have the character of “discourses.” They are, technically, “kinds of language that tend to occur in the same discursive contexts.” But a “kind of language” may or may not really be a “topic.” I suspect you’re always going to get things like “art hath thy thou,” which are better called a “register” or a “sociolect” than they are a “topic.” For me, this is not a problem to be fixed. After all, if I really want to identify topics, I can open a thesaurus. The great thing about topic modeling is that it maps the actual discursive contours of a collection, which may or may not line up with “concepts” any writer ever consciously held in mind.

Computer scientists don’t understand the technique that way.* But on this point, I think we literary scholars have something to teach them.

On the collective course blog for English 581 I have some other examples of topics produced at a volume level.

*[UPDATE April 3, 2012: Allen Riddell rightly points out in the comments below that Blei's original LDA article is elegantly agnostic about the significance of the "topics" -- which are at bottom just "latent variables." The word "topic" may be misleading, but computer scientists themselves are often quite careful about interpretation.]

Documentation / open data:
I’ve put the topic model I used to produce these visualizations on github. It’s in the subfolder 19th150topics under folder BrowseLDA. Each folder contains an R script that you run; it then prompts you to load the data files included in the same folder, and allows you to browse around in the topic model, visualizing each topic as you go.

I have also pushed my Java code for LDA up to github. But really, most people are better off with MALLET, which is infinitely faster and has hyperparameter optimization that I haven’t added yet. I wrote this just so that I would be able to see all the moving parts and understand how they worked.

16 thoughts on “What kinds of “topics” does topic modeling actually produce?

  1. Something I should have mentioned: often, modeling nonfiction prose produces topics that look like “topics” in our ordinary sense of the word: groups of words that are linked by a concept or at least by a subject category.

    Modeling poetry, drama, and fiction does not tend to produce that. Oh, sure, you get some topics about war and topics about the sea and so on, because works of literature are after all loosely “topical” in that sense. But more commonly (at least at the century-plus scales I’m exploring) you get topics that are really linked to styles or kinds of diction, or even to specific authors (the “author” signature can be surprisingly strong). Again: we could view this as a problem to be fixed, in order to make literature “model” more like nonfiction. But I don’t see it as a problem: I see it as a reflection of the actually-significant differentiations between works of poetry and fiction.

  2. I’m very impressed you wrote your own LDA code–that’s something that’s been low on my own list for quite a while, but I don’t honestly think I’ll ever do it. I did run one model on my old corpus a year ago, but didn’t love the results for various reasons. (The fiction-literature thing on a mixed corpus was one–it pulled out a number of very specific topics for sciences and (IIRC) history, while producing for-me ungrokkable constellations that seemed to come from the literature/social sciences sphere. It may be that one just needs more consistent collections, or that I need to read up on hLDA, I guess.

    Still, I remain a bit skeptical about the human interpretation of LDA results in general. So if you’re going to keep going through these, I’d love to hear in particular–

    1) what the least intuitive topics produced are, and what cracks they fill in the model space.
    2) About topics that seem to mirror each other in different time periods (you’re showing a lot of 20-year peaks here, which I’ve seen those in other historical models as well, and often there are sibling topics elsewhere in the model); I’ve seen some charts that essentially sub out ngram curves for topic curves, and I wonder if that’s kosher.

    Also, apropos terminology: one advantage of ‘topic’ over ‘discourse’ is that it preserves some vestige of this being a spatial/statistical property more than a rhetorical one.

    • Will do. You’re so right that the first thing to do with a new technique is try to break it, and point out where it breaks.

      I can quickly share three “glitch areas” that I’ve already found: 1) Using the kind of corpora that you and I are using, running headers at the tops of pages can be a nightmare. I had to go back a couple of weeks ago and rebuild the whole collection without them. They weren’t a significant problem for anything but LDA. 2) Stop words … as you already know and have blogged, a multitude of sins can be concealed under “(cough) a fairly standard list of stop words.” MALLET contains some fixes to address this, involving “hyperparameter optimization” — more later. and 3) the number of topics you create is important. Small shifts in that parameter can change things radically.

      I’ll say more about all of this, and point out the unintuitive/”broken” parts of a few models, in a couple of days. Unfortunately, the taxpayers of Illinois keep forcing me to do things other than text mining …

    • Btw, LDA is not hard to write compared to other stuff you’ve already done. At its heart, it’s a simple clustering algorithm.

      The derivation that proves the algorithm is brain-squashingly hard, and I still don’t fully understand the math behind it. But the algorithm itself, as implemented in “Gibbs sampling,” is actually intuitive and straightforward. There’s some unnecessary mystification on this topic which I hope to address in another post.

    • Hi Ben, I’ve tried to reply in my most recent post, but my replies are a bit askew to your questions, so I’m replying also here. The difficulty I have addressing (1) is that the least intuitive topics … are the ones I value. The *most* intuitive topics I see are the ones unified by typography (i’ll, i’ve, can’t, etc), or grammar (helping/modal verbs) or topics composed mostly of foreign words, and so on. I’m not very interested in these, but I’d call them very intuitive. Thematically unified topics (war, commerce) are almost as intuitive, and also not that interesting to me. There are a lot of them, especially if I include nonfiction in the corpus. Then we hit topics that are unified by elusive issues of genre/style/period, which are (to me) the interesting ones.

      I’m not sure I understand the question you’re asking in (2), because I’m not sure what kind of “mirroring” you mean. I think maybe I’m not seeing these “sibling topics.” Possibly I would see them if I increased the number of topics in the model. That tends to force division. But I do think you’re onto something interesting in your observation about 20-year peaks. I see a range of different kinds of curves in these models, but I don’t see many peaks that are *narrower* than about 20-25 years. Not sure what to make of that, although it might perhaps be related to some results you’ve blogged about the generational basis of language change.

      • Thanks for this, and I owe you a response on the first post. I probably threw up my hands too early at some of the less intuitive topics, because I find it hard enough to talk about using ‘shall’ a lot, and trying to talk about clustered groups of words seems harder to do.

        That’s good if you’re not seeing ‘sibling topics’–I haven’t spent much time looking at LDA results, so this may not be as common as I thought. It has to do with what you’re calling the historian’s practice of wanting to slap a topic label on everything; I recall getting a couple different groups that both wanted to be called ‘social sciences,’ and I’ve seen other people resort to having topics with names like “wars” and “major wars” that seem more divided by chronology than time.

        Anyhow, way to get the topic-modeling conversation rolling along. Hopefully I’m chime in again from my own end soon.

  3. “We refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but we make no epistemological claims regarding these latent variables beyond their utility in representing probability distributions on sets of words.”
    p. 996 footnote 1. Original LDA paper http://jmlr.csail.mit.edu/papers/v3/blei03a.html

    Also, I found Wallach, et al.’s “Rethinking LDA: Why Priors Matter” really helpful on thinking through some of these things. http://people.cs.umass.edu/~mimno/papers/NIPS2009_0929.pdf

    • Very apt citation. I should go back and insert an update so I don’t slander comp sci. I agree with you on the value of “Why Priors Matter” — it’s written beautifully (for a comp sci article!) and gives good intuitive accounts of why things work the way they do.

  4. Pingback: Thought this was cool: 收藏:Topic modeling made just simple enough « CWYAlpha

  5. Pingback: Text Mining Workshop » THATCamp Southern California 2012

  6. Pingback: Topic Muddling | Roderic Crooks DH 201 Blog

  7. Pingback: » “Secret” Recipe for Topic Modeling Themes Matthew L. Jockers

  8. Pingback: Thomas Padilla

  9. Pingback: 4Humanities@UCSB Meeting – Topic Modeling “WhatEvery1Says” (Nov. 21, 2013) | 4Humanities

  10. Pingback: 4Humanities@UCSB Meeting – “What Every1Says” Project (continued) (Feb. 18, 2014) | 4Humanities

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s