What kinds of “topics” does topic modeling actually produce?

I’m having an interesting discussion with Lisa Rhody about the significance of topic modeling at different scales that I’d like to follow up with some examples.

I’ve been doing topic modeling on collections of eighteenth- and nineteenth-century volumes, using volumes themselves as the “documents” being modeled. Lisa has been pursuing topic modeling on a collection of poems, using individual poems as the documents being modeled.

The math we’re using is probably similar. I believe Lisa is using MALLET. I’m using a version of Latent Dirichlet Allocation that I wrote in Java so I could tinker with it.

But the interesting question we’re exploring is this: How does the meaning of LDA change when it’s applied to writing at different scales of granularity? Lisa’s documents (poems) are a typical size for LDA: this technique is often applied to identify topics in newspaper articles, for instance. This is a scale that seems roughly in keeping with the meaning of the word “topic.” We often assume that the topic of written discourse changes from paragraph to paragraph, “topic sentence” to “topic sentence.”

By contrast, I’m using documents (volumes) that are much larger than a paragraph, so how is it possible to produce topics as narrowly defined as this one?


This is based on a generically diverse collection of 1,782 19c volumes, not all of which are plotted here (only the volumes where the topic is most prominent are plotted; the gray line represents an aggregate frequency including unplotted volumes). The most prominent words in this topic are “mother, little, child, children, old, father, poor, boy, young, family.” It’s clearly a topic about familial relationships, and more specifically about parent-child relationships. But there aren’t a whole lot of books in my collection specifically about parent-child relationships! True, the most prominent books in the topic are A. F. Chamberlain’s The Child and Childhood in Folk Thought (1896) and Alice Earl Morse’s Child Life in Colonial Days (1899), but most of the rest of the prominent volumes are novels — by, for instance, Catharine Sedgwick, William Thackeray, Louisa May Alcott, and so on. Since few novels are exclusively about parent-child relations, how can the differences between novels help LDA identify this topic?

The answer is that the LDA algorithm doesn’t demand anything remotely like a one-to-one relationship between documents and topics. LDA uses the differences between documents to distinguish topics — but not by establishing a one-to-one mapping. On the contrary, every document contains a bit of every topic, although it contains them in different proportions. The numerical variation of topic proportions between documents provides a kind of mathematical leverage that distinguishes topics from each other.

The implication of this is that your documents can be considerably larger than the kind of granularity you’re trying to model. As long as the documents are small enough that the proportions between topics vary significantly from one document to the next, you’ll get the leverage you need to discriminate those topics. Thus you can model a collection of volumes and get topics that are not mere “subject classifications” for volumes.

Now, in the comments to an earlier post I also said that I thought “topic” was not always the right word to use for the categories that are produced by topic modeling. I suggested that “discourse” might be better, because topics are not always unified semantically. This is a place where Lisa starts to question my methodology a little, and I don’t blame her for doing so; I’m making a claim that runs against the grain of a lot of existing discussion about “topic modeling.” The computer scientists who invented this technique certainly thought they were designing it to identify semantically coherent “topics.” If I’m not doing that, then, frankly, am I using it right? Let’s consider this example:


This is based on the same generically diverse 19c collection. The most prominent words are “love, life, soul, world, god, death, things, heart, men, man, us, earth.” Now, I would not call that a semantically coherent topic. There is some religious language in there, but it’s not about religion as such. “Love” and “heart” are mixed in there; so are “men” and “man,” “world” and “earth.” It’s clearly a kind of poetic diction (as you can tell from the color of the little circles), and one that increases in prominence as the nineteenth century goes on. But you would be hard pressed to identify this topic with a single concept.

Does that mean topic modeling isn’t working well here? Does it mean that I should fix the system so that it would produce topics that are easier to label with a single concept? Or does it mean that LDA is telling me something interesting about Victorian poetry — something that might be roughly outlined as an emergent discourse of “spiritual earnestness” and “self-conscious simplicity”? It’s an open question, but I lean toward the latter alternative. (By the way, the writers most prominently involved here include Christina Rossetti, Algernon Swinburne, and both Brownings.)

In an earlier comment I implied that the choice between “semantic” topics and “discourses” might be aligned with topic modeling at different scales, but I’m not really sure that’s true. I’m sure that the document size we choose does affect the level of granularity we’re modeling, but I’m not sure how radically it affects it. (I believe Matt Jockers has done some systematic work on that question, but I’ll also be interested to see the results Lisa gets when she models differences between poems.)

I actually suspect that the topics identified by LDA probably always have the character of “discourses.” They are, technically, “kinds of language that tend to occur in the same discursive contexts.” But a “kind of language” may or may not really be a “topic.” I suspect you’re always going to get things like “art hath thy thou,” which are better called a “register” or a “sociolect” than they are a “topic.” For me, this is not a problem to be fixed. After all, if I really want to identify topics, I can open a thesaurus. The great thing about topic modeling is that it maps the actual discursive contours of a collection, which may or may not line up with “concepts” any writer ever consciously held in mind.

Computer scientists don’t understand the technique that way.* But on this point, I think we literary scholars have something to teach them.

On the collective course blog for English 581 I have some other examples of topics produced at a volume level.

*[UPDATE April 3, 2012: Allen Riddell rightly points out in the comments below that Blei’s original LDA article is elegantly agnostic about the significance of the “topics” — which are at bottom just “latent variables.” The word “topic” may be misleading, but computer scientists themselves are often quite careful about interpretation.]

Documentation / open data:
I’ve put the topic model I used to produce these visualizations on github. It’s in the subfolder 19th150topics under folder BrowseLDA. Each folder contains an R script that you run; it then prompts you to load the data files included in the same folder, and allows you to browse around in the topic model, visualizing each topic as you go.

I have also pushed my Java code for LDA up to github. But really, most people are better off with MALLET, which is infinitely faster and has hyperparameter optimization that I haven’t added yet. I wrote this just so that I would be able to see all the moving parts and understand how they worked.

Etymology and nineteenth-century poetic diction; or, singing the shadow of the bitter old sea.

In a couple of recent posts, I argued that fiction and poetry became less similar to nonfiction prose over the period 1700-1900. But because I only measured genres’ distance from each other, I couldn’t say much substantively about the direction of change. Toward the end of the second post, though, I did include a graph that hinted at a possible cause:


The older part of the lexicon (mostly words derived from Old English) gradually became more common in poetry, fiction, and drama than in nonfiction prose. This may not be the only reason for growing differentiation between literary and nonliterary language, but it seems worth exploring. (I should note that function words are excluded from this calculation for reasons explained below; we’re talking about verbs, nouns, and adjectives — not about a rising frequency of “the.”)

Why would genres become etymologically different? Well, it appears that words of different origins are associated in contemporary English with different registers (varieties of language appropriate for a particular social situation). Words of Old English provenance get used more often in speech than in writing — and in writing they are (now) used more often in narrative than in exposition. Moreover, writers learn to produce this distinction as they get older; there isn’t a marked difference for students in elementary school. But as they advance to high school, students learn to use Latinate words in formal expository writing (Bar-Ilan and Berman, 2007).

It’s not hard to see why words of Old English origin might be associated with spoken language. English was for 200 years (1066-1250) almost exclusively spoken. The learned part of the Old English lexicon didn’t survive this period. Instead, when English began to be used again in writing, literate vocabulary was borrowed from French and Latin. As a result, etymological distinctions in English tend also to be distinctions between different social contexts of language use.

Instead of distinguishing “Germanic” and “Latinate” diction here, I have used the first attested date for each word, choosing 1150 as a dividing line because it’s the midpoint of the period when English was not used in writing. Of course pre-1150 words are mostly from Old English, but I prefer to divide based on date-of-entry because that highlights the history of writing rather than a spurious ethnic mystique. (E.g., “Anglo-Saxon is a livelier tongue than Latin, so use Anglo-Saxon words.” — E. B. White.) But the difference isn’t material. You could even just measure the average length of words in different genres and get results that are close to the results I’m graphing here (the correlation between the pre/post-1150 ratio and average word length is often -.85 or lower).

The bottom line is this: using fewer pre-1150 words tends to make diction more overtly literate or learned. Using more of them makes diction less overtly learned, and perhaps closer to speech. It would be dangerous to assume much more: people may think that Old English words are “concrete” — but this isn’t true, for instance, of “word” or “true.”

What can we learn by graphing this aspect of diction?


In the period 1700-1900, I think we learn three interesting things:

    All genres of writing (or at least of prose) seem to acquire an exaggeratedly “literate” diction in the course of the eighteenth century.

    Poetry and fiction reverse that process in the nineteenth century, and develop a diction that is markedly less learned than other kinds of writing — or than their own past history.

    But they do that to different degrees, and as a result the overall story is one of increasing differentiation — not just between “literary” and “nonliterary” diction — but between poetry and fiction as well.

I’m fascinated by this picture. It suggests that the difference linguists have observed between the registers of exposition and narrative may be a relatively recent development. It also raises interesting questions about “literariness” in the eighteenth and nineteenth centuries. For instance, contrast this picture to the standard story where “poetic diction” is an eighteenth-century refinement that the nineteenth century learns to dispense with. Where the etymological dimension of diction is concerned, that story doesn’t fit the evidence. On the contrary, nineteenth-century poetry differentiates itself from the diction of prose in a new and radical way: by the end of the century, the older part of the lexicon has become more than 2.5 times more prominent, on average, in verse than it is in nonfiction prose.

I could speculate about why this happened, but I don’t really know yet. What I can do is give a little more descriptive detail. For instance, if pre-1150 words became more common in 19c poetry … which words, exactly, were involved? One way to approach that is to ask which individual words correlate most strongly with the pre/post-1150 ratio. We might focus especially, for instance, on the rising trend in poetry from the middle of the eighteenth century to 1900. If you sort the top 10,000 words in the poetry collection by correlation with yearly values of the pre/post ratio, you get a list like this:


But the precise correlation coefficients don’t matter as much as an overall picture of diction, so I’ll simply list the hundred words that correlate most strongly with the pre/post-1150 ratio in poetry from 1755 to 1900:


We’re looking mostly at a list of pre-1150 words, with a few exceptions (“face,” “flower,” “surely”). That’s not an inevitable result; if the etymological trend had been a side-effect of something mostly unrelated to linguistic register (say, a vogue for devotional poetry), then sorting the top 10,000 words by correlation with the trend would reveal a list of words associated with its underlying (religious) cause. But instead we’re seeing a trend that seems to have a coherent sociolinguistic character. That’s not just a feature of the top 100 words: the average pre-1150 word is located 2210 places higher on this list than the average post-1150 word.

It’s not, however, simply a list of common Anglo-Saxon words. The list clearly reflects a particular model of “poetic diction,” although the nature of that model is not easy to describe. It involves an odd mixture of nouns for large natural phenomena (wind, sea, rain, water, moon, sun, star, stars, sunset, sunrise, dawn, morning, days, night, nights) and verbs that express a subjective relation (sang, laughed, dreamed, seeing, kiss, kissed, heard, looked, loving, stricken). [Afterthought: I don’t think we have any Hopkins in our collection, but it sounds like my computer is parodying Gerard Manley Hopkins.] There’s also a bit of explicitly archaic Wardour Street in there (yea, nay, wherein, thereon, fro).

Here, by contrast, are the words at the bottom of the list — the ones that correlate negatively with the pre/post-1150 trend, because they are less common, on average, in years where that trend spikes.


There’s a lot that could be said about this list, but one thing that leaps out is an emphasis on social competition. Pomp, power, superior, powers, boast, bestow, applause, grandeur, taste, pride, refined, rival, fortune, display, genius, merit, talents. This is the language of poems that are not bashful about acknowledging the relationship between “social” distinction and the “arts” “inspired” by the “muse” — a theme entirely missing (or at any rate disavowed) in the other list. So we’re getting a fairly clear picture of a thematic transformation in the concerns of poetry from 1755 to 1900. But these lists are generated strictly by correlation with the unsmoothed year-to-year variation of an etymological trend! Moreover, the lists have themselves an etymological character. There are admittedly a few pre-1150 words in this list of negative correlators (mind, oft, every), but by and large it’s a list of words derived from French or Latin after 1150.

I think the apparent connection between sociolinguistic and thematic issues is the really interesting part of this. It begins to hint that the broader shift in poetic diction (using words from the older part of the lexicon to differentiate poetry from prose) had itself an unacknowledged social rationale — which was to disavow poetry’s connection to cultural distinction, and foreground instead a simplified individual subjectivity. I’m admittedly speculating a little here, and there’s a great deal more that could be said both about poetry and about parallel trends in fiction — but I’ve said enough for one blog post.

A couple of quick final notes. You’re wondering, what about drama?


Our collection of drama in the nineteenth century is actually too sparse to draw any conclusions yet, but there’s the trend line so far, if you’re interested.

You’re also wondering, how were pre- and post-1150 words actually sorted out? I made a list of the 10,500 most common words in the collection, and mined etymologies for them using a web-crawler on Dictionary.com. I excluded proper nouns, abbreviations, and words that entered English after 1699. I also excluded function words (determiners, prepositions, conjunctions, pronouns, and the verb to be) because as Bar-Ilan and Berman say, “register variation is essentially a matter of choice — of selecting high-level more formal alternatives instead of everyday, colloquial items or vice versa” (15). There is generally no alternative to prepositions, pronouns, etc, so they don’t tell us much about choice. After those exclusions, I had a list of 9,517 words, of which 2,212 entered the language before 1150 and 7,125 after 1149. (The list is available here.)

Finally, I doubt we’ll be so lucky — but if you do cite this blog post, it should be cited as a collective work by Ted Underwood and Jordan Sellers, because the nineteenth-century part of the underlying collection is a product of Jordan’s research.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

[UPDATE March 13, 2011: Twitter conversation about this post with Natalia Cecire.]