I’m getting ahead of myself with this post, because I don’t have time to explain everything I did to produce this. But it was just too striking not to share.
Basically, I’m experimenting with Latent Dirichlet Allocation, and I’m impressed. So first of all, thanks to Matt Jockers, Travis Brown, Neil Fraistat, and everyone else who tried to convince me that Bayesian methods are better. I’ve got to admit it. They are.
But anyway, in a class I’m teaching we’re using LDA on a generically diverse collection of 1,853 volumes between 1751 and 1903. The collection includes fiction, poetry, drama, and a limited amount of nonfiction (just biography). We’re stumbling on a lot of fascinating things, but this was slightly moving. Here’s the graph for one particular topic.
The circles and X’s are individual volumes. Blue is fiction, green is drama, pinkish purple is poetry, black biography. Only the volumes where this topic turned out to be prominent are plotted, because if you plot all 1,853 it’s just a blurry line at the bottom of the image. The gray line is an aggregate frequency curve, which is not related in any very intelligible way to the y-axis. (Work in progress …) As you can see. this topic is mostly prominent in fiction around the year 1800. Here are the top 50 words in the topic:
But here’s what I find slightly moving. The x’s at the top of the graph are the 10 works in the collection where the topic was most prominent. They include, in order, Mary Wollstonecraft Shelley, Frankenstein, Mary Wollstonecraft, Mary, William Godwin, St. Leon, Mary Wollstonecraft Shelley, Lodore, William Godwin, Fleetwood, William Godwin, Mandeville, and Mary Wollstonecraft Shelley, Falkner.
In short, this topic is exemplified by a family! Mary Hays does intrude into the family circle with Memoirs of Emma Courtney, but otherwise, it’s Mary Wollstonecraft, William Godwin, and their daughter.
Other critics have of course noticed that M. W. Shelley writes “Godwinian novels.” And if you go further down the list of works, the picture becomes less familial (Helen Maria Williams and Thomas Holcroft butt in, as well as P. B. Shelley). Plus, there’s another topic in the model (“myself these should situation”) that links William Godwin more closely to Charles Brockden Brown than it does to his wife or daughter. And LDA isn’t graven on stone; every time you run topic modeling you’re going to get something slightly different. But still, this is kind of a cool one. “Mind feelings heart felt” indeed.
7 replies on “A touching detail produced by LDA …”
Welcome =]
Ha! Just as I suspected, then — there’s a secret society.
http://www.youtube.com/watch?v=ZB_2oIKUVks
We can control the world with the right priors.
Hi, Ted. I really enjoyed this post because you’ve taken a single generated topic and then created a readable visualization which makes sense in terms of the probability of topic distribution… I have a few questions that probably push along the lines of exactly what you didn’t have time to document here, but since I’m working right in this same vein right now, it would be useful if you have a moment to articulate a couple of things…
1.) Have you done anything to normalize text length/density? Would there be a value in doing such a thing? You mention that these are individual volumes, but are the volumes themselves comparable in number of words? Could the length of volumes and their sustained attention to a subject throughout the volume affect the outcome? I’m thinking, for example of something like the 1799 or 1850 Preludes… would those have similar mind/soul/feeling probabilities that might be obscured by the relative density of the number of words found within the Wolstonecraft / Shelley / Godwin texts? How would we know? Also, with a poem such as Tintern Abbey, in which one might also find similar mind/soul/feeling language, could it’s publication inside a larger volume reduce it’s relative probability to be affiliated with the topic? In other words, would it be reasonable to say that a volume by volume approach works better for prose than it does for poetry or even drama?
2.) What was used to generate the visualization? I find it particularly elegant in its layout. And yet the aggregate frequency curve seems to be penciled in… which is fine, honestly, to me… I’m just wondering what tools allowed you to make the visualization so legible… This is an entirely selfishly motivated question… as I’m trying to find something to display results of my own LDA experiments.
Thanks so much for sharing your ongoing work. I always find your posts readable and engaging when so much of the material in this area is dense and not enjoyable to read. It’s not only useful but necessary to post preliminary results as the field moves further ahead in topic modeling and other computational/textual analytic experiments, because as a community, we’re learning along with one another… What a boon for us all to get a glimpse of your work as it refines, adapts, and grows.
Thanks much for that kind comment, Lisa. It also speaks to several important questions. Matt Jockers has a book forthcoming that’s really going to give the definitive answer to some of these questions — but to answer the parts of this that I can answer from my own unsystematic fiddling.
1a) Most people use much shorter “chunks” or contexts for topic modeling than I’m using here. I think Woodchipper, in particular, uses pages. But my own experience has been that volume-length chunks work very well; I’ve tried smaller chunks of fixed size, and haven’t found them to be better. In fact, when the chunks get very small (below about 1000 words), I actually think the usefulness of the model decreases.
Let me try to provide a bit of intuition for why that might so. The LDA algorithm doesn’t assign each “chunk” equal weight in the process. On the contrary, it’s mostly interested in individual words. As it assigns words to topics, it does look at how many other words in the document belong to each topic, but the size of the document (or “chunk”) essentially gets factored out in that calculation.
So, if I may put it a little impressionistically: chunk size can affect the relative granularity of the phenomena you’re modeling, but it doesn’t in itself give more weight to one document or another. So chunks don’t have to be equally sized in order for LDA to work well, although yes, other things being equal, you might ideally aim for *roughly* similar-sized chunks so that you’re modeling a similar level of granularity across the corpus.
1b) Now, there’s a separate question, which is not mathematical so much as it is a question about the objects we want to describe. You’re very right that “Tintern Abbey” is different from the rest of the Lyrical Ballads. That’s an interesting fact, and it’s a fact that we can’t get at unless we start segmenting books into parts below the volume level, at least in the minimal sense that we won’t be able to distinguish “Tintern” from the rest of the volume.
I’m less certain that the topics we produce will generally become more *interesting* as we make the chunk size smaller. This is going to depend on what you’re interested in, so there’s not going to be a single answer here. But I will say this: it might be a mistake to assume that literary applications of topic modeling are going to be looking for semantically-conceived “topics.” In my experience, we’re more interested in discourses.
2) Thanks for your kind words about my fairly cheesy visualization. This is just produced with the basic visualization tool included in the language R. I wrote an R script to visualize the results of topic modeling for my grad class. I’ll push the R script, as well as the Java code I’m using for LDA, up onto github later this afternoon.
If you’re thinking that line about “discourses” is opaque … yes, it’s opaque, and I’m not even sure I understand my own meaning, to tell the truth.
To be a bit less gnomic: you’re raising an open and really interesting question about the different aspects of literature that get foregrounded at different levels of granularity. I’ll be very interested to see what you find, doing topic modeling at the level of individual poems. I think that interesting things can also emerge at the level of the “volume,” if the corpus is big enough. But I should perhaps call these patterns “discourses,” because they are ways of writing rather than “topics” in our ordinary semantic sense of the word.
[…] This post is the continuation of a conversation begun on Ted Underwood’s blog under the post “A touching detail produced by LDA”—in which he demonstrates that there is an overlay between the works of the Shelley/Godwin family […]