18c methodology topic modeling

The key to all mythologies.

Well, not really. But it is a classifying scheme that might turn out to be as loopy as Casaubon’s incomplete project in Middlemarch, and I thought I might embrace the comparison to make clear that I welcome skepticism.

In reality, it’s just a map of eighteenth-century diction. I took the 1,650 most common words in eighteenth-century writing, and asked my iMac to group them into clusters that tend to be common in the same eighteenth-century works. Since the clustering program works recursively, you end up with a gigantic branching tree that reveals how closely words are related to each other in 18c practice. If they appear on the same “branch”; they tend to occur in the same works. If they appear on the same “twig,” that tendency is even stronger.

You wouldn’t necessarily think that two words happening to occur in the same book would tell you much, but when you’re dealing with a large number of documents, it seems there’s a lot of information contained in the differences between them. In any case, this technique produced a detailed map of eighteenth-century topics that seemed — to me, anyway — surprisingly illuminating. To explore a couple of branches, or just marvel at this monument of digital folly, click here, or on the illustration to the right. That’ll take you through to a page where you can click on whichever branches interest you. (Click on the links in the right-hand margin, not the annotations on the tree itself.) To start with, I recommend Branch 18, which is a sort of travel narrative, Branch 13, which is 18c poetic diction in a nutshell, and Branch 5, which is saying something about gender and/or sexuality that I don’t yet understand.

If you want to know exactly how this was produced, and contrast it to other kinds of topic modeling, I describe the technique in this “technical note.” I should also give thanks to the usual cast of characters. Ryan Heuser and Ben Schmidt have produced analogous structures which gave me the idea of attempting this. Laura Mandell and 18th Connect helped me obtain the eighteenth-century texts on which the tree was based.

By tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

2 replies on “The key to all mythologies.”

Professor Ted Underwood. I appreciate your work and analysis very mush indeed. It’s fascinaing. I have one question, and will you be so kind so solve it for me? You mentioned “the 1,650 most common words in eighteenth-century writing”, and I begin to wonder how did you come out with the number “1,650”? How to make sure these words are most common?
Thank you

No problem. To be honest, the number 1650 is arbitrary. I was trying not to overwhelm my computer, so I used a relatively small list of words, but larger lists would also work.

I chose the most common words in this period, I believe, just by sorting the data itself into a list of words by frequency. If you have a whole lot of documents from a given period, then you have all the information you need to create a list of common words. How exactly you sort the information depends on how you have it stored. In this case, I had converted the documents into a “sparse table” in MySQL, where each line contained essentially a) document ID b) a word c) the number of times b appears in a. Once you have that information, it is not difficult to produce a list of words sorted by frequency.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s