Categories
18c methodology

Revealing the relationships between topics in a corpus.

[UPDATE April 7: The illustrations in this post are now out of date, though some of the explanation may still be useful. The kinds of diction mapped in these illustrations are now mapped better in branches 13-14, 18, and 1 of this larger topic tree.] While trying to understand the question I posed in my last post (why did style become less “conversational” in the 18th century?), I stumbled on a technique that might be useful to other digital humanists. I thought I might pause to describe it.

The technique is basically a kind of topic modeling. But whereas most topic modeling aims to map recurring themes in a single work, this technique maps topics at the corpus level. In other words, it identifies groups of words that are linked by the fact that they tend to occur in the same kinds of books. I’m borrowing this basic idea from Ben Schmidt, who used tf-idf scores to do something similar. I’ve taken a slightly different approach by using a “vector space model,” which I prefer for reasons I’ll describe in some technical notes. But since you’ll need to see results before you care about the how, let me start by showing you what the technique produces.

part of a topic tree based on 2,200 18c works

This branch of a larger tree structure was produced by a clustering program that groups words together when they resemble each other according to some measure of similarity. In this case I defined “similarity” as a tendency to occur in the same eighteenth-century texts. Since the tree structure records the sequence of grouping operations, it can register different nested levels of similarity. In the image above, for instance, we can see that “proud” and “pride” are more likely to occur in the same texts than either is to occur together with “smile” or “gay.” But since this is just one branch of a much larger tree, all of these words are actually rather likely to occur together.

This tree is based on a generically diverse collection of 18c texts drawn from ECCO-TCP with help from 18thConnect, and was produced by applying the clustering program I wrote to the 1350 most common words in that collection. The branch shown above represents about 1/50th of the whole tree. But I’ve chosen this branch to start with because it neatly illustrates the underlying principle of association. What do these words have in common? They’re grouped together because they appear in the same kinds of texts, and it’s fairly clear that the “kinds” in this case are poetic. We could sharpen that hypothesis by using this list of words as a search query to see exactly which texts it turns up, but given the prevalence of syncope (“o’er” and “heav’n”), poetry is a safe guess.

It is true that semantically related words tend to be strongly grouped in the tree. Ease/care, charms/fair and so on, are closely linked. But that isn’t a rule built into the algorithm I’m using; the fact that it happens is telling us something about the way words are in practice distributed in the collection. As a result, you get a snapshot of eighteenth century “poetic diction,” not just in the sense of specialized words like “oft,” but in the sense that you can see which themes counted as “poetic” in the eighteenth century, and possibly gather some clues about the way those themes were divided into groups. (In order to find out whether those divisions were generic or historical, you would need to turn the process around and use the sublists as search queries.)

part of a topic tree based on 2,200 18c works

Here’s another part of the tree, showing words that are grouped together because they tend to appear in accounts of travel. The words at the bottom of the image (from “main” to “ships”) are very clearly connected to maritime travel, and the verbs of motion at the top of the image are connected to travel more generally. It’s less obvious that diurnal rhythms like morning/evening and day/night would be described heavily in the same contexts, but apparently they are.

In trees like these, some branches are transparently related to a single genre or subject category, while others are semantically fascinating but difficult to interpret as reflections of a single genre. They may well be produced by the intersection or overlap of several different generic (or historical) categories, and it’ll require more work to understand the nature of the overlap. In a few days I’ll post an overview of the architecture of the whole 1350-word eighteenth-century tree. It’ll be interesting to see how its architecture changes when I slide the collection forward in time to cover progressively later periods (like, say, 1750-1850). But I’m finding the tree interesting for reasons that aren’t limited to big architectural questions of classification: there are interesting thematic clues at every level of the structure. Here’s a portion of one that I constructed with a slightly different list of words.

part of a topic tree based on 2,200 18c works

Broadly, I would say that this is the language of sentiment: “alarm,” “softened,” “shocked,” “warmest,” “unfeeling.” But there are also ringers in there, and in a way they’re the most interesting parts. For instance, why are “moment” and “instantly” part of the language of sentiment in the eighteenth century?

By tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s