18c tree

This is super-outdated and definitely deprecated. Leaving it up for now to avoid link rot, but this will probably be archived somewhere eventually.

a topic tree of 1650 words, grouped according to their tendency to appear together in a generically diverse collection of 2,200 18c works


53
51
50
49
45
44
43
41
39
38
37
36
35
32
31
30
29
28
22 – – – 21
19
18
17
16
15 – – – 14
13
12 – – – 11
9
5 – – – 4
3 – – – 2
1






The image above is the “trunk” of what I’m calling a “topic tree.” It divides eighteenth-century vocabulary into clusters of words that tend to occur together in the same works. It’s based on a generically diverse collection (drawn from ECCO-TCP), covering poetry, drama, fiction, and a lot of different kinds of nonfiction. You can’t click on the tree itself, but the numbers in the right-hand margin are keyed to the branches; clicking on a number will reveal the detailed structure of that branch. I haven’t included links to all the branches yet, just a selection of interesting ones. For a detailed account of how this was created, see here and here.

Words in this tree are clustered based on their tendency to appear in the same works, but the clusters should be understood as topics rather than genre or subject classifications. In other words, branches of the tree don’t line up with categories of books in a one-to-one fashion: they’re defined by the differences between multiple categories. Also, since a century is a long time, it’s likely that some of these clusters are produced by diachronic as well as thematic differences. That may be why, e.g. natural history or the language of feeling seem to appear in two different places in the tree.

Although the tree structure will inevitably suggest systematic hubris, I don’t mean to make that sort of claim; I mean this to be as playful and interrogative as a massive tree graph can manage to be. I’ve added descriptive annotations to the image purely to help readers choose a couple of branches that might interest them; in reality, these descriptions are very tentative and should all be followed by question marks.

Why build a tree like this this? Well, I’m still trying to figure out what, if anything, we might learn. A few clues are obvious. One is that Ireland appears in a section of the tree (53) associated with titles, inheritance, and violence, whereas Scotland is closely associated with English politics, and with England itself (38). “Natives,” of course, end up getting filed under landscape (18).

At the largest level, the tree is divided between relatively concrete and familiar language (probably overrepresented in letters, novels, poetry, drama) at the bottom, and more specialized discourses (philosophy, law, and so on) in the upper half. Poetic diction (13-14), and the conceptual structure of 18c philosophy (30-31), come through bright and clear.

But if this exercise really turns out to be worthwhile, it’ll be worthwhile because of the things I don’t yet understand. The thing I find most intriguing at the moment is the distinction between the language of emotion at (11) and the slightly different language of emotional response at (1), which seems more closely connected to immediacy (“moment,” “instantly”). I’ve annotated parts of those branches with the titles of some works that turn up when you use the branches as a search query: basically it seems to involve a difference between poetry/drama on one hand and the novel on the other, especially late-18c novels by female authors. I’m also intrigued by the way gender is represented at (5), although I’m not yet certain what to say about it. I don’t understand why freedom/slavery appears where it does at (17).

Finally, to tell the truth, I enjoy some of the branches in an unintellectual way as a sort of found poetry. The vocabulary of travel at (18) is almost a story in itself, and the structure of inheritance at (51) is visually fascinating. The language of sensation (12) and of landscape description (21) are also phenomenologically cool.

4 thoughts on “18c tree

  1. This is really interesting. I like your solution to browsing down into the dendrogram, too—maybe we need something more interactive like that to help browse that sort of structure in general.

    I’d be curious to see some examples of topic composition of some individual novels—what percentage of the words in Robinson Crusoe are travel-related? What topics show the least variation in use among novels, and are they also the most intuitively difficult to understand? Do others correspond well to the genres that we already know very well about.

    And I had a random, not fully fleshed-out idea to filter out some of the diachronic differences—maybe you could use principal components analysis or something to find the vector that most separates out words by year of use, and then somehow (there’s the rub) subtract that out from the results before you cluster. It would also tend to underplay genres that emerged over time, but there ought to be some way to pull out a little bit of the time variation.

  2. I agree: these are two very promising avenues for further investigation. I was talking to Miles Efron the other day, and he was likewise wondering to what extent these corpus-level topics would or wouldn’t line up with the topic divisions that emerge in individual works. I guess a really strong version of structuralism would want to claim that they’ll have to turn out to be the same: individual works have to be constituted by larger structures at the level of discourse. I doubt it’ll work quite that neatly in practice, but just posing the question is a nice way to dramatize what’s at stake here.

    I also like your idea that it ought to be possible to somehow factor out the diachronic component of these differences from the generic-or-thematic component. I think that could be useful in a bunch of different ways, even if or especially if it turned out not to be possible to separate them fully. I’m actually most interested in the intersection of those two components, but you might have to try to separate them in order to figure out how they intersect.

  3. This is very, very cool, Ted. I’m particularly gratified to see your analysis of how “Irish” and “Scottish” domains get categorized so differently! This certainly confirms my sense of the very different status of these two entities in 18th c. thinking … Thanks very much for posting this!

    • Thanks, Evan. That seemed salient. It’s probably part of a broader pattern, too. In his blog Sapping Attention, Ben Schmidt has pointed out that when you cluster different subject categories in the 19c, the history of colonized regions has the strongest affinity to … military science.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s