I’ve been talking about correlation since I started this blog. Actually, that was the reason why I did start it: I think literary scholars can get a huge amount of heuristic leverage out of the fact that thematically and socially-related words tend to rise and fall together. It’s a simple observation, and one that stares you in the face as soon as you start to graph word frequencies on the time axis.1 But it happens to be useful for literary historians, because it tends to uncover topics that also pose periodizable kinds of puzzles. Sometimes the puzzle takes the form of a topic we intuitively recognize (say, the concept of “color”) that increases or decreases in prominence for reasons that remain to be explained:
At other times, the connection between elements of the topic is not immediately intuitive, but the terms are related closely enough that their correlation suggests a pattern worthy of further exploration. The relationship between terms may be broadly historical:
Or it may involve a pattern of expression that characterizes a periodizable style:
Of course, as the semantic relationship between terms becomes less intuitively obvious, scholars are going to wonder whether they’re looking at a real connection or merely an accidental correlation. “Ardent” and “tranquil” seem like opposites; can they really be related as elements of a single discourse? And what’s the relationship to “bosom,” anyway?
Ultimately, questions like this have to be addressed on a case-by-case basis; the significance of the lead has to be fleshed out both with further analysis, and with close reading.
But scholars who are wondering about the heuristic value of correlation may be reassured to know that this sort of lead does generally tend to pan out. Words that correlate with each other across the time axis do in practice tend to appear in the same kinds of volumes. For instance, if you randomly select pairs of words from the top 10,000 words in the Google English ngrams dataset 1700-1849,2 measure their correlation with each other in that dataset across the period 1700-1849, and then measure their tendency to appear in the same volumes in a different collection3 (taking the cosine similarity of term vectors in a term-document matrix), the different measures of association correlate with each other strongly. (Pearson’s r is 0.265, significant at p < 0.0005.) Moreover, the relationship holds (less strongly, but still significantly) even in adjacent centuries: words that appear in the same eighteenth-century volumes still tend to rise and fall together in the nineteenth century.
Why should humanists care about the statistical relationship between two measures of association? It means that correlation-mining is in general going to be a useful way of identifying periodizable discourses. If you find a group of words that correlate with each other strongly, and that seem related at first glance, it's probably going to be worthwhile to follow up the hunch. You’re probably looking at a discourse that is bound together both diachronically (in the sense that the terms rise and fall together) and topically (in the sense that they tend to appear in the same kinds of volumes).
Ultimately, literary historians are going to want to assess correlation within different genres; a dataset like Google's, which mixes all genres in a single pool, is not going to be an ideal tool. However, this is also a domain where size matters, and in that respect, at the moment, the ngrams dataset is very helpful. It becomes even more helpful if you correct some of the errors that vitiate it in the period before 1820. A team of researchers at Illinois and Stanford4, supported by the Andrew W. Mellon Foundation, has been doing that over the course of the last year, and we're now able to make an early version of the tool available on the web. Right now, this ngram viewer only covers the period 1700-1899, but we hope it will be useful for researchers in that period, because it has mostly corrected the long-s problem that confufes opt1cal charader readers in the 18c — as well as a host of other, less notorious problems. Moreover, it allows researchers to mine correlations in the top 10,000 words of the lexicon, instead of trying words one by one to see whether an interesting pattern emerges. In the near future, we hope to expand the correlation miner to cover the twentieth century as well.
For further discussion of the statistical relationship between topics and trends, see this paper submitted to DHCS 2011.
UPDATE Nov 22, 2011: At DHCS 2011, Travis Brown pointed out to me that Topics Over Time (Wang and McCallum) might mine very similar patterns in a more elegant, generative way. I hope to find a way to test that method, and may perhaps try to build an implementation for it myself.
1) Ryan Heuser and I both noticed this pattern last winter. Ryan and Long Le-Khac presented on a related topic at DH2011: Heuser, Ryan, and Le-Khac, Long. “Abstract Values in the 19th Century British Novel: Decline and Transformation of a Semantic Field,” Digital Humanities 2011, Stanford University.
2) Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (Published online ahead of print: 12/16/2010)
3) The collection of 3134 documents (1700-1849) I used for this calculation was produced by combining ECCO-TCP volumes with nineteenth-century volumes selected and digitized by Jordan Sellers.
4) The SEASR Correlation Analysis and Ngrams Viewer was developed by Loretta Auvil and Boris Capitanu at the Illinois Informatics Institute, modeled on prototypes built by Ted Underwood, University of Illinois, and Ryan Heuser, Stanford.