I wrote a long post last Friday arguing that topic-modeling an 18c collection is a reliable way of discovering eighteenth- and nineteenth-century trends, even in a different collection.
But when I woke up on Saturday I realized that this result probably didn’t depend on anything special about “topic modeling.” After all, the topic-modeling process I was using was merely a form of clustering. And all the clustering process does is locate the hypothetical centers of more-or-less-evenly-sized clouds of words in vector space. It seemed likely that the specific locations of these centers didn’t matter. The key was the underlying measure of association between words — “cosine similarity in vector space,” which is a fancy way of saying “tendency to be common in the same 18c volumes.” Any group of words that were common (and uncommon) in the same 18c volumes would probably tend to track each other over time, even into the next century.
To test this I wrote a script that chose 200 words at random from the top 5000 in a collection of 2,193 18c volumes (drawn from TCP-ECCO with help from Laura Mandell), and then created a cluster around each word by choosing the 25 words most likely to appear in the same volumes (cosine similarity). Would pairs of words drawn from these randomly distributed clusters also show a tendency to correlate with each other over time?
They absolutely do. The Fisher weighted mean pairwise r for all possible pairs drawn from the same cluster is .267 in the 18c and .284 in the 19c (the 19c results are probably better because Google’s dataset is better in the 19c even after my efforts to clean the 18c up*). At n = 100 (measured over a century), both correlations have rock-solid statistical significance, p < .01. And in case you're wondering … yes, I wrote another script to test randomly selected words using the same statistical procedure, and the mean pairwise r for randomly selected pairs (factoring out, as usual, partial correlation with the larger 5000-word group they’re selected from) is .0008. So I feel confident that the error level here is low.**
What does this mean, concretely? It means that the universe of word-frequency data is not chaotic. Words that appear in the same discursive contexts tend, strongly, to track each other over time, and (although I haven’t tested this rigorously yet), it’s not hard to see that the converse proposition is also going to hold true: words that track each other over time are going to tend to have contextual associations as well.
To put it even more bluntly: pick a word, any word! We are likely to be able to define not just a topic associated with that word, but a discourse — a group of words that are contextually related and that also tend to wax and wane together over time. I don’t imagine that in doing so we prove anything of humanistic significance, but I do think it means that we can raise several thousand significant questions. To start with: what was the deal with eagerness, gratification, and disappointment in the second half of the eighteenth century?
* A better version of Google’s 18c dataset may be forthcoming from the NCSA.
** For people who care about the statistical reliability of data-mining, here’s the real kicker: if you run a Benjamini-Hochberg procedure on these 200 randomly-generated clusters, 134 of them have significance at p < .05 in the 19c even after controlling for the false discovery rate. To put that more intelligibly, these are guaranteed not to be xkcd’s green jelly beans. The coherence of these clusters is even greater than the ones produced by topic-modeling, but that’s probably because they are on average slightly smaller (25 words); I have yet to test the relative importance of different generation procedures while holding cluster size rigorously constant.