Benjamin Schmidt has been posting some fascinating reflections on different ways of analyzing texts digitally and characterizing the affinities between them.
I’m tempted to briefly comment on a technique of his that I find very promising. This is something that I don’t yet have the tools to put into practice myself, and perhaps I shouldn’t comment until I do. But I’m just finding the technique too intriguing to resist speculating about what might be done with it.
Basically, Schmidt describes a way of mapping the relationships between terms in a particular archive. He starts with a word like “evolution,” identifies texts in his archive that use the word, and then uses tf-idf weighting to identify the other words that, statistically, do most to characterize those texts.
After iterating this process a few times, he has a list of something like 100 terms that are related to “evolution” in the sense that this whole group of terms tends, not just to occur in the same kinds of books, but to be statistically prominent in them. He then uses a range of different clustering algorithms to break this list into subsets. There is, for instance, one group of terms that’s clearly related to social applications of evolution, another that seems to be drawn from anatomy, and so on. Schmidt characterizes this as a process that maps different “discourses.” I’m particularly interested in his decision not to attempt topic modeling in the strict sense, because it echoes my own hesitation about that technique:
In the language of text analysis, of course, I’m drifting towards not discourses, but a simple form of topic modeling. But I’m trying to only submerge myself slowly into that pool, because I don’t know how well fully machine-categorized topics will help researchers who already know their fields. Generally, we’re interested in heavily supervised models on locally chosen groups of texts.
This makes a lot of sense to me. I’m not sure that I would want a tool that performed pure “topic modeling” from the ground up — because in a sense, the better that tool performed, the more it might replicate the implicit processing and clustering of a human reader, and I already have one of those.
Schmidt’s technique is interesting to me because the initial seed word gives it what you might call a bias, as well as a focus. The clusters he produces aren’t necessarily the same clusters that would emerge if you tried to map the latent topics of his whole archive from the ground up. Instead, he’s producing a map of the semantic space surrounding “evolution,” as seen from the perspective of that term. He offers this less as a finished product than as an example of a heuristic that humanists might use for any keyword that interested them, much in the way we’re now accustomed to using simple search strategies. Presumably it would also be possible to move from the semantic clusters he generates to a list of the documents they characterize.
I think this is a great idea, and I would add only that it could be adapted for a number of other purposes. Instead of starting with a particular seed word, you might start with a list of terms that happen to be prominent in a particular period or genre, and then use Schmidt’s technique of clustering based on tf-idf correlations to analyze the list. “Prominence” can be defined in a lot of different ways, but I’m particularly interested in words that display a similar profile of change across time.
diction, elegance, in the English corpus, 1700-1900, plus the capitalized 18c versions
For instance, I think it’s potentially rather illuminating that “diction” and “elegance” change in closely correlated ways in the late eighteenth and early nineteenth century. It’s interesting that they peak at the same time, and I might even be willing to say that the dip they both display, in the radical decade of the 1790s, suggests that they had a similar kind of social significance. But of course there will be dozens of other terms (and perhaps thousands of phrases) that also correlate with this profile of change, and the Google dataset won’t do anything to tell us whether they actually occurred in the same sorts of books. This could be a case of unrelated genres that happened to have emerged at the same time.
But I think a list of chronologically correlated terms could tell you a lot if you then took it to an archive with metadata, where Schmidt’s technique of tf-idf clustering could be used to break the list apart into subsets of terms that actually did occur in the same groups of works. In effect this would be a kind of topic modeling, but it would be topic modeling combined with a filter that selects for a particular kind of historical
“topicality” or timeliness. I think this might tell me a lot, for instance, about the social factors shaping the late-eighteenth-century vogue for characterizing writing based on its “diction” — a vogue that, incidentally, has a loose relationship to data mining itself.
I’m not sure whether other humanists would accept this kind of technique as evidence. Schmidt has some shrewd comments
on the difference between data mining and assisted reading, and he’s right that humanists are usually going to prefer the latter. Plus, the same “bias” that makes a technique like this useful dispels any illusion that it is a purely objective or self-generating pattern. It’s clearly a tool used to slice an archive from a particular angle, for particular reasons.
But whether I could use it as evidence or not, a technique like this would be heuristically priceless: it would give me a way of identifying topics that peculiarly characterize a period — or perhaps even, as the dip in the 1790s hints, a particular impulse in that period — and I think it would often turn up patterns that are entirely unexpected. It might generate these patterns by looking for correlations between words, but it would then be fairly easy to turn lists of correlated words into lists of works, and investigate those in more traditionally humanistic ways.
For instance, I had no idea that “diction” would correlate with “elegance” until I stumbled on the connection, but having played around with the terms a bit in MONK
, I’m already getting a sense that the terms are related not just through literary criticism (as you might expect), but also through historical discourse and (oddly) discourse about the physiology of sensation. I don’t have a tool yet that can really perform Schmidt’s sort of tf-idf clustering, but just to leave you with a sense of the interesting patterns I’m glimpsing, here’s a word cloud I generated in MONK by contrasting eighteenth-century works that contain “elegance” to the larger reference set of all eighteenth-century works. The cloud is based on Dunning’s log likelihood, and limited to adjectives, frankly, just because they’re easier to interpret at first glance.
Dark adjectives are overrepresented in a corpus of 18c works that contain "elegance," light ones underrepresented.
There’s a pretty clear contrast here between aesthetic and moral discourse, which is interesting to begin with. But it’s also a bit interesting that the emphasis on aesthetics extends into physiological terms like “sensorial,” “irritative,” and “numb,” and historical terms like “Greek” and “Latin.” Moreover, many of the same terms reoccur if you pursue the same strategy with “diction.”
Dark adjectives are overrepresented in a corpus of 18c works containing "diction," light ones underrepresented.
A lot of words here are predictably literary, but again you see sensory terms like “numb,” and historical ones like “Greek,” “Latin,” and “historical” itself. Once again, moreover, moral discourse is interestingly underrepresented. This is actually just one piece of the larger pattern you might generate if you pursued Schmidt’s clustering strategy — plus, Dunning’s is not the same thing as tf-idf clustering, and the MONK corpus of 1000 eighteenth-century works is smaller than one would wish — but the patterns I’m glimpsing are interesting enough to suggest to me that this general kind of approach could tell me a lot of things I don’t yet know about a period.