How not to do things with words.

In recent weeks, journals published two papers purporting to draw broad cultural inferences from Google’s ngram corpus. The first of these papers, in PLoS One, argued that “language in American books has become increasingly focused on the self and uniqueness” since 1960. The second, in The Journal of Positive Psychology, argued that “moral ideals and virtues have largely waned from the public conversation” in twentieth-century America. Both articles received substantial attention from journalists and blogs; both have been discussed skeptically by linguists and digital humanists. (Mark Liberman’s takes on Language Log are particularly worth reading.)

I’m writing this post because systems of academic review and communication are failing us in cases like this, and we need to step up our game. Tools like Google’s ngram viewer have created new opportunities, but also new methodological pitfalls. Humanists are aware of those pitfalls, but I think we need to work a bit harder to get the word out to journalists, and to disciplines like psychology.

The basic methodological problem in both articles is that researchers have used present-day patterns of association to define a wordlist that they then take as an index of the fortunes of some concept (morality, individualism, etc) over historical time. (In the second study, for instance, words associated with morality were extracted from a thesaurus and crowdsourced using Mechanical Turk.)

The fallacy involved here has little to do with hot-button issues of quantification. A basic premise of historicism is that human experience gets divided up in different ways in different eras. If we crowdsource “leadership” using twenty-first-century reactions on Mechanical Turk, for instance, we’ll probably get words like “visionary” and “professional.” “Loud-voiced” probably won’t be on the list — because that’s just rude. But to Homer, there’s nothing especially noble about working for hire (“professionally”), whereas “the loud-voiced Achilles” is cut out to be a leader of men, since he can be heard over the din of spears beating on shields (Blackwell).

The laws of perspective apply to history as well. We don’t have an objective overview; we have a position in time that produces its own kind of distortion and foreshortening. Photo 2004 by June Ruivivar.

The authors of both articles are dimly aware of this problem, but they imagine that it’s something they can dismiss if they’re just conscientious and careful to choose a good list of words. I don’t blame them; they’re not coming from historical disciplines. But one of the things you learn by working in a historical discipline is that our perspective is often limited by history in ways we are unable to anticipate. So if you want to understand what morality meant in 1900, you have to work to reconstruct that concept; it is not going to be intuitively accessible to you, and it cannot be crowdsourced.

The classic way to reconstruct concepts from the past involves immersing yourself in sources from the period. That’s probably still the best way, but where language is concerned, there are also quantitative techniques that can help. For instance, Ryan Heuser and Long Le-Khac have carried out research on word frequency in the nineteenth-century novel that might superficially look like the psychological articles I am critiquing. (It’s Pamphlet 4 in the Stanford Literary Lab series.) But their work is much more reliable and more interesting, because it begins by mining patterns of association from the period in question. They don’t start from an abstract concept like “individualism” and pick words that might be associated with it. Instead, they find groups of words that are associated with each other, in practice, in nineteenth-century novels, and then trace the history of those groups. In doing so, they find some intriguing patterns that scholars of the nineteenth-century novel are going to need to pay attention to.

It’s also relevant that Heuser and Le-Khac are working in a corpus that is limited to fiction. One of the problems with the Google ngram corpus is that really we have no idea what genres are represented in it, or how their relative proportions may vary over time. So it’s possible that an apparent decline in the frequency of words for moral values is actually a decline in the frequency of certain genres — say, conduct books, or hagiographic biographies. A decline of that sort would still be telling us something about literary culture; but it might be telling us something different than we initially assume from tracing the decline of a word like “fidelity.”

So please, if you know a psychologist, or journalist, or someone who blogs for The Atlantic: let them know that there is actually an emerging interdisciplinary field developing a methodology to grapple with this sort of evidence. Articles that purport to draw historical conclusions from language need to demonstrate that they have thought about the problems involved. That will require thinking about math, but it also, definitely, requires thinking about dilemmas of historical interpretation.

My illustration about “loud-voiced Achilles” is a very old example of the way concepts change over time, drawn via Friedrich Meinecke from Thomas Blackwell, An Enquiry into the Life and Writings of Homer, 1735. The word “professional,” by the way, also illustrates a kind of subtly moralized contemporary vocabulary that Kesebir & Kesebir may be ignoring in their account of the decline of moral virtue. One of the other dilemmas of historical perspective is that we’re in our own blind spot.

Exploring the relationship between topics and trends.

I’ve been talking about correlation since I started this blog. Actually, that was the reason why I did start it: I think literary scholars can get a huge amount of heuristic leverage out of the fact that thematically and socially-related words tend to rise and fall together. It’s a simple observation, and one that stares you in the face as soon as you start to graph word frequencies on the time axis.1 But it happens to be useful for literary historians, because it tends to uncover topics that also pose periodizable kinds of puzzles. Sometimes the puzzle takes the form of a topic we intuitively recognize (say, the concept of “color”) that increases or decreases in prominence for reasons that remain to be explained:

At other times, the connection between elements of the topic is not immediately intuitive, but the terms are related closely enough that their correlation suggests a pattern worthy of further exploration. The relationship between terms may be broadly historical:

Or it may involve a pattern of expression that characterizes a periodizable style:

Of course, as the semantic relationship between terms becomes less intuitively obvious, scholars are going to wonder whether they’re looking at a real connection or merely an accidental correlation. “Ardent” and “tranquil” seem like opposites; can they really be related as elements of a single discourse? And what’s the relationship to “bosom,” anyway?

Ultimately, questions like this have to be addressed on a case-by-case basis; the significance of the lead has to be fleshed out both with further analysis, and with close reading.

But scholars who are wondering about the heuristic value of correlation may be reassured to know that this sort of lead does generally tend to pan out. Words that correlate with each other across the time axis do in practice tend to appear in the same kinds of volumes. For instance, if you randomly select pairs of words from the top 10,000 words in the Google English ngrams dataset 1700-1849,2 measure their correlation with each other in that dataset across the period 1700-1849, and then measure their tendency to appear in the same volumes in a different collection3 (taking the cosine similarity of term vectors in a term-document matrix), the different measures of association correlate with each other strongly. (Pearson’s r is 0.265, significant at p < 0.0005.) Moreover, the relationship holds (less strongly, but still significantly) even in adjacent centuries: words that appear in the same eighteenth-century volumes still tend to rise and fall together in the nineteenth century.

Why should humanists care about the statistical relationship between two measures of association? It means that correlation-mining is in general going to be a useful way of identifying periodizable discourses. If you find a group of words that correlate with each other strongly, and that seem related at first glance, it's probably going to be worthwhile to follow up the hunch. You’re probably looking at a discourse that is bound together both diachronically (in the sense that the terms rise and fall together) and topically (in the sense that they tend to appear in the same kinds of volumes).

Ultimately, literary historians are going to want to assess correlation within different genres; a dataset like Google's, which mixes all genres in a single pool, is not going to be an ideal tool. However, this is also a domain where size matters, and in that respect, at the moment, the ngrams dataset is very helpful. It becomes even more helpful if you correct some of the errors that vitiate it in the period before 1820. A team of researchers at Illinois and Stanford4, supported by the Andrew W. Mellon Foundation, has been doing that over the course of the last year, and we're now able to make an early version of the tool available on the web. Right now, this ngram viewer only covers the period 1700-1899, but we hope it will be useful for researchers in that period, because it has mostly corrected the long-s problem that confufes opt1cal charader readers in the 18c — as well as a host of other, less notorious problems. Moreover, it allows researchers to mine correlations in the top 10,000 words of the lexicon, instead of trying words one by one to see whether an interesting pattern emerges. In the near future, we hope to expand the correlation miner to cover the twentieth century as well.

For further discussion of the statistical relationship between topics and trends, see this paper submitted to DHCS 2011.

UPDATE Nov 22, 2011: At DHCS 2011, Travis Brown pointed out to me that Topics Over Time (Wang and McCallum) might mine very similar patterns in a more elegant, generative way. I hope to find a way to test that method, and may perhaps try to build an implementation for it myself.

1) Ryan Heuser and I both noticed this pattern last winter. Ryan and Long Le-Khac presented on a related topic at DH2011: Heuser, Ryan, and Le-Khac, Long. “Abstract Values in the 19th Century British Novel: Decline and Transformation of a Semantic Field,” Digital Humanities 2011, Stanford University.

2) Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (Published online ahead of print: 12/16/2010)

3) The collection of 3134 documents (1700-1849) I used for this calculation was produced by combining ECCO-TCP volumes with nineteenth-century volumes selected and digitized by Jordan Sellers.

4) The SEASR Correlation Analysis and Ngrams Viewer was developed by Loretta Auvil and Boris Capitanu at the Illinois Informatics Institute, modeled on prototypes built by Ted Underwood, University of Illinois, and Ryan Heuser, Stanford.

Words that appear in the same 18c volumes also track each other over time, through the 19c.

I wrote a long post last Friday arguing that topic-modeling an 18c collection is a reliable way of discovering eighteenth- and nineteenth-century trends, even in a different collection.

But when I woke up on Saturday I realized that this result probably didn’t depend on anything special about “topic modeling.” After all, the topic-modeling process I was using was merely a form of clustering. And all the clustering process does is locate the hypothetical centers of more-or-less-evenly-sized clouds of words in vector space. It seemed likely that the specific locations of these centers didn’t matter. The key was the underlying measure of association between words — “cosine similarity in vector space,” which is a fancy way of saying “tendency to be common in the same 18c volumes.” Any group of words that were common (and uncommon) in the same 18c volumes would probably tend to track each other over time, even into the next century.

Six words that tend to occur in the same 18c volumes as 'gratify' (in TCP-ECCO), plotted over time in a different collection (a corrected version of Google ngrams).

To test this I wrote a script that chose 200 words at random from the top 5000 in a collection of 2,193 18c volumes (drawn from TCP-ECCO with help from Laura Mandell), and then created a cluster around each word by choosing the 25 words most likely to appear in the same volumes (cosine similarity). Would pairs of words drawn from these randomly distributed clusters also show a tendency to correlate with each other over time?

They absolutely do. The Fisher weighted mean pairwise r for all possible pairs drawn from the same cluster is .267 in the 18c and .284 in the 19c (the 19c results are probably better because Google’s dataset is better in the 19c even after my efforts to clean the 18c up*). At n = 100 (measured over a century), both correlations have rock-solid statistical significance, p < .01. And in case you're wondering … yes, I wrote another script to test randomly selected words using the same statistical procedure, and the mean pairwise r for randomly selected pairs (factoring out, as usual, partial correlation with the larger 5000-word group they’re selected from) is .0008. So I feel confident that the error level here is low.**

What does this mean, concretely? It means that the universe of word-frequency data is not chaotic. Words that appear in the same discursive contexts tend, strongly, to track each other over time, and (although I haven’t tested this rigorously yet), it’s not hard to see that the converse proposition is also going to hold true: words that track each other over time are going to tend to have contextual associations as well.

To put it even more bluntly: pick a word, any word! We are likely to be able to define not just a topic associated with that word, but a discourse — a group of words that are contextually related and that also tend to wax and wane together over time. I don’t imagine that in doing so we prove anything of humanistic significance, but I do think it means that we can raise several thousand significant questions. To start with: what was the deal with eagerness, gratification, and disappointment in the second half of the eighteenth century?

* A better version of Google’s 18c dataset may be forthcoming from the NCSA.

** For people who care about the statistical reliability of data-mining, here’s the real kicker: if you run a Benjamini-Hochberg procedure on these 200 randomly-generated clusters, 134 of them have significance at p < .05 in the 19c even after controlling for the false discovery rate. To put that more intelligibly, these are guaranteed not to be xkcd’s green jelly beans. The coherence of these clusters is even greater than the ones produced by topic-modeling, but that’s probably because they are on average slightly smaller (25 words); I have yet to test the relative importance of different generation procedures while holding cluster size rigorously constant.

Trends, topics, and trending topics.

I’ve developed a text-mining strategy that identifies what I call “trending topics” — with apologies to Twitter, where the term is used a little differently. These are diachronic patterns that I find practically useful as a literary historian, although they don’t fit very neatly into existing text-mining categories.

A “topic,” as the term is used in text-mining, is a group of words that occur together in a way that defines a thematic focus. Cameron Blevin’s analysis of Martha Ballard’s diary is often cited as an example: Blevin identifies groups of words that seem to be associated, for instance, with “midwifery,” “death,” or “gardening,” and tracks these topics over the course of the diary.

“Trends” haven’t received as much attention as topics, but we need some way to describe the pattern that Google’s ngram viewer has made so visible, where groups of related words rise and fall together across long periods of time. I suspect “trend” is as a good a name for this phenomenon as we’ll get.

blue, red, green, yellow, in the English corpus 1750-2000

From 1750 to 1920, the prominence of color vocabulary increases by a factor of three, for instance: and when it does, the names of different colors track each other very closely. I would call this a trend. Moreover, it’s possible to extend the principle that conceptually related words rise and fall together beyond cases like the colors and seasons where we’re dealing with an obvious physical category.

Google data graphed with my own viewer; if you compare this to Google's viewer, remember that I'm merging capitalized and uncapitalized forms, as well as ardor/ardour.

“Animated,” “attentive,” and “ardour” track each other almost as closely as the names of primary colors (the correlation coefficients are around 0.8), and they characterize conduct in ways that are similar enough to suggest that we’re looking at the waxing and waning not just of a few random words, but of a conceptual category — say, a particular sort of interest in states of heightened receptiveness or expressivity.

I think we could learn a lot by thoughtfully considering “trends” of this sort, but it’s also a kind of evidence that’s not easy to interpret, and that could easily be abused. A lot of other words correlate almost as closely with “attentive,” including “propriety,” “elegance,” “sentiments,” “manners,” “flattering,” and “conduct.” Now, I don’t think that’s exactly a random list (these terms could all be characterized loosely as a discourse of manners), but it does cover more conceptual ground than I initially indicated by focusing on words like “animated” and “ardour.” And how do we know that any of these terms actually belonged to the same “discourse”? Perhaps the books that talked about “conduct” were careful not to talk about “ardour”! Isn’t it possible that we have several distinct discourses here that just happened to be rising and falling at the same time?

In order to answer these questions, I’ve been developing a technique that mines “trends” that are at the same time “topics.” In other words, I look for groups of terms that hold together both in the sense that they rise and fall together (correlation across time), and in the sense that they tend to be common in the same documents (co-occurrence). My way of achieving this right now is a two-stage process: first I mine loosely defined trends from the Google ngrams dataset (long lists of, say, one hundred closely correlated words), and then I send those trends to a smaller, generically diverse collection (including everything from sermons to plays) where I can break the list into clusters of terms that tend to occur in the same kinds of documents.

I do this with the same vector space model and hierarchical clustering technique I’ve been using to map eighteenth-century diction on a larger scale. It turns the list of correlated words into a large, branching tree. When you look at a single branch of that tree you’re looking at what I would call a “trending topic” — a topic that represents, not a stable, more-or-less-familiar conceptual category, but a dynamically-linked set of concepts that became prominent at the same time, and in connection with each other.

one branch of a tree created by finding words that correlate with "manners," and then clustering them based on co-occurrence in 18c books

Here, for instance, is a branch of a larger tree that I produced by clustering words that correlate with “manners” in the eighteenth century. It may not immediately look thematically coherent. We might have expected “manners” to be associated with words like “propriety” or “conduct” (which do in fact correlate with it over time), but when we look at terms that change in correlated ways and occur in the same volumes, we get a list of words that are largely about wealth and rank (“luxury,” “opulence,” “magnificence”), as well as the puzzling “enervated.” To understand a phenomenon like this, you can simply reverse the process that generated it, by using the list as a search query in the eighteenth-century collection it’s based on. What turned up in this case were, pre-eminently, a set of mid-eighteenth-century works debating whether modern commercial opulence, and refinements in the arts, have had an enervating effect on British manners and civic virtue. Typical examples are John Brown’s Estimate of the Manners and Principles of the Times (1757) and John Trusler’s Luxury no Political Evil but Demonstratively Proved to be Necessary to the Preservation and Prosperity of States (1781). I was dimly aware of this debate, but didn’t grasp how central it became to debate about manners, and certainly wasn’t familiar with the works by Brown and Trusler.

I feel like this technique is doing what I want it to do, practically, as a literary historian. It makes the ngram viewer something more than a provocative curiosity. If I see an interesting peak in a particular word, I can can map the broader trend of which it’s a part, and then break that trend up into intersecting discourses, or individual works and authors.

Admittedly, there’s something inelegant about the two-stage process I’m using, where I first generate a list of terms and then use a smaller collection to break the list into clusters. When I discussed the process with Ben Schmidt and Miles Efron, they both, independently, suggested that there ought to be some simpler way of distinguishing “trends” from “topics” in a single collection, perhaps by using Principal Component Analysis. I agree about that, and PCA is an intriguing suggestion. On the other hand, the two-stage process is adapted to the two kinds of collections I actually have available at the moment: on the one hand, the Google dataset, which is very large and very good at mapping trends with precision, but devoid of metadata, on the other hand smaller, richer collections that are good at modeling topics, but not large enough to produce smooth trend lines. I’m going to experiment with Principal Component Analysis and see what it can do for me, but in the meantime — speaking as a literary historian rather than a computational linguist — I’m pretty happy with this rough-and-ready way of identifying trending topics. It’s not an analytical tool: it’s just a souped-up search technology that mines trends and identifies groups of works that could help me understand them. But as a humanist, that’s exactly what I want text mining to provide.

The Google dataset as an episode in the history of science.

In a few years, some enterprising historian of science is going to write a history of the “culturomics” controversy, and it’s going to be fun to read. In some ways, the episode is a classic model of the social processes underlying the production of knowledge. Whenever someone creates a new method or tool (say, an air pump), and claims to produce knowledge with it, they run head-on into the problem that knowledge is social. If the tool is really new, their experience with it is by definition anomalous, and anomalous experiences — no matter how striking — never count as knowledge. They get dismissed as amusing curiosities.

Robert Boyle's air pump.

The team that published in Science has attempted to address this social problem, as scientists usually do, by making their data public and carefully describing the conditions of their experiment. In this case, however, one runs into the special problem that the underlying texts are the private property of Google, and have been released only in a highly compressed form that strips out metadata. As Matt Jockers may have been the first to note, we don’t yet even have a bibliography of the contents of each corpus. Yesterday, in a new FAQ posted on (see section III.5), researchers acknowledged that they want to release such a bibliography, but haven’t yet received permission from Google to do it.

This is going to produce a very interesting deadlock. I’ve argued in many other posts that the Google dataset is invaluable, because its sheer scale allows us to grasp diachronic patterns that wouldn’t otherwise be visible. But without a list of titles, it’s going to be difficult to cite it as evidence. What I suspect may happen is that humanists will start relying on it in private to discover patterns, but then write those patterns up as if they had just been doing, you know, a bit of browsing in 500,000 books — much as we now use search engines quietly and without acknowledgment, although they in fact entail significant methodological choices. As Benjamin Schmidt has recently been arguing, search technology is based on statistical presuppositions more complex and specific than most people realize, presuppositions that humanists already “use all the time to, essentially, do a form of reading for them.”

A different solution, and the one I’ll try, is to use the Google dataset openly, but in conjunction with other smaller and more transparent collections. I’ll use the scope of the Google dataset to sketch broad contours of change, and then switch to a smaller archive in order to reach firmer and more detailed conclusions. But I still hope that Google can somehow be convinced to release a bibliography — at least of the works that are out of copyright — and I would urge humanists to keep lobbying them.

If some of the dilemmas surrounding this tool are classic history-of-science problems, others are specific to a culture clash between the humanities and the sciences. For instance, I’ve argued in the past that humanists need to develop a quantitative conception of error. We’re very talented at making the perfect the enemy of the good, but that simply isn’t how statistical knowledge works. As the newly-released FAQ points out, there’s a comparably high rate of error in fields like genomics.

On other topics, though, it may be necessary for scientists to learn a bit more about the way humanists think. For instance, one of the corpora included in the ngram viewer is labeled “English fiction.” Matt Jockers was the first to point out that this is potentially ambiguous. I assumed that it contained mostly novels and short stories, since that’s how we use the word in the humanities, but prompted by Matt’s skepticism, I wrote the culturomics team to inquire. Yesterday in the FAQ they answered my question, and it turns out that Matt’s skepticism was well founded.

Crucially, it’s not just actual works of fiction! The English fiction corpus contains some fiction and lots of fiction-associated work, like commentary and criticism. We created the fiction corpus as an experiment meant to explore the notion of creating a subject-specific corpus. We don’t actually use it in the main text of our paper because the experiment isn’t very far along. Even so, a thoughtful data analyst can do interesting things with this corpus, for instance by comparing it to the results for English as a whole.

Humanists are going to find that an eye-opening paragraph. This conception of fiction is radically different from the way we usually understand fiction — as a genre. Instead, the culturomics team has constructed a corpus based on fiction as a subject category; or perhaps it would be better to say that they have combined the two conceptions. I can say pretty confidently that no humanist will want to rely on the corpus of “English fiction” to make claims about fiction; it represents something new and anomalous.

On the other hand, I have to say that I’m personally grateful that the culturomics team made this corpus available — not because it tells me much about fiction, but because it tells me something about what happens when you try to hold “subject designations” constant across time instead of allowing the relative proportions of books in different subjects to fluctuate as they actually did in publishing history. I think they’re right that this is a useful point of comparison, although at the moment the corpus is labeled in a potentially misleading way.

In general, though, I’m going to use the main English corpus, which is easier to interpret. The lack of metadata is still a problem here, but this corpus seems to represent university library collections more fully than any other dataset I have access to. While sheer scale is a crude criterion of representativeness, for some questions it’s the useful one.

The long and short of it all is that the next few years are going to be a wild ride. I’m convinced that advances in digital humanities are reaching the point where they’re going to start allowing us to describe some large, fascinating, and until now largely invisible patterns. But at the moment, the biggest dataset — prominent in public imagination, but also genuinely useful — is curated by scientists, and by a private corporation that has not yet released full information about it. The stage is set for a conflict of considerable intensity and complexity.

Identifying topics with a specific kind of historical timeliness.

Benjamin Schmidt has been posting some fascinating reflections on different ways of analyzing texts digitally and characterizing the affinities between them.

I’m tempted to briefly comment on a technique of his that I find very promising. This is something that I don’t yet have the tools to put into practice myself, and perhaps I shouldn’t comment until I do. But I’m just finding the technique too intriguing to resist speculating about what might be done with it.

Basically, Schmidt describes a way of mapping the relationships between terms in a particular archive. He starts with a word like “evolution,” identifies texts in his archive that use the word, and then uses tf-idf weighting to identify the other words that, statistically, do most to characterize those texts.

After iterating this process a few times, he has a list of something like 100 terms that are related to “evolution” in the sense that this whole group of terms tends, not just to occur in the same kinds of books, but to be statistically prominent in them. He then uses a range of different clustering algorithms to break this list into subsets. There is, for instance, one group of terms that’s clearly related to social applications of evolution, another that seems to be drawn from anatomy, and so on. Schmidt characterizes this as a process that maps different “discourses.” I’m particularly interested in his decision not to attempt topic modeling in the strict sense, because it echoes my own hesitation about that technique:

In the language of text analysis, of course, I’m drifting towards not discourses, but a simple form of topic modeling. But I’m trying to only submerge myself slowly into that pool, because I don’t know how well fully machine-categorized topics will help researchers who already know their fields. Generally, we’re interested in heavily supervised models on locally chosen groups of texts.

This makes a lot of sense to me. I’m not sure that I would want a tool that performed pure “topic modeling” from the ground up — because in a sense, the better that tool performed, the more it might replicate the implicit processing and clustering of a human reader, and I already have one of those.

Schmidt’s technique is interesting to me because the initial seed word gives it what you might call a bias, as well as a focus. The clusters he produces aren’t necessarily the same clusters that would emerge if you tried to map the latent topics of his whole archive from the ground up. Instead, he’s producing a map of the semantic space surrounding “evolution,” as seen from the perspective of that term. He offers this less as a finished product than as an example of a heuristic that humanists might use for any keyword that interested them, much in the way we’re now accustomed to using simple search strategies. Presumably it would also be possible to move from the semantic clusters he generates to a list of the documents they characterize.

I think this is a great idea, and I would add only that it could be adapted for a number of other purposes. Instead of starting with a particular seed word, you might start with a list of terms that happen to be prominent in a particular period or genre, and then use Schmidt’s technique of clustering based on tf-idf correlations to analyze the list. “Prominence” can be defined in a lot of different ways, but I’m particularly interested in words that display a similar profile of change across time.

diction, elegance, in the English corpus, 1700-1900, plus the capitalized 18c versions

For instance, I think it’s potentially rather illuminating that “diction” and “elegance” change in closely correlated ways in the late eighteenth and early nineteenth century. It’s interesting that they peak at the same time, and I might even be willing to say that the dip they both display, in the radical decade of the 1790s, suggests that they had a similar kind of social significance. But of course there will be dozens of other terms (and perhaps thousands of phrases) that also correlate with this profile of change, and the Google dataset won’t do anything to tell us whether they actually occurred in the same sorts of books. This could be a case of unrelated genres that happened to have emerged at the same time.

But I think a list of chronologically correlated terms could tell you a lot if you then took it to an archive with metadata, where Schmidt’s technique of tf-idf clustering could be used to break the list apart into subsets of terms that actually did occur in the same groups of works. In effect this would be a kind of topic modeling, but it would be topic modeling combined with a filter that selects for a particular kind of historical “topicality” or timeliness. I think this might tell me a lot, for instance, about the social factors shaping the late-eighteenth-century vogue for characterizing writing based on its “diction” — a vogue that, incidentally, has a loose relationship to data mining itself.

I’m not sure whether other humanists would accept this kind of technique as evidence. Schmidt has some shrewd comments on the difference between data mining and assisted reading, and he’s right that humanists are usually going to prefer the latter. Plus, the same “bias” that makes a technique like this useful dispels any illusion that it is a purely objective or self-generating pattern. It’s clearly a tool used to slice an archive from a particular angle, for particular reasons.

But whether I could use it as evidence or not, a technique like this would be heuristically priceless: it would give me a way of identifying topics that peculiarly characterize a period — or perhaps even, as the dip in the 1790s hints, a particular impulse in that period — and I think it would often turn up patterns that are entirely unexpected. It might generate these patterns by looking for correlations between words, but it would then be fairly easy to turn lists of correlated words into lists of works, and investigate those in more traditionally humanistic ways.

For instance, I had no idea that “diction” would correlate with “elegance” until I stumbled on the connection, but having played around with the terms a bit in MONK, I’m already getting a sense that the terms are related not just through literary criticism (as you might expect), but also through historical discourse and (oddly) discourse about the physiology of sensation. I don’t have a tool yet that can really perform Schmidt’s sort of tf-idf clustering, but just to leave you with a sense of the interesting patterns I’m glimpsing, here’s a word cloud I generated in MONK by contrasting eighteenth-century works that contain “elegance” to the larger reference set of all eighteenth-century works. The cloud is based on Dunning’s log likelihood, and limited to adjectives, frankly, just because they’re easier to interpret at first glance.

Dark adjectives are overrepresented in a corpus of 18c works that contain "elegance," light ones underrepresented.

There’s a pretty clear contrast here between aesthetic and moral discourse, which is interesting to begin with. But it’s also a bit interesting that the emphasis on aesthetics extends into physiological terms like “sensorial,” “irritative,” and “numb,” and historical terms like “Greek” and “Latin.” Moreover, many of the same terms reoccur if you pursue the same strategy with “diction.”

Dark adjectives are overrepresented in a corpus of 18c works containing "diction," light ones underrepresented.

A lot of words here are predictably literary, but again you see sensory terms like “numb,” and historical ones like “Greek,” “Latin,” and “historical” itself. Once again, moreover, moral discourse is interestingly underrepresented. This is actually just one piece of the larger pattern you might generate if you pursued Schmidt’s clustering strategy — plus, Dunning’s is not the same thing as tf-idf clustering, and the MONK corpus of 1000 eighteenth-century works is smaller than one would wish — but the patterns I’m glimpsing are interesting enough to suggest to me that this general kind of approach could tell me a lot of things I don’t yet know about a period.

How to make the Google dataset work for humanists.

I started blogging about the Google dataset because it revealed stylistic trends so intriguing that I couldn’t wait to write them up. But these reflections are also ending up in a blog because they can’t yet go in an article. The ngram viewer, as fascinating as it is, is not yet very useful as evidence in a humanistic argument.

As I’ve explained at more length elsewhere, the problems that most humanists have initially pointed to don’t seem to me especially troubling. It’s true that the data contains noise — but so does all data. Researchers in other fields don’t wait for noiseless instruments before they draw any conclusions; they assess the signal/noise ratio and try to frame questions that are answerable within those limits.

It’s also true that the history of diction doesn’t provide transparent answers to social and literary questions. This kind of evidence will require context and careful interpretation. In which respect, it resembles every other kind of evidence humanists currently grapple with.

Satanic, Satanic influence, Satanic verses, in English corpus, 1800-2000

The problem that seems more significant to me is one that Matt Jockers has raised. We simply don’t yet know what’s in these corpora. We do know how they were constructed: that’s explained, in a fairly detailed way, in the background material supporting the original article in Science. But we don’t yet have access to a list of titles for each corpus.

Here differences between disciplines become amusing. For a humanist, it’s a little shocking that a journal like Science would publish results without what we would call simply “a bibliography” — a list of the primary texts that provide evidence for the assertion. The list contains millions of titles in this case, and would be heavy in print. But it seems easy enough for Google, or the culturomics research team, to make these lists available on the web. In fact, I assume they’re forthcoming; the datasets themselves aren’t fully uploaded yet, so apparently more information is on the way. I’ve written Google Labs asking whether they plan to release lists of titles, and I’ll update this post when they do.

Until they do, it will be difficult for humanists to use the ngram viewer as scholarly evidence. The background material to the Science article does suggest that these datasets have been constructed thoughtfully, with an awareness of publishing history, and on an impressive scale. But humanists and scientists understand evidence differently. I can’t convince other humanists by telling them “Look, here’s how I did the experiment.” I have to actually show them the stuff I experimented on — that is, a bibliography.

Ideally, one might ask even more from Google. They could make the original texts themselves available (at least those out of copyright), so that we could construct our own archives. With the ability to ask questions about genre and context of occurrence, we could connect quantitative trends to a more conventional kind of literary history. Instead of simply observing that a lot of physical adjectives peak around 1940, The Big Sleepwe could figure out how much of that is due to modernism (“The sunlight was hot and hard”), to Time magazine, or to some other source — and perhaps even figure out why the trend reversed itself.

Google seems unlikely to release all their digitized texts; it may not be in their corporate interest to do so. But fortunately, there are workarounds. HathiTrust, and other online archives, are making large electronic collections freely available, and these will eventually be used to construct more flexible tools. Even now, it’s possible to have the best of both worlds by pairing the scope of Google’s dataset with the analytic flexibility of a tool like MONK (constructed by a team of researchers funded by the Andrew W. Mellon Foundation, including several here at Illinois). When I discover an interesting 18c. or 19c. trend in the ngram viewer, I take it to MONK, which can identify genres, authors, works, or parts of works where a particular pattern of word choice was most prominent.

So, to make the ngram viewer useful, Google needs to release lists of titles, and humanists need to pair the scope of the Google dataset with the analytic power of a tool like MONK, which can ask more precise, and literarily useful, questions on a smaller scale. And then, finally, we have to read some books and say smart things about them. That part hasn’t changed.

But the ngram viewer itself could also be improved. It could, for instance

1) Give researchers the option to get rid of case sensitivity and (at least partly) undo the f/s substitution, which together make it very hard to see any patterns in the 18c.

2) Provide actual numbers as output, not just pretty graphs, so that we can assess correlation and statistical significance.

3) Offer better search strategies. Instead of plugging in words one by one to identify a pattern, I would like to be able to enter a seed word, and ask for a list of words that correlate with it across a given period, sorted by degree of positive (or inverse) correlation.

It would be even more interesting to do the same thing for ngrams. One might want the option to exclude phrases that contain only the original seed word(s) and stop words (“of,” “the,” and so on). But I suspect a tool like this could rapidly produce some extremely interesting results.

fight for existence, fight for life, fight for survival, fight to the death, in English, 1800-2000

4) Offer other ways to mine the list of 2,3,4, and 5-grams, where a lot of conceptually interesting material is hiding. For instance, “what were the most common phrases containing ‘feminine’ between 1950 and 1970?” Or, “which phrases containing ‘male’ increased most in frequency between 1940 and 1960?”

Of course, since the dataset is public, none of these improvements actually have to be made by Google itself.