methodology ngrams

The Google dataset as an episode in the history of science.

In a few years, some enterprising historian of science is going to write a history of the “culturomics” controversy, and it’s going to be fun to read. In some ways, the episode is a classic model of the social processes underlying the production of knowledge. Whenever someone creates a new method or tool (say, an air pump), and claims to produce knowledge with it, they run head-on into the problem that knowledge is social. If the tool is really new, their experience with it is by definition anomalous, and anomalous experiences — no matter how striking — never count as knowledge. They get dismissed as amusing curiosities.

Robert Boyle's air pump.

The team that published in Science has attempted to address this social problem, as scientists usually do, by making their data public and carefully describing the conditions of their experiment. In this case, however, one runs into the special problem that the underlying texts are the private property of Google, and have been released only in a highly compressed form that strips out metadata. As Matt Jockers may have been the first to note, we don’t yet even have a bibliography of the contents of each corpus. Yesterday, in a new FAQ posted on (see section III.5), researchers acknowledged that they want to release such a bibliography, but haven’t yet received permission from Google to do it.

This is going to produce a very interesting deadlock. I’ve argued in many other posts that the Google dataset is invaluable, because its sheer scale allows us to grasp diachronic patterns that wouldn’t otherwise be visible. But without a list of titles, it’s going to be difficult to cite it as evidence. What I suspect may happen is that humanists will start relying on it in private to discover patterns, but then write those patterns up as if they had just been doing, you know, a bit of browsing in 500,000 books — much as we now use search engines quietly and without acknowledgment, although they in fact entail significant methodological choices. As Benjamin Schmidt has recently been arguing, search technology is based on statistical presuppositions more complex and specific than most people realize, presuppositions that humanists already “use all the time to, essentially, do a form of reading for them.”

A different solution, and the one I’ll try, is to use the Google dataset openly, but in conjunction with other smaller and more transparent collections. I’ll use the scope of the Google dataset to sketch broad contours of change, and then switch to a smaller archive in order to reach firmer and more detailed conclusions. But I still hope that Google can somehow be convinced to release a bibliography — at least of the works that are out of copyright — and I would urge humanists to keep lobbying them.

If some of the dilemmas surrounding this tool are classic history-of-science problems, others are specific to a culture clash between the humanities and the sciences. For instance, I’ve argued in the past that humanists need to develop a quantitative conception of error. We’re very talented at making the perfect the enemy of the good, but that simply isn’t how statistical knowledge works. As the newly-released FAQ points out, there’s a comparably high rate of error in fields like genomics.

On other topics, though, it may be necessary for scientists to learn a bit more about the way humanists think. For instance, one of the corpora included in the ngram viewer is labeled “English fiction.” Matt Jockers was the first to point out that this is potentially ambiguous. I assumed that it contained mostly novels and short stories, since that’s how we use the word in the humanities, but prompted by Matt’s skepticism, I wrote the culturomics team to inquire. Yesterday in the FAQ they answered my question, and it turns out that Matt’s skepticism was well founded.

Crucially, it’s not just actual works of fiction! The English fiction corpus contains some fiction and lots of fiction-associated work, like commentary and criticism. We created the fiction corpus as an experiment meant to explore the notion of creating a subject-specific corpus. We don’t actually use it in the main text of our paper because the experiment isn’t very far along. Even so, a thoughtful data analyst can do interesting things with this corpus, for instance by comparing it to the results for English as a whole.

Humanists are going to find that an eye-opening paragraph. This conception of fiction is radically different from the way we usually understand fiction — as a genre. Instead, the culturomics team has constructed a corpus based on fiction as a subject category; or perhaps it would be better to say that they have combined the two conceptions. I can say pretty confidently that no humanist will want to rely on the corpus of “English fiction” to make claims about fiction; it represents something new and anomalous.

On the other hand, I have to say that I’m personally grateful that the culturomics team made this corpus available — not because it tells me much about fiction, but because it tells me something about what happens when you try to hold “subject designations” constant across time instead of allowing the relative proportions of books in different subjects to fluctuate as they actually did in publishing history. I think they’re right that this is a useful point of comparison, although at the moment the corpus is labeled in a potentially misleading way.

In general, though, I’m going to use the main English corpus, which is easier to interpret. The lack of metadata is still a problem here, but this corpus seems to represent university library collections more fully than any other dataset I have access to. While sheer scale is a crude criterion of representativeness, for some questions it’s the useful one.

The long and short of it all is that the next few years are going to be a wild ride. I’m convinced that advances in digital humanities are reaching the point where they’re going to start allowing us to describe some large, fascinating, and until now largely invisible patterns. But at the moment, the biggest dataset — prominent in public imagination, but also genuinely useful — is curated by scientists, and by a private corporation that has not yet released full information about it. The stage is set for a conflict of considerable intensity and complexity.

By tedunderwood

Ted Underwood is Professor of Information Sciences and English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

3 replies on “The Google dataset as an episode in the history of science.”

Google ended up with some legal problems that resulted from their creation of the digitized collection of books. Apparent reluctance to provide a bibliography may be linked to concern about creating further legal problems.

Might I say that the method, rather than the fine detail of these initial experiments, or the initial results, using the method, is the important milestone. Clearly, the dataset is less than perfect and various aspects of this approach will be refined and improved on, or superseded. And some early results may prove to be wrong due to these deficiencies. But the core idea certainly seems worthy of the cover it garnered on Science.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s