How to make the Google dataset work for humanists.

I started blogging about the Google dataset because it revealed stylistic trends so intriguing that I couldn’t wait to write them up. But these reflections are also ending up in a blog because they can’t yet go in an article. The ngram viewer, as fascinating as it is, is not yet very useful as evidence in a humanistic argument.

As I’ve explained at more length elsewhere, the problems that most humanists have initially pointed to don’t seem to me especially troubling. It’s true that the data contains noise — but so does all data. Researchers in other fields don’t wait for noiseless instruments before they draw any conclusions; they assess the signal/noise ratio and try to frame questions that are answerable within those limits.

It’s also true that the history of diction doesn’t provide transparent answers to social and literary questions. This kind of evidence will require context and careful interpretation. In which respect, it resembles every other kind of evidence humanists currently grapple with.

Satanic630 — Satanic, Satanic influence, Satanic verses, in English corpus, 1800-2000

The problem that seems more significant to me is one that Matt Jockers has raised. We simply don’t yet know what’s in these corpora. We do know how they were constructed: that’s explained, in a fairly detailed way, in the background material supporting the original article in Science. But we don’t yet have access to a list of titles for each corpus.

Here differences between disciplines become amusing. For a humanist, it’s a little shocking that a journal like Science would publish results without what we would call simply “a bibliography” — a list of the primary texts that provide evidence for the assertion. The list contains millions of titles in this case, and would be heavy in print. But it seems easy enough for Google, or the culturomics research team, to make these lists available on the web. In fact, I assume they’re forthcoming; the datasets themselves aren’t fully uploaded yet, so apparently more information is on the way. I’ve written Google Labs asking whether they plan to release lists of titles, and I’ll update this post when they do.

Until they do, it will be difficult for humanists to use the ngram viewer as scholarly evidence. The background material to the Science article does suggest that these datasets have been constructed thoughtfully, with an awareness of publishing history, and on an impressive scale. But humanists and scientists understand evidence differently. I can’t convince other humanists by telling them “Look, here’s how I did the experiment.” I have to actually show them the stuff I experimented on — that is, a bibliography.

Ideally, one might ask even more from Google. They could make the original texts themselves available (at least those out of copyright), so that we could construct our own archives. With the ability to ask questions about genre and context of occurrence, we could connect quantitative trends to a more conventional kind of literary history. Instead of simply observing that a lot of physical adjectives peak around 1940, we could figure out how much of that is due to modernism (“The sunlight was hot and hard”), to Time magazine, or to some other source — and perhaps even figure out why the trend reversed itself.

Google seems unlikely to release all their digitized texts; it may not be in their corporate interest to do so. But fortunately, there are workarounds. HathiTrust, and other online archives, are making large electronic collections freely available, and these will eventually be used to construct more flexible tools. Even now, it’s possible to have the best of both worlds by pairing the scope of Google’s dataset with the analytic flexibility of a tool like MONK (constructed by a team of researchers funded by the Andrew W. Mellon Foundation, including several here at Illinois). When I discover an interesting 18c. or 19c. trend in the ngram viewer, I take it to MONK, which can identify genres, authors, works, or parts of works where a particular pattern of word choice was most prominent.

So, to make the ngram viewer useful, Google needs to release lists of titles, and humanists need to pair the scope of the Google dataset with the analytic power of a tool like MONK, which can ask more precise, and literarily useful, questions on a smaller scale. And then, finally, we have to read some books and say smart things about them. That part hasn’t changed.

But the ngram viewer itself could also be improved. It could, for instance

1) Give researchers the option to get rid of case sensitivity and (at least partly) undo the f/s substitution, which together make it very hard to see any patterns in the 18c.

2) Provide actual numbers as output, not just pretty graphs, so that we can assess correlation and statistical significance.

3) Offer better search strategies. Instead of plugging in words one by one to identify a pattern, I would like to be able to enter a seed word, and ask for a list of words that correlate with it across a given period, sorted by degree of positive (or inverse) correlation.

It would be even more interesting to do the same thing for ngrams. One might want the option to exclude phrases that contain only the original seed word(s) and stop words (“of,” “the,” and so on). But I suspect a tool like this could rapidly produce some extremely interesting results.

FightFor620 — fight for existence, fight for life, fight for survival, fight to the death, in English, 1800-2000

4) Offer other ways to mine the list of 2,3,4, and 5-grams, where a lot of conceptually interesting material is hiding. For instance, “what were the most common phrases containing ‘feminine’ between 1950 and 1970?” Or, “which phrases containing ‘male’ increased most in frequency between 1940 and 1960?”

Of course, since the dataset is public, none of these improvements actually have to be made by Google itself.

Share this:

By tedunderwood

Leave a comment Cancel reply