Categories
Uncategorized

Efficiency and pleasure

Okay, I’ve already spilled some ink railing against this application of the ngram viewer — using it to stage contests between abstract terms. In fact, I actually made this graph as a joke. But then, I found myself hypnotized by the apparent inverse correlation between the two curves in the 20c. So … shoot … here it is.

efficiency, pleasure, in English corpus, 1820-2000

I have to admit that at first glance it appears that Taylorist discourse about efficiency in the 20th century (and perhaps the pressures of war) correlated closely with a sort of embarrassment about mentioning pleasure. But for now, I’m going to treat this kind of contrast the way physicists treat claims about cold fusion. It may be visually striking, but we should demand more confirmation before we treat the correlation as meaningful. When you hold genre constant, by restricting the search to fiction, the correlation is a little less striking, so it may be at least partly a fluctuation in the genres that got published, rather than a fluctuation in underlying patterns of expression.

In any case, there’s a broad decline in “pleasure” from beginning to end that Frederick W. Taylor can hardly explain. To understand that, we still have to consult Lionel Trilling on “The Fate of Pleasure,” and perhaps Thomas Carlyle on “The Gospel of Work.”

Categories
methodology ngrams

On the imperfection of the Google dataset, and imperfection in general

The dataset that Google made public last week isn’t perfect. As Natalie Binder among others has pointed out, the dataset contains many OCR (optical character recognition) errors, and at least a few errors in dating. (UPDATE 12/22: It is worth noting, however, that the dataset will have many fewer errors than Google Books itself, because the dataset is based on a subset of volumes with relatively clean OCR.)

Moreover, as Dennis Baron argues in The Web of Language, “books don’t always reflect the spoken language accurately.” Informal words like “hello” are likely to be underrepresented in books.

The utility of the dataset is even more importantly reduced by Google’s decision to strip out all information about context of original occurrence, as Mark Liberman has noted. If researchers had unfettered access to the full text of original works, we could draw much more interesting conclusions about context, genre, and authorship.

Finally, I would add that — even with the present structure of the dataset — it’s possible to imagine search strategies other than simply graphing the frequencies of individual words and phrases, one by one. The ngram viewer is an elegant interface, but a limited one.

All true. But the Google dataset is also turning out to be tremendously useful, and it’s likely to become even more useful as researchers refine it and develop more flexible ways to query it.

Of course, it has to be used appropriately. This is not a tool you should use if you want to know exactly how often Laurence Sterne referred to noses. It’s a tool for statistical questions about the written language that involve very large numbers of examples. When it’s applied to questions on that scale, the OCR errors in the English corpus (after 1820) are not significant enough to prevent the ngram viewer from producing useful results. Before 1820 there are more significant OCR problems, especially with the substitution of f for “long s.” But even there, I don’t see the problem as insuperable; there are straightforward ways for researchers to compensate for the most predictable OCR errors.

The larger critique being leveled at the ngram viewer, by Natalie Binder and many other humanists, is that it’s impossible to know what an individual graph measures. Complex words have multiple meanings, Binder reminds us, so how should we interpret a graph showing a decline in the frequency of “nature”? How should we interpret a correlation between the increasing frequency of “vampire” and the declining frequency of “dilettante”?

The saying that correlation doesn’t prove causation definitely needs to be underlined in this domain. There are so many words in the language that a huge number of them will always correlate in largely accidental ways. More generally, it’s true that, in most cases, a graph of word frequency will not by itself tell us very much. You have to have some cultural context before the increasing frequency of “vampire” in the late twentieth century is going to mean anything at all to you. But of course, this is true of all historical evidence: no single poem or novel, in isolation, can tell us what was happening culturally around 1800. You need to compare different texts and authors from different social groups; it may be helpful to know that there was a revolution in France, and so on.

What puzzles me about humanistic disdain for the ngram viewer is that it often seems to presume that a piece of evidence must be legible in itself — naked and free of all context — in order to have any significance at all. If a graph doesn’t have a single determinate meaning, read from its face as easily as the value of a coin, then what is it good for? This critique seems to take hyper-positivism as a premise in order to refute a rather mild and contextual empiricism.

In short, the evidence produced by Google’s new tool is imperfect. It will have to be interpreted sensitively, by people who understand how it was produced. And it will need to be supplemented by multiple kinds of context (literary, social, political), before it acquires much historical significance. But these things are also true of all the other forms of evidence humanists invoke.

It seems likely that humanists are reluctant to take this kind of evidence seriously not because they find it too loose and indeterminate, but because they fear that the superficial certainty of quantitative evidence will seduce people away from more difficult kinds of interpretation. This concern can easily be exaggerated. If an awareness of social history doesn’t prevent us from reading sensitively (and I don’t think it does), then the much weaker evidence provided by text-mining isn’t likely to do so either. I’m reminded of an observation Matt Yglesias made in a different (political) context: that people are in general liable to take “an unduly zero-sum view of human interactions.” Different kinds of evidence needn’t be construed as competitive; they might conceivably enrich each other.

Categories
Uncategorized

What I hope this blog will do

Changing patterns of expression often imply interesting questions about literary or social history. Electronic archives with diachronic scope have made it easier to perceive these questions, and Google’s recent decision to make a very large dataset available has turned that trickle of questions into a flood. It’s not always possible to explain puzzling phenomena at first glance, let alone write them up in a journal article. But it might be useful to record them and share them with other scholars, in the hope that different pieces of a puzzle will make more sense in context.

That’s what I hope to accomplish here. I’m going to record interesting patterns of change as I encounter them, and invite speculation about what they mean. I invite other people to submit observations as well. At first, many of these observations are going to be based on results from Google’s ngram viewer, but I expect that other archives, and other ways of querying them, will play an increasingly important role.

The name of the blog is drawn from a dream described in the fifth book of Wordsworth’s Prelude, where a shell seems to represent poetry, and a stone mathematics — or at any rate, “geometric truth.” Toward the end of the book, Wordsworth observes that

                    Visionary power
Attends upon the motions of the winds
Embodied in the mystery of words;
There darkness makes abode, and all the host
Of shadowy things do work their changes there
As in a mansion like their proper home.

So, there’s a bit of poetry about changes worked through the mystery of words. Now for some math.

Categories
methodology

Using changes in diction to frame historical questions ≠ ‘culturomics’

I hope this blog will focus on recording specific puzzles, rather than debating method; this area of inquiry is really too young for claims about method to be more than speculative.

But Google released its ngram viewer in tandem with an article in Science that made fairly strong claims for a new discipline to be called culturomics, and strong claims about a whole new discipline have naturally been met with strong skepticism. So it’s impossible to avoid a few reflections on method.

I don’t expect that quantitative studies of word frequency will in the end amount to a new discipline — although who knows? The team that published in Science chose questions where quantitative analysis could, in itself, count as proof — for instance, questions about the changing frequency and duration of references to dates.

References to dates, 1900-2000, image by Zach Seward, 12/18/10.

I don’t want to disparage this approach; posing a new kind of question is significant — although, if it does create a new discipline, I hope the discipline will be called “N-grammatology,” in homage to Derrida.

But most of the questions that interest humanists can’t be converted quite this directly into questions about the occurrence of a particular sign. We’re interested in questions about modes of thought and behavior that don’t map onto individual signs in a simple one-to-one fashion.

That doesn’t, of course, mean that there’s nothing to be gained by studying shifts in vocabulary, diction, and phraseology. But I think in most cases quantitative evidence about word choice will function as a clue rather than as demonstrative proof; it may alert scholars to a change in patterns of expression, and tell them where and when to look. But to actually understand what happened, we’ll still have to read books all the way through, and study social history.