Identifying diction that characterizes an author or genre: why Dunning’s may not be the best method.

Most of what I’m about to say is directly lifted from articles in corpus linguistics (1, 2), but I don’t think these results have been widely absorbed yet by people working in digital humanities, so I thought it might be worthwhile to share them, while demonstrating their relevance to literary topics.

The basic question is just this: if I want to know what words or phrases characterize an author or genre, how do I find out? As Ben Schmidt has shown in an elegantly visual way, simple mathematical operations won’t work. If you compare ratios (dividing word frequencies in the genre A that interests you by the frequencies in a corpus B used as a point of comparison), you’ll get a list of very rare words. But if you compare the absolute magnitude of the difference between frequencies (subtracting B from A), you’ll get a list of very common words. So the standard algorithm that people use is Dunning’s log likelihood,

— a formula that incorporates both absolute magnitude (O is the observed frequency) and a ratio (O/E is the observed frequency divided by the frequency you would expect). For a more complete account of how this is calculated, see Wordhoard.

But there’s a problem with this measure, as Adam Kilgarriff has pointed out (1, pp. 237-38, 247-48). A word can be common in a corpus because it’s very common in one or two works. For instance, when I characterize early-nineteenth-century poetic diction (1800-1849) by comparing a corpus of 60 volumes of poetry to a corpus of fiction, drama, and nonfiction prose from the same period (3), I get this list:

Much of this looks like “poetic diction” — but “canto” is poetic diction only in a weird sense. It happens to be very common in a few works of poetry that are divided into cantos (works for instance by Lord Byron and Walter Scott). So when everything is added up, yes, it’s more common in poetry — but it doesn’t broadly characterize the corpus. Similar problems occur for a range of other reasons (proper nouns and pronouns can be extremely common in a restricted context).

The solution Kilgarriff offers is to instead use a Mann-Whitney ranks test. This allows us to assess how consistently a given term is more common in one corpus than in another. For instance, suppose I have eight text samples of equal length. Four of them are poetry, and four are prose. I want to know whether “lamb” is significantly more common in the poetry corpus than in prose. A simple form of the Mann-Whitney test would rank these eight samples by the frequency of “lamb” and then add up their respective ranks:

Since most works of poetry “beat” most works of prose in this ranking, the sum of ranks for poetry is higher, in spite of the 31 occurrences of lamb in one work of prose — which is, let us imagine, a novel about sheep-rustling in the Highlands. But a log-likelihood test would have identified this word as more common in prose.

In reality, one never has “equal-sized” documents, but the test is not significantly distorted if one simply replaces absolute frequency with relative frequency (normalized for document size). (If one corpus has on average much smaller documents than the other does, there may admittedly be a slight distortion.) Since the number of documents in each corpus is also going to vary, it’s useful to replace the rank-sum (U) with a statistic ρ (Mann-Whitney rho) that is U, divided by the product of the sizes of the two corpora.

Using this measure of over-representation in a corpus produces a significantly different model of “poetic diction”:

This looks at first glance like a better model. It demotes oddities like “canto,” but also slightly demotes pronouns like “thou” and “his,” which may be very common in some works of poetry but not others. In general, it gives less weight to raw frequency, and more weight to the relative ubiquity of a term in different corpora. Kilgarriff argues that the Mann-Whitney test thereby does a better job of identifying the words that characterize male and female conversation (1, pp. 247-48).

On the other hand, Paul Rayson has argued that by reducing frequency to a rank measure, this approach discards “most of the evidence we have about the distribution of words” (2). For linguists, this poses an interesting, principled dilemma, where two statistically incompatible definitions of “distinctive diction” are pitted against each other. But for a shameless literary hack like myself, it’s no trouble to cut the Gordian knot with an improvised algorithm that combines both measures. For instance, one could multiply rho by the log of Dunning’s log likelihood (represented here as G-squared) …

I don’t yet know how well this algorithm will perform if used for classification or authorship attribution. But it does produce what is for me an entirely convincing portrait of early-nineteenth-century poetic diction:

Of course, once you have an algorithm that convincingly identifies the characteristic diction of a particular genre relative to other publications in the same period, it becomes possible to say how the distinctive diction of a genre is transformed by the passage of time. That’s what I hope to address in my next post.

UPDATE Nov 10, 2011: As I continue to use these tests in different ways (using them e.g. to identify distinctively “fictional” diction, and to compare corpora separated by time) I’m finding the Mann-Whitney ρ measure more and more useful on its own. I think my urge to multiply it by Dunning’s log-likelihood may have been the needless caution of someone who’s using an unfamiliar metric and isn’t sure yet whether it will work unassisted.

(1) Adam Kilgarriff, “Comparing Corpora,” International Journal of Corpus Linguistics 6.1 (2001): 97-133.
(2) Paul Rayson, Matrix: A Statistical Method and Software Tool for Linguistic Analysis through Corpus Comparison. Unpublished Ph.D thesis, Lancaster University, 2003, p. 47. Cited in Magali Paquot and Yves Bestgen, “Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction,” Corpora: Pragmatics and Discourse Papers from the 29th International Conference on English Language Research on Computerized Corpora (ICAME 29), Ascona, Switzerland, 14-18 May 2008, p. 254.
(3) The corpora used in this post were selected by Jordan Sellers, mostly from texts available in the Internet Archive, and corrected with a Python script described in this post.

“… a selection of the language really spoken by men”?

William Wordsworth’s claim to have brought poetry back to “the language of conversation in the middle and lower classes of society” gets repeated to each new generation of students (1). But did early nineteenth-century writing in general become more accessible, or closer to speech? It’s hard to say. We’ve used remarks like Wordsworth’s to anchor literary history, but we haven’t had a good way to assess their representativeness.

Increasingly, though, we’re in a position to test some familiar stories about literary history — to describe how the language of one genre changed relative to others, or even relative to “the language of conversation.” We don’t have eighteenth-century English speakers to interview, but we do have evidence about the kinds of words that tend to be more common in spoken language. For instance, Laly Bar-Ilan and Ruth Berman have shown in the journal Linguistics that contemporary spoken English is distinguished from writing by containing a higher proportion of words from the Old English part of the lexicon (2). This isn’t terribly surprising, since English was for a couple of hundred years (1066-1250) almost exclusively a spoken language, while French and Latin were used for writing. Any word that entered English before this period, and survived, had to be the kind of word that gets used in conversation. Words that entered afterward were often borrowed from French or Latin to flesh out the written language.

If the spoken language was distinguished from writing this way in the thirteenth century, and the same thing holds true today, then one might expect it to hold true in the eighteenth and nineteenth centuries as well. And it does seem to hold true: eighteenth-century drama, written to be spoken on stage, is distinguished from nondramatic poetry and prose by containing a higher proportion of Old English words. This is a broad-brush approach to diction, and not one that I would use to describe individual works. But applied to an appropriately large canvas, it may give us a rough picture of how the “register” of written diction has changed across time, becoming more conversational or more formal.

This graph is based on a version of the Google English corpus that I’ve cleaned up in a number of ways. Common OCR errors involving s, f, and ct have been corrected. The graph shows the aggregate frequency of the 500 most common English words that entered the language before the twelfth century. (I’ve found date-of-entry a more useful metric of a word’s affinity with spoken language than terms like “Latinate” or “Germanic.” After all, “Latinate” words like “school,” “street,” and “wall” don’t feel learned to us, because they’ve been transmitted orally for more than a millennium.) I’ve excluded a list of stopwords that includes determiners, prepositions, pronouns, and conjunctions, as well as the auxiliary verbs “be,” “will,” and “have.”

In relative terms, the change here may not look enormous; the peak in the early eighteenth century (181 words per thousand) is only about 20% higher than the trough in the late eighteenth century (152 words per thousand). But we’re talking about some of the most common words in the language (can, think, do, self, way, need, know). It’s a bit surprising that this part of the lexicon fluctuates at all. You might expect to see a gradual decline in the frequency of these words, as the overall size of the lexicon increases. But that’s not what happens: instead we see a rapid decline in the eighteenth century (as prose becomes less like speech, or at least less like the imagined speech of contemporaneous drama), and then a gradual recovery throughout the nineteenth century.

What does this tell us about literature? Not much, without information about genre. After all, as I mentioned, dramatic writing is a lot closer to speech than, say, poetry is. This curve might just be telling us that very few plays got written in the late eighteenth century.

Fortunately it’s possible to check the Google corpus against a smaller corpus of individual texts categorized by genre. I’ve made an initial pass at the first hundred years of this problem using a corpus of 2,188 eighteenth-century books produced by ECCO-TCP, which I obtained in plain text with help from Laura Mandell and 18thConnect. Two thousand books isn’t a huge corpus, especially not after you divide them up by genre, so these results are only preliminary. But the initial results seem to confirm that the change involved the language of prose itself, and not just changes in the relative prominence of different genres. Both fiction and nonfiction prose show a marked change across the century. If I’m right that the frequency of pre-12c words is a fair proxy for resemblance to spoken language, they became less and less like speech.

“Fiction” is of course a fuzzy category in the eighteenth century. The blurriness of the boundary between a sensationalized biography and a “novel” is a lot of the point of being a novel. In the graph above, I’ve lumped biographies and collections of personal letters in with novels, because I’m less interested in distinguishing something unique about fiction than I am in confirming a broad change in the diction of nondramatic prose.

By contrast, there’s relatively little change in the diction of poetry and drama. The proportion of pre-twelfth-century words is roughly the same at the end of the century as it was at the beginning.

Are these results intuitive, or are they telling us something new? I think the general direction of these curves probably confirms some intuitions. Anyone who studies eighteenth and nineteenth-century English knows that you get a lot of long words around 1800. Sad things become melancholy, needs become a necessity, and so on.

What may not be intuitive is how broad and steady the arc of change appears to be. To the extent that we English professors have any explanation for the elegant elaboration of late-eighteenth-century prose, I think we tend to blame Samuel Johnson. But these graphs suggest that much of the change had already taken place by the time Johnson published his Dictionary. Moreover, our existing stories about the history of style put a lot of emphasis on poetry — for instance, on Wordsworth’s critique of poetic diction. But the biggest changes in the eighteenth century seem to have involved prose rather than poetry. It’ll be interesting to see whether that holds true in the nineteenth century as well.

How do we explain these changes? I’m still trying to figure out. In the next couple of weeks I’ll write a post asking what took up the slack: what kinds of language became common in books where old, common words were relatively underrepresented?

—– references —–
1) William Wordsworth and Samuel T. Coleridge, Lyrical Ballads, with a Few Other Poems (Bristol: 1798), i.
2) Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.