Distant reading and representativeness.

Digital collections are vastly expanding literary scholars’ field of view: instead of describing a few hundred well-known novels, we can now test our claims against corpora that include tens of thousands of works. But because this expansion of scope has also raised expectations, the question of representativeness is often discussed as if it were a weakness rather than a strength of digital methods. How can we ever produce a corpus complete and balanced enough to represent print culture accurately?

I think the question is wrongly posed, and I’d like to suggest an alternate frame. As I see it, the advantage of digital methods is that we never need to decide on a single model of representation. We can and should keep enlarging digital collections, to make them as inclusive as possible. But no matter how large our collections become, the logic of representation itself will always remain open to debate. For instance, men published more books than women in the eighteenth century. Would a corpus be correctly balanced if it reproduced those disproportions? Or would a better model of representation try to capture the demographic reality that there were roughly as many women as men? There’s something to be said for both views.

Scott Weingart tweet.To take another example, Scott Weingart has pointed out that there’s a basic tension in text mining between measuring “what was written” and “what was read.” A corpus that contains one record for every title, dated to its year of first publication, would tend to emphasize “what was written.” Measuring “what was read” is harder: a perfect solution would require sales figures, reviews, and other kinds of evidence. But, as a quick stab at the problem, we could certainly measure “what was printed,” by including one record for every volume in a consortium of libraries like HathiTrust. If we do that, a frequently-reprinted work like Robinson Crusoe will carry about a hundred times more weight than a novel printed only once.

We’ll never create a single collection that perfectly balances all these considerations. But fortunately, we don’t need to: there’s nothing to prevent us from framing our inquiry instead as a comparative exploration of many different corpora balanced in different ways.

For instance, if we’re troubled by the difference between “what was written” and “what was read,” we can simply create two different collections — one limited to first editions, the other including reprints and duplicate copies. Neither collection is going to be a perfect mirror of print culture. Counting the volumes of a novel preserved in libraries is not the same thing as counting the number of its readers. But comparing these collections should nevertheless tell us whether the issue of popularity makes much difference for a given research question.

I suspect in many cases we’ll find that it makes little difference. For instance, in tracing the development of literary language, I got interested in the relative prominence of words that entered English before and after the Norman Conquest — and more specifically, in how that ratio changed over time in different genres. My first approach to this problem was based on a collection of 4,275 volumes that were, for the most part, limited to first editions (773 of these were prose fiction).

But I recognized that other scholars would have questions about the representativeness of my sample. So I spent the last year wrestling with 470,000 volumes from HathiTrust; correcting their OCR and using classification algorithms to separate fiction from the rest of the collection. This produced a collection with a fundamentally different structure — where a popular work of fiction could be represented by dozens or scores of reprints scattered across the timeline. What difference did that make to the result? (click through to enlarge)

The same question posed to two different collections. 773 hand-selected first editions on the left; on the right, 47,549 volumes, including many translations and reprints.

The same question posed to two different collections. 773 hand-selected first editions on the left; on the right, 47,549 volumes, including many translations and reprints. Yearly ratios are plotted rather than individual works.


It made almost no difference. The scatterplots look different, of course, because the hand-selected collection (on the left) is relatively stable in size across the timespan, and has a consistent kind of noisiness, whereas the HathiTrust collection (on the right) gets so huge in the nineteenth century that noise almost disappears. But the trend lines are broadly comparable, although the collections were created in completely different ways and rely on incompatible theories of representation.

I don’t regret the year I spent getting a binocular perspective on this question. Although in this case changing the corpus made little difference to the result, I’m sure there are other questions where it will make a difference. And we’ll want to consider as many different models of representation as we can. I’ve been gathering metadata about gender, for instance, so that I can ask what difference gender makes to a given question; I’d also like to have metadata about the ethnicity and national origin of authors.

pullquoteBut the broader point I want to make here is that people pursuing digital research don’t need to agree on a theory of representation in order to cooperate.

If you’re designing a shared syllabus or co-editing an anthology, I suppose you do need to agree in advance about the kind of representativeness you’re aiming to produce. Space is limited; tradeoffs have to be made; you can only select one set of works.

But in digital research, there’s no reason why we should ever have to make up our minds about a model of representativeness, let alone reach consensus. The number of works we can select for discussion is not limited. So we don’t need to imagine that we’re seeking a correspondence between the reality of the past and any set of works. Instead, we can look at the past from many different angles and ask how it’s transformed by different perspectives. We can look at all the digitized volumes we have — and then at a subset of works that were widely reprinted — and then at another subset of works published in India — and then at three or four works selected as case studies for close reading. These different approaches will produce different pictures of the past, to be sure. But nothing compels us to make a final choice among them.

Where to start with text mining.

This post is an outline of discussion topics I’m proposing for a workshop at NASSR2012 (a conference of Romanticists). I’m putting it on the blog since some of the links might be useful for a broader audience.

In the morning I’ll give a few examples of concrete literary results produced by text mining. I’ll start the afternoon workshop by opening two questions for discussion: first, what are the obstacles confronting a literary scholar who might want to experiment with quantitative methods? Second, how do those methods actually work, and what are their limits?

I’ll also invite participants to play around with a collection of 818 works between 1780 and 1859, using an R program I’ve provided for the occasion. Links for these materials are at the end of this post.

I. HOW DIFFICULT IS IT TO GET STARTED?
There are two kinds of obstacles: getting the data you need, and getting the digital skills you need.

1. Is it really necessary to have a large collection of texts?
This is up for debate. But I tend to think the answer is “yes.”

Not because bigger is better, or because “distant reading” is the new hotness. It’s still true that a single passage, perceptively interpreted, may tell us more than a thousand volumes.

But if you want to interpret a single passage, you fortunately already have a wrinkled protein sponge that will do a better job than any computer. Quantitative analysis starts to make things easier only when we start working on a scale where it’s impossible for a human reader to hold everything in memory. Your mileage may vary, but I’d say, more than ten books?

And actually, you need a larger collection than that, because quantitative analysis tends to require context before it becomes meaningful. It doesn’t mean much to say that the word “motion” is common in Wordsworth, for instance, until we know whether “motion” is more common in his works than in other nineteenth-century poets. So yes, text-mining can provide clues that lead to real insights about a single author or text. But it’s likely that you’ll need a collection of several hundred volumes, for comparison, before those clues become legible.

Words that are consistently more common in works by William Wordsworth than in other poets from 1780 to 1850. I’ve used Wordle’s graphics, but the words have been selected by a Mann-Whitney test, which measures overrepresentation relative to a context — not by Wordle’s own (context-free) method. See the R script at the end of this post.

This isn’t to deny that there are interesting things that can be done digitally with a single text: digital editing, building timelines and maps, and so on. I just doubt that quantitative analysis adds much value at that scale. (And to give credit where it’s due: Mark Olsen was saying all this back in the 90s — see References.)

2. So, where do I get all those texts?
That’s what I was asking myself 18 months ago. A lot of excitement about digital humanities is premised on the notion that we already have large collections of digitized sources waiting to be used. But it’s not true, because page images are not the same thing as clean, machine-readable text.

If you’re interested in twentieth-century secondary sources, the JSTOR Data for Research API can probably get you what you need. Primary sources are a harder problem. In our own (Romantic) era, optical character recognition (OCR) is unreliable. The ratio of words transcribed accurately ranges from around 80% to around 98%, depending on print quality and typographical quirks like the notorious “long s.” For a lot of text-mining purposes, 95% might be fine, if the errors were randomly distributed. But they’re not random: errors cluster in certain words and periods.

What you see in a page image.

The problem can be addressed in several different ways. There are a few collections (like ECCO-TCP and the Brown Women Writers Project) that transcribe text manually. That’s an ideal solution, but coverage of that kind is stronger in the eighteenth than the nineteenth century.

What you may see as OCR.

What you may see as OCR.

So Jordan Sellers and I have supplemented those collections by automatically correcting 19c OCR that we got from the Internet Archive. Our strategy involved statistically cautious, period-specific spellchecking, combined with enough reasoning about context to realize that “mortal fin” is probably “mortal sin,” even though “fin” is a correctly spelled word. It’s not a perfect solution, but in our period it works well enough for text-mining purposes. We have corrected about 2,000 volumes this way, and are happy to share our texts and metadata, as well as the spellchecker itself (once I get it packaged well enough to distribute). I can give you either a zip file containing the 19c texts themselves, or a tab-separated file containing docIDs, words, and word counts for the whole collection. In either scheme, the docIDs are keyed to this metadata file.

Of course, selecting titles for a collection like this raises intractable questions about representativeness. We tried to maximize diversity while also selecting volumes that seemed to have reached a significant audience. But other scholars may have other priorities. I don’t think it would be useful to seek a single right answer about representativeness; instead, I’d like to see multiple scholars building different kinds of collections, making them all public, and building on each other’s work. Then we would be able to test a hypothesis against multiple collections, and see whether the obvious caveats about representativeness actually make a difference in any given instance.

3. Is it necessary to learn how to program?
I’m not going to try to answer that question, because it’s complex and better addressed through discussion.

I will tell a brief story. I went into this gig thinking that I wouldn’t have to do my own programming, since there were already public toolsets for text-mining (Voyant, MONK, MALLET, TAPoR, SEASR) and for visualization (Gephi). I figured I would just use those.

But I rapidly learned otherwise. Tools like MONK and Voyant taught me what was possible, but they weren’t well adapted for managing a very large collection of texts, and didn’t permit me to make my own methodological innovations. When you start trying to do either of those things, you rapidly need “nonstandard parts,” which means that someone in the team has to be able to program.

That doesn’t have to be a daunting prospect, because the programming involved is of a relatively forgiving sort. It’s not easy, but it’s also not professional software development. So if you want to do it yourself, that’s a plausible aspiration. Alternately, if you want to collaborate with someone, you don’t necessarily need to find “a computer scientist.” A graduate student or fellow humanist who can program will do just fine.

If you do want to learn to program, I would recommend starting with either Python or R. Of the two languages, Python is certainly easier. It’s intuitive, and well-documented, and great for working with text. If you expect to use existing tools (like MALLET), and just need some “glue” to connect them to each other, Python is probably the way to go. R is a more specialized and less intuitive language. But it happens to be specialized in some ways that are useful for text mining. In particular, it has built-in statistical functions, and a built-in plotting/graphing capacity. I’ve used it for the sample exercise that accompanies this post. But if you’re learning to program for the first time, Python might be a better all-around choice, and you could in principle extend it to do everything R does. [Later addition: You could do worse than start with The Programming Historian.]

II. WHAT CAN WE ACTUALLY DO WITH QUANTITATIVE METHODS?
What follows is just a list of elements. Interesting research projects tend to combine several of these elementary operations in ad-hoc ways suited to a particular question. The list of elements runs a little long, so let me cut to the chase: the overall theme I’m trying to convey is that you can build complex arguments on a very simple foundation. Yes, at bottom, text mining is often about counting words. But a) words matter and b) they hang together in interesting ways, like individual dabs of paint that together start to form a picture.

So, to return to the original question: what can we do?

1) Categorize documents. You can “categorize” in several different senses.

    a) Information retrieval: retrieve documents that match a query. This is what you do every time you use a search engine.

    b) (Supervised) classification: a program can learn to correctly distinguish texts by a given author, or learn (with a bit more difficulty) to distinguish poetry from prose, tragedies from history plays, or “gothic novels” from “sensation novels.” (See “Quantitative Formalism,” Pamphlet 1 from the Stanford Literary Lab.) The researcher has to provide examples of different categories, but doesn’t have to specify how to make the distinction: algorithms can learn to recognize a combination of features that is the “fingerprint” of a given category.

    An example of clustering from “Quantitative Formalism,” Allison, Heuser, Jockers, Moretti, and Witmore, Stanford Literary Lab.

    c) (Unsupervised) clustering: a program can subdivide a group of documents using general measures of similarity instead of predetermined categories. This may reveal patterns you don’t expect.

All three of these techniques can achieve amazing results armed with what seems like very crude information about the documents they’re categorizing. We know, intuitively, that merely counting words is not enough to distinguish a tragedy from a history play. But our intuitions are simply wrong — see the lit lab pamphlet I cited above. It turns out that there’s an enormous amount of information contained in relative word frequencies, even if you know nothing about sequence or syntax. As you consider other aspects of text mining, it’s useful to keep this intuitive misfire in mind. Relatively simple statistical techniques often characterize discourse a good deal better than our intuitions would predict.

2) Contrast the vocabulary of different corpora. In a way, this reverses the logic of classifying documents (1b). Instead of using features to sort documents into categories, you start with two categories of documents and contrast them to identify distinctive features.

For instance, you can discover which words (or phrases) are overrepresented in one author or genre (relative to, say, the rest of nineteenth-century literature). It can admittedly be a challenge to interpret the results: this is a kind of evidence we aren’t accustomed to yet. But lists of overrepresented words can be a fruitful source of critical leads to pursue in more traditional ways.

Beyond identifying distinctive words and phrases, corpora can be compared using metrics chosen for some more specific reason. It’s difficult to give an exhaustive list – but, for instance, the argument I’ve been making about generic differentiation is based on a kind of corpus comparison. As a general think-piece on the topic, I recommend Ben Schmidt’s blog post arguing that comparison is an underused and underrated tool; Schmidt’s taxonomy of text-mining techniques in that post was a strong influence on the taxonomy I’m offering here.

3) Trace the history of particular features (words or phrases) over time. This could be viewed as a special category of corpus comparison, where you’re comparing corpora segmented on the time axis.

The best-known example here would be Google’s ngram viewer. Digital humanists love to criticize the ngram viewer, partly for valid reasons (there’s no way to know what texts are being used). But it has probably been the single most influential application of text mining, so clearly people are finding this simple kind of diachronic visualization useful. A couple of other projects have built on the same dataset, slicing it in different ways. Mark Davies of BYU built an interface that lets you survey the history of collocations. Our team at Illinois built an interface that mines 18-19c correlations in the ngram dataset; it turns out that correlated words have a high likelihood of being related in other ways as well, and these can be intriguing leads: see what words correlate with “delicacy” in our period, for instance. Harvard has built Bookworm, which can be understood as a smaller but more flexible and better-documented version of the ngram viewer (built on the Open Library instead of Google Books).

Words whose frequencies correlate strongly over time are often related in other ways as well. Ngram viewer by Auvil, Capitanu, Heuser and Underwood, based on corrected Google dataset.

Of special interest to Romanticists: a project that isn’t built on the ngram dataset but that does use diachronic correlation-mining as a central methodology. In Stanford Lit Lab Pamphlet 4, Ryan Heuser and Long Le-Khac have traced some very interesting, strongly correlated changes in novelistic diction over the course of the 19th century.

Finally, anyone who wants to make a diachronic argument about diction should read Ben Schmidt’s simple, elegant experiment peeling apart two different components of change: generational succession and historical change within the diction of a single age-cohort.

4) Cluster features that tend to be associated in a given corpus of documents (aka topic modeling). In a way, this reverses the logic of clustering documents (1c). Instead of grouping documents that tend to share the same words, you group words that tend to appear in the same documents, or parts of documents. This produces something that looks like a semantic map of the period or corpus you’re studying. (It would be more accurate to call it a discursive map, because topics don’t actually have to be unified semantically. They are more analogous to “discourses.”)

There are a lot of ways to cluster features, ranging from older approaches (Latent Semantic Analysis), to the new, hip approach — “Bayesian topic modeling,” which has the advantage that it clusters individual occurrences of words (tokens) instead of word types. As a result, it can distinguish different senses of a word. (Scott Weingart has written a clear and comprehensive introduction to topic modeling for humanists.)

Topic modeling has become justifiably popular for several reasons. First and foremost, a “discursive map” can be a nice thing to have; it lends itself easily to interpretation. Also, frankly, this approach doesn’t require a whole lot of improvisation. You just pour text files into a tool like MALLET, and out come a list of topics, looking meaningful and authoritative. It’s important to remember that topic-modeling is in fact an imprecise process. Slightly different inputs (for instance, a different stopword list) can produce very different outputs.

5) Entity extraction. If you’re mainly interested in proper nouns (personal names or place names, or dates and prices) there are tools like OpenNLP that can extract these from text, using syntactic patterns as clues.

6) Visualization. Perhaps this isn’t technically a form of analysis, but in practice it’s important enough that it deserves to be treated as a separate analytical step. It’s impractical to list all possible forms of visualization here, but for instance, results can be visualized:

Putting things together.
There’s no limit to the number of ways you can combine these different operations. Matt Wilkens has extracted references to named entities from fiction, and then visualized their density geographically. Robert K. Nelson has performed topic modeling on the print run of a Civil-War-era newspaper, and then graphed the frequency of each topic over time. You could go a step further and look for correlations between topics (either over time, or in terms of their distribution over documents). Then you could visualize the relationships between topics as a network.

What’s the goal uniting all this experimentation? I suspect there are two different but equally valid goals. In some cases, we’re going to find patterns that actually function as evidence to support literary-historical arguments. (In a number of the examples cited above, I think that’s starting to happen.) In other cases, text mining may work mainly as an exploratory technique, revealing clues that need to be fleshed out and written up using more traditional critical methods. The boundary between those two applications will be hotly debated for years, so I won’t attempt to define it here.

III. SAMPLE DATA AND SCRIPT FOR EXPLORATION.
I don’t know whether we’ll really have time for this, but I ought to at least offer you a chance to do hands-on stuff. So here’s a medium-sized project.

I’ve created a pre-packaged set of 818 volumes of poetry and fiction between 1780 and 1859, including 243 authors. I can give you first, a metadata file that includes the authors, titles, dates, and so on for each volume, and second, a data file that includes word counts for each volume. (To keep from frying your laptop, I’ve only included the top 9,000 words in the collection. But actually that’s a lot.)

Finally, I’ve provided an R script that will let you define different chunks of the collection and compare them against each other, to identify words that are significantly overrepresented in a given author, genre, or period. The script will try two different measures of “overrepresentation”: the first, “log-likelihood,” is based on the aggregate frequency of words in the corpus you selected, adding all the volumes in the corpus together. The second, “Mann-Whitney rho,” tries to locate words that are consistently more common in corpus X by paying attention to individual volumes. For more on how that works, see this blog post.

Of course, the R script won’t work until you download R and open it from within R. Please understand that this is a very rough, ad-hoc piece of work for this one occasion, not a polished piece of software that I expect people to use for the long term.

Postscript about the word “mining.”
I know it has an industrial sound; I know humanists like “analysis” more. But I’m sticking with the mining metaphor on the principle of truth in advertising. I think that word accurately conveys the scale of this enterprise, and the fact that it’s often more exploratory than probative. Besides, “mining” is vivid, and that has its own sort of humanistic value.

References (that aren’t already implicit in links)
Mark Olsen, “Signs, Symbols, and Discourses: A New Direction for Computer-Aided Literature Studies” Computers and the Humanities 27 (1993): 309-314.

A touching detail produced by LDA …

I’m getting ahead of myself with this post, because I don’t have time to explain everything I did to produce this. But it was just too striking not to share.

Basically, I’m experimenting with Latent Dirichlet Allocation, and I’m impressed. So first of all, thanks to Matt Jockers, Travis Brown, Neil Fraistat, and everyone else who tried to convince me that Bayesian methods are better. I’ve got to admit it. They are.

But anyway, in a class I’m teaching we’re using LDA on a generically diverse collection of 1,853 volumes between 1751 and 1903. The collection includes fiction, poetry, drama, and a limited amount of nonfiction (just biography). We’re stumbling on a lot of fascinating things, but this was slightly moving. Here’s the graph for one particular topic.

Image of a topic.
The circles and X’s are individual volumes. Blue is fiction, green is drama, pinkish purple is poetry, black biography. Only the volumes where this topic turned out to be prominent are plotted, because if you plot all 1,853 it’s just a blurry line at the bottom of the image. The gray line is an aggregate frequency curve, which is not related in any very intelligible way to the y-axis. (Work in progress …) As you can see. this topic is mostly prominent in fiction around the year 1800. Here are the top 50 words in the topic:


But here’s what I find slightly moving. The x’s at the top of the graph are the 10 works in the collection where the topic was most prominent. They include, in order, Mary Wollstonecraft Shelley, Frankenstein, Mary Wollstonecraft, Mary, William Godwin, St. Leon, Mary Wollstonecraft Shelley, Lodore, William Godwin, Fleetwood, William Godwin, Mandeville, and Mary Wollstonecraft Shelley, Falkner.

In short, this topic is exemplified by a family! Mary Hays does intrude into the family circle with Memoirs of Emma Courtney, but otherwise, it’s Mary Wollstonecraft, William Godwin, and their daughter.

Other critics have of course noticed that M. W. Shelley writes “Godwinian novels.” And if you go further down the list of works, the picture becomes less familial (Helen Maria Williams and Thomas Holcroft butt in, as well as P. B. Shelley). Plus, there’s another topic in the model (“myself these should situation”) that links William Godwin more closely to Charles Brockden Brown than it does to his wife or daughter. And LDA isn’t graven on stone; every time you run topic modeling you’re going to get something slightly different. But still, this is kind of a cool one. “Mind feelings heart felt” indeed.

Etymology and nineteenth-century poetic diction; or, singing the shadow of the bitter old sea.

In a couple of recent posts, I argued that fiction and poetry became less similar to nonfiction prose over the period 1700-1900. But because I only measured genres’ distance from each other, I couldn’t say much substantively about the direction of change. Toward the end of the second post, though, I did include a graph that hinted at a possible cause:


The older part of the lexicon (mostly words derived from Old English) gradually became more common in poetry, fiction, and drama than in nonfiction prose. This may not be the only reason for growing differentiation between literary and nonliterary language, but it seems worth exploring. (I should note that function words are excluded from this calculation for reasons explained below; we’re talking about verbs, nouns, and adjectives — not about a rising frequency of “the.”)

Why would genres become etymologically different? Well, it appears that words of different origins are associated in contemporary English with different registers (varieties of language appropriate for a particular social situation). Words of Old English provenance get used more often in speech than in writing — and in writing they are (now) used more often in narrative than in exposition. Moreover, writers learn to produce this distinction as they get older; there isn’t a marked difference for students in elementary school. But as they advance to high school, students learn to use Latinate words in formal expository writing (Bar-Ilan and Berman, 2007).

It’s not hard to see why words of Old English origin might be associated with spoken language. English was for 200 years (1066-1250) almost exclusively spoken. The learned part of the Old English lexicon didn’t survive this period. Instead, when English began to be used again in writing, literate vocabulary was borrowed from French and Latin. As a result, etymological distinctions in English tend also to be distinctions between different social contexts of language use.

Instead of distinguishing “Germanic” and “Latinate” diction here, I have used the first attested date for each word, choosing 1150 as a dividing line because it’s the midpoint of the period when English was not used in writing. Of course pre-1150 words are mostly from Old English, but I prefer to divide based on date-of-entry because that highlights the history of writing rather than a spurious ethnic mystique. (E.g., “Anglo-Saxon is a livelier tongue than Latin, so use Anglo-Saxon words.” — E. B. White.) But the difference isn’t material. You could even just measure the average length of words in different genres and get results that are close to the results I’m graphing here (the correlation between the pre/post-1150 ratio and average word length is often -.85 or lower).

The bottom line is this: using fewer pre-1150 words tends to make diction more overtly literate or learned. Using more of them makes diction less overtly learned, and perhaps closer to speech. It would be dangerous to assume much more: people may think that Old English words are “concrete” — but this isn’t true, for instance, of “word” or “true.”

What can we learn by graphing this aspect of diction?


In the period 1700-1900, I think we learn three interesting things:

    All genres of writing (or at least of prose) seem to acquire an exaggeratedly “literate” diction in the course of the eighteenth century.

    Poetry and fiction reverse that process in the nineteenth century, and develop a diction that is markedly less learned than other kinds of writing — or than their own past history.

    But they do that to different degrees, and as a result the overall story is one of increasing differentiation — not just between “literary” and “nonliterary” diction — but between poetry and fiction as well.

I’m fascinated by this picture. It suggests that the difference linguists have observed between the registers of exposition and narrative may be a relatively recent development. It also raises interesting questions about “literariness” in the eighteenth and nineteenth centuries. For instance, contrast this picture to the standard story where “poetic diction” is an eighteenth-century refinement that the nineteenth century learns to dispense with. Where the etymological dimension of diction is concerned, that story doesn’t fit the evidence. On the contrary, nineteenth-century poetry differentiates itself from the diction of prose in a new and radical way: by the end of the century, the older part of the lexicon has become more than 2.5 times more prominent, on average, in verse than it is in nonfiction prose.

I could speculate about why this happened, but I don’t really know yet. What I can do is give a little more descriptive detail. For instance, if pre-1150 words became more common in 19c poetry … which words, exactly, were involved? One way to approach that is to ask which individual words correlate most strongly with the pre/post-1150 ratio. We might focus especially, for instance, on the rising trend in poetry from the middle of the eighteenth century to 1900. If you sort the top 10,000 words in the poetry collection by correlation with yearly values of the pre/post ratio, you get a list like this:


But the precise correlation coefficients don’t matter as much as an overall picture of diction, so I’ll simply list the hundred words that correlate most strongly with the pre/post-1150 ratio in poetry from 1755 to 1900:


We’re looking mostly at a list of pre-1150 words, with a few exceptions (“face,” “flower,” “surely”). That’s not an inevitable result; if the etymological trend had been a side-effect of something mostly unrelated to linguistic register (say, a vogue for devotional poetry), then sorting the top 10,000 words by correlation with the trend would reveal a list of words associated with its underlying (religious) cause. But instead we’re seeing a trend that seems to have a coherent sociolinguistic character. That’s not just a feature of the top 100 words: the average pre-1150 word is located 2210 places higher on this list than the average post-1150 word.

It’s not, however, simply a list of common Anglo-Saxon words. The list clearly reflects a particular model of “poetic diction,” although the nature of that model is not easy to describe. It involves an odd mixture of nouns for large natural phenomena (wind, sea, rain, water, moon, sun, star, stars, sunset, sunrise, dawn, morning, days, night, nights) and verbs that express a subjective relation (sang, laughed, dreamed, seeing, kiss, kissed, heard, looked, loving, stricken). [Afterthought: I don't think we have any Hopkins in our collection, but it sounds like my computer is parodying Gerard Manley Hopkins.] There’s also a bit of explicitly archaic Wardour Street in there (yea, nay, wherein, thereon, fro).

Here, by contrast, are the words at the bottom of the list — the ones that correlate negatively with the pre/post-1150 trend, because they are less common, on average, in years where that trend spikes.


There’s a lot that could be said about this list, but one thing that leaps out is an emphasis on social competition. Pomp, power, superior, powers, boast, bestow, applause, grandeur, taste, pride, refined, rival, fortune, display, genius, merit, talents. This is the language of poems that are not bashful about acknowledging the relationship between “social” distinction and the “arts” “inspired” by the “muse” — a theme entirely missing (or at any rate disavowed) in the other list. So we’re getting a fairly clear picture of a thematic transformation in the concerns of poetry from 1755 to 1900. But these lists are generated strictly by correlation with the unsmoothed year-to-year variation of an etymological trend! Moreover, the lists have themselves an etymological character. There are admittedly a few pre-1150 words in this list of negative correlators (mind, oft, every), but by and large it’s a list of words derived from French or Latin after 1150.

I think the apparent connection between sociolinguistic and thematic issues is the really interesting part of this. It begins to hint that the broader shift in poetic diction (using words from the older part of the lexicon to differentiate poetry from prose) had itself an unacknowledged social rationale — which was to disavow poetry’s connection to cultural distinction, and foreground instead a simplified individual subjectivity. I’m admittedly speculating a little here, and there’s a great deal more that could be said both about poetry and about parallel trends in fiction — but I’ve said enough for one blog post.

A couple of quick final notes. You’re wondering, what about drama?


Our collection of drama in the nineteenth century is actually too sparse to draw any conclusions yet, but there’s the trend line so far, if you’re interested.

You’re also wondering, how were pre- and post-1150 words actually sorted out? I made a list of the 10,500 most common words in the collection, and mined etymologies for them using a web-crawler on Dictionary.com. I excluded proper nouns, abbreviations, and words that entered English after 1699. I also excluded function words (determiners, prepositions, conjunctions, pronouns, and the verb to be) because as Bar-Ilan and Berman say, “register variation is essentially a matter of choice — of selecting high-level more formal alternatives instead of everyday, colloquial items or vice versa” (15). There is generally no alternative to prepositions, pronouns, etc, so they don’t tell us much about choice. After those exclusions, I had a list of 9,517 words, of which 2,212 entered the language before 1150 and 7,125 after 1149. (The list is available here.)

Finally, I doubt we’ll be so lucky — but if you do cite this blog post, it should be cited as a collective work by Ted Underwood and Jordan Sellers, because the nineteenth-century part of the underlying collection is a product of Jordan’s research.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

[UPDATE March 13, 2011: Twitter conversation about this post with Natalia Cecire.]

Literary and nonliterary diction, the sequel.

In my last post, I suggested that literary and nonliterary diction seem to have substantially diverged over the course of the eighteenth and nineteenth centuries. The vocabulary of fiction, for instance, becomes less like nonfiction prose at the same time as it becomes more like poetry.

It’s impossible to interpret a comparative result like this purely as evidence about one side of the comparison. We’re looking at a process of differentiation that involves changes on both sides: the language of nonfiction and fiction, for instance, may both have specialized in different ways.

This post is partly a response to very helpful suggestions I received from commenters, both on this blog and at Language Log. It’s especially a response to Ben Schmidt’s effort to reproduce my results using the Bookworm dataset. I also try two new measures of similarity toward the end of the post (cosine similarity and etymology) which I think interestingly sharpen the original hypothesis.

I have improved my number-crunching in four main ways (you can skip these if you’re bored):

1) In order to normalize corpus size across time, I’m now comparing equal-sized samples. Because the sample sizes are small relative to the larger collection, I have been repeating the sampling process five times and averaging results with a Fisher’s r-to-z transform. Repeated sampling doesn’t make a huge difference, but it slightly reduces noise.

2) My original blog post used 39-year slices of time that overlapped with each other, producing a smoothing effect. Ben Schmidt persuasively suggests that it would be better to use non-overlapping samples, so in this post I’m using non-overlapping 20-year slices of time.

3) I’m now running comparisons on the top 5,000 words in each pair of samples, rather than the top 5,000 words in the collection as a whole. This is a crucial and substantive change.

4) Instead of plotting a genre’s similarity to itself as a flat line of perfect similarity at the top of each plot, I plot self-similarity between two non-overlapping samples selected randomly from that genre. (Nick Lamb at Language Log recommended this approach.) This allows us to measure the internal homogeneity of a genre and use it as a control for the differentiation between genres.

Briefly, I think the central claims I was making in my original post hold up. But the constraints imposed by this newly-rigorous methodology have forced me to focus on nonfiction, fiction, and poetry. Our collections of biography and drama simply aren’t large enough yet to support equal-sized random samples across the whole period.

Here are the results for fiction compared to nonfiction, and nonfiction compared to itself.


This strongly supports the conclusion that fiction was becoming less like nonfiction, but also reveals that the internal homogeneity of the nonfiction corpus was decreasing, especially in the 18c. So some of the differentiation between fiction and nonfiction may be due to the internal diversification of nonfiction prose.

By contrast, here are the results for poetry compared to fiction, and fiction compared to itself.

Poetry and fiction are becoming more similar in the period 1720-1900. I should note that I’ve dropped the first datapoint, for the period 1700-1719, because it seemed to be an outlier. Also, we’re using a smaller sample size here, because my poetry collection won’t support 1 million word samples across the whole period. (We have stripped the prose introduction and notes from volumes of poetry, so they’re small.)

Another question that was raised, both by Ben and by Mark Liberman at Language Log, involved the relationship between “diction” and “topical content.” The Spearman correlation coefficient gives common and uncommon words equal weight, which means (in effect) that it makes no effort to distinguish style from content.

But there are other ways of contrasting diction. And I thought I might try them, because I wanted to figure out how much of the growing distance between fiction and nonfiction was due simply to the topical differentiation of nonfiction in this period. So in the next graph, I’m comparing the cosine similarity of million-word samples selected from fiction and nonfiction to distinct samples selected from nonfiction. Cosine similarity is a measure that, in effect, gives more weight to common words.


I was surprised by this result. When I get very stable numbers for any variable I usually assume that something is broken. But I ran this twice, and used the same code to make different comparisons, and the upshot is that samples of nonfiction really are very similar to other samples of nonfiction in the same period (as measured by cosine similarity). I assume this is because the growing topical heterogeneity that becomes visible in Spearman’s correlation makes less difference to a measure that focuses on common words. Fiction is much more diverse internally by this measure — which makes sense, frankly, because the most common words can be totally different in first-person and third-person fiction. But — to return to the theme of this post — the key thing is that there’s a dramatic differentiation of fiction and nonfiction in this period. Here, by contrast, are the results for nonfiction and poetry compared to fiction, as well as fiction compared to itself.

This graph is a little wriggly, and the underlying data points are pretty bouncy — because fiction is internally diverse when measured by cosine similarity, and it makes a rather bouncy reference point. But through all of that I think one key fact does emerge: by this measure, fiction looks more similar to nonfiction prose in the eighteenth century, and more similar to poetry in the nineteenth.

There’s a lot more to investigate here. In my original post I tried to identify some of the words that became more common in fiction as it became less like nonfiction. I’d like to run that again, in order to explain why fiction and poetry became more similar to each other. But I’ll save that for another day. I do want to offer one specific metric that might help us explain the differentiation of “literary” and “nonliterary” diction: the changing etymological character of the vocabulary in these genres.


Measuring the ratio of “pre-1150″ to “post-1150″ words is roughly like measuring the ratio of “Germanic” to “Latinate” diction, except that there are a number of pre-1150 words (like “school” and “wall”) that are technically “Latinate.” So this is essentially a way of measuring the relative “familiarity” or “informality” of a genre (Bar-Ilan and Berman 2007). (This graph is based on the top 10k words in the whole collection. I have excluded proper nouns, words that entered the language after 1699, and stopwords — determiners, pronouns, conjunctions, and prepositions.)

I think this graph may help explain why we have the impression that literary language became less specialized in this period. It may indeed have become more informal — perhaps even closer to the spoken language. But in doing so it became more distinct from other kinds of writing.

I’d like to thank everyone who responded to the original post: I got a lot of good ideas for collection development as well as new ways of slicing the collection. Katherine Harris, for instance, has convinced me to add more women writers to the collection; I’m hoping that I can get texts from the Brown Women Writers Project. This may also be a good moment to reiterate that the nineteenth-century part of the collection I’m working with was selected by Jordan Sellers, and these results should be understood as built on his research. Finally, I have put the R code that I used for most of these plots in my Open Data page, but it’s ugly and not commented yet; prettier code will appear later this weekend.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

The differentiation of literary and nonliterary diction, 1700-1900.

When you stumble on an interesting problem, the question arises: do you blog the problem itself — or wait until you have a full solution to publish as an article?

In this case, I think the problem is too big to be solved by a single person anyway, so I might as well get it out there where we can all chip away at it. At the end of this post, I include a link to a page where you can also download the data and code I’m using.

When we compare groups of texts, we’re often interested in characterizing the contrast between them. But instead of characterizing the contrast, you could also just measure the distance between categories. For instance, you could generate a list of word frequencies for two genres, and then run a Spearman’s correlation test, to measure the rank-order similarity of their diction.

In isolation, a measure of similarity between two genres is hard to interpret. But if you run the test repeatedly to compare genres at different points in time, the changes can tell you when the diction of the genres becomes more or less similar.

Spearman similarity to nonfiction, measured at 5-year intervals. At each interval, a 39-year chunk of the collection (19 years on either side of the midpoint) is being selected for comparison.


In the graph above, I’ve done that with four genres, in a collection of 3,724 eighteenth- and nineteenth-century volumes (constructed in part by TCP and in part by Jordan Sellers — see acknowledgments), using the 10,000 most frequent words in the collection, excluding proper nouns. The black line at the top is flat, because nonfiction is always similar to itself. But the other lines decline as poetry, drama, and fiction become progressively less similar to nonfiction where word choice is concerned. Unsurprisingly, prose fiction is always more similar to nonfiction than poetry is. But the steady decline in the similarity of all three genres to nonfiction is interesting. Literary histories of this period have tended to pivot on William Wordsworth’s rebellion against a specialized “poetic diction” — a story that would seem to suggest that the diction of 19c poetry should be less different from prose than 18c poetry had been. But that’s not the pattern we’re seeing here: instead it appears that a differentiation was setting in between literary and nonliterary language.

This should be described as a differentiation of “diction” rather than style. To separate style from content (for instance to determine authorship) you need to focus on the frequencies of common words. But when critics discuss “diction,” they’re equally interested, I think, in common and less common words — and that’s the kind of measure of similarity that Spearman’s correlation will give you (Kilgarriff 2001).

The graph above makes it look as though nonfiction was remaining constant while other genres drifted away from it. But we are after all graphing a comparison with two sides. This raises the question: were poetry, fiction, and drama changing relative to nonfiction, or was nonfiction changing relative to them? But of course the answer is “both.”

At each 5-year interval, the Spearman similarity is being measured between the 40-year span surrounding that point and the period 1700-1740.


Here we’re comparing each genre to its own past. The language of nonfiction changes somewhat more rapidly than the language of the other genres, but none of them remain constant. There is no fixed reference point in this world, which is why I’m talking about the “differentiation” of two categories. But even granting that, we might want to pose another skeptical question: when literary genres become less like nonfiction, is that merely a sign of some instability in the definition of “nonfiction”? Did it happen mostly because, say, the nineteenth century started to publish on specialized scientific topics? We can address this question to some extent by selecting a more tightly defined subset of nonfiction as a reference point — say, biographies, letters, and orations.

The Spearman similarities here happen to be generated on the top 5000 words rather than the top 10000, but I have tried both wordsets and it makes very little difference.


Even when we focus on this relatively stable category, we see significant differentiation. Two final skeptical questions need addressing before I try to explain what happened. First, I’ve been graphing results so far as solid lines, because our eyes can’t sort out individual data points for four different variables at once. But a numerically savvy reader will want to see some distributions and error bars before assessing the significance of these results. So here are yearly values for fiction. In some cases these are individual works of fiction, though when there are two or more works of fiction in a single year they have been summed and treated as a group. Each year of fiction is being compared against biographies, letters, and orations for 19 years on either side.

That’s a fairly persuasive trend. You may, however, notice that the Spearman similarities for individual years on this graph are about .1 lower than they were when we graphed fiction as a 39-year moving window. In principle Spearman similarity is independent of corpus size, but it can be affected by the diversity of a corpus. The similarity between two individual texts is generally going to be lower than the similarity between two large and diverse corpora. So could the changes we’ve seen be produced by changes in corpus size? There could be some effect, but I don’t think it’s large enough to explain the phenomenon. [See update at the bottom of this post. The results are in fact even clearer when you keep corpus size constant. -Ed.] The sizes of the corpora for different genres don’t change in a way that would produce the observed decreases in similarity; the fiction corpus, in particular, gets larger as it gets less like nonfiction. Meanwhile, it is at the same time becoming more like poetry. We’re dealing with some factor beyond corpus size.

So how then do we explain the differentiation of literary and nonliterary diction? As I started by saying, I don’t expect to provide a complete answer: I’m raising a question. But I can offer a few initial leads. In some ways it’s not surprising that novels would gradually become less like biographies and letters. The novel began very much as faked biography and faked correspondence. Over the course of the period 1700-1900 the novel developed a sharper generic identity, and one might expect it to develop a distinct diction. But the fact that poetry and drama seem to have experienced a similar shift (together with the fact that literary genres don’t seem to have diverged significantly from each other) begins to suggest that we’re looking at the emergence of a distinctively “literary” diction in this period.

To investigate the character of that diction, we need to compare the vocabulary of genres at many different points. If we just compared late-nineteenth-century fiction to late-nineteenth-century nonfiction, we would get the vocabulary that characterized fiction at that moment, but we wouldn’t know which aspects of it were really new. I’ve done that on the side here, using the Mann-Whitney rho test I described in an earlier post. As you’ll see, the words that distinguish fiction from nonfiction from 1850 to 1900 are essentially a list of pronouns and verbs used to describe personal interaction. But that is true to some extent about fiction in any period. We want to know what aspects of diction had changed.

In other words, we want to find the words that became overrepresented in fiction as fiction was becoming less like nonfiction prose. To find them, I compared fiction to nonfiction at five-year intervals between 1720 and 1880. At each interval I selected a 39-year slice of the collection and ranked words according to the extent to which they were consistently more prominent in fiction than nonfiction (using Mann-Whitney rho). After moving through the whole timeline you end up with a curve for each word that plots the degree to which it is over or under-represented in fiction over time. Then you sort the words to find ones that tend to become more common in fiction as the whole genre becomes less like nonfiction. (Technically, you’re looking for an inverse Pearson’s correlation, over time, between the Mann-Whitney rho for this word and the Spearman’s similarity between genres.) Here’s a list of the top 60 words you find when you do that:


It’s not hard to see that there are a lot of words for emotional conflict here (“horror, courage, confused, eager, anxious, despair, sorrow, dread, agony”). But I would say that emotion is just one aspect of a more general emphasis on subjectivity, ranging from verbs of perception (“listen, listened, watched, seemed, feel, felt”) to explicitly psychological vocabulary (“nerves, mind, unconscious, image, perception”) to questions about the accuracy of perception (“dream, real, sight, blind, forget, forgot, mystery, mistake”). To be sure, there are other kinds of words in the list (“cottage, boy, carriage”). But since we’re looking at a change across a period of 200 years, I’m actually rather stunned by the thematic coherence of the list. For good measure, here are words that became relatively less common in fiction (or more common in nonfiction — that’s the meaning of “relatively”) as the two genres differentiated:


Looking at that list, I’m willing to venture out on a limb and suggest that fiction was specializing in subjectivity while nonfiction was tending to view the world from an increasingly social perspective (“executive, population, colonists, department, european, colonists, settlers, number, individuals, average.”)

Now, I don’t pretend to have solved this whole problem. First of all, the lists I just presented are based on fiction; I haven’t yet assessed whether there’s really a shared “literary diction” that unites fiction with poetry and drama. Jordan and I probably need to build up our collection a bit before we’ll know. Also, the technique I just used to select lists of words looks for correlations across the whole period 1700-1900, so it’s going to select words that have a relatively continuous pattern of change throughout this period. But it’s also entirely possible that “the differentiation of literary and nonliterary diction” was a phenomenon composed of several different, overlapping changes with a smaller “wavelength” on the time axis. So I would say that there’s lots of room here for alternate/additional explanations.

But really, this is a question that does need explanation. Literary scholars may hate the idea of “counting words,” but arguments about a distinctively “literary” language have been central to literary criticism from John Dryden to the Russian Formalists. If we can historicize that phenomenon — if we can show that a systematic distinction between literary and nonliterary language emerged at a particular moment for particular reasons — it’s a result that ought to have significance even for literary scholars who don’t consider themselves digital humanists.

By the way, I think I do know why the results I’m presenting here don’t line up with our received impression that “poetic diction” is an eighteenth-century phenomenon that fades in the 19c. There is a two-part answer. For one thing, part of what we perceive as poetic diction in the 18c is orthography (“o’er”, “silv’ry”). In this collection, I have deliberately normalized orthography, so “silv’ry” is treated as equivalent to “silvery,” and that aspect of “poetic diction” is factored out.

But we may also miss differentiation because we wrongly assume that plain or vivid language cannot be itself a form of specialization. Poetic diction probably did become more accessible in the 19c than it had been in the 18c. But this isn’t the same thing as saying that it became less specialized! A self-consciously plain or restricted diction still counts as a mode of specialization relative to other written genres. More on this in a week or two …

Finally, let me acknowledge that the work I’m doing here is built on a collaborative foundation. Laura Mandell helped me obtain the TCP-ECCO volumes before they were public, and Jordan Sellers selected most of the nineteenth-century collection on which this work is based — something over 1,600 volumes. While Jordan and I were building this collection, we were also in conversation with Loretta Auvil, Boris Capitanu, Tanya Clement, Ryan Heuser, Matt Jockers, Long Le-Khac, Ben Schmidt, and John Unsworth, and were learning from them how to do this whole “text mining” thing. The R/MySQL infrastructure for this is pretty directly modeled on Ben’s. Also, since the work was built on a collaborative foundation, I’m going to try to give back by sharing links to my data and code on this “Open Data” page.

References
Adam Kilgarriff, “Comparing Corpora,” International Journal of Corpus Linguistics 6.1 (2001): 97-133.

[UPDATE Monday Feb 27th, 7 pm: After reading Ben Schmidt's comment below, I realized that I really had to normalize corpus size. "Probably not a problem" wasn't going to cut it. So I wrote a script that samples a million-word corpus for each genre every two years. As long as I was addressing that problem, I figured I would address another one that had been nagging at my conscience. I really ought to be comparing a different wordlist each time I run the comparison. It ought to be the top 5000 words in each pair of corpora that get compared -- not the top 5000 words in the collection as a whole.

The first time I ran the improved version I got a cloud of meaningless dots, and for a moment I thought my whole hypothesis about genre had been produced by a 'loose optical cable.' Not a good moment. But it was a simple error, and once I fixed it I got results that were actually much clearer than my earlier graphs.

I suppose you could argue that, since document size varies across time, it's better to select corpora that have a fixed number of documents rather than a fixed word size. I ran the script that way too, and it produces results that are noisier but still unambiguous. The moral of the story is: it's good to have blog readers who keep you honest and force you to clean up your methodology!]

MLA talk: just the thesis.

Giving a talk this morning at the MLA. There are two main arguments:

1) The first one will be familiar if you’ve read my blog. I suggest that the boundary between “text mining” and conventional literary research is far fuzzier than people realize. There appears to be a boundary only because literary scholars are pretty unreflective about the way we’re currently using full-text search. I’m going to press this point in detail, because it’s not just a metaphor: to produce a simple but useful topic-modeling algorithm, all you have to do is take a search engine and run it backwards.

2) The second argument is newer; I don’t think I’ve blogged about it yet. I’m going to present topic modeling as a useful bridge between “distant” and “close” reading. I’ve found that I often learn most about a genre by modeling it as part of a larger collection that includes many other genres. In that context, a topic-modeling algorithm can highlight peculiar convergences of themes that characterize the genre relative to its contemporary backdrop.

a slide from the talk, where a simple topic-modeling algorithm has been used to produce a dendrogram that offers a clue about the temporal framing of narration in late-18c novels


This is distant reading, in the sense that it requires a large collection. But it’s also close reading, in the sense that it’s designed to reveal subtle formal principles that shape individual works, and that might otherwise elude us.

Although the emphasis is different, a lot of the examples I use are recycled from a talk I gave in August, described here.

Exploring the relationship between topics and trends.

I’ve been talking about correlation since I started this blog. Actually, that was the reason why I did start it: I think literary scholars can get a huge amount of heuristic leverage out of the fact that thematically and socially-related words tend to rise and fall together. It’s a simple observation, and one that stares you in the face as soon as you start to graph word frequencies on the time axis.1 But it happens to be useful for literary historians, because it tends to uncover topics that also pose periodizable kinds of puzzles. Sometimes the puzzle takes the form of a topic we intuitively recognize (say, the concept of “color”) that increases or decreases in prominence for reasons that remain to be explained:

At other times, the connection between elements of the topic is not immediately intuitive, but the terms are related closely enough that their correlation suggests a pattern worthy of further exploration. The relationship between terms may be broadly historical:

Or it may involve a pattern of expression that characterizes a periodizable style:

Of course, as the semantic relationship between terms becomes less intuitively obvious, scholars are going to wonder whether they’re looking at a real connection or merely an accidental correlation. “Ardent” and “tranquil” seem like opposites; can they really be related as elements of a single discourse? And what’s the relationship to “bosom,” anyway?

Ultimately, questions like this have to be addressed on a case-by-case basis; the significance of the lead has to be fleshed out both with further analysis, and with close reading.

But scholars who are wondering about the heuristic value of correlation may be reassured to know that this sort of lead does generally tend to pan out. Words that correlate with each other across the time axis do in practice tend to appear in the same kinds of volumes. For instance, if you randomly select pairs of words from the top 10,000 words in the Google English ngrams dataset 1700-1849,2 measure their correlation with each other in that dataset across the period 1700-1849, and then measure their tendency to appear in the same volumes in a different collection3 (taking the cosine similarity of term vectors in a term-document matrix), the different measures of association correlate with each other strongly. (Pearson’s r is 0.265, significant at p < 0.0005.) Moreover, the relationship holds (less strongly, but still significantly) even in adjacent centuries: words that appear in the same eighteenth-century volumes still tend to rise and fall together in the nineteenth century.

Why should humanists care about the statistical relationship between two measures of association? It means that correlation-mining is in general going to be a useful way of identifying periodizable discourses. If you find a group of words that correlate with each other strongly, and that seem related at first glance, it's probably going to be worthwhile to follow up the hunch. You’re probably looking at a discourse that is bound together both diachronically (in the sense that the terms rise and fall together) and topically (in the sense that they tend to appear in the same kinds of volumes).

Ultimately, literary historians are going to want to assess correlation within different genres; a dataset like Google's, which mixes all genres in a single pool, is not going to be an ideal tool. However, this is also a domain where size matters, and in that respect, at the moment, the ngrams dataset is very helpful. It becomes even more helpful if you correct some of the errors that vitiate it in the period before 1820. A team of researchers at Illinois and Stanford4, supported by the Andrew W. Mellon Foundation, has been doing that over the course of the last year, and we're now able to make an early version of the tool available on the web. Right now, this ngram viewer only covers the period 1700-1899, but we hope it will be useful for researchers in that period, because it has mostly corrected the long-s problem that confufes opt1cal charader readers in the 18c — as well as a host of other, less notorious problems. Moreover, it allows researchers to mine correlations in the top 10,000 words of the lexicon, instead of trying words one by one to see whether an interesting pattern emerges. In the near future, we hope to expand the correlation miner to cover the twentieth century as well.

For further discussion of the statistical relationship between topics and trends, see this paper submitted to DHCS 2011.

UPDATE Nov 22, 2011: At DHCS 2011, Travis Brown pointed out to me that Topics Over Time (Wang and McCallum) might mine very similar patterns in a more elegant, generative way. I hope to find a way to test that method, and may perhaps try to build an implementation for it myself.

References
1) Ryan Heuser and I both noticed this pattern last winter. Ryan and Long Le-Khac presented on a related topic at DH2011: Heuser, Ryan, and Le-Khac, Long. “Abstract Values in the 19th Century British Novel: Decline and Transformation of a Semantic Field,” Digital Humanities 2011, Stanford University.

2) Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (Published online ahead of print: 12/16/2010)

3) The collection of 3134 documents (1700-1849) I used for this calculation was produced by combining ECCO-TCP volumes with nineteenth-century volumes selected and digitized by Jordan Sellers.

4) The SEASR Correlation Analysis and Ngrams Viewer was developed by Loretta Auvil and Boris Capitanu at the Illinois Informatics Institute, modeled on prototypes built by Ted Underwood, University of Illinois, and Ryan Heuser, Stanford.

LSA is a marvellous tool, but literary historians may want to customize it for their own discipline.

Right now Latent Semantic Analysis is the analytical tool I’m finding most useful. By measuring the strength of association between words or groups of words, LSA allows a literary historian to map themes, discourses, and varieties of diction in a given period.

This approach, more than any other I’ve tried, turns up leads that are useful for me as a literary scholar. But when I talk to other people in digital humanities, I rarely hear enthusiasm for it. Why doesn’t LSA get more love? I see three reasons:

1. The word “semantic” is a false lead: it points away from the part of this technique that would actually interest us. It’s true that Latent Semantic Analysis is based on the observation that a word’s distribution across a collection of documents works remarkably well as a first approximation of its meaning. A program running LSA can identify English synonyms on the TOEFL as well as the average student applying to college from a non-English-speaking country. [1]

But for a literary historian, the value of this technique does not depend on its claim to identify synonyms and antonyms. We may actually be more interested in contingent associations (e.g., “sensibility” — “rousseau” in the list on the left) than we are in the core “meaning” of a word.

I’ll return in a moment to this point. It has important implications, because it means that we want LSA to do something slightly different than linguists and information scientists have designed it to do. The “flaws” they have tried to iron out of the technique may not always be flaws for our purposes.

2. People who do topic-modeling may feel that they should use more-recently-developed Bayesian methods, which are supposed to be superior on theoretical grounds. I’m acknowledging this point just to set it aside; I’ve mused out loud about it once already, and I don’t want to do more musing until I have rigorously compared the two methods. I will say that from the perspective of someone just getting started, LSA is easier to implement than Bayesian topic modeling: it runs faster and scales up more easily.

3. The LSA algorithm provided by an off-the-shelf package is not necessarily the best algorithm for a literary historian. At bottom, that’s why I’m writing this post: humanists who want to use LSA are going to need guidance from people in their own discipline. Computer scientists do acknowledge that LSA requires “tuning, which is viewed as a kind of art.” [2] But they also offer advice about “best practices,” and some of those best practices are defined by disciplinary goals that humanists don’t share.

For instance, the power of LSA is often said to come from “reducing the dimensionality of the matrix.” The matrix in question is a term-document matrix — documents are listed along one side of the matrix, and terms along the other, and each cell of the matrix (tfi,j) records the number of times term i appears in document j, modified by a weighting algorithm described at the end of this post.

A (very small) term-document matrix.


That term-document matrix in and of itself can tell you a lot about the associations between words; all you have to do is measure the similarity between the vectors (columns of numbers) associated with each term. But associations of this kind won’t always reveal synonyms. For instance, “gas” and “petrol” might seem unrelated, because they substitute for each other in different sociolects and are rarely found together. To address that problem, you can condense the matrix by factorizing it with a technique called singular value decomposition (SVD). I’m not going to get into the math here, but the key is that condensing the matrix partially fuses related rows and columns — and as a result, the compressed matrix is able to measure transitive kinds of association. The words “gas” and “petrol” may rarely appear together. But they both appear with the same kinds of other words. So when dimensionality reduction “merges” the rows representing similar documents, “gas” and “petrol” will end up being strongly represented in the same merged rows. A compressed matrix is better at identifying synonyms, and for that reason at information retrieval. So there is a lot of consensus among linguists and information scientists that reducing the number of dimensions in the matrix is a good idea.

But literary historians approach this technique with a different set of goals. We care a lot about differences of sociolect and register, and may even be more interested in those sources of “noise” than we are in purely semantic relations. “Towering,” for instance, is semantically related to “high.” But I could look that up in a dictionary; I don’t need a computer program to tell me that! I might be more interested to discover that “towering” belongs to a particular subset of poetic diction in the eighteenth century. And that is precisely the kind of accident of distribution that dimensionality-reduction is designed to filter out. For that reason, I don’t think literary applications of LSA are always going to profit from the dimensionality-reduction step that other disciplines recommend.

For about eight months now, I’ve been using a version of LSA without dimensionality reduction. It mines associations simply by comparing the cosine-similarity of term vectors in a term-document matrix (weighted in a special way to address differences of document size). But I wanted to get a bit more clarity about the stakes of that choice, so recently I’ve been comparing it to a version of LSA that does use SVD to compress the matrix.

Comparing 18c associations for "delicacy" generated by two different algorithms.


Here’s a quick look at the results. (I’m using 2,193 18c volumes, mostly produced by TCP-ECCO; volumes that run longer than 100,000 words get broken into chunks that can range from 50k-100k words.) In many cases, the differences between LSA with and without compression are not very great. In the case of “delicacy,” for instance, both algorithms indicate that “delicate” has the strongest association. “Politeness” and “tenderness” are also very high on both lists. But compare the second row. The algorithm with compression produces “sensibility” — a close synonym. On the left-hand side, we have “woman.” This is not a synonym for “delicacy,” and if a linguist or computer scientist were evaluating these algorithms, it would probably be rejected as a mistake. But from a literary-historical point of view, it’s no mistake: the association between “delicacy” and femininity is possibly the most interesting fact about the word.

The 18c associations of "high" and "towering," in an uncompressed term-document matrix.


In short, compressing the matrix with SVD highlights semantic relationships at the cost of slightly blurring other kinds of association. In the case of “delicacy,” the effect is fairly subtle, but in other cases the difference between the two approaches is substantial. For instance, if you measure the similarity of term vectors in a matrix without compression, “high” and “towering” look entirely different. The main thing you discover about “high” is that it’s used for physical descriptions of landscape (“lies,” “hills”), and the main thing you discover about “towering” is that it’s used in poetic contexts (“flowery,” “glittering”).

The 18c. associations of "high" and "towering," as measured in a term-document matrix that has undergone SVD compression.


In a matrix that has undergone dimensionality reduction with SVD, associations have a much more semantic character, although they are still colored by other dimensions of context. Which of these two algorithms is more useful for humanistic purposes? I think the answer is going to depend on the goals being pursued in a given research project — if you’re interested in “topics” that are strictly semantic, you might want to use an algorithm that reduces dimensionality with SVD. If you’re interested in discourses, sociolects, genres, or types of diction, you might use LSA without dimensionality reduction.

My purpose here isn’t to choose between those approaches; it’s just to remind humanists that the algorithms we borrow from other disciplines are often going to need to be customized for our own disciplinary purposes. Information scientists have designed topic-modeling algorithms that produce semantically unified topics, because semantic categorization is important for them. But in literary history, we also care about other dimensions of language, and we don’t have to judge topic-modeling algorithms by strictly semantic criteria. How should we judge them? It will probably take decades for us to answer that question fully, but the short answer is just — by how well, in practice, they help us locate critically and historically interesting patterns.

A couple of technical notes: A fine point of LSA that can matter a great deal is how you weight the individual cells in the term-document matrix. For the normal LSA algorithm that uses dimensionality reduction, the consensus is that “log-entropy weighting” works well. You take the log of each frequency, and multiply the whole term vector by the entropy of the vector. I have found that this also works well for humanistic purposes.

For LSA without dimensionality reduction, I would recommend weighting cells by subtracting the expected frequency from the observed frequency. This formula “evens the playing field” between common and uncommon words — and it does so, vitally, in a way that gives a word’s absence from a long document more weight than its absence from a short one. (Much of LSA’s power actually comes from learning where a given word tends not to appear. [3]) I have tried various ways of applying log-entropy weighting without compressing the matrix, and I do not recommend it. Those two techniques belong together.

For reasons that remain somewhat mysterious (although the phenomenon itself is widely discussed), dimensionality reduction seems to work best when the number of dimensions retained is in the range of 250-350. Intuitively, it would seem possible to strike a sort of compromise between LSA methods that do and don’t compress the matrix by reducing dimensionality less drastically (perhaps only, say, cutting it by half). But in practice I find that doesn’t work very well; I suspect compression has to reach a certain threshold before the noise inherent in the process starts to cancel itself out and give way to a new sort of order.

[1] Thomas K. Landauer, Peter W. Foltz, and Darrell Latham, An Introduction to Latent Semantic Analysis, Discourse Processes 25 (1998): 259-84. Web reprint, p. 22.
[2] Preslav Nakov, Elena Valchanova, and Galia Angelova, “Towards Deeper Understanding of Latent Sematic Analysis Performance,” Recent Advances in Natural Language Processing, ed. Nicolas Nicolov (Samokov, Bulgaria: John Benjamins, 2004), 299.
[3] Landauer, Foltz, and Latham, p. 24.

The challenges of digital work on early-19c collections.

I’ve been posting mostly about collections built by other people (TCP-ECCO and Google). But I’m also in the process of building a small (thousand-title) 19c collection myself, in collaboration with E. Jordan Sellers. Jordan is selecting titles for the collection; I’m writing the Python scripts that process the texts. This is a modest project intended to support research for a few years, not a model for long-term curatorial practice. But we’ve encountered a few problems specific to the early 19c, and I thought I might share some of our experience and tools in case they’re useful for other early-19c scholars.

Literary and Characteristical Lives (1800), by William and Alexander Smellie. Note esp. the ligatures in 'first' and 'section.'


I originally wanted to create a larger collection, containing twenty or thirty thousand volumes, on the model of Ben Schmidt’s impressive work with nineteenth-century volumes vacuumed up from the Open Library. But because I needed a collection that bridged the eighteenth and nineteenth centuries, I found I had to proceed more slowly. The eighteenth century itself wasn’t the problem. Before 1800, archaic typography makes most optical character recognition unreliable — but for that very reason, TCP-ECCO has been producing clean, manually-keyed versions of 18c texts, enough at least for a small collection. The later 19c also isn’t a problem, because after 1830 or so, OCR quality is mostly adequate.

OCR version of Smellie, contributed by Columbia University Libraries to the Internet Archive.


But between 1800 and (say) 1830, you fall between two stools. It’s technically the nineteenth century, so people assume that OCR ought to work. But in practice, volumes from this period still have a lot of eighteenth-century typographical quirks, including loopy ligatures, the notorious “long s,” and worn or broken type. So the OCR is often pretty vile. I’m willing to put up with background noise if it’s evenly distributed. But these errors are distributed unevenly across the lexicon and across time, so they could actually distort conclusions if left unaddressed.

I decided to build a Python script to do post-processing correction of OCR. There are a lot of ways to do this; my approach was modeled on a paper written by Thomas A. Lasko and Susan E. Hauser for the National Library of Medicine. Briefly, what they show is that OCR correction becomes much more reliable when the program is given statistical information about the language, and errors, to be expected in a given domain. They’re working with contemporary text, but the principle holds even more strongly when you’re working in a different historical period. A generic spellchecker won’t perform well with texts that contain period spellings (“despatch,” “o’erflow’d”), systematic f/s substitution, and a much higher proportion of Latin and French than we’re used to. If your system corrects every occurrence of “même” to “mime,” you’re going to end up with a surprising number of mimes; if you accept “foul” at face value as a correctly-spelled word, you’re going to have very little “soul” in your collection.

Briefly, I customized my spellchecker for the early 19c in three ways:

    • The underlying dictionary included period spellings as well as common French and Latin terms, and recorded the frequency of each term in the 18/19c domain. I used frequencies (lightly) to guide fuzzy matching.
    • To calculate “edit distance,” I used a weighted matrix that recorded the probability of specific character substitutions in early-19c OCR, learning as it went along.
    • To resolve pairs like “foul/soul” and “flip/slip/ship,” where common OCR errors produce a token that could also be a real word, I extracted 2gram frequencies from the Google ngram database so that the program could judge which word made more sense in context. I.e., in the case of “the flip sailed,” the program can infer that a word before “sailed” is pretty likely to be “ship.”

A few other tricks are needed to optimize speed, and to make sure the script doesn’t over-correct proper nouns; anyone who’s interested in doing this should drop me a line for a fuller description and a copy of the code.

Automatically corrected version.


The results aren’t perfect, but they’re good enough to be usable (I am also recording the number of corrections and uncorrectable tokens so that I can assess margins of error later on).

I haven’t packaged this code yet for off-the-shelf use; it’s still got a few trailing wires. But if you want to cannibalize/adapt it, I’d be happy to give you a copy. Perhaps more importantly, I’d like to share a couple of sets of rules that might be helpful for anyone who’s attempting to normalize an 18/19c collection. Both of these rulesets are tab-delimited utf-8 .txt files. First, my list of 4600 rules for correcting 18/19c spellings, including syncopated past-tense forms like “bury’d” and “drop’d.” (Note that syncope cannot always be fixed simply by adding back an “e.” Rules for normalizing poetic syncope — “flow’ry,” “ta’en” — are clustered at the end of the file, so you can delete them if desired.) This ruleset has been transformed by a long series of joins and filtering operations, and edited manually, but I should acknowledge that part of the original list was borrowed from the source files that accompany WordHoard, developed at Northwestern University. I should also warn potential users that these rules are designed to normalize spelling to modern British practice.

The other thing it might be useful to share is a list of 2grams extracted from the Google English corpus, that I use for contextual spellchecking. This includes only 2grams where one of the two elements is a token like “fix” or “flip” that could be read either as a valid word or as an OCR error caused by the long s. Since the long s is also a problem in the Google dataset itself up to 1820, this list was based on frequencies from 1825-50. That’s not perfect for correcting texts in the 1800-1820 period, but I find that in practice it’s adequate. There are two columns here: the 2gram itself, and the frequency.