Problems of scale.

The Artist in Despair Over the Magnitude of Antique Fragments, by Henry Fuseli.

Just a quick note here to acknowledge a collaborative project that I hope will generate some useful resources for scholars interested in text mining. We don’t have many resources up on the website yet, but watch this space.

The project is called The Uses of Scale, and it’s a pilot project for the Humanities Without Walls planning initiative, run by the Illinois Program for Research in the Humanities at the University of Illinois at Urbana-Champaign.

The principal investigators most actively involved in Uses of Scale are Ted Underwood (University of Illinois, Urbana-Champaign), Robin Valenza (University of Wisconsin, Madison), and Matt Wilkens (Notre Dame). All of us have been mining large collections of printed books, ranging from the early modern period to the twentieth century. We’ll be joining forces this year to reflect critically on problems of scale in literary research — including the questions that arise when we try to connect different scales of analysis. But we also hope to generate a few resources that are immediately and practically useful for scholars attempting to “scale up” their research projects (resources, for instance, for correcting OCR). There’s already a bare-bones list of OCR-correction rules on the website, as well a description of a more ambitious project now underway.

Where to start with text mining.

This post is an outline of discussion topics I’m proposing for a workshop at NASSR2012 (a conference of Romanticists). I’m putting it on the blog since some of the links might be useful for a broader audience.

In the morning I’ll give a few examples of concrete literary results produced by text mining. I’ll start the afternoon workshop by opening two questions for discussion: first, what are the obstacles confronting a literary scholar who might want to experiment with quantitative methods? Second, how do those methods actually work, and what are their limits?

I’ll also invite participants to play around with a collection of 818 works between 1780 and 1859, using an R program I’ve provided for the occasion. Links for these materials are at the end of this post.

I. HOW DIFFICULT IS IT TO GET STARTED?
There are two kinds of obstacles: getting the data you need, and getting the digital skills you need.

1. Is it really necessary to have a large collection of texts?
This is up for debate. But I tend to think the answer is “yes.”

Not because bigger is better, or because “distant reading” is the new hotness. It’s still true that a single passage, perceptively interpreted, may tell us more than a thousand volumes.

But if you want to interpret a single passage, you fortunately already have a wrinkled protein sponge that will do a better job than any computer. Quantitative analysis starts to make things easier only when we start working on a scale where it’s impossible for a human reader to hold everything in memory. Your mileage may vary, but I’d say, more than ten books?

And actually, you need a larger collection than that, because quantitative analysis tends to require context before it becomes meaningful. It doesn’t mean much to say that the word “motion” is common in Wordsworth, for instance, until we know whether “motion” is more common in his works than in other nineteenth-century poets. So yes, text-mining can provide clues that lead to real insights about a single author or text. But it’s likely that you’ll need a collection of several hundred volumes, for comparison, before those clues become legible.

Words that are consistently more common in works by William Wordsworth than in other poets from 1780 to 1850. I’ve used Wordle’s graphics, but the words have been selected by a Mann-Whitney test, which measures overrepresentation relative to a context — not by Wordle’s own (context-free) method. See the R script at the end of this post.

This isn’t to deny that there are interesting things that can be done digitally with a single text: digital editing, building timelines and maps, and so on. I just doubt that quantitative analysis adds much value at that scale. (And to give credit where it’s due: Mark Olsen was saying all this back in the 90s — see References.)

2. So, where do I get all those texts?
That’s what I was asking myself 18 months ago. A lot of excitement about digital humanities is premised on the notion that we already have large collections of digitized sources waiting to be used. But it’s not true, because page images are not the same thing as clean, machine-readable text.

If you’re interested in twentieth-century secondary sources, the JSTOR Data for Research API can probably get you what you need. Primary sources are a harder problem. In our own (Romantic) era, optical character recognition (OCR) is unreliable. The ratio of words transcribed accurately ranges from around 80% to around 98%, depending on print quality and typographical quirks like the notorious “long s.” For a lot of text-mining purposes, 95% might be fine, if the errors were randomly distributed. But they’re not random: errors cluster in certain words and periods.

What you see in a page image.

The problem can be addressed in several different ways. There are a few collections (like ECCO-TCP and the Brown Women Writers Project) that transcribe text manually. That’s an ideal solution, but coverage of that kind is stronger in the eighteenth than the nineteenth century.

What you may see as OCR.

What you may see as OCR.

So Jordan Sellers and I have supplemented those collections by automatically correcting 19c OCR that we got from the Internet Archive. Our strategy involved statistically cautious, period-specific spellchecking, combined with enough reasoning about context to realize that “mortal fin” is probably “mortal sin,” even though “fin” is a correctly spelled word. It’s not a perfect solution, but in our period it works well enough for text-mining purposes. We have corrected about 2,000 volumes this way, and are happy to share our texts and metadata, as well as the spellchecker itself (once I get it packaged well enough to distribute). I can give you either a zip file containing the 19c texts themselves, or a tab-separated file containing docIDs, words, and word counts for the whole collection. In either scheme, the docIDs are keyed to this metadata file.

Of course, selecting titles for a collection like this raises intractable questions about representativeness. We tried to maximize diversity while also selecting volumes that seemed to have reached a significant audience. But other scholars may have other priorities. I don’t think it would be useful to seek a single right answer about representativeness; instead, I’d like to see multiple scholars building different kinds of collections, making them all public, and building on each other’s work. Then we would be able to test a hypothesis against multiple collections, and see whether the obvious caveats about representativeness actually make a difference in any given instance.

3. Is it necessary to learn how to program?
I’m not going to try to answer that question, because it’s complex and better addressed through discussion.

I will tell a brief story. I went into this gig thinking that I wouldn’t have to do my own programming, since there were already public toolsets for text-mining (Voyant, MONK, MALLET, TAPoR, SEASR) and for visualization (Gephi). I figured I would just use those.

But I rapidly learned otherwise. Tools like MONK and Voyant taught me what was possible, but they weren’t well adapted for managing a very large collection of texts, and didn’t permit me to make my own methodological innovations. When you start trying to do either of those things, you rapidly need “nonstandard parts,” which means that someone in the team has to be able to program.

That doesn’t have to be a daunting prospect, because the programming involved is of a relatively forgiving sort. It’s not easy, but it’s also not professional software development. So if you want to do it yourself, that’s a plausible aspiration. Alternately, if you want to collaborate with someone, you don’t necessarily need to find “a computer scientist.” A graduate student or fellow humanist who can program will do just fine.

If you do want to learn to program, I would recommend starting with either Python or R. Of the two languages, Python is certainly easier. It’s intuitive, and well-documented, and great for working with text. If you expect to use existing tools (like MALLET), and just need some “glue” to connect them to each other, Python is probably the way to go. R is a more specialized and less intuitive language. But it happens to be specialized in some ways that are useful for text mining. In particular, it has built-in statistical functions, and a built-in plotting/graphing capacity. I’ve used it for the sample exercise that accompanies this post. But if you’re learning to program for the first time, Python might be a better all-around choice, and you could in principle extend it to do everything R does. [Later addition: You could do worse than start with The Programming Historian.]

II. WHAT CAN WE ACTUALLY DO WITH QUANTITATIVE METHODS?
What follows is just a list of elements. Interesting research projects tend to combine several of these elementary operations in ad-hoc ways suited to a particular question. The list of elements runs a little long, so let me cut to the chase: the overall theme I’m trying to convey is that you can build complex arguments on a very simple foundation. Yes, at bottom, text mining is often about counting words. But a) words matter and b) they hang together in interesting ways, like individual dabs of paint that together start to form a picture.

So, to return to the original question: what can we do?

1) Categorize documents. You can “categorize” in several different senses.

    a) Information retrieval: retrieve documents that match a query. This is what you do every time you use a search engine.

    b) (Supervised) classification: a program can learn to correctly distinguish texts by a given author, or learn (with a bit more difficulty) to distinguish poetry from prose, tragedies from history plays, or “gothic novels” from “sensation novels.” (See “Quantitative Formalism,” Pamphlet 1 from the Stanford Literary Lab.) The researcher has to provide examples of different categories, but doesn’t have to specify how to make the distinction: algorithms can learn to recognize a combination of features that is the “fingerprint” of a given category.

    An example of clustering from “Quantitative Formalism,” Allison, Heuser, Jockers, Moretti, and Witmore, Stanford Literary Lab.

    c) (Unsupervised) clustering: a program can subdivide a group of documents using general measures of similarity instead of predetermined categories. This may reveal patterns you don’t expect.

All three of these techniques can achieve amazing results armed with what seems like very crude information about the documents they’re categorizing. We know, intuitively, that merely counting words is not enough to distinguish a tragedy from a history play. But our intuitions are simply wrong — see the lit lab pamphlet I cited above. It turns out that there’s an enormous amount of information contained in relative word frequencies, even if you know nothing about sequence or syntax. As you consider other aspects of text mining, it’s useful to keep this intuitive misfire in mind. Relatively simple statistical techniques often characterize discourse a good deal better than our intuitions would predict.

2) Contrast the vocabulary of different corpora. In a way, this reverses the logic of classifying documents (1b). Instead of using features to sort documents into categories, you start with two categories of documents and contrast them to identify distinctive features.

For instance, you can discover which words (or phrases) are overrepresented in one author or genre (relative to, say, the rest of nineteenth-century literature). It can admittedly be a challenge to interpret the results: this is a kind of evidence we aren’t accustomed to yet. But lists of overrepresented words can be a fruitful source of critical leads to pursue in more traditional ways.

Beyond identifying distinctive words and phrases, corpora can be compared using metrics chosen for some more specific reason. It’s difficult to give an exhaustive list – but, for instance, the argument I’ve been making about generic differentiation is based on a kind of corpus comparison. As a general think-piece on the topic, I recommend Ben Schmidt’s blog post arguing that comparison is an underused and underrated tool; Schmidt’s taxonomy of text-mining techniques in that post was a strong influence on the taxonomy I’m offering here.

3) Trace the history of particular features (words or phrases) over time. This could be viewed as a special category of corpus comparison, where you’re comparing corpora segmented on the time axis.

The best-known example here would be Google’s ngram viewer. Digital humanists love to criticize the ngram viewer, partly for valid reasons (there’s no way to know what texts are being used). But it has probably been the single most influential application of text mining, so clearly people are finding this simple kind of diachronic visualization useful. A couple of other projects have built on the same dataset, slicing it in different ways. Mark Davies of BYU built an interface that lets you survey the history of collocations. Our team at Illinois built an interface that mines 18-19c correlations in the ngram dataset; it turns out that correlated words have a high likelihood of being related in other ways as well, and these can be intriguing leads: see what words correlate with “delicacy” in our period, for instance. Harvard has built Bookworm, which can be understood as a smaller but more flexible and better-documented version of the ngram viewer (built on the Open Library instead of Google Books).

Words whose frequencies correlate strongly over time are often related in other ways as well. Ngram viewer by Auvil, Capitanu, Heuser and Underwood, based on corrected Google dataset.

Of special interest to Romanticists: a project that isn’t built on the ngram dataset but that does use diachronic correlation-mining as a central methodology. In Stanford Lit Lab Pamphlet 4, Ryan Heuser and Long Le-Khac have traced some very interesting, strongly correlated changes in novelistic diction over the course of the 19th century.

Finally, anyone who wants to make a diachronic argument about diction should read Ben Schmidt’s simple, elegant experiment peeling apart two different components of change: generational succession and historical change within the diction of a single age-cohort.

4) Cluster features that tend to be associated in a given corpus of documents (aka topic modeling). In a way, this reverses the logic of clustering documents (1c). Instead of grouping documents that tend to share the same words, you group words that tend to appear in the same documents, or parts of documents. This produces something that looks like a semantic map of the period or corpus you’re studying. (It would be more accurate to call it a discursive map, because topics don’t actually have to be unified semantically. They are more analogous to “discourses.”)

There are a lot of ways to cluster features, ranging from older approaches (Latent Semantic Analysis), to the new, hip approach — “Bayesian topic modeling,” which has the advantage that it clusters individual occurrences of words (tokens) instead of word types. As a result, it can distinguish different senses of a word. (Scott Weingart has written a clear and comprehensive introduction to topic modeling for humanists.)

Topic modeling has become justifiably popular for several reasons. First and foremost, a “discursive map” can be a nice thing to have; it lends itself easily to interpretation. Also, frankly, this approach doesn’t require a whole lot of improvisation. You just pour text files into a tool like MALLET, and out come a list of topics, looking meaningful and authoritative. It’s important to remember that topic-modeling is in fact an imprecise process. Slightly different inputs (for instance, a different stopword list) can produce very different outputs.

5) Entity extraction. If you’re mainly interested in proper nouns (personal names or place names, or dates and prices) there are tools like OpenNLP that can extract these from text, using syntactic patterns as clues.

6) Visualization. Perhaps this isn’t technically a form of analysis, but in practice it’s important enough that it deserves to be treated as a separate analytical step. It’s impractical to list all possible forms of visualization here, but for instance, results can be visualized:

Putting things together.
There’s no limit to the number of ways you can combine these different operations. Matt Wilkens has extracted references to named entities from fiction, and then visualized their density geographically. Robert K. Nelson has performed topic modeling on the print run of a Civil-War-era newspaper, and then graphed the frequency of each topic over time. You could go a step further and look for correlations between topics (either over time, or in terms of their distribution over documents). Then you could visualize the relationships between topics as a network.

What’s the goal uniting all this experimentation? I suspect there are two different but equally valid goals. In some cases, we’re going to find patterns that actually function as evidence to support literary-historical arguments. (In a number of the examples cited above, I think that’s starting to happen.) In other cases, text mining may work mainly as an exploratory technique, revealing clues that need to be fleshed out and written up using more traditional critical methods. The boundary between those two applications will be hotly debated for years, so I won’t attempt to define it here.

III. SAMPLE DATA AND SCRIPT FOR EXPLORATION.
I don’t know whether we’ll really have time for this, but I ought to at least offer you a chance to do hands-on stuff. So here’s a medium-sized project.

I’ve created a pre-packaged set of 818 volumes of poetry and fiction between 1780 and 1859, including 243 authors. I can give you first, a metadata file that includes the authors, titles, dates, and so on for each volume, and second, a data file that includes word counts for each volume. (To keep from frying your laptop, I’ve only included the top 9,000 words in the collection. But actually that’s a lot.)

Finally, I’ve provided an R script that will let you define different chunks of the collection and compare them against each other, to identify words that are significantly overrepresented in a given author, genre, or period. The script will try two different measures of “overrepresentation”: the first, “log-likelihood,” is based on the aggregate frequency of words in the corpus you selected, adding all the volumes in the corpus together. The second, “Mann-Whitney rho,” tries to locate words that are consistently more common in corpus X by paying attention to individual volumes. For more on how that works, see this blog post.

Of course, the R script won’t work until you download R and open it from within R. Please understand that this is a very rough, ad-hoc piece of work for this one occasion, not a polished piece of software that I expect people to use for the long term.

Postscript about the word “mining.”
I know it has an industrial sound; I know humanists like “analysis” more. But I’m sticking with the mining metaphor on the principle of truth in advertising. I think that word accurately conveys the scale of this enterprise, and the fact that it’s often more exploratory than probative. Besides, “mining” is vivid, and that has its own sort of humanistic value.

References (that aren’t already implicit in links)
Mark Olsen, “Signs, Symbols, and Discourses: A New Direction for Computer-Aided Literature Studies” Computers and the Humanities 27 (1993): 309-314.

Getting everything you want from HathiTrust.

NYPL, photo courtesy Alex Proimos

A fair number of scholars would like to work on large digital collections, but aren’t entirely sure where to get them.

For people who work on text after, say, 1700, I’d like to briefly make a case for HathiTrust. I’m a few months into a project based on 800,000 volumes — collaborating with Mike Black, an English Ph.D student and extraordinary Python programmer. We decided to get our collection from HathiTrust, and it’s a decision I haven’t regretted. In terms of sheer numbers, I don’t know whether they’re larger than, say, the Internet Archive. But their collection has some subtle details that I’ve come to greatly appreciate.

For one thing, they divide documents into individual page files. At first this may seem like a pain (you want a file, right, not a folder of files?) But in fact it’s a significant advantage to have that hard-coded representation of page breaks. It has made it possible for Mike to design a Python script that a) recognizes running headers at the tops of pages b) uses them to make a reasonable guess about chapters and other document divisions and then c) removes the headers, which can otherwise throw a wrench in your topic model.

Also, the HathiTrust API is solid and well documented. If you request a large dataset from them, you will get metadata with it. But the availability of the bibliographic API can still be a significant benefit. (By the way, re: metadata — ask them to give you the complete .json record, not just the marc-record part of the json.)

For small numbers of texts, you could in fact get the text itself from the data API. But this is not recommended for a big collection. Instead you’re going to want to write Hathi and request that they construct a dataset for you, based on facets that would be available in their Advanced Search feature. Once they build it — which could take a few weeks to a month — you can send them a hard drive or download data through rsync. (I initially found rsync perplexing, but after the nice people at Hathi gave me precise instructions, it was easy.) Using rsync through my campus office connection, it took about two days to transfer 800,000 volumes, which consumed a little less than 1TB of disk space. It would have been slower if I had tried to do it at home through commercial broadband and an AirPort.

There is a lot of time involved simply in moving data around, and in part I’m writing this post to warn people about that. One really basic point that took me a while to figure out: do not try to unzip the files. Part of the reason why it’s slow to move a large collection is that separate files require your i/o to do a lot of starting and stopping. That’s hard enough with (say) 500,000 separate zipped document folders. If you unzip those documents and get 165 million separate page files, it becomes very hard indeed. I actually spent more than a week unzipping the collection, and about a week trying to move it from one drive to another — only to get a disk error halfway through the process that required reformat.

Mothers, teach your children not to do as I have done. Just use the Python module zipfiles that works directly with the .zip file. It takes Python a few tenths of a second to extract the data, but it’s much better than trying to move 165 million individual pages. H/t to Loretta Auvil, by the way, for convincing me that this was simpler.

I’m going to try to make available the Python scripts and lexica that Mike and I design for working with the collection. There are

a) Simple logistical issues, like navigating the pairtree folder structure where files are stored and extracting them from .zip.
b) Metadata issues, like normalizing dates of publication that can be “1871″ or “[18--]“.
c) Document-format issues, like running headers and page numbers.
d) OCR issues, which are the really fun ones as far as I’m concerned.

We’ve written pieces of all of this, and (a) through (c) are working, but it’s not yet in beta (to put it mildly). However, if you’re grappling with a similar problem, drop me a line and I’ll send you our code, such as it is. Development of this code was supported by the Andrew W. Mellon Foundation.

I’d also like to encourage everyone who’s interested in these kinds of problems to attend the HathiTrust Research Center UnCamp in Indiana this September (pre-register by August 1). This should be particularly useful if you’re interested in working on collections after 1923. HTRC has begun to design an infrastructure that will permit non-consumptive or non-expressive research on texts without transmitting the text itself to the researcher — obviously a crucial part of the solution to the problem of research on copyrighted text. They hope to demo parts of that infrastructure in September — but if you show up, you also have a fair chance of getting input on the design of the final version.

It’s the data: a plan of action.


I’m still fairly new at this gig, so take the following with a grain of salt. But the more I explore the text-mining side of DH, the more I wonder whether we need to rethink our priorities.

Over the last ten years we’ve been putting a lot of effort into building tools and cyberinfrastructure. And that’s been valuable: projects like MONK and Voyant play a crucial role in teaching people what’s possible. (I learned a lot from them myself.) But when I look around for specific results produced by text-mining, I tend to find that they come in practice from fairly simple, ad-hoc tools, applied to large datasets.

Ben Schmidt’s blog Sapping Attention is a good source of examples. Ben has discovered several patterns that really have the potential to change disciplines. For instance, he’s mapped the distribution of gender in nineteenth-century collections, and assessed the role of generational succession in vocabulary change. To do this, he hasn’t needed natural language processing, or TEI, or even topic modeling. He tends to rely on fairly straightforward kinds of corpus comparison. The leverage he’s getting comes ultimately from his decision to go ahead and build corpora as broad as possible using existing OCR.

I think that’s the direction to go right now. Moreover, before 1923 it doesn’t require any special agreement with publishers. There’s a lot of decent OCR in the public domain, because libraries can now produce cleaner copy than Google used to. Yes, some cleanup is still needed: running headers need to be removed, and the OCR needs to be corrected in period-sensitive ways. But it’s easier than people think to do that reliably. (You get a lot of clues, for instance, from cleaning up a whole collection at once. That way, the frequency of a particular form across the collection can help your corrector decide whether it’s an OCR error or a proper noun.)

In short, I think we should be putting a bit more collective effort into data preparation. Moreover, it seems to me that there’s a discernible sweet spot between vast collections of unreliable OCR and small collections of carefully-groomed TEI. What we need are collections in the 5,000 – 500,000 volume range, cleaned up to at least (say) 95% recall and 99% precision. Precision is more important than recall, because false negatives drop out of many kinds of analysis — as long as they’re randomly distributed (i.e. you can’t just ignore the f/s problem in the 18c). Collections of that kind are going to generate insights that we can’t glimpse as individual readers. They’ll be especially valuable once we enrich the metadata with information about (for instance) genre, gender, and nationality. I’m not confident that we can crowdsource OCR correction (it’s an awful lot of work), but I am confident that we could crowdsource some light enrichment of metadata.

So this is less a manifesto than a plan of action. I don’t think we need a center or a grant for this kind of thing: all we need is a coalition of the willing. I’ve asked HathiTrust for English-language OCR in the 18th and 19th centuries; once I get it, I’ll clean it up and make the cleaned version publicly available (as far as legally possible, which I think is pretty far). Then I’ll invite researchers to crowdsource metadata in some fairly low-tech way, and share the enriched metadata with everyone who participated in the crowdsourcing.

I would eagerly welcome suggestions about the kinds of metadata we ought to be recording (for instance, the genre categories we ought to use). Questions about selection/representativeness are probably better handled by individual researchers; I don’t think it’s possible to define a collective standard on that point, because people have different goals. Instead, I’ll simply take everything I can get, measure OCR quality, and allow people to define their own selection criteria. Researchers who want to produce a specific balance between X and Y can always do that by selecting a subset of the collection, or by combining it with another collection of their own.

The obvious thing we’re lacking.

I love Karen Coyle’s idea that we should make OCR usable by identifying the best-available copy of each text. It’s time to start thinking about this kind of thing. Digital humanists have been making big claims about our ability to interpret large collections. But — outside of a few exemplary projects like EEBO-TCP — we really don’t have free access to the kind of large, coherent collections that our rhetoric would imply. We’ve got feet of clay on this issue.

Do you think the 'ct' ligature looks a little like an ampersand? Well, so do OCR engines.


Moreover, this wouldn’t be a difficult problem to address. I think it can be even simpler than Coyle suggests. In many cases, libraries have digitized multiple copies of a single edition. The obvious, simple thing to do is just:

    Measure OCR quality — automatically, using a language model rather than ground truth — and associate a measurement of OCR quality with each bibliographic record.

This simple metric would save researchers a huge amount of labor, because a scholar could use an API to request “all the works you have between 1790 and 1820 that are above 90% probable accuracy” or “the best available copy of each edition in this period,” making it much easier to build a meaningfully normalized corpus. (This may be slightly different from Coyle’s idea about “urtexts,” because I’m talking about identifying the best copy of an edition rather than the best edition of a title.) And of course a metric destroys nothing: if you want to talk about print culture without filtering out poor OCR, all that metadata is still available. All this would do is empower researchers to make their own decisions.

One could go even further, and construct a “Frankenstein” edition by taking the best version of each page in a given edition. Or one could improve OCR with post-processing. But I think those choices can be left to individual research projects and repositories. The only part of this that really does need to be a collective enterprise is an initial measurement of OCR quality that gets associated with each bibliographic record and exposed to the API. That measurement would save research assistants thousands of hours of labor picking “the cleanest version of X.” I think it’s the most obvious thing we’re lacking.

[Postscript: Obviously, researchers can do this for themselves by downloading everything in period X, measuring OCR quality, and then selecting copies accordingly. In fact, I'm getting ready to build that workflow this summer. But this is going to take time and consume a lot of disk space, and it's really the kind of thing an API ought to be doing for us.]

Big but not distant.

Big data. I’m tempted to begin “I, too, dislike it,” because the phrase has become a buzzword. To mainstream humanists, it sounds like a perversion. Even people who work in digital humanities protest that DH shouldn’t be normatively identified with big data — and I agree — so generally I keep quiet on the whole vexed question.

Except … there are a lot of grad students out there just starting to look at DH curiously, wondering whether it offers anything useful for their own subfield. In that situation, it’s natural to start by building a small collection that addresses a specific research problem you know about. And that might, in many cases, be a fine approach! But my conscience is nagging at me, because I can see some other, less obvious opportunities that students ought to be informed about.

It’s true that DH doesn’t have to be identified with scale. But the fact remains that problems of scale constitute a huge blind spot for individual researchers, and also define a problem that we know computers can help us explore. And when you first go into an area that was a blind spot for earlier generations of scholars, you’re almost guaranteed to find research opportunities — lying out on the ground like lumps of gold you don’t have to mine.

I'm just saying.


This suggests that it might be a mistake to assume that the most cost-effective way to get started in DH is to define a small collection focused on a particular problem you know about. It might actually be a better strategy to beg, borrow, or steal a large collection — and poke around in it for problems we don’t yet know about.

“But I’m not interested in big statistical generalizations; I care about describing individual works, decades, and social problems.” I understand; that’s a valid goal; but it’s not incompatible with the approach I’m recommending. I think it’s really vital that we do a better job of distinguishing “big data” (the resource) from “distant reading” (a particular interpretive strategy).* Big data doesn’t have to produce distant generalizations; we can use the leverage provided by scale and comparative analysis to crack open small and tightly-focused questions.

I don’t think most humanists have an intuitive grasp of how that “leverage” would work — but topic modeling is a good example. As I play around with topic-modeling large collections, I’m often finding that the process tells me interesting things about particular periods, genres, or works, by revealing how they differ from other relevant points of comparison. Topic modeling doesn’t use scale to identify a “trend” or an “average,” after all; what it does is identify the most salient dimensions of difference in a given collection. If you believe that the significance of a text is defined by its relation to context, then you can see how topic modeling a collection might help us crack open the (relational) significance of individual works.

“But how do we get our hands on the data?” Indeed: there’s the rub. Miriam Posner has recently suggested that the culture surrounding “coding” serves as a barrier that discourages women and minorities from entering certain precincts of DH. I think that’s right, but I’m even more concerned about the barriers embodied in access to data. Coding is actually not all that hard to pick up. Yes, it’s surrounded by gendered assumptions; but still, you can do it over a summer. [Update: Or, where that's not practical, you can collaborate with someone. At Illinois, Loretta Auvil and Boris Capitanu do kinds of DH programming that are beyond me. I don't mean to minimize issues of gender here, but I do mean to put "coding" in perspective. It's not a mysterious, magical key.] By contrast, none of us can build big data on our own (or even in small teams) over the summer. If we don’t watch out, our field could easily slip into a situation where power gravitates to established scholars at large/wealthy research universities.

I’ve tried to address that by making my own data public. I haven’t documented it very well yet, but give me a few weeks. I think peer pressure should be exerted on everyone (especially established scholars) to make their data public at the time of publication. I do understand that some kinds of data can’t be shared because they’re owned by private enterprise. I accept that. But if you’ve supplemented proprietary data with other things you’ve produced on your own: in my opinion, that data should be made public at the time of publication.

Moreover, if you do that, I’m not going to care very much about the mistakes you have made in building your collection. I may think your data is completely biased and unrepresentative, because it includes too much Y and not enough X. But if so, I have an easy solution — which is to take your data, add it to my own collection of X, and other data borrowed from Initiative Z, and then select whatever subset would in my opinion create a balanced and representative collection. Then I can publish my own article correcting your initial, biased result.

Humanists are used to approaching debates about historical representation as if they were zero-sum questions. I suppose we are on some level still imagining this as a debate about canonicity — which is, as John Guillory pointed out, really a debate about space on the syllabus. Space on the syllabus is a zero-sum game. But the process of building big data is not zero-sum; it is cumulative. Every single thing you digitize is more good news for me, even if I shudder at the tired 2007-vintage assumptions implicit in your research agenda.

Personally, I feel the same way about questions of markup and interoperability. It’s all good. If you can give me clean** ascii text files with minimal metadata, I love you. If you can give me TEI with enriched metadata, I love you. I don’t want to waste a lot of breath arguing about which standard is better. In most cases, clean ascii text would be a lot better than what I can currently get.

* I hasten to say that I’m using “distant reading” here as the phrase is commonly deployed in debate — not as Franco Moretti originally used it — because the limitation I’m playing on is not really present in Moretti’s own use of the term. Moretti pointedly emphasizes that the advantage of a distant perspective may be to reveal the relational significance of an individual work.

** And, when I say “clean” — I will definitely settle for a 5% error rate.

References
Guillory, John. Cultural Capital. Chicago: U. of Chicago Press, 1993.
Moretti, Franco. Graphs, Maps, Trees. New York: Verso, 2005.

[UPDATE: For a different perspective on the question of representativeness, see Katherine D. Harris on Big Data, DH, and Gender. Also, see Roger Whitson, who suggests that linked open data may help us address issues of representation.]

Literary and nonliterary diction, the sequel.

In my last post, I suggested that literary and nonliterary diction seem to have substantially diverged over the course of the eighteenth and nineteenth centuries. The vocabulary of fiction, for instance, becomes less like nonfiction prose at the same time as it becomes more like poetry.

It’s impossible to interpret a comparative result like this purely as evidence about one side of the comparison. We’re looking at a process of differentiation that involves changes on both sides: the language of nonfiction and fiction, for instance, may both have specialized in different ways.

This post is partly a response to very helpful suggestions I received from commenters, both on this blog and at Language Log. It’s especially a response to Ben Schmidt’s effort to reproduce my results using the Bookworm dataset. I also try two new measures of similarity toward the end of the post (cosine similarity and etymology) which I think interestingly sharpen the original hypothesis.

I have improved my number-crunching in four main ways (you can skip these if you’re bored):

1) In order to normalize corpus size across time, I’m now comparing equal-sized samples. Because the sample sizes are small relative to the larger collection, I have been repeating the sampling process five times and averaging results with a Fisher’s r-to-z transform. Repeated sampling doesn’t make a huge difference, but it slightly reduces noise.

2) My original blog post used 39-year slices of time that overlapped with each other, producing a smoothing effect. Ben Schmidt persuasively suggests that it would be better to use non-overlapping samples, so in this post I’m using non-overlapping 20-year slices of time.

3) I’m now running comparisons on the top 5,000 words in each pair of samples, rather than the top 5,000 words in the collection as a whole. This is a crucial and substantive change.

4) Instead of plotting a genre’s similarity to itself as a flat line of perfect similarity at the top of each plot, I plot self-similarity between two non-overlapping samples selected randomly from that genre. (Nick Lamb at Language Log recommended this approach.) This allows us to measure the internal homogeneity of a genre and use it as a control for the differentiation between genres.

Briefly, I think the central claims I was making in my original post hold up. But the constraints imposed by this newly-rigorous methodology have forced me to focus on nonfiction, fiction, and poetry. Our collections of biography and drama simply aren’t large enough yet to support equal-sized random samples across the whole period.

Here are the results for fiction compared to nonfiction, and nonfiction compared to itself.


This strongly supports the conclusion that fiction was becoming less like nonfiction, but also reveals that the internal homogeneity of the nonfiction corpus was decreasing, especially in the 18c. So some of the differentiation between fiction and nonfiction may be due to the internal diversification of nonfiction prose.

By contrast, here are the results for poetry compared to fiction, and fiction compared to itself.

Poetry and fiction are becoming more similar in the period 1720-1900. I should note that I’ve dropped the first datapoint, for the period 1700-1719, because it seemed to be an outlier. Also, we’re using a smaller sample size here, because my poetry collection won’t support 1 million word samples across the whole period. (We have stripped the prose introduction and notes from volumes of poetry, so they’re small.)

Another question that was raised, both by Ben and by Mark Liberman at Language Log, involved the relationship between “diction” and “topical content.” The Spearman correlation coefficient gives common and uncommon words equal weight, which means (in effect) that it makes no effort to distinguish style from content.

But there are other ways of contrasting diction. And I thought I might try them, because I wanted to figure out how much of the growing distance between fiction and nonfiction was due simply to the topical differentiation of nonfiction in this period. So in the next graph, I’m comparing the cosine similarity of million-word samples selected from fiction and nonfiction to distinct samples selected from nonfiction. Cosine similarity is a measure that, in effect, gives more weight to common words.


I was surprised by this result. When I get very stable numbers for any variable I usually assume that something is broken. But I ran this twice, and used the same code to make different comparisons, and the upshot is that samples of nonfiction really are very similar to other samples of nonfiction in the same period (as measured by cosine similarity). I assume this is because the growing topical heterogeneity that becomes visible in Spearman’s correlation makes less difference to a measure that focuses on common words. Fiction is much more diverse internally by this measure — which makes sense, frankly, because the most common words can be totally different in first-person and third-person fiction. But — to return to the theme of this post — the key thing is that there’s a dramatic differentiation of fiction and nonfiction in this period. Here, by contrast, are the results for nonfiction and poetry compared to fiction, as well as fiction compared to itself.

This graph is a little wriggly, and the underlying data points are pretty bouncy — because fiction is internally diverse when measured by cosine similarity, and it makes a rather bouncy reference point. But through all of that I think one key fact does emerge: by this measure, fiction looks more similar to nonfiction prose in the eighteenth century, and more similar to poetry in the nineteenth.

There’s a lot more to investigate here. In my original post I tried to identify some of the words that became more common in fiction as it became less like nonfiction. I’d like to run that again, in order to explain why fiction and poetry became more similar to each other. But I’ll save that for another day. I do want to offer one specific metric that might help us explain the differentiation of “literary” and “nonliterary” diction: the changing etymological character of the vocabulary in these genres.


Measuring the ratio of “pre-1150″ to “post-1150″ words is roughly like measuring the ratio of “Germanic” to “Latinate” diction, except that there are a number of pre-1150 words (like “school” and “wall”) that are technically “Latinate.” So this is essentially a way of measuring the relative “familiarity” or “informality” of a genre (Bar-Ilan and Berman 2007). (This graph is based on the top 10k words in the whole collection. I have excluded proper nouns, words that entered the language after 1699, and stopwords — determiners, pronouns, conjunctions, and prepositions.)

I think this graph may help explain why we have the impression that literary language became less specialized in this period. It may indeed have become more informal — perhaps even closer to the spoken language. But in doing so it became more distinct from other kinds of writing.

I’d like to thank everyone who responded to the original post: I got a lot of good ideas for collection development as well as new ways of slicing the collection. Katherine Harris, for instance, has convinced me to add more women writers to the collection; I’m hoping that I can get texts from the Brown Women Writers Project. This may also be a good moment to reiterate that the nineteenth-century part of the collection I’m working with was selected by Jordan Sellers, and these results should be understood as built on his research. Finally, I have put the R code that I used for most of these plots in my Open Data page, but it’s ugly and not commented yet; prettier code will appear later this weekend.

References
Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate-Germanic divide in English,” Linguistics 45 (2007): 1-35.

The challenges of digital work on early-19c collections.

I’ve been posting mostly about collections built by other people (TCP-ECCO and Google). But I’m also in the process of building a small (thousand-title) 19c collection myself, in collaboration with E. Jordan Sellers. Jordan is selecting titles for the collection; I’m writing the Python scripts that process the texts. This is a modest project intended to support research for a few years, not a model for long-term curatorial practice. But we’ve encountered a few problems specific to the early 19c, and I thought I might share some of our experience and tools in case they’re useful for other early-19c scholars.

Literary and Characteristical Lives (1800), by William and Alexander Smellie. Note esp. the ligatures in 'first' and 'section.'


I originally wanted to create a larger collection, containing twenty or thirty thousand volumes, on the model of Ben Schmidt’s impressive work with nineteenth-century volumes vacuumed up from the Open Library. But because I needed a collection that bridged the eighteenth and nineteenth centuries, I found I had to proceed more slowly. The eighteenth century itself wasn’t the problem. Before 1800, archaic typography makes most optical character recognition unreliable — but for that very reason, TCP-ECCO has been producing clean, manually-keyed versions of 18c texts, enough at least for a small collection. The later 19c also isn’t a problem, because after 1830 or so, OCR quality is mostly adequate.

OCR version of Smellie, contributed by Columbia University Libraries to the Internet Archive.


But between 1800 and (say) 1830, you fall between two stools. It’s technically the nineteenth century, so people assume that OCR ought to work. But in practice, volumes from this period still have a lot of eighteenth-century typographical quirks, including loopy ligatures, the notorious “long s,” and worn or broken type. So the OCR is often pretty vile. I’m willing to put up with background noise if it’s evenly distributed. But these errors are distributed unevenly across the lexicon and across time, so they could actually distort conclusions if left unaddressed.

I decided to build a Python script to do post-processing correction of OCR. There are a lot of ways to do this; my approach was modeled on a paper written by Thomas A. Lasko and Susan E. Hauser for the National Library of Medicine. Briefly, what they show is that OCR correction becomes much more reliable when the program is given statistical information about the language, and errors, to be expected in a given domain. They’re working with contemporary text, but the principle holds even more strongly when you’re working in a different historical period. A generic spellchecker won’t perform well with texts that contain period spellings (“despatch,” “o’erflow’d”), systematic f/s substitution, and a much higher proportion of Latin and French than we’re used to. If your system corrects every occurrence of “même” to “mime,” you’re going to end up with a surprising number of mimes; if you accept “foul” at face value as a correctly-spelled word, you’re going to have very little “soul” in your collection.

Briefly, I customized my spellchecker for the early 19c in three ways:

    • The underlying dictionary included period spellings as well as common French and Latin terms, and recorded the frequency of each term in the 18/19c domain. I used frequencies (lightly) to guide fuzzy matching.
    • To calculate “edit distance,” I used a weighted matrix that recorded the probability of specific character substitutions in early-19c OCR, learning as it went along.
    • To resolve pairs like “foul/soul” and “flip/slip/ship,” where common OCR errors produce a token that could also be a real word, I extracted 2gram frequencies from the Google ngram database so that the program could judge which word made more sense in context. I.e., in the case of “the flip sailed,” the program can infer that a word before “sailed” is pretty likely to be “ship.”

A few other tricks are needed to optimize speed, and to make sure the script doesn’t over-correct proper nouns; anyone who’s interested in doing this should drop me a line for a fuller description and a copy of the code.

Automatically corrected version.


The results aren’t perfect, but they’re good enough to be usable (I am also recording the number of corrections and uncorrectable tokens so that I can assess margins of error later on).

I haven’t packaged this code yet for off-the-shelf use; it’s still got a few trailing wires. But if you want to cannibalize/adapt it, I’d be happy to give you a copy. Perhaps more importantly, I’d like to share a couple of sets of rules that might be helpful for anyone who’s attempting to normalize an 18/19c collection. Both of these rulesets are tab-delimited utf-8 .txt files. First, my list of 4600 rules for correcting 18/19c spellings, including syncopated past-tense forms like “bury’d” and “drop’d.” (Note that syncope cannot always be fixed simply by adding back an “e.” Rules for normalizing poetic syncope — “flow’ry,” “ta’en” — are clustered at the end of the file, so you can delete them if desired.) This ruleset has been transformed by a long series of joins and filtering operations, and edited manually, but I should acknowledge that part of the original list was borrowed from the source files that accompany WordHoard, developed at Northwestern University. I should also warn potential users that these rules are designed to normalize spelling to modern British practice.

The other thing it might be useful to share is a list of 2grams extracted from the Google English corpus, that I use for contextual spellchecking. This includes only 2grams where one of the two elements is a token like “fix” or “flip” that could be read either as a valid word or as an OCR error caused by the long s. Since the long s is also a problem in the Google dataset itself up to 1820, this list was based on frequencies from 1825-50. That’s not perfect for correcting texts in the 1800-1820 period, but I find that in practice it’s adequate. There are two columns here: the 2gram itself, and the frequency.