It’s the data: a plan of action.


I’m still fairly new at this gig, so take the following with a grain of salt. But the more I explore the text-mining side of DH, the more I wonder whether we need to rethink our priorities.

Over the last ten years we’ve been putting a lot of effort into building tools and cyberinfrastructure. And that’s been valuable: projects like MONK and Voyant play a crucial role in teaching people what’s possible. (I learned a lot from them myself.) But when I look around for specific results produced by text-mining, I tend to find that they come in practice from fairly simple, ad-hoc tools, applied to large datasets.

Ben Schmidt’s blog Sapping Attention is a good source of examples. Ben has discovered several patterns that really have the potential to change disciplines. For instance, he’s mapped the distribution of gender in nineteenth-century collections, and assessed the role of generational succession in vocabulary change. To do this, he hasn’t needed natural language processing, or TEI, or even topic modeling. He tends to rely on fairly straightforward kinds of corpus comparison. The leverage he’s getting comes ultimately from his decision to go ahead and build corpora as broad as possible using existing OCR.

I think that’s the direction to go right now. Moreover, before 1923 it doesn’t require any special agreement with publishers. There’s a lot of decent OCR in the public domain, because libraries can now produce cleaner copy than Google used to. Yes, some cleanup is still needed: running headers need to be removed, and the OCR needs to be corrected in period-sensitive ways. But it’s easier than people think to do that reliably. (You get a lot of clues, for instance, from cleaning up a whole collection at once. That way, the frequency of a particular form across the collection can help your corrector decide whether it’s an OCR error or a proper noun.)

In short, I think we should be putting a bit more collective effort into data preparation. Moreover, it seems to me that there’s a discernible sweet spot between vast collections of unreliable OCR and small collections of carefully-groomed TEI. What we need are collections in the 5,000 – 500,000 volume range, cleaned up to at least (say) 95% recall and 99% precision. Precision is more important than recall, because false negatives drop out of many kinds of analysis — as long as they’re randomly distributed (i.e. you can’t just ignore the f/s problem in the 18c). Collections of that kind are going to generate insights that we can’t glimpse as individual readers. They’ll be especially valuable once we enrich the metadata with information about (for instance) genre, gender, and nationality. I’m not confident that we can crowdsource OCR correction (it’s an awful lot of work), but I am confident that we could crowdsource some light enrichment of metadata.

So this is less a manifesto than a plan of action. I don’t think we need a center or a grant for this kind of thing: all we need is a coalition of the willing. I’ve asked HathiTrust for English-language OCR in the 18th and 19th centuries; once I get it, I’ll clean it up and make the cleaned version publicly available (as far as legally possible, which I think is pretty far). Then I’ll invite researchers to crowdsource metadata in some fairly low-tech way, and share the enriched metadata with everyone who participated in the crowdsourcing.

I would eagerly welcome suggestions about the kinds of metadata we ought to be recording (for instance, the genre categories we ought to use). Questions about selection/representativeness are probably better handled by individual researchers; I don’t think it’s possible to define a collective standard on that point, because people have different goals. Instead, I’ll simply take everything I can get, measure OCR quality, and allow people to define their own selection criteria. Researchers who want to produce a specific balance between X and Y can always do that by selecting a subset of the collection, or by combining it with another collection of their own.

9 thoughts on “It’s the data: a plan of action.

  1. One of the interesting things about crowdsourcing, I think, is that it is a tool for data collection but also for creating debate — which can threaten the stability of data but also enrich it by difference. So I guess I see crowdsourcing as an opportunity to gather sets of classifications and taxonomies, like genre, form, and literary period (Romantic, 18th c., or 19th c.), that are endemic to literary field work and that are so often contested. Audience, then, becomes the challenge: if the crowd in the crowdsource includes those who are not informed scholars, it puts your data at risk of not just containing a healthy debate, but also perhaps some misinformed responses in the mix.

    • I agree that debate is useful. Re: audience, “crowdsourcing” may have been a misnomer. I’m envisioning a pretty small team (<10 people) that would probably in practice be limited to scholars. We might start with something like 10k volumes — which is enough to reveal a lot. And it's not actually *that* much work for one person to categorize 500 volumes by genre, etc. I'm all about keeping it simple and getting it done on whatever scale is immediately doable.

      • Ah! Thanks for the clarification of “crowdsourcing” for your project. Knowing that you hope to work with a small, hand-selected scholarly team makes a huge difference in the kinds of data you can ask researchers to acquire but also the kind of platform or interface you can ask them to input data into. On the topic of speed: With the Stainforth Library project, we’re in the process of figuring out the best way to gather help from undergraduates working for the library to transcribe the Stainforth along with graduate student-level (and above) researchers who supplement the transcription with researched data from the Women Poets of the Romantic Period collection (~600 volumes) in the CUB archive and other sources. We’re guessing that with a team of undergraduates doing non-specialty-related work and grad students (or higher) tackling the field-specific work, it will get done faster — but this is a preliminary plan that needs testing.

    • The genre categories in Graphs, Maps, Trees are provocative. But for the purposes of cataloging objects, we would probably want to use less debatable kinds of categories. E.g., there are too many things that could be both a “bildungsroman” and a “historical novel,” etc.

      At the moment, I frankly just categorize as “fiction,” “poetry,” “drama,” “nonfiction,” “letters,” etc. I’m open to getting more specific than that. But I doubt we’ll be able to go all the way to “historical novel.”

  2. I agree with most of what you’re saying, although I think a minimal TEI format could be a good way of also recording some structural information (chapters, sections, paragraphs) and some of the metadata you’re interested in in a standardized way.
    As for categories, something I have found very useful in my own past small-scale work on the novel was the dominant narrative perspective as a major category. Also, this is not too controversial, that is, as long as you don’t look too closely, or try to distinguish various parts of any one novel. Of course, this only makes sense for your “fiction” category. For the French eighteenth century, the following types or forms of narrative have been useful:
    * autodiegetic single-narrator narration (e.g., roman-mémoires)
    * autodiegetic multiple-narrator narration (essentially, epistolary novel)
    * multiple direct speech narration (dialogue novel, such as Jacques le fataliste)
    * heterodiegetic, single-narrator narration (typical 19th century fare)
    Which of these narrative forms are dominant overall changes slowly over time, correlating with other changes in narrative form; at least in what I was looking at at the time (descriptive practice in French eighteenth-century novels), these categories made a difference in many more or less subtle ways.

    • Thanks, Christof. I think both of those suggestions are good. I am moving toward (very) minimal TEI for at least pages and divisions of a volume.

      Also, I like the suggestion that point-of-view is worth recording. It’s a feature of narration that tends to be lexically salient — at any rate, 1st- and 3rd-person narration are certainly easy to distinguish by clustering. For that reason, it might be something you would need to control for.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s