It’s the data: a plan of action.


I’m still fairly new at this gig, so take the following with a grain of salt. But the more I explore the text-mining side of DH, the more I wonder whether we need to rethink our priorities.

Over the last ten years we’ve been putting a lot of effort into building tools and cyberinfrastructure. And that’s been valuable: projects like MONK and Voyant play a crucial role in teaching people what’s possible. (I learned a lot from them myself.) But when I look around for specific results produced by text-mining, I tend to find that they come in practice from fairly simple, ad-hoc tools, applied to large datasets.

Ben Schmidt’s blog Sapping Attention is a good source of examples. Ben has discovered several patterns that really have the potential to change disciplines. For instance, he’s mapped the distribution of gender in nineteenth-century collections, and assessed the role of generational succession in vocabulary change. To do this, he hasn’t needed natural language processing, or TEI, or even topic modeling. He tends to rely on fairly straightforward kinds of corpus comparison. The leverage he’s getting comes ultimately from his decision to go ahead and build corpora as broad as possible using existing OCR.

I think that’s the direction to go right now. Moreover, before 1923 it doesn’t require any special agreement with publishers. There’s a lot of decent OCR in the public domain, because libraries can now produce cleaner copy than Google used to. Yes, some cleanup is still needed: running headers need to be removed, and the OCR needs to be corrected in period-sensitive ways. But it’s easier than people think to do that reliably. (You get a lot of clues, for instance, from cleaning up a whole collection at once. That way, the frequency of a particular form across the collection can help your corrector decide whether it’s an OCR error or a proper noun.)

In short, I think we should be putting a bit more collective effort into data preparation. Moreover, it seems to me that there’s a discernible sweet spot between vast collections of unreliable OCR and small collections of carefully-groomed TEI. What we need are collections in the 5,000 – 500,000 volume range, cleaned up to at least (say) 95% recall and 99% precision. Precision is more important than recall, because false negatives drop out of many kinds of analysis — as long as they’re randomly distributed (i.e. you can’t just ignore the f/s problem in the 18c). Collections of that kind are going to generate insights that we can’t glimpse as individual readers. They’ll be especially valuable once we enrich the metadata with information about (for instance) genre, gender, and nationality. I’m not confident that we can crowdsource OCR correction (it’s an awful lot of work), but I am confident that we could crowdsource some light enrichment of metadata.

So this is less a manifesto than a plan of action. I don’t think we need a center or a grant for this kind of thing: all we need is a coalition of the willing. I’ve asked HathiTrust for English-language OCR in the 18th and 19th centuries; once I get it, I’ll clean it up and make the cleaned version publicly available (as far as legally possible, which I think is pretty far). Then I’ll invite researchers to crowdsource metadata in some fairly low-tech way, and share the enriched metadata with everyone who participated in the crowdsourcing.

I would eagerly welcome suggestions about the kinds of metadata we ought to be recording (for instance, the genre categories we ought to use). Questions about selection/representativeness are probably better handled by individual researchers; I don’t think it’s possible to define a collective standard on that point, because people have different goals. Instead, I’ll simply take everything I can get, measure OCR quality, and allow people to define their own selection criteria. Researchers who want to produce a specific balance between X and Y can always do that by selecting a subset of the collection, or by combining it with another collection of their own.

OT: politics, social media, and the academy.

This is a blog about text mining, but from time to time I’m going to allow myself to wander off topic, briefly. At the moment, I think social media are adding a few interesting twists to an old question about the relationship between academic politics and politics-politics.

Protests outside the Wisconsin State Capitol, Feb 2011.

It’s perhaps never a good idea to confuse politics with communicative rationality. Recently, in the United States, it isn’t even clear that all parties share a minimal respect for democratic norms. One side is willing to obstruct the right to vote, to lie about scientifically ascertainable fact, and to convert US attorneys (when they’re in power) into partisan enforcers. In circumstances like this, observers of good faith don’t need to spend a lot of time “debating” politics, because the other side isn’t debating. The only thing worth debating is how to fight back. And in a fight, dispassionate self-criticism becomes less important than solidarity.

Personally, I don’t mind a good fight with clearly-drawn moral lines. But this same clarity can be a bad thing for the academy. Dispassionate debate is what our institution is designed to achieve. If contemporary political life teaches us that “debate” is usually a sham, staged to produce an illusion of equivalence between A) fact and B) bullshit — then we may start to lose faith in our own guiding principles.

This opens up a whole range of questions. But maybe the most interesting question for likely readers of this blog will involve the role of social media. I think the web has proven itself a good tool for grassroots push-back against corporate power; we’re all familiar with successful campaigns against SOPA and Susan G. Komen. But social media also work by harnessing the power of groupthink. “Click like.” “Share.” “Retweet.” This doesn’t bother me where politics itself is concerned; political life always entails a decision to “hang together or hang separately.”

But I’m uneasy about extending the same strategy to academic politics, because our main job, in the academy, is debate rather than solidarity. I hesitate to use Naomi Schaefer Riley’s recent blog post to the Chronicle as an example, because it’s not in any sense a model of the virtues of debate. It was a hastily-tossed-off, sneering attack on junior scholars that failed to engage in any depth with the texts it attacked. Still, I’m uncomfortable when I see academics harnessing the power of social media to discourage The Chronicle from publishing Riley.

There was, after all, an idea buried underneath Riley’s sneers. It could have been phrased as a question about the role of politics in the humanities. Political content has become more central to humanistic research, at the same time as actual political debate has become less likely (for reasons sketched above). The result is that a lot of dissertations do seem to be proceeding toward a predetermined conclusion. This isn’t by any means a problem only in Black Studies, and Riley’s reasons for picking on Black Studies probably won’t bear close examination.

Still, I’m not persuaded that we would improve the academy by closing publications like the Chronicle to Riley’s kind of critique. Attacks on academic institutions can raise valid questions, even when they are poorly argued, sneering, and unfair. (E.g., I wouldn’t be writing this blog post if it weren’t for the outcry over Riley.) So in the end I agree with Liz McMillen’s refusal to take down the post.

But this particular incident is not of great significance. I want to raise a more general question about the role that technologies of solidarity should play in academic politics. We’ve become understandably cynical about the ideal of “open debate” in politics and journalism. How cynical are we willing to become about its place in academia? It’s a question that may become especially salient if we move toward more public forms of review. Would we be comfortable, for instance, with a petition directed at a particular scholarly journal, urging them not to publish article(s) by a particular author?

[11 p.m. May 6th: This post was revised after initial publication, mainly for brevity. I also made the final paragraph a little more pointed.]

[Update May 7: The Chronicle has asked Schaefer Riley to leave the blog. It’s a justifiable decision, since she wrote a very poorly argued post. But it also does convince me that social media are acquiring a new power to shape the limits of academic debate. That’s a development worth watching.]

[Update May 8th: Kevin Drum at Mother Jones weighs in on the issue.]