I love Karen Coyle’s idea that we should make OCR usable by identifying the best-available copy of each text. It’s time to start thinking about this kind of thing. Digital humanists have been making big claims about our ability to interpret large collections. But — outside of a few exemplary projects like EEBO-TCP — we really don’t have free access to the kind of large, coherent collections that our rhetoric would imply. We’ve got feet of clay on this issue.
Moreover, this wouldn’t be a difficult problem to address. I think it can be even simpler than Coyle suggests. In many cases, libraries have digitized multiple copies of a single edition. The obvious, simple thing to do is just:
Measure OCR quality — automatically, using a language model rather than ground truth — and associate a measurement of OCR quality with each bibliographic record.
This simple metric would save researchers a huge amount of labor, because a scholar could use an API to request “all the works you have between 1790 and 1820 that are above 90% probable accuracy” or “the best available copy of each edition in this period,” making it much easier to build a meaningfully normalized corpus. (This may be slightly different from Coyle’s idea about “urtexts,” because I’m talking about identifying the best copy of an edition rather than the best edition of a title.) And of course a metric destroys nothing: if you want to talk about print culture without filtering out poor OCR, all that metadata is still available. All this would do is empower researchers to make their own decisions.
One could go even further, and construct a “Frankenstein” edition by taking the best version of each page in a given edition. Or one could improve OCR with post-processing. But I think those choices can be left to individual research projects and repositories. The only part of this that really does need to be a collective enterprise is an initial measurement of OCR quality that gets associated with each bibliographic record and exposed to the API. That measurement would save research assistants thousands of hours of labor picking “the cleanest version of X.” I think it’s the most obvious thing we’re lacking.
[Postscript: Obviously, researchers can do this for themselves by downloading everything in period X, measuring OCR quality, and then selecting copies accordingly. In fact, I'm getting ready to build that workflow this summer. But this is going to take time and consume a lot of disk space, and it's really the kind of thing an API ought to be doing for us.]