I love Karen Coyle’s idea that we should make OCR usable by identifying the best-available copy of each text. It’s time to start thinking about this kind of thing. Digital humanists have been making big claims about our ability to interpret large collections. But — outside of a few exemplary projects like EEBO-TCP — we really don’t have free access to the kind of large, coherent collections that our rhetoric would imply. We’ve got feet of clay on this issue.
Moreover, this wouldn’t be a difficult problem to address. I think it can be even simpler than Coyle suggests. In many cases, libraries have digitized multiple copies of a single edition. The obvious, simple thing to do is just:
-
Measure OCR quality — automatically, using a language model rather than ground truth — and associate a measurement of OCR quality with each bibliographic record.
This simple metric would save researchers a huge amount of labor, because a scholar could use an API to request “all the works you have between 1790 and 1820 that are above 90% probable accuracy” or “the best available copy of each edition in this period,” making it much easier to build a meaningfully normalized corpus. (This may be slightly different from Coyle’s idea about “urtexts,” because I’m talking about identifying the best copy of an edition rather than the best edition of a title.) And of course a metric destroys nothing: if you want to talk about print culture without filtering out poor OCR, all that metadata is still available. All this would do is empower researchers to make their own decisions.
One could go even further, and construct a “Frankenstein” edition by taking the best version of each page in a given edition. Or one could improve OCR with post-processing. But I think those choices can be left to individual research projects and repositories. The only part of this that really does need to be a collective enterprise is an initial measurement of OCR quality that gets associated with each bibliographic record and exposed to the API. That measurement would save research assistants thousands of hours of labor picking “the cleanest version of X.” I think it’s the most obvious thing we’re lacking.
[Postscript: Obviously, researchers can do this for themselves by downloading everything in period X, measuring OCR quality, and then selecting copies accordingly. In fact, I’m getting ready to build that workflow this summer. But this is going to take time and consume a lot of disk space, and it’s really the kind of thing an API ought to be doing for us.]
6 replies on “The obvious thing we’re lacking.”
The big problem is that no one but one treats an OCR’ed text as an object in itself–it’s just an associated field to the physical item (or the scan) in all the databases I’ve ever seen. There isn’t even a logical place to put metadata that pertains to the OCR in most production catalogs. So if OCR technology improved, for example, I assume the Internet Archive or Jstor or Hathi would just replace the old versions, not keep them around as something citeable. So I took Coyle’s post as being more about the need for digital editions, rather than a curation issue for existing works. But urtexts could actually be unicode things that would curated in themselves. (Although her bit about page numbers brings the whole thing dangerously close to “let’s TEI encode everything.”)
I haven’t documented this anywhere because I want to switch algorithms, but the Bookworm ‘language’ field treats OCR errors as a separate language category and drops them out of “English” or “French”. (This was a happy accident–I just noticed a bunch of handwritten and otherwise unusable texts being classified as Latin or Greek.) But if that API comes online, it would make sense to have an “OCR error %” field that could be queried on.
The Stanford/U North Texas newspapers project does some stuff along these lines–I have to read the white paper to see how they determine percentage of ‘good’ words. It’s always seemed to me like this is a decent use for topic modeling, since there’s always an OCR error topic or two.
Interesting point — that the OCR itself lacks any ontology. So you’re right: the first step is to give it one. It might be useful to think about OCR as a versioning process; a digital repository is like a github for books. Project Bamboo seems to be headed in this direction: http://www.projectbamboo.org/newsletter/spring2012/
The metric for OCR quality is going to be a little tricky, because we might not want to penalize texts with large numbers of proper nouns, obscure Latin phrases, and so on. We might in the end need two different metrics. One for % of recognizable words, and one that would be based on a language model of character bigrams. (We could use something like KL divergence from the model.) Travis Brown has pointed out that you can’t just use character bigrams, because some OCR engines actually use a bigram model in the “guessing” process and will be really good at producing nonsense that “looks” English.
That presents an interesting paradox–any metric that we can come up with to measure OCR quality could also be an input for improving it, and so is unlikely to work in the long term. Which cuts against the idea that ‘OCR quality’ could be a metadata field on a digital record, since it, too, would have to be subject to endless emendation and improvement. Get enough NLP involved in the OCR process, and we could have perfectly sensible editions that are wrong in undetectable ways. (“It was the best at times, it was the worst at times…”).
OTOH, I do think it should be easy enough to measure something like “frequent ABBYY English misreadings” The ratio of “tlie” to “the” is a good enough first approximation for now for a certain type of error.
I like that paradox — though, in practice, this metric interests me less as a way of defining final accuracy than as a way of filtering out the truly awful.
And ah, yes, “tlie.” I believe I’ll have to write a short story entirely composed of OCR errors. “CaU rne Islimael.”
Many years ago the volunteers at Distributed Proofreaders set up a scanner for these “stealth scannos”, jeebies. It’s a pretty long list, and every project I’ve worked on through the years has added to it, what with changes in page quality, original print, typesetter skill (you’d be amazed how well I can predict publisher by the number of u n typos; Lupton, I’m looking at you), and so forth.
Nice blog! Thanks for this wonderful post and hoping to post more of this.