I’ve been posting mostly about collections built by other people (TCP-ECCO and Google). But I’m also in the process of building a small (thousand-title) 19c collection myself, in collaboration with E. Jordan Sellers. Jordan is selecting titles for the collection; I’m writing the Python scripts that process the texts. This is a modest project intended to support research for a few years, not a model for long-term curatorial practice. But we’ve encountered a few problems specific to the early 19c, and I thought I might share some of our experience and tools in case they’re useful for other early-19c scholars.
I originally wanted to create a larger collection, containing twenty or thirty thousand volumes, on the model of Ben Schmidt’s impressive work with nineteenth-century volumes vacuumed up from the Open Library. But because I needed a collection that bridged the eighteenth and nineteenth centuries, I found I had to proceed more slowly. The eighteenth century itself wasn’t the problem. Before 1800, archaic typography makes most optical character recognition unreliable — but for that very reason, TCP-ECCO has been producing clean, manually-keyed versions of 18c texts, enough at least for a small collection. The later 19c also isn’t a problem, because after 1830 or so, OCR quality is mostly adequate.
But between 1800 and (say) 1830, you fall between two stools. It’s technically the nineteenth century, so people assume that OCR ought to work. But in practice, volumes from this period still have a lot of eighteenth-century typographical quirks, including loopy ligatures, the notorious “long s,” and worn or broken type. So the OCR is often pretty vile. I’m willing to put up with background noise if it’s evenly distributed. But these errors are distributed unevenly across the lexicon and across time, so they could actually distort conclusions if left unaddressed.
I decided to build a Python script to do post-processing correction of OCR. There are a lot of ways to do this; my approach was modeled on a paper written by Thomas A. Lasko and Susan E. Hauser for the National Library of Medicine. Briefly, what they show is that OCR correction becomes much more reliable when the program is given statistical information about the language, and errors, to be expected in a given domain. They’re working with contemporary text, but the principle holds even more strongly when you’re working in a different historical period. A generic spellchecker won’t perform well with texts that contain period spellings (“despatch,” “o’erflow’d”), systematic f/s substitution, and a much higher proportion of Latin and French than we’re used to. If your system corrects every occurrence of “même” to “mime,” you’re going to end up with a surprising number of mimes; if you accept “foul” at face value as a correctly-spelled word, you’re going to have very little “soul” in your collection.
Briefly, I customized my spellchecker for the early 19c in three ways:
-
• The underlying dictionary included period spellings as well as common French and Latin terms, and recorded the frequency of each term in the 18/19c domain. I used frequencies (lightly) to guide fuzzy matching.
• To calculate “edit distance,” I used a weighted matrix that recorded the probability of specific character substitutions in early-19c OCR, learning as it went along.
• To resolve pairs like “foul/soul” and “flip/slip/ship,” where common OCR errors produce a token that could also be a real word, I extracted 2gram frequencies from the Google ngram database so that the program could judge which word made more sense in context. I.e., in the case of “the flip sailed,” the program can infer that a word before “sailed” is pretty likely to be “ship.”
A few other tricks are needed to optimize speed, and to make sure the script doesn’t over-correct proper nouns; anyone who’s interested in doing this should drop me a line for a fuller description and a copy of the code.
The results aren’t perfect, but they’re good enough to be usable (I am also recording the number of corrections and uncorrectable tokens so that I can assess margins of error later on).
I haven’t packaged this code yet for off-the-shelf use; it’s still got a few trailing wires. But if you want to cannibalize/adapt it, I’d be happy to give you a copy. Perhaps more importantly, I’d like to share a couple of sets of rules that might be helpful for anyone who’s attempting to normalize an 18/19c collection. Both of these rulesets are tab-delimited utf-8 .txt files. First, my list of 4600 rules for correcting 18/19c spellings, including syncopated past-tense forms like “bury’d” and “drop’d.” (Note that syncope cannot always be fixed simply by adding back an “e.” Rules for normalizing poetic syncope — “flow’ry,” “ta’en” — are clustered at the end of the file, so you can delete them if desired.) This ruleset has been transformed by a long series of joins and filtering operations, and edited manually, but I should acknowledge that part of the original list was borrowed from the source files that accompany WordHoard, developed at Northwestern University. I should also warn potential users that these rules are designed to normalize spelling to modern British practice.
The other thing it might be useful to share is a list of 2grams extracted from the Google English corpus, that I use for contextual spellchecking. This includes only 2grams where one of the two elements is a token like “fix” or “flip” that could be read either as a valid word or as an OCR error caused by the long s. Since the long s is also a problem in the Google dataset itself up to 1820, this list was based on frequencies from 1825-50. That’s not perfect for correcting texts in the 1800-1820 period, but I find that in practice it’s adequate. There are two columns here: the 2gram itself, and the frequency.
7 replies on “The challenges of digital work on early-19c collections.”
[…] by automatically correcting 19c OCR that we got from the Internet Archive. Our strategy involved statistically cautious, period-specific spellchecking, combined with enough reasoning about context to realize that “mortal fin” is probably “mortal […]
[…] my era is late 18th and early 19th century. Which means I’ll come up against the problems described and solved by Ted Underwood. It’s fantastic that he’s found a solution which works, but 4,600 rules for solving […]
On the one hand I understand the motives behind OCR of older documents. These are not all bad motives. However, I would never rely on a text derived from an OCR scan. I work on translations of older documents and I only use scans of the original document. So, for example, if I am translating something from Migne, I use a PDF containing an optical scan, so in effect I am looking at the original print document. I would not even consider referring to a text derived from OCR.
Agreed. If I were translating or producing a scholarly edition of a text, I would naturally work with the original images, or original object. Data mining 470,000 volumes is a very different task.
I am working on a python project where I need to use OCR to rip a huge collection of dvd/bluray subtitles. Although my need are very different, I have much less text to process and dvd sub pictures are more clean than a 19c old documents, I am still facing some typical ocr errors like character mixing (1 -> I : m -> iii). I have already implemented a voting system to choose OCR output of several engine and mixed it together to improve accuracy (this is obviously not an option in your case) . I am now considering ocr correction, there are tons of publication on the matter and even some implementation are available online, but not much in python, so I was actually pleased to read that you already wrote a piece of python script that I am looking to implement. as you mention in your post I’d like to get a copy of your implementation and canibalize and adapt it to my need ^^ cheers
[…] .zip file (350MB). The 19c files are based on optically scanned text, but we corrected the OCR with a Python script that considered the probability of specific character substitutions and word sequenc… Word-by-word recall is now better than 95%, and precision (which matters more for the analysis […]
[…] [18] “The challenges of digital work on early-19c collections,” October 7, 2011, The Stone and the Shell, https://tedunderwood.com/2011/10/07/the-challenges-of-digital-work-on-early-19c-collections/. […]