A window on the twentieth century may be about to open.

The nineteenth century gets a lot of attention from scholars interested in text mining, simply because it’s in the public domain. After 1923, you run into copyright laws that make it impossible to share digital texts of many volumes.

"Ray of Light," by Russell H Cribb, 2006.   CC-BY 2.0.

“Ray of Light,” by Russell H Cribb, 2006. CC-BY 2.0.

One of the most promising solutions to that problem is the non-consumptive research portal being designed by the HathiTrust Research Center. In non-consumptive research, algorithms characterize a collection without exposing the original texts to human reading or copying.

This could work in a range of ways. Some of them are complex — for instance, if worksets and algorithms have to be tailored to individual projects. HTRC is already supporting that kind of research, but expanding it to the twentieth century may pose problems of scale that take a while to solve. But where algorithms can be standardized, calculations can run once, in advance, across a whole collection, creating datasets that are easy to serve up in a secure way. This strategy could rapidly expand opportunities for research on twentieth-century print culture.

For instance, a great deal of interesting macroscopic research can be done, at bottom, by counting words. JSTOR has stirred up a lot of interest by making word counts available for scholarly journal articles. Word counts from printed books would be at least equally valuable, and relatively easy to provide.

So people interested in twentieth-century history and literary history should prick up their ears at the news that HathiTrust Research Center is releasing an initial set of word counts from public-domain works as an alpha test. This set only includes 250,000 of the eleven million volumes in HathiTrust, and does not yet include any data about works after 1923, but one can hope that the experiment will soon expand to cover the twentieth century. (I’m just an interested observer, so I don’t know how rapid the expansion will be, but the point of this experiment is ultimately to address obstacles to twentieth-century text mining.)

The data provided by HTRC is in certain ways richer than the data provided by JSTOR, and it may already provide a valuable service for scholars who study the nineteenth or early twentieth centuries. Words are tagged with parts of speech, and word counts are provided at the page level — an important choice, since a single volume may combine a number of different kinds of text. HTRC is also making an effort to separate recurring headers and footers from the main text on each page; they’re providing line counts and sentence counts for each page, and also providing a count of the characters that begin and end lines. In my own research, I’ve found that it’s possible to use this kind of information to separate genres and categories of paratext within a volume (the lines of an index tend to begin with capital letters and end with numbers, for instance).

Of course, researchers would like to pose many questions that can’t be answered by page-level unigram counts. Some of those questions will require custom-tailored algorithms. But other questions might be possible to address with precalculated features extracted in a relatively standard way.

Whatever kinds of information interest you, speak up for them, using the e-mail address provided on the HTRC feature-extraction page. And if this kind of service would have value for your research, please write in to say how you would use it. Part of the point of this experiment is to assess the degree of scholarly interest.

You can’t govern reception.

I’ve read a number of articles lately that posit “digital humanities” as a coherent intellectual movement that makes strong, scary normative claims about the proper future of the humanities as a whole.

Adam Kirsch’s piece in The New Republic is the latest of these; he constructs an opposition between a “minimalist” DH that simply uses computers to edit or read things as we have always done, and a “maximalist” version where technology is taking over English departments and leveling solitary genius in order to impose a cooperative but “post-verbal” vision of the future.

I think there’s a large excluded middle in that picture, where everything interesting actually happens. But I’m resisting — or trying to resist — the urge to write a blog post of clarification and explanation. Increasingly, I believe that’s a futile impulse, not only because “DH” can be an umbrella for many different projects, but more fundamentally because “the meaning of DH” is a perspectival question.

I mean it’s true, objectively, that the number of scholars actually pursuing (say) digital history or game studies is still rather small. But I nevertheless believe that Kirsch is sincere in perceiving them as the narrow end of a terrifying wedge. And there’s no way to prove he’s wrong about that, because threats are very much in the eye of the beholder. Projects don’t have to be explicitly affiliated with each other, or organized around an explicit normative argument, in order to be perceived collectively as an implicit rebuke to some existing scheme of values. In fact, people don’t even really get to choose whether they’re part of a threatening phenomenon. Franco Moretti hasn’t been cheerleading for anything called “digital humanities,” but that point is rapidly becoming moot.

I’m reminded of a piece of advice Mark Seltzer gave me sixteen years ago, during my dissertation defense. Like all grad students in the 90s, I had written an overly-long introduction explaining what my historical research meant in some grander theoretical way. As I recall, he said simply, “you can’t govern your own reception.” A surprisingly hard thing to accept! People of course want to believe that they’re the experts about the meaning of their own actions. But that’s not how social animals work.

So I’m going to try to resist the temptation to debate the meaning of “DH,” which is not in anyone’s control. Instead I’m going to focus on doing cool stuff. Like Alexis Madrigal’s reverse-engineering of Netflix genres, or Mark Sample’s Twitter bots, or the Scholars’ Lab project PRISM, which apparently forgot to take over English departments and took over K-12 education instead. At some future date, historians can decide whether any of that was digital humanities, and if so, what it meant.

(Comments are turned off, because you can’t moderate a comment thread titled “you can’t govern reception.”)

Postscript May 10th: This was written quickly, in the heat of the occasion, and I think my anecdote may be better at conveying a feeling than explaining its underlying logic. Obviously, “you can’t govern reception” cannot mean “never try to change what other people think.” Instead, I mean that “digital humanities” seems to me a historical generalization more than a “field” or a “movement” based on shared premises that could be debated. I see it as closer to “modernism,” for instance, than to “psychology” or “post-structuralism.”

You cannot really write editorials convincing people to like “modernism.” You’d have to write a book. Even then, understandings of the historical phenomenon are going to differ, and some people are going to feel nostalgic for impressionist painting. The analogy to “DH” is admittedly imperfect; DH is an academic phenomenon (mostly! at times it’s hard to distinguish from data journalism), and has slightly more institutional coherence than modernism did. But I’m not sure it has more intellectual coherence.

How much DH can we fit in a literature department?

It’s an open secret that the social phenomenon called “digital humanities” mostly grew outside the curriculum. Library-based programs like Scholars’ Lab at UVA have played an important role; so have “centers” like MITH (Maryland) and CHNM (George Mason) — not to mention the distributed unconference movement called THATCamp, which started at CHNM. At Stanford, the Literary Lab is a sui generis thing, related to departments of literature but not exactly contained inside them.

The list could go on, but I’m not trying to cover everything — just observing that “DH” didn’t begin by embedding itself in the curricula of humanities departments. It went around them, in improvisational and surprisingly successful ways.

That’s a history to be proud of, but I think it’s also setting us up for predictable frustrations at the moment, as disciplines decide to import “DH” and reframe it in disciplinary terms. (“Seeking a scholar of early modern drama, with a specialization in digital humanities …”)

Of course, digital methods do have consequences for existing disciplines; otherwise they wouldn’t be worth the trouble. In my own discipline of literary study, it’s now easy to point to a long sequence of substantive contributions to literary study that use digital methods to make thesis-driven interventions in literary history and even interpretive theory.

But although the research payoff is clear, the marriage between disciplinary and extradisciplinary institutions may not be so easy. I sense that a lot of friction around this topic is founded in a feeling that it ought to be straightforward to integrate new modes of study in disciplinary curricula and career paths. So when this doesn’t go smoothly, we feel there must be some irritating mistake in existing disciplines, or in the project of DH itself. Something needs to be trimmed to fit.

What I want to say is just this: there’s actually no reason this should be easy. Grafting a complex extradisciplinary project onto existing disciplines may not completely work. That’s not because anyone made a mistake.

Consider my home field of literary study. If digital methods were embodied in a critical “approach,” like psychoanalysis, they would be easy to assimilate. We could identify digital “readings” of familiar texts, add an article to every Norton edition, and be done with it. In some cases that actually works, because digital methods do after all change the way we read familiar texts. But DH also tends to raise foundational questions about the way literary scholarship is organized. Sometimes it valorizes things we once considered “mere editing” or “mere finding aids”; sometimes it shifts the scale of literary study, so that courses organized by period and author no longer make a great deal of sense. Disciplines can be willing to welcome new ideas, and yet (understandably) unwilling to undertake this sort of institutional reorganization.

Training is an even bigger problem. People have argued long and fiercely about the amount of digital training actually required to “do DH,” and I’m not going to resolve that question here. I just want to say that there’s a reason for the argument: it’s a thorny problem. In many cases, humanists are now tackling projects that require training not provided in humanities departments. There are a lot of possible fixes for that — we can make tools easier to use, foster collaboration — but none of those fixes solve the whole problem. Not everything can be externalized as a “tool.” Some digital methods are really new forms of interpretation; packaging them in a GUI would create a problematic black box. Collaboration, likewise, may not remove the need for new forms of training. Expecting computer scientists to do all the coding on a project can be like expecting English professors to do all the spelling.

I think these problems can find solutions, but I’m coming to suspect that the solutions will be messy. Humanities curricula may evolve, but I don’t think the majority of English or History departments are going to embrace rapid structural change — for instance, change of the kind that would be required to support graduate programs in distant reading. These disciplines have already spent a hundred years rejecting rapprochement with social science; why would they change course now? English professors may enjoy reading Moretti, but it’s going to be a long time before they add a course on statistical methods to the major.

Meanwhile, there are other players in this space (at least at large universities): iSchools, Linguistics, Departments of Communications, Colleges of Media. Digital methods are being assimilated rapidly in these places. New media, of course, are already part of media studies, and if a department already requires statistics, methods like topic modeling are less of a stretch. It’s quite possible that the distant reading of literary culture will end up being shared between literature departments and (say) Communications. The reluctance of literary studies to become a social science needn’t prevent social scientists from talking about literature.

I’m saying all this because I think there’s a strong tacit narrative in DH that understands extradisciplinary institutions as a wilderness, in which we have wandered that we may reach the promised land of recognition by familiar disciplinary authority. In some ways that’s healthy. It’s good to have work organized by clear research questions (so we aren’t just digitizing aimlessly), and I’m proud that digital methods are making contributions to the core concerns of literary studies.

But I’m also wary of the normative pressures associated with that narrative, because (if you’ll pardon the extended metaphor) I’m not sure this caravan actually fits in the promised land. I suspect that some parts of the sprawling enterprise called “DH” (in fact, some of the parts I enjoy most) won’t be absorbed easily in the curricula of History or English. That problem may be solved differently at different schools; the nice thing about strong extradisciplinary institutions is that they allow us to work together even if the question of disciplinary identity turns out to be complex.

postscript: This whole post should have footnotes to Bethany Nowviskie every time I use the term “extradisciplinary,” and to Matt Kirschenbaum every time I say “DH” with implicit air quotes.

New models of literary collectivity.

This is a version of a response I gave at session 155 of MLA 2014, “Literary Criticism at the Macroscale.” Slides and/or texts of the original papers by Andrew Piper and Hoyt Long and Richard So are available on the web, as is another resonse by Haun Saussy.

* * *

The papers we heard today were not picking the low-hanging fruit of text mining. There’s actually a lot of low-hanging fruit out there still worth picking — big questions that are easy to answer quantitatively and that only require organizing large datasets — but these papers were tackling problems that are (for good or ill) inherently more difficult. Part of the reason involves their transnational provenance, but another reason is that they aren’t just counting or mapping known categories but trying to rethink some of the basic concepts we use to write literary history — in particular, the concept we call “influence” or “diffusion” or “intertextuality.”

I’m tossing several terms at this concept because I don’t think literary historians have ever agreed what it should be called. But to put it very naively: new literary patterns originate somehow, and somehow they are reproduced. Different generations of scholars have modeled this differently. Hoyt and Richard quote Laura Riding and Robert Graves exploring, in 1927, an older model centered on basically personal relationships of imitation or influence. But early-twentieth-century scholars could also think anthropologically about the transmission of motifs or myths or A. O. Lovejoy’s “unit ideas.” In the later 20th century, critics got more cautious about implying continuity, and reframed this topic abstractly as “intertextuality.” But then the specificity of New Historicism sometimes pushed us back in the direction of tracing individual sources.

I’m retelling a story you already know, but trying to retell it very frankly, in order to admit that (while we’ve gained some insight) there is also a sense in which literary historians keep returning to the same problem and keep answering it in semi-satisfactory ways. We don’t all, necessarily, aspire to give a causal account of literary change. But I think we keep returning to this problem because we would like to have a kind of narrative that can move more smoothly between individual examples and the level of the discourse or genre. When we’re writing our articles the way this often works in practice is: “here’s one example, two examples — magic hand-waving — a discourse!”

leviathan_svSomething interesting and multivocal about literary history gets lost at the moment when we do that hand-waving. The things we call genres or discourses have an internal complexity that may be too big to illustrate with examples, but that also gets lost if you try to condense it into a single label, like “the epistolary novel.” Though we aspire to subtlety, in practice it’s hard to move from individual instances to groups without constructing something like the sovereign in the frontispiece for Hobbes’ Leviathan – a homogenous collection of instances composing a giant body with clear edges.

While they offer different solutions, I think both of the papers we heard today are imagining other ways to move between instances and groups. They both use digital methods to describe new forms of similarity between texts. And in both cases, the point of doing this lies less in precision than in creating a newly flexible model of collectivity. We gain a way of talking about texts that is collective and social, but not necessarily condensed into a single label. For Andrew, the “Werther effect” is less about defining a new genre than about recognizing a new set of relationships between different communities of works. For Hoyt and Richard, machine learning provides a way of talking about the reception of hokku that isn’t limited to formal imitation or to a group of texts obviously “influenced” by specific models. Algorithms help them work outward from clear examples of a literary-historical phenomenon toward a broader penumbra of similarity.

I think this kind of flexibility is one of the most important things digital tools can help us achieve, but I don’t think it’s on many radar screens right now. The reason, I suspect, is that it doesn’t fit our intuitions about computers. We understand that computers can help us with scale (distant reading), and we also get that they can map social networks. But the idea that computers can help us grapple with ambiguity and multiple determination doesn’t feel intuitive. Aren’t computers all about “binary logic”? If I tell my computer that this poem both is and is not a haiku, won’t it probably start to sputter and emit smoke?

Well, maybe not. And actually I think this is a point that should be obvious but just happens to fall in a cultural blind spot right now. The whole point of quantification is to get beyond binary categories — to grapple with questions of degree that aren’t well-represented as yes-or-no questions. Classification algorithms, for instance, are actually very good at shades of gray; they can express predictions as degrees of probability and assign the same text different degrees of membership in as many overlapping categories as you like. So I think it should feel intuitive that a quantitative approach to literary history would have the effect of loosening up categories that we now tend to treat too much as homogenous bodies. If you need to deal with gradients of difference, numbers are your friend.

Of course, how exactly this is going to work remains an open question. Technically, the papers we heard today approach the problem of similarity in different ways. Hoyt and Richard are borrowing machine learning algorithms that use the contrast between groups of texts to define similarity. Andrew’s improvising a different approach that uses a single work to define a set of features that can then be used to organize other works as an “exotext.” And other scholars have approached the same problem in other ways. Franco Moretti’s chapter on “Trees” also bridges the gap I’m talking about between individual examples and coherent discourses; he does it by breaking the genre of detective fiction up into a tree of differentiations. It’s not a computational approach, but for some problems we may not need computation. Matt Jockers, on the other hand, has a chapter on “influence” in Macroanalysis that uses topic modeling to define global criteria of similarity for nineteenth-century novels. And I could go on: Sara Steger, for instance, has done work on sentimentality in the nineteenth century novel that uses machine learning in a loosely analogous way to think about the affective dimension of genre.

The differences between these projects are worth discussing, but in this response I’m more interested in highlighting the common impulse they share. While these projects explore specific problems in literary history, they can also be understood as interventions in literary theory, because they’re all attempting to rethink certain basic concepts we use to organize literary-historical narrative. Andrew’s concept of the “exotext” makes this theoretical ambition most overt, but I think it’s implicit across a range of projects. For me the point of the enterprise, at this stage, is to brainstorm flexible alternatives to our existing, slightly clunky, models of literary collectivity. And what I find exciting at the moment is the sheer proliferation of alternatives.

Measurement and modeling.

If the Internet is good for anything, it’s good for speeding up the Ent-like conversation between articles, to make that rumble more perceptible by human ears. I thought I might help the process along by summarizing the Stanford Literary Lab’s latest pamphlet — a single-authored piece by Franco Moretti, “‘Operationalizing': or the function of measurement in modern literary theory.”

One of the many strengths of Moretti’s writing is a willingness to dramatize his own learning process. This pamphlet situates itself as a twist in the ongoing evolution of “computational criticism,” a turn from literary history to literary theory.

Measurement as a challenge to literary theory, one could say, echoing a famous essay by Hans Robert Jauss. This is not what I expected from the encounter of computation and criticism; I assumed, like so many others, that the new approach would change the history, rather than the theory of literature ….

Measurement challenges literary theory because it asks us to “operationalize” existing critical concepts — to say, for instance, exactly how we know that one character occupies more “space” in a work than another. Are we talking simply about the number of words they speak? or perhaps about their degree of interaction with other characters?

Moretti uses Alex Woloch’s concept of “character-space” as a specific example of what it means to operationalize a concept, but he’s more interested in exploring the broader epistemological question of what we gain by operationalizing things. When literary scholars discuss quantification, we often tacitly assume that measurement itself is on trial. We ask ourselves whether measurement is an adequate proxy for our existing critical concepts. Can mere numbers capture the ineffable nuances we assume they possess? Here, Moretti flips that assumption and suggests that measurement may have something to teach us about our concepts — as we’re forced to make them concrete, we may discover that we understood them imperfectly. At the end of the article, he suggests for instance (after begging divine forgiveness) that Hegel may have been wrong about “tragic collision.”

I think Moretti is frankly right about the broad question this pamphlet opens. If we engage quantitative methods seriously, they’re not going to remain confined to empirical observations about the history of predefined critical concepts. Quantification is going to push back against the concepts themselves, and spill over into theoretical debate. I warned y’all back in August that literary theory was “about to get interesting again,” and this is very much what I had in mind.

At this point in a scholarly review, the standard procedure is to point out that a work nevertheless possesses “oversights.” (Insight, meet blindness!) But I don’t think Moretti is actually blind to any of the reflections I add below. We have differences of rhetorical emphasis, which is not the same thing.

For instance, Moretti does acknowledge that trying to operationalize concepts could cause them to dissolve in our hands, if they’re revealed as unstable or badly framed (see his response to Bridgman on pp. 9-10). But he chooses to focus on a case where this doesn’t happen. Hegel’s concept of “tragic collision” holds together, on his account; we just learn something new about it.

In most of the quantitative projects I’m pursuing, this has not been my experience. For instance, in developing statistical models of genre, the first thing I learned was that critics use the word genre to cover a range of different kinds of categories, with different degrees of coherence and historical volatility. Instead of coming up with a single way to operationalize genre, I’m going to end up producing several different mapping strategies that address patterns on different scales.

Something similar might be true even about a concept like “character.” In Vladimir Propp’s Morphology of the Folktale, for instance, characters are reduced to plot functions. Characters don’t have to be people or have agency: when the hero plucks a magic apple from a tree, the tree itself occupies the role of “donor.” On Propp’s account, it would be meaningless to represent a tale like “Le Petit Chaperon Rouge” as a social network. Our desire to imagine narrative as a network of interactions between imagined “people” (wolf ⇌ grandmother) presupposes a separation between nodes and edges that makes no sense for Propp. But this doesn’t necessarily mean that Moretti is wrong to represent Hamlet as a social network: Hamlet is not Red Riding Hood, and tragic drama arguably envisions character in a different way. In short, one of the things we might learn by operationalizing the term “character” is that the term has genuinely different meanings in different genres, obscured for us by the mere continuity of a verbal sign. [I should probably be citing Tzvetan Todorov here, The Poetics of Prose, chapter 5.]

Illustration from "Learning Latent Personas of Film Characters," Bamman et. al.

Illustration from “Learning Latent Personas of Film Characters,” Bamman et. al.

Another place where I’d mark a difference of emphasis from Moretti involves the tension, named in my title, between “measurement” and “modeling.” Moretti acknowledges that there are people (like Graham Sack) who assume that character-space can’t be measured directly, and therefore look for “proxy variables.” But concepts that can’t be directly measured raise a set of issues that are quite a bit more challenging than the concept of a “proxy” might imply. Sack is actually trying to build models that postulate relations between measurements. Digital humanists are probably most familiar with modeling in the guise of topic modeling, a way of mapping discourse by postulating latent variables called “topics” that can’t be directly observed. But modeling is a flexible heuristic that could be used in a lot of different ways.

The illustration on the right is a probabilistic graphical model drawn from a paper on the “Latent Personas of Film Characters” by Bamman, O’Connor, and Smith. The model represents a network of conditional relationships between variables. Some of those variables can be observed (like words in a plot summary w and external information about the film being summarized md), but some have to be inferred, like recurring character types (p) that are hypothesized to structure film narrative.

Having empirically observed the effects of illustrations like this on literary scholars, I can report that they produce deep, Lovecraftian horror. Nothing looks bristlier and more positivist than plate notation.

But I think this is a tragic miscommunication produced by language barriers that both sides need to overcome. The point of model-building is actually to address the reservations and nuances that humanists correctly want to interject whenever the concept of “measurement” comes up. Many concepts can’t be directly measured. In fact, many of our critical concepts are only provisional hypotheses about unseen categories that might (or might not) structure literary discourse. Before we can attempt to operationalize those categories, we need to make underlying assumptions explicit. That’s precisely what a model allows us to do.

It’s probably going to turn out that many things are simply beyond our power to model: ideology and social change, for instance, are very important and not at all easy to model quantitatively. But I think Moretti is absolutely right that literary scholars have a lot to gain by trying to operationalize basic concepts like genre and character. In some cases we may be able to do that by direct measurement; in other cases it may require model-building. In some cases we may come away from the enterprise with a better definition of existing concepts; in other cases those concepts may dissolve in our hands, revealed as more unstable than even poststructuralists imagined. The only thing I would say confidently about this project is that it promises to be interesting.

The imaginary conflicts disciplines create.

One thing I’ve never understood about humanities disciplines is our insistence on staging methodology as ethical struggle. I don’t think humanists are uniquely guilty here; at bottom, it’s probably the institution of disciplinarity itself that does it. But the normative tone of methodological conversation is particularly odd in the humanities, because we have a reputation for embracing multiple perspectives. And yet, where research methods are concerned, we actually seem to find that very hard.

It never seems adequate to say “hey, look through the lens of this method for a sec — you might see something new.” Instead, critics practicing historicism feel compelled to justify their approach by showing that close reading is the crypto-theological preserve of literary mandarins. Arguments for close reading, in turn, feel compelled to claim that distant reading is a slippery slope to takeover by the social sciences — aka, a technocratic boot stomping on the individual face forever. Or, if we do admit that multiple perspectives have value, we often feel compelled to prescribe some particular balance between them.

Imagine if biologists and sociologists went at each other in the same way.

“It’s absurd to study individual bodies, when human beings are social animals!”

“Your obsession with large social phenomena is a slippery slope — if we listened to you, we would eventually forget about the amazing complexity of individual cells!”

“Both of your methods are regrettably limited. What we need, today, is research that constantly tempers its critique of institutions with close analysis of mitochondria.”

As soon as we back up and think about the relation between disciplines, it becomes obvious that there’s a spectrum of mutually complementary approaches, and different points on the spectrum (or different combinations of points) can be valid for different problems.

So why can’t we see this when we’re discussing the possible range of methods within a discipline? Why do we feel compelled to pretend that different approaches are locked in zero-sum struggle — or that there is a single correct way of balancing them — or that importing methods from one discipline to another raises a grave ethical quandary?

It’s true that disciplines are finite, and space in the major is limited. But a debate about “what will fit in the major” is not the same thing as ideology critique or civilizational struggle. It’s not even, necessarily, a substantive methodological debate that needs to be resolved.

A half-decent OCR normalizer for English texts after 1700.

Perhaps not the most inspiring title. But the words are carefully chosen.

Basically, I’m sharing the code I use to correct OCR in my own research. I’ve shared parts of this before, but this is the first time I’ve made any effort to package it so that it will run on other people’s machines. If you’ve got Python 3.x, you should be able to clone this github repository, run OCRnormalizer.py, and point it at a folder of files you want corrected. The script is designed to handle data structures from HathiTrust, so (for instance) if you have zip files contained in a pairtree structure, it will recursively walk the directories to identify all zip files, concatenate pages, and write a file with the suffix “.clean.txt” in the same folder where each zip file lives. But it can also work on files from another source. If you point it at a flat folder of generic text files, it will correct those.

I’m calling this an OCR “normalizer” rather than “corrector” because it’s designed to accomplish very specific goals.

In my research, I’m mainly concerned with the kinds of errors that become problems for diachronic text mining. The algorithms I use can handle a pretty high level of error as long as those errors are distributed in a more-or-less random way. If a word is mistranscribed randomly in 200 different ways, each of those errors may be rare enough to drop out of the analysis. You don’t necessarily have to catch them all.

The percentage of tokens that are recognized as words before (red) and after (black) correction by my script. Technically this is not "recall" but a count of (true and false) "positives."

The percentage of tokens in the HathiTrust corpus that are recognized as words before (red) and after (black) correction by my script. Technically this is not “recall” but a count of (true and false) “positives.”

The errors that become problems are the ones that cluster in particular words or periods. The notorious example is eighteenth-century “long S,” which caufes subftantial diflortions before 1820. Other errors caused by ligaturcs and worn typc also tend to cluster toward the early end of the timeline. But as you can see in the illustration above, long S is a particularly big issue; there’s a major improvement in OCR transcription shortly after 1800 as it gets phased out.

The range of possible OCR errors is close to infinite. It would be impossible to catch them all, and as you can see above, my script doesn’t. For a lot of nineteenth-century texts it produces a pretty small improvement. But it does normalize major variations (like long S) that would otherwise create significant distortions. (In cases like fame/same where a word could be either an OCR error or a real word, it uses the words on either side to disambiguate.)

Moreover, certain things that aren’t “errors” can be just as problematic for diachronic analysis. E.g., it’s a problem that “today” is sometimes written “to day” and sometimes “to-day,” and it’s a problem that eighteenth-century verbs get “condens’d.” A script designed to correct OCR might leave these variants unaltered, but in order to make meaningful diachronic comparisons, I have to produce a corpus where variations of spelling and word division are normalized.

The rulesets contained in the repo standardize (roughly) to modern British practice. Some of the rules about variant spellings were originally drawn, in part, from rules associated with the Wordhoard project, and the some of the rules for OCR correction were developed in collaboration with Loretta Auvil. Subfolders of the repo contain scripts I used to develop new rules.

I’ve called this release version 0.1 because it’s very rough. You can write Python in a disciplined, object-oriented way … but I, um, tend not to. This code has grown by accretion, and I’m sure there are bugs. More importantly, as noted above, this isn’t a generic “corrector” but a script that normalizes in order to permit diachronic comparison. It won’t meet everyone’s needs. But there may be a few projects out there that would find it useful as a resource — if so, feel free to fork it and alter it to fit your project!