The real problem with distant reading.

This will be an old-fashioned, shamelessly opinionated, 1000-word blog post.

Anyone who has tried to write literary history using numbers knows that they are a double-edged sword. On the one hand they make it possible, not only to consider more examples, but often to trace subtler, looser patterns than we could trace by hand.

On the other hand, quantitative methods are inherently complex, and unfamiliar for humanists. So it’s easy to bog down in preambles about method.

Social scientists talking about access to healthcare may get away with being slightly dull. But literary criticism has little reason to exist unless it’s interesting; if it bogs down in a methodological preamble, it’s already dead. Some people call the cause of death “positivism,” but only because that sounds more official than “boredom.”

This is a rhetorical rather than epistemological problem, and it needs a rhetorical solution. For instance, Matthew Wilkens and Cameron Blevins have rightly been praised for focusing on historical questions, moving methods to an appendix if necessary. You may also recall that a book titled Distant Reading recently won a book award in the US. Clearly, distant reading can be interesting, even exciting, when writers are able to keep it simple. That requires resisting several temptations.

One temptation to complexity is technical, of course: writers who want to reach a broad audience need to resist geeking out over the latest algorithm. Perhaps fewer people recognize that the advice of more traditional colleagues can be another source of temptation. Scholars who haven’t used computational methods rarely understand the rhetorical challenges that confront distant readers. They worry that our articles won’t be messy enough — bless their hearts — so they advise us to multiply close readings, use special humanistic visualizations, add editorial apparatus to the corpus, and scatter nuances liberally over everything.

Some parts of this advice are useful: a crisp, pointed close reading can be a jolt of energy. And Katherine Bode is right that, in 2016, scholars should share data. But a lot of the extra knobs and nuances that colleagues suggest adding are gimcrack notions that would aggravate the real problem we face: complexity.

Consider the common advice that distant readers should address questions about the representativeness of their corpora by weighting all the volumes to reflect their relative cultural importance. A lot of otherwise smart people have recommended this. But as far as I know, no one ever does it. The people who recommend it, don’t do it themselves, because a moment’s thought reveals that weighting volumes can only multiply dubious assumptions. Personally, I suspect that all quests for the One Truly Representative Corpus are a mug’s game. People who worry about representativeness are better advised to consult several differently-selected samples. That sometimes reveals confounding variables — but just as often reveals that selection practices make little difference for the long-term history of the variable you’re tracing. (The differences between canon and archive are not always as large, or as central to this project, as Franco Moretti initially assumed.)


Sic semper robotis. “I, Mudd” (1967).

Another tempting piece of advice comes from colleagues who invite distant readers to prove humanistic credentials by adding complexity to their data models. This suggestion draws moral authority from a long-standing belief that computers force everything to be a one or a zero, whereas human beings are naturally at home in paradox. That’s why Captain Kirk could easily destroy alien robots by confronting them with the Liar’s Paradox. “How can? It be X. But also? Not X. Does not compute. <Smell of frying circuitry>.”

Maybe in 1967 it was true that computers could only handle exclusive binary categories. I don’t know: I was divided into a pair of gametes at the time myself. But nowadays data models can be as complex as you like. Go ahead, add another column to the database. No technical limit will stop you. Make categories perspectival, by linking each tag to a specific observer. Allow contradictory tags. If you’re still worried that things are too simple, make each variable a probability distribution. The computer won’t break a sweat, although your data model may look like the illustration below to human readers.


Rube Goldberg, “Professor Butts and the Self-Operating Napkin” (1931), via Wikimedia Commons.

I just wrote an article, for instance, where I consider eighteen different sources of testimony about genre — each of which models a genre in ways that can implicitly conflict with, or overlap with, other definitions. I trust you can see the danger: it’s not that the argument will be too reductive. I was willing to run a risk of complexity in this case because I was tired of being told that computers force everything into binaries. Machine learning is actually good at eschewing fixed categories to tease out loose family resemblances; it can be every bit as perspectival, multifaceted, and blurry as you wanna be.

I hope my article manages to remain lively, but I think readers will discover, by the end, that it could have succeeded with a simpler data model. When I rework it for the book version, I may do some streamlining.

It’s okay to simplify the world in order to investigate a specific question. That’s what smart qualitative scholars do themselves, when they’re not busy giving impractical advice to their quantitative friends. Max Weber and Hannah Arendt didn’t make an impact on their respective fields — or on public life — by adding the maximum amount of nuance to everything, so their models could represent every aspect of reality at once, and also function as self-operating napkins.

Because distant readers use larger corpora and more explicit data models than is usual for literary study, critics of the field (internal as well as external) have a tendency to push on those visible innovations, asking “What is still left out?” Something will always be left out, but I don’t think distant reading needs to be pushed toward even more complexity and completism. Computers make those things only too easy. Instead distant readers need to struggle to retain the vividness and polemical verve of the best literary criticism, and the  “combination of simplicity and strength” that characterizes useful social theory.


Versions of disciplinary history.

Accounts of the history of the humanities are being strongly shaped, right now, by stances for or against something called “digital humanities.” I have to admit I avoid the phrase when I can. The good thing about DH is, it creates a lively community that crosses disciplinary lines to exchange ideas. The bad thing is, it also creates a community that crosses disciplinary lines to fight pointlessly over the meaning of “digital humanities.” Egyptologists and scholars of game studies, who once got along just fine doing different things, suddenly understand themselves as advancing competing, incompatible versions of DH.

The desire to defend a coherent tradition called DH can also lead to models of intellectual history that I find bizarre. Sometimes, for instance, people trace all literary inquiry using computers back to Roberto Busa. That seems to me an oddly motivated genealogy: it would only make sense if you thought the physical computers themselves were very important. I tend to trace the things people are doing instead to Janice Radway, Roman Jakobson, Raymond Williams, or David Blei.

On the other hand, we’ve recently seen that a desire to take a stand against digital humanities can lead to equally unpersuasive genealogies. I’m referring to a recent critique of digital humanities in LARB by Daniel Allington, Sarah Brouillette, and David Golumbia. The central purpose of the piece is to identify digital humanities as a neoliberal threat to the humanities.

I’m not going to argue about whether digital humanities is neoliberal; I’ve already said that I fear the term is becoming a source of pointless fights. So I’m not the person to defend the phrase, or condemn it. But I do care about properly crediting people who contributed to the tradition of literary history I work in, and here I think the piece in LARB leads to important misunderstandings.

The argument is supported by two moves that I would call genealogical sleight-of-hand. On the one hand, it unifies a wide range of careers that might seem to have pursued different ends (from E. D. Hirsch to Rita Felski) by the crucial connecting link that all these people worked at the University of Virginia. On the other hand, it needs to separate various things that readers might associate with digital humanities, so if any intellectual advances happen to take place in some corner of a discipline, it can say “well, you know, that part wasn’t really related; it had a totally different origin.”

I don’t mean to imply that the authors are acting in bad faith here; nor do I think people who over-credit Roberto Busa for all literary work done with computers are motivated by bad faith. This is just an occupational hazard of doing history. If you belong to a particular group (a national identity, or a loose social network like “DH”), there’s always a danger of linking and splitting things so history becomes a story about “the rise of France.” The same thing can happen if you deeply dislike a group.

So, I take it as a sincere argument. But the article’s “splitting” impulses are factually mistaken in three ways. First, the article tries to crisply separate everything happening in distant reading from the East Coast — where people are generally tarnished (in the authors’ eyes) by association with UVA. Separating these traditions allows the article to conclude “well, Franco Moretti may be a Marxist, but the kind of literary history he’s pursuing had nothing to do with those editorial theory types.”

That’s just not true; the projects may be different, but there have also been strong personal and intellectual connections between them. At times, the connections have been embodied institutionally in the ADHO, but let me offer a more personal example: I wouldn’t be doing what I’m doing right now if it weren’t for the MONK project. Before I knew how to code — or, code in anything other than 1980s-era Basic — I spent hours playing with the naive Bayes feature in MONK online, discovering what it was capable of. For me, that was the gateway drug that led eventually to a deeper engagement with sociology of literature, book history, machine learning, and so on. MONK was created by a group centered at our Graduate School of Library and Information Science, but the dark truth is that several of those people had been trained at UVA (I know Unsworth, Ramsay, and Kirschenbaum were involved — pardon me if I’m forgetting others).

MONK is also an example of another way the article’s genealogy goes wrong: by trying to separate anything that might be achieved intellectually in a field like literary history from the mere “support functions for the humanities” provided by librarians and academic professionals. Just as a matter of historical fact, that’s not a correct account of how large-scale literary history has developed. My first experiment with quantitative methods — before MONK — took shape back in 1995, when my first published article, in Studies in Romanticism (1995). used quantitative methods influenced by Mark Olsen, a figure who deserves a lot more credit than he has received. Olsen had already sketched out the theoretical rationale for a research program you might call “distant reading” in 1989, arguing that text analysis would only really become useful for the humanities when it stopped trying to produce readings of individual books and engaged broad social-historical questions. But Olsen was not a literature professor. He had a Ph.D in French history, and was working off the tenure track with a digital library called ARTFL at the University of Chicago.

Really at every step of the way — from ARTFL, to MONK, to the Stanford Literary Lab, to HathiTrust Research Center — my thinking about this field has been shaped by projects that were organized and led by people with appointments in libraries and/or in library science. You may like that, or feel that it’s troubling — up to you — but it’s the historical fact.

Personally, I take it as a sign that, in historical disciplines, libraries and archives really matter. A methodology, by itself, is not enough; you also need material, and the material needs to be organized in ways that are far from merely clerical. Metadata is a hard problem. The organization of the past is itself an interpretive act, and libraries are one of the institutional forms it takes. I might not have realized that ten years ago, but after struggling to keep my head above water in a sea of several million books, I feel it very sincerely.

This is why I think the article is also wrong to treat distant reading as a mere transplantation of social-science methods. I suspect the article has seen this part of disciplinary history mainly through the lens of Daniel Allington’s training in linguistics, so I credit it as a good-faith understanding: if you’re trained in social science, then I understand, large-scale literary history will probably look like sociology and linguistics that happen to have gotten mixed in some way and then applied to the past.

But the article is leaving out something that really matters in this field, which is turning methods into historical arguments. To turn social-scientific methods into literary history, you have to connect the results of a model, meaningfully, to an existing conversation about the literary past. For that, you need a lot of things that aren’t contained in the original method. Historical scholarship. Critical insight, dramatized by lively writing. And also metadata. Authors’ dates of birth and death; testimony about perceived genre categories. A corpus isn’t enough. Social-scientific methods can only become literary history in collaboration with libraries.

I know nothing I have said here will really address the passions evoked on multiple sides by the LARB article. I expect this post will be read by some as an attempt to defend digital humanities, and by others as a mealy-mouthed failure to do so. That’s okay. But from my own (limited) perspective, I’m just trying to write some history here, giving proper credit to people who were involved in building the institutions and ideas I rely on. Those people included social scientists, humanists, librarians, scholars in library and information science, and people working off the tenure track in humanities computing.

Postscript: On the importance of libraries, see Steven E. Jones, quoting Bethany Nowviskie about the catalytic effect of Google Books (Emergence 8, and “Resistance in the Materials”). Since metadata matters, Google Books became enormously more valuable to scholars in the form of HathiTrust. The institutional importance I attribute to libraries is related to Alan Liu’s recent observations about the importance of critically engaging infrastructure.


Jones, Steven E. The Emergence of the Digital Humanities. New York: Routledge, 2014.

Olsen, Mark. “The History of Meaning: Computational and Quantitative Methods in Intellectual History,” Joumal of History and Politics 6 (1989): 121-54.

Olsen, Mark. “Signs, Symbols, and Discourses: A New Direction for Computer-Aided Literary Studies.” Computers and the Humanities 27 (1993): 309-14.