I think I see an interesting theoretical debate over the horizon. The debate is too big to resolve in a blog post, but I thought it might be narratively useful to foreshadow it—sort of as novelists create suspense by dropping hints about the character traits that will develop into conflict by the end of the book.
Basically, the problem is that scholars who use numbers to understand literary history have moved on from Stanley Fish’s critique, without much agreement about why or how. In the early 1970s, Fish gave a talk at the English Institute that defined a crucial problem for linguistic analysis of literature. Later published as “What Is Stylistics, and Why are They Saying Such Terrible Things About It?”, the essay focused on “the absence of any constraint” governing the move “from description to interpretation.” Fish takes Louis Milic’s discussion of Jonathan Swift’s “habit of piling up words in series” as an example. Having demonstrated that Swift does this, Milic concludes that the habit “argues a fertile and well stocked mind.” But Fish asks how we can make that sort of inference, generally, about any linguistic pattern. How do we know that reliance on series demonstrates a “well stocked mind” rather than, say, “an anal-retentive personality”?
The problem is that isolating linguistic details for analysis also removes them from the context we normally use to give them a literary interpretation. We know what the exclamation “Sad!” implies, when we see it at the end of a Trumpian tweet. But if you tell me abstractly that writer A used “sad” more than writer B, I can’t necessarily tell you what it implies about either writer. If I try to find an answer by squinting at word lists, I’ll often make up something arbitrary. Word lists aren’t self-interpreting.
Thirty years passed; the internet got invented. In the excitement, dusty critiques from the 1970s got buried. But Fish’s argument was never actually killed, and if you listen to the squeaks of bats, you hear rumors that it still walks at night.
Or you could listen to blogs. This post is partly prompted by a blogged excerpt from a forthcoming work by Dennis Tenen, which quotes Fish to warn contemporary digital humanists that “a relation can always be found between any number of low-level, formal features of a text and a given high-level account of its meaning.” Without “explanatory frameworks,” we won’t know which of those relations are meaningful.
Ryan Cordell’s recent reflections on “machine objectivity” could lead us in a similar direction. At least they lead me in that direction, because I think the error Cordell discusses—over-reliance on machines themselves to ground analysis—often comes from a misguided attempt to solve the problem of arbitrariness exposed by Fish. Researchers are attracted to unsupervised methods like topic modeling in part because those methods seem to generate analytic categories that are entirely untainted by arbitrary human choices. But as Fish explained, you can’t escape making choices. (Should I label this topic “sadness” or “Presidential put-downs”?)
I don’t think any of these dilemmas are unresolvable. Although Fish’s critique identified a real problem, there are lots of valid solutions to it, and today I think most published research is solving the problem reasonably well. But how? Did something happen since the 1970s that made a difference? There are different opinions here, and the issues at stake are complex enough that it could take decades of conversation to work through them. Here I just want to sketch a few directions the conversation could go.
Dennis Tenen’s recent post implies that the underlying problem is that our models of form lack causal, explanatory force. “We must not mistake mere extrapolation for an account of deep causes and effects.” I don’t think he takes this conclusion quite to the point of arguing that predictive models should be avoided, but he definitely wants to recommend that mere prediction should be supplemented by explanatory inference. And to that extent, I agree—although, as I’ll say in a moment, I have a different diagnosis of the underlying problem.
It may also be worth reviewing Fish’s solution to his own dilemma in “What Is Stylistics,” which was that interpretive arguments need to be anchored in specific “interpretive acts” (93). That has always been a good idea. David Robinson’s analysis of Trump tweets identifies certain words (“badly,” “crazy”) as signs that a tweet was written by Trump, and others (“tomorrow,” “join”) as signs that it was written by his staff. But he also quotes whole tweets, so you can see how words are used in context, make your own interpretive judgment, and come to a better understanding of the model. There are many similar gestures in Stanford LitLab pamphlets: distant readers actually rely quite heavily on close reading.
My understanding of this problem has been shaped by a slightly later Fish essay, “Interpreting the Variorum” (1976), which returns to the problem broached in “What Is Stylistics,” but resolves it in a more social way. Fish concludes that interpretation is anchored not just in an individual reader’s acts of interpretation, but in “interpretive communities.” Here, I suspect, he is rediscovering an older hermeneutic insight, which is that human acts acquire meaning from the context of human history itself. So the interpretation of culture inevitably has a circular character.
One lesson I draw is simply, that we shouldn’t work too hard to avoid making assumptions. Most of the time we do a decent job of connecting meaning to an implicit or explicit interpretive community. Pointing to examples, using word lists derived from a historical thesaurus or sentiment dictionary—all of that can work well enough. The really dubious moves we make often come from trying to escape circularity altogether, in order to achieve what Alan Liu has called “tabula rasa interpretation.”
Of course, if you pursue that approach systematically enough, it will lead you away from topic modeling toward methods that rely more explicitly on human judgment. I have been leaning on supervised algorithms a lot lately—not because they’re easier to test or more reliable than unsupervised ones—but because they explicitly acknowledge that interpretation has to be anchored in human history.
At a first glance, this may seem to make progress impossible. “All we can ever discover is which books resemble these other books selected by a particular group of readers. The algorithm can only reproduce a category someone else already defined!” And yes, supervised modeling is circular. But this is a circularity shared by all interpretation of history, and it never merely reproduces its starting point. You can discover that books resemble each other to different degrees. You can discover that models defined by the responses of one interpretive community do or don’t align with models of another. And often you can, carefully, provisionally, draw explanatory inferences from the model itself, assisted perhaps by a bit of close reading.
I’m not trying to diss unsupervised methods here. Actually, unsupervised methods are based on clear, principled assumptions. And a topic model is already a lot more contextually grounded than “use of series == well stocked mind.” I’m just saying that the hermeneutic circle is a little slipperier in unsupervised learning, easier to misunderstand, and harder to defend to crowds of pitchfork-wielding skeptics.
In short, there are lots of good responses to Fish’s critique. But if that critique is going to be revived by skeptics over the next few years—as I suspect—I think I’ll take my stand for the moment on supervised machine learning, which can explicitly build bridges between details of literary language and social contexts of reception. There are other ways to describe best practices: we could emphasize a need to seek “explanations,” or avoid claims of “objectivity.” But I think the crucial advance we have made over the 1970s is that we’re no longer just modeling language; we can model interpretive communities at the same time.
Photo credit: A school of yellow-tailed goatfish, photo for NOAA Photo Library, CC-BY Dwayne Meadows, 2004.
One of the many strengths of Moretti’s writing is a willingness to dramatize his own learning process. This pamphlet situates itself as a twist in the ongoing evolution of “computational criticism,” a turn from literary history to literary theory.
Measurement as a challenge to literary theory, one could say, echoing a famous essay by Hans Robert Jauss. This is not what I expected from the encounter of computation and criticism; I assumed, like so many others, that the new approach would change the history, rather than the theory of literature ….
Measurement challenges literary theory because it asks us to “operationalize” existing critical concepts — to say, for instance, exactly how we know that one character occupies more “space” in a work than another. Are we talking simply about the number of words they speak? or perhaps about their degree of interaction with other characters?
Moretti uses Alex Woloch’s concept of “character-space” as a specific example of what it means to operationalize a concept, but he’s more interested in exploring the broader epistemological question of what we gain by operationalizing things. When literary scholars discuss quantification, we often tacitly assume that measurement itself is on trial. We ask ourselves whether measurement is an adequate proxy for our existing critical concepts. Can mere numbers capture the ineffable nuances we assume they possess? Here, Moretti flips that assumption and suggests that measurement may have something to teach us about our concepts — as we’re forced to make them concrete, we may discover that we understood them imperfectly. At the end of the article, he suggests for instance (after begging divine forgiveness) that Hegel may have been wrong about “tragic collision.”
I think Moretti is frankly right about the broad question this pamphlet opens. If we engage quantitative methods seriously, they’re not going to remain confined to empirical observations about the history of predefined critical concepts. Quantification is going to push back against the concepts themselves, and spill over into theoretical debate. I warned y’all back in August that literary theory was “about to get interesting again,” and this is very much what I had in mind.
At this point in a scholarly review, the standard procedure is to point out that a work nevertheless possesses “oversights.” (Insight, meet blindness!) But I don’t think Moretti is actually blind to any of the reflections I add below. We have differences of rhetorical emphasis, which is not the same thing.
For instance, Moretti does acknowledge that trying to operationalize concepts could cause them to dissolve in our hands, if they’re revealed as unstable or badly framed (see his response to Bridgman on pp. 9-10). But he chooses to focus on a case where this doesn’t happen. Hegel’s concept of “tragic collision” holds together, on his account; we just learn something new about it.
In most of the quantitative projects I’m pursuing, this has not been my experience. For instance, in developing statistical models of genre, the first thing I learned was that critics use the word genre to cover a range of different kinds of categories, with different degrees of coherence and historical volatility. Instead of coming up with a single way to operationalize genre, I’m going to end up producing several different mapping strategies that address patterns on different scales.
Something similar might be true even about a concept like “character.” In Vladimir Propp’s Morphology of the Folktale, for instance, characters are reduced to plot functions. Characters don’t have to be people or have agency: when the hero plucks a magic apple from a tree, the tree itself occupies the role of “donor.” On Propp’s account, it would be meaningless to represent a tale like “Le Petit Chaperon Rouge” as a social network. Our desire to imagine narrative as a network of interactions between imagined “people” (wolf ⇌ grandmother) presupposes a separation between nodes and edges that makes no sense for Propp. But this doesn’t necessarily mean that Moretti is wrong to represent Hamlet as a social network: Hamlet is not Red Riding Hood, and tragic drama arguably envisions character in a different way. In short, one of the things we might learn by operationalizing the term “character” is that the term has genuinely different meanings in different genres, obscured for us by the mere continuity of a verbal sign. [I should probably be citing Tzvetan Todorov here, The Poetics of Prose, chapter 5.]
Another place where I’d mark a difference of emphasis from Moretti involves the tension, named in my title, between “measurement” and “modeling.” Moretti acknowledges that there are people (like Graham Sack) who assume that character-space can’t be measured directly, and therefore look for “proxy variables.” But concepts that can’t be directly measured raise a set of issues that are quite a bit more challenging than the concept of a “proxy” might imply. Sack is actually trying to build models that postulate relations between measurements. Digital humanists are probably most familiar with modeling in the guise of topic modeling, a way of mapping discourse by postulating latent variables called “topics” that can’t be directly observed. But modeling is a flexible heuristic that could be used in a lot of different ways.
Having empirically observed the effects of illustrations like this on literary scholars, I can report that they produce deep, Lovecraftian horror. Nothing looks bristlier and more positivist than plate notation.
But I think this is a tragic miscommunication produced by language barriers that both sides need to overcome. The point of model-building is actually to address the reservations and nuances that humanists correctly want to interject whenever the concept of “measurement” comes up. Many concepts can’t be directly measured. In fact, many of our critical concepts are only provisional hypotheses about unseen categories that might (or might not) structure literary discourse. Before we can attempt to operationalize those categories, we need to make underlying assumptions explicit. That’s precisely what a model allows us to do.
It’s probably going to turn out that many things are simply beyond our power to model: ideology and social change, for instance, are very important and not at all easy to model quantitatively. But I think Moretti is absolutely right that literary scholars have a lot to gain by trying to operationalize basic concepts like genre and character. In some cases we may be able to do that by direct measurement; in other cases it may require model-building. In some cases we may come away from the enterprise with a better definition of existing concepts; in other cases those concepts may dissolve in our hands, revealed as more unstable than even poststructuralists imagined. The only thing I would say confidently about this project is that it promises to be interesting.
One thing I’ve never understood about humanities disciplines is our insistence on staging methodology as ethical struggle. I don’t think humanists are uniquely guilty here; at bottom, it’s probably the institution of disciplinarity itself that does it. But the normative tone of methodological conversation is particularly odd in the humanities, because we have a reputation for embracing multiple perspectives. And yet, where research methods are concerned, we actually seem to find that very hard.
It never seems adequate to say “hey, look through the lens of this method for a sec — you might see something new.” Instead, critics practicing historicism feel compelled to justify their approach by showing that close reading is the crypto-theological preserve of literary mandarins. Arguments for close reading, in turn, feel compelled to claim that distant reading is a slippery slope to takeover by the social sciences — aka, a technocratic boot stomping on the individual face forever. Or, if we do admit that multiple perspectives have value, we often feel compelled to prescribe some particular balance between them.
Imagine if biologists and sociologists went at each other in the same way.
“It’s absurd to study individual bodies, when human beings are social animals!”
“Your obsession with large social phenomena is a slippery slope — if we listened to you, we would eventually forget about the amazing complexity of individual cells!”
“Both of your methods are regrettably limited. What we need, today, is research that constantly tempers its critique of institutions with close analysis of mitochondria.”
As soon as we back up and think about the relation between disciplines, it becomes obvious that there’s a spectrum of mutually complementary approaches, and different points on the spectrum (or different combinations of points) can be valid for different problems.
So why can’t we see this when we’re discussing the possible range of methods within a discipline? Why do we feel compelled to pretend that different approaches are locked in zero-sum struggle — or that there is a single correct way of balancing them — or that importing methods from one discipline to another raises a grave ethical quandary?
It’s true that disciplines are finite, and space in the major is limited. But a debate about “what will fit in the major” is not the same thing as ideology critique or civilizational struggle. It’s not even, necessarily, a substantive methodological debate that needs to be resolved.
But there is one place where I’m coming to agree with people who say that quantitative methods can make us dumber. To put it simply: numbers tend to distract the eye. If you quantify part of your argument, critics (including your own internal critic) will tend to focus on problems in the numbers, and ignore the deeper problems located elsewhere.
I’ve discovered this in my own practice. For instance, when I blogged about genre in large digital collections. I got a lot of useful feedback on those blog posts; it was probably the most productive conversation I’ve ever had as a scholar. But most of the feedback focused on potential problems in the quantitative dimension of my argument. E.g., how representative was this collection as a sample of print culture? Or, what smoothing strategies should I be using to plot results? My own critical energies were focused on similar questions.
Those questions were useful, and improved the project greatly, but in most cases they didn’t rock its foundations. And with a year’s perspective, I’ve come to recognize that there were after all foundation-rocking questions to be posed. For instance, in early versions of this project, I hadn’t really ironed out the boundary between “poetry” and “drama.” Those categories overlap, after all! This wasn’t creating quantitative problems (Jordan Sellers and I were handling cases consistently), but it was creating conceptual ones: the line “poetry” below should probably be labeled “nondramatic verse.”
The biggest problem was even less quantitative, and more fundamental: I needed to think harder about the concept of genre itself. As I model different kinds of genre, and read about similar (traditional and digital) projects by other scholars, I increasingly suspect the elephant in the room is that the word may not actually hold together. Genre may be a box we’ve inherited for a whole lot of basically different things. A bibliography is a genre; so is the novel; so is science fiction; so is the Kailyard school; so is acid house. But formally, socially, and chronologically, those are entities of very different kinds.
Skepticism about foundational concepts has been one of the great strengths of the humanities. The fact that we have a word for something (say genre or the individual) doesn’t necessarily imply that any corresponding entity exists in reality. Humanists call this mistake “reification,” and we should hold onto our skepticism about it. If I hand you a twenty-page argument using Google ngrams to prove that the individual has been losing ground to society over the last hundred years, your response should not be “yeah, but how representative is Google Books, and how good is their OCR?” (Those problems are relatively easy to solve.) Your response should be, “Uh … how do you distinguish ‘the individual’ from ‘society’ again?”
As I said, humanists have been good at catching reification; it’s a strength we should celebrate. But I don’t see this habit of skepticism as an endangered humanistic specialty that needs to be protected by a firewall. On the contrary, we should be exporting our skepticism! This habit of questioning foundational concepts can be just as useful in the sciences and social sciences, where quantitative methods similarly distract researchers from more fundamental problems. [I don’t mean to suggest that it’s never occurred to scientists to resist this distraction: as Matt Wilkens points out in the comments, they’re often good at it. -Ed.]
In psychology, for instance, emphasis on clearing a threshold of statistical significance (defined as a p-value) frequently distracts researchers from more fundamental questions of experimental design (like, are we attempting to measure an entity that actually exists?) Andrew Gelman persuasively suggests that this is not just a problem caused by quantification but can be more broadly conceived as a “dangerous lure of certainty.” In any field, it can be tempting to focus narrowly on the degree of certainty associated with a hypothesis. But it’s often more important to ask whether the underlying question is interesting and meaningfully framed.
On the other hand, this doesn’t mean that humanists need to postpone quantitative research until we know how to define long-debated concepts. I’m now pretty skeptical about the coherence of this word genre, for instance, but it’s a skepticism I reached precisely by attempting to iron out details in a quantitative model. Questions about accuracy can prompt deeper conceptual questions, which reframe questions of accuracy, in a virtuous cycle. The important thing, I think, is not to let yourself stall out on the “accuracy” part of the cycle: it offers a tempting illusion of perfectibility, but that’s not actually our goal.
A couple of weeks ago, after reading abstracts from DH2013, I said that the take-away for me was that “literary theory is about to get interesting again” – subtweeting the course of history in a way that I guess I ought to explain.
In the twentieth century, “literary theory” was often a name for the sparks that flew when literary scholars pushed back against challenges from social science. Theory became part of the academic study of literature around 1900, when the comparative study of folklore seemed to reveal coherent patterns in national literatures that scholars had previously treated separately. Schools like the University of Chicago hired “Professors of Literary Theory” to explore the controversial possibility of generalization.* Later in the century, structural linguistics posed an analogous challenge, claiming to glimpse an organizing pattern in language that literary scholars sought to appropriate and/or deconstruct. Once again, sparks flew.
I think literary scholars are about to face a similarly productive challenge from the discipline of machine learning — a subfield of computer science that studies learning as a problem of generalization from limited evidence. The discipline has made practical contributions to commercial IT, but it’s an epistemological method founded on statistics more than it is a collection of specific tools, and it tends to be intellectually adventurous: lately, researchers are trying to model concepts like “character” (pdf) and “gender,” citing Judith Butler in the process (pdf).
This could be the beginning of a beautiful friendship. I realize a marriage between machine learning and literary theory sounds implausible: people who enjoy one of these things are pretty likely to believe the other is fraudulent and evil.** But after reading through a couple of ML textbooks,*** I’m convinced that literary theorists and computer scientists wrestle with similar problems, in ways that are at least loosely congruent. Neither field is interested in the mere accumulation of data; both are interested in understanding the way we think and the kinds of patterns we recognize in language. Both fields are interested in problems that lack a single correct answer, and have to be mapped in shades of gray (ML calls these shades “probability”). Both disciplines are preoccupied with the danger of overgeneralization (literary theorists call this “essentialism”; computer scientists call it “overfitting”). Instead of saying “every interpretation is based on some previous assumption,” computer scientists say “every model depends on some prior probability,” but there’s really a similar kind of self-scrutiny involved.
It’s already clear that machine learning algorithms (like topic modeling) can be useful tools for humanists. But I think I glimpse an even more productive conversation taking shape, where instead of borrowing fully-formed “tools,” humanists borrow the statistical language of ML to think rigorously about different kinds of uncertainty, and return the favor by exposing the discipline to boundary cases that challenge its methods.
Won’t quantitative models of phenomena like plot and genre simplify literature by flattening out individual variation? Sure. But the same thing could be said about Freud and Lévi-Strauss. When scientists (or social scientists) write about literature they tend to produce models that literary scholars find overly general. Which doesn’t prevent those models from advancing theoretical reflection on literature! I think humanists, conversely, can warn scientists away from blind alleys by reminding them that concepts like “gender” and “genre” are historically unstable. If you assume words like that have a single meaning, you’re already overfitting your model.
Of course, if literary theory and computer science do have a conversation, a large part of the conversation is going to be a meta-debate about what the conversation can or can’t achieve. And perhaps, in the end, there will be limits to the congruence of these disciplines. Alan Liu’s recent essay in PMLA pushes against the notion that learning algorithms can be analogous to human interpretation, suggesting that statistical models become meaningful only through the inclusion of human “seed concepts.” I’m not certain how deep this particular disagreement goes, because I think machine learning researchers would actually agree with Liu that statistical modeling never starts from a tabula rasa. Even “unsupervised” algorithms have priors. More importantly, human beings have to decide what kind of model is appropriate for a given problem: machine learning aims to extend our leverage over large volumes of data, not to take us out of the hermeneutic circle altogether.
But as Liu’s essay demonstrates, this is going to be a lively, deeply theorized conversation even where it turns out that literary theory and computer science have fundamental differences. These disciplines are clearly thinking about similar questions: Liu is right to recognize that unsupervised learning, for instance, raises hermeneutic questions of a kind that are familiar to literary theorists. If our disciplines really approach similar questions in incompatible ways, it will be a matter of some importance to understand why.
PS later that afternoon: Belatedly realize I didn’t say anything about the most controversial word in my original tweet: “literary theory is about to get interesting again.” I suppose I tacitly distinguish literary theory (which has been a little sleepy lately, imo) from theory-sans-adjective (which has been vigorous, although hard to define). But now I’m getting into a distinction that’s much too slippery for a short blog post.
Digital collections are vastly expanding literary scholars’ field of view: instead of describing a few hundred well-known novels, we can now test our claims against corpora that include tens of thousands of works. But because this expansion of scope has also raised expectations, the question of representativeness is often discussed as if it were a weakness rather than a strength of digital methods. How can we ever produce a corpus complete and balanced enough to represent print culture accurately?
I think the question is wrongly posed, and I’d like to suggest an alternate frame. As I see it, the advantage of digital methods is that we never need to decide on a single model of representation. We can and should keep enlarging digital collections, to make them as inclusive as possible. But no matter how large our collections become, the logic of representation itself will always remain open to debate. For instance, men published more books than women in the eighteenth century. Would a corpus be correctly balanced if it reproduced those disproportions? Or would a better model of representation try to capture the demographic reality that there were roughly as many women as men? There’s something to be said for both views.
To take another example, Scott Weingart has pointed out that there’s a basic tension in text mining between measuring “what was written” and “what was read.” A corpus that contains one record for every title, dated to its year of first publication, would tend to emphasize “what was written.” Measuring “what was read” is harder: a perfect solution would require sales figures, reviews, and other kinds of evidence. But, as a quick stab at the problem, we could certainly measure “what was printed,” by including one record for every volume in a consortium of libraries like HathiTrust. If we do that, a frequently-reprinted work like Robinson Crusoe will carry about a hundred times more weight than a novel printed only once.
We’ll never create a single collection that perfectly balances all these considerations. But fortunately, we don’t need to: there’s nothing to prevent us from framing our inquiry instead as a comparative exploration of many different corpora balanced in different ways.
For instance, if we’re troubled by the difference between “what was written” and “what was read,” we can simply create two different collections — one limited to first editions, the other including reprints and duplicate copies. Neither collection is going to be a perfect mirror of print culture. Counting the volumes of a novel preserved in libraries is not the same thing as counting the number of its readers. But comparing these collections should nevertheless tell us whether the issue of popularity makes much difference for a given research question.
I suspect in many cases we’ll find that it makes little difference. For instance, in tracing the development of literary language, I got interested in the relative prominence of words that entered English before and after the Norman Conquest — and more specifically, in how that ratio changed over time in different genres. My first approach to this problem was based on a collection of 4,275 volumes that were, for the most part, limited to first editions (773 of these were prose fiction).
But I recognized that other scholars would have questions about the representativeness of my sample. So I spent the last year wrestling with 470,000 volumes from HathiTrust; correcting their OCR and using classification algorithms to separate fiction from the rest of the collection. This produced a collection with a fundamentally different structure — where a popular work of fiction could be represented by dozens or scores of reprints scattered across the timeline. What difference did that make to the result? (click through to enlarge)
It made almost no difference. The scatterplots look different, of course, because the hand-selected collection (on the left) is relatively stable in size across the timespan, and has a consistent kind of noisiness, whereas the HathiTrust collection (on the right) gets so huge in the nineteenth century that noise almost disappears. But the trend lines are broadly comparable, although the collections were created in completely different ways and rely on incompatible theories of representation.
I don’t regret the year I spent getting a binocular perspective on this question. Although in this case changing the corpus made little difference to the result, I’m sure there are other questions where it will make a difference. And we’ll want to consider as many different models of representation as we can. I’ve been gathering metadata about gender, for instance, so that I can ask what difference gender makes to a given question; I’d also like to have metadata about the ethnicity and national origin of authors.
But the broader point I want to make here is that people pursuing digital research don’t need to agree on a theory of representation in order to cooperate.
If you’re designing a shared syllabus or co-editing an anthology, I suppose you do need to agree in advance about the kind of representativeness you’re aiming to produce. Space is limited; tradeoffs have to be made; you can only select one set of works.
But in digital research, there’s no reason why we should ever have to make up our minds about a model of representativeness, let alone reach consensus. The number of works we can select for discussion is not limited. So we don’t need to imagine that we’re seeking a correspondence between the reality of the past and any set of works. Instead, we can look at the past from many different angles and ask how it’s transformed by different perspectives. We can look at all the digitized volumes we have — and then at a subset of works that were widely reprinted — and then at another subset of works published in India — and then at three or four works selected as case studies for close reading. These different approaches will produce different pictures of the past, to be sure. But nothing compels us to make a final choice among them.
Of all our literary-historical narratives it is the history of criticism itself that seems most wedded to a stodgy history-of-ideas approach—narrating change through a succession of stars or contending schools. While scholars like John Guillory and Gerald Graff have produced subtler models of disciplinary history, we could still do more to complicate the narratives that organize our discipline’s understanding of itself.
The archive of scholarship is also, unlike many twentieth-century archives, digitized and available for “distant reading.” Much of what we need is available through JSTOR’s Data for Research API. So last summer it occurred to a group of us that topic modeling PMLA might provide a new perspective on the history of literary studies. Although Goldstone and Underwood are writing this post, the impetus for the project also came from Natalia Cecire, Brian Croxall, and Roger Whitson, who may do deeper dives into specific aspects of this archive in the near future.
Topic modeling is a technique that automatically identifies groups of words that tend to occur together in a large collection of documents. It was developed about a decade ago by David Blei among others. Underwood has a blog post explaining topic modeling, and you can find a practical introduction to the technique at the Programming Historian. Jonathan Goodwin has explained how it can be applied to the word-frequency data you get from JSTOR.
Obviously, PMLA is not an adequate synecdoche for literary studies. But, as a generalist journal with a long history, it makes a useful test case to assess the value of topic modeling for a history of the discipline.
Goldstone and Underwood each independently produced several different models of PMLA, using different software, stopword lists, and numbers of topics. Our results overlapped in places and diverged in places. But we’ve reached a shared sense that topic modeling can enrich the history of literary scholarship by revealing trends that are presently invisible.
What is a topic?
A “topic model” assigns every word in every document to one of a given number of topics. Every document is modeled as a mixture of topics in different proportions. A topic, in turn, is a distribution of words—a model of how likely given words are to co-occur in a document. The algorithm (called LDA) knows nothing “meta” about the articles (when they were published, say), and it knows nothing about the order of words in a given document.
This is a picture of 5940 articles from PMLA, showing the changing presence of each of 100 "topics" in PMLA over time. (Click through to enlarge; a longer list of topic keywords is here.) For example, the most probable words in the topic arbitrarily numbered 59 in the model visualized above are, in descending order:
che gli piu nel lo suo sua sono io delle perche questo quando ogni mio quella loro cosi dei
This is not a “topic” in the sense of a theme or a rhetorical convention. What these words have in common is simply that they’re basic Italian words, which appear together whenever an extended Italian text occurs. And this is the point: a “topic” is neither more nor less than a pattern of co-occurring words.
Nonetheless, a topic like topic 59 does tell us about the history of PMLA. The articles where this topic achieved its highest proportion were:
Antonio Illiano, “Momenti e problemi di critica pirandelliana: L’umorismo, Pirandello e Croce, Pirandello e Tilgher,” PMLA 83 no. 1 (1968): pp. 135-143
Domenico Vittorini, “I Dialogi ad Petrum Histrum di Leonardo Bruni Aretino (Per la Storia del Gusto Nell’Italia del Secolo XV),” PMLA 55 no. 3 (1940): pp. 714-720
Vincent Luciani, “Il Guicciardini E La Spagna,” PMLA 56 no. 4 (1941): pp. 992-1006
And here’s a plot of the changing proportions of this topic over time, showing moving 1-year and 5-year averages:
We see something about PMLA that is worth remembering for the history of criticism, namely, that it has embedded Italian less and less frequently in its language since midcentury. (The model shows that the same thing is true of French and German.)
What can topics tell us about the history of theory?
Of course a topic can also be a subject category—modeling PMLA, we have found topics that are primarily “about Beowulf” or “about music.” Or a topic can be a group of words that tend to co-occur because they’re associated with a particular critical approach.
Here, for instance, we have a topic from Underwood’s 150-topic model associated with discussions of pattern and structure in literature. We can characterize it by listing words that occur more commonly in the topic than elsewhere, or by graphing the frequency of the topic over time, or by listing a few articles where it’s especially salient.
At first glance this topic might seem to fit neatly into a familiar story about critical history. We know that there was a mid-twentieth-century critical movement called “structuralism,” and the prominence of “structure” here might suggest that we’re looking at the rise and fall of that movement. In part, perhaps, we are. But the articles where this topic is most prominent are not specifically “structuralist.” In the top four articles, Ferdinand de Saussure, Claude Lévi-Strauss, and Northrop Frye are nowhere in evidence. Instead these articles appeal to general notions of symmetry, or connect literary patterns to Neoplatonism and Renaissance numerology.
By forcing us to attend to concrete linguistic practice, topic modeling gives us a chance to bracket our received assumptions about the connections between concepts. While there is a distinct mid-century vogue for structure, it does not seem strongly associated with the concepts that are supposed to have motivated it (myth, kinship, language, archetype). And it begins in the 1940s, a decade or more before “structuralism” is supposed to have become widespread in literary studies. We might be tempted to characterize the earlier part of this trend as “New Critical interest in formal unity” and the latter part of it as “structuralism.” But the dividing line between those rationales for emphasizing pattern is not evident in critical vocabulary (at least not at this scale of analysis).
This evidence doesn’t necessarily disprove theses about the history of structuralism. Topic modeling might not reveal varying “rationales” for using a word even if those rationales did vary. The strictly linguistic character of this technique is a limitation as well as a strength: it’s not designed to reveal motivation or conflict. But since our histories of criticism are already very intellectual and agonistic, foregrounding the conscious beliefs of contending critical “schools,” topic modeling may offer a useful corrective. This technique can reveal shifts of emphasis that are more gradual and less conscious than the ones we tend to celebrate.
It may even reveal shifts of emphasis of which we were entirely unaware. “Structure” is a familiar critical theme, but what are we to make of this?
A fuller list of terms included in this topic would include “character”, “fact,” “choice,” “effect,” and “conflict.” Reading some of the articles where the topic is prominent, it appears that in this topic “point” is rarely the sort of point one makes in an argument. Instead it’s a moment in a literary work (e.g., “at the point where the rain occurs,” in Robert apRoberts 379). Apparently, critics in the 1960s developed a habit of describing literature in terms of problems, questions, and significant moments of action or choice; the habit intensified through the early 1980s and then declined. This habit may not have a name; it may not line up neatly with any recognizable school of thought. But it’s a fact about critical history worth knowing.
Note that this concern with problem-situations is embodied in common words like “way” and “cannot” as well as more legible, abstract terms. Since common words are often difficult to interpret, it can be tempting to exclude them from the modeling process. It’s true that a word like “the” isn’t likely to reveal much. But subtle, interesting rhetorical habits can be encoded in common words. (E.g. “itself” is especially common in late-20c theoretical topics.)
We don’t imagine that this brief blog post has significantly contributed to the history of criticism. But we do want to suggest that topic modeling could be a useful resource for that project. It has the potential to reveal shifts in critical vocabulary that aren’t well described, and that don’t fit our received assumptions about the history of the discipline.
Why browse topics as a network?
The fact that a word is prominent in topic A doesn’t prevent it from also being prominent in topic B. So certain generalizations we might make about an individual topic (for instance, that Italian words decline in frequency after midcentury) will be true only if there’s not some other “Italian” topic out there, picking up where the first one left off.
For that reason, interpreters really need to survey a topic model as a whole, instead of considering single topics in isolation. But how can you browse a whole topic model? We’ve chosen relatively small numbers of topics, but it would not be unreasonable to divide literary scholarship into, say, 500 topics. Information overload becomes a problem.
We’ve found network graphs useful here. Click on the image of the network on the right to browse Underwood’s 150-topic model. The size of each node (roughly) indicates the number of words in the topic; color indicates the average date of words. (Blue topics are older; yellow topics are more recent.) Topics are linked to each other if they tend to appear in the same articles. Topics have been labeled with their most salient word—unless that word was already taken for another topic, or seemed misleading. Mousing over a topic reveals a list of words associated with it; with most topics it’s also possible to click through for more information.
The structure of the network makes a loose kind of sense. Topics in French and German form separate networks floating free of the main English structure. Recent topics tend to cluster at the bottom of the page. And at the bottom, historical and pedagogical topics tend to be on the left, while formal, phenomenological, and aesthetic categories tend to be on the right.
But while it’s a little eerie to see patterns like this emerge automatically, we don’t advise readers to take the network structure too seriously. A topic model isn’t a network, and mapping one onto a network can be misleading. For instance, topics that are physically distant from each other in this visualization are not necessarily unrelated. Connections below a certain threshold go unrepresented.
Moreover, as you can see by comparing illustrations in this post, a little fiddling with dials can turn the same data into networks with rather different shapes. It’s probably best to view network visualization as a convenience. It may help readers browse a model by loosely organizing topics—but there can be other equally valid ways to organize the same material.
How did our models differ?
The two models we’ve examined so far in this post differ in several ways at once. They’re based on different spans of PMLA‘s print run (1890–1999 and 1924–2006). They were produced with different software. Perhaps most importantly, we chose different numbers of topics (100 and 150).
But the models we’re presenting are only samples. Goldstone and Underwood each produced several models of PMLA, changing one variable at a time, and we have made some closer apples-to-apples comparisons.
Broadly, the conclusion we’ve reached is that there’s both a great deal of fluidity and a great deal of consistency in this process. The algorithm has to estimate parameters that are impossible to calculate exactly. So the results you get will be slightly different every time. If you run the algorithm on the same corpus with the same number of topics, the changes tend to be fairly minor. But if you change the number of topics, you can get results that look substantially different.
On the other hand, to say that two models “look substantially different” isn’t to say that they’re incompatible. A jigsaw puzzle cut into 100 pieces looks different from one with 150 pieces. If you examine them piece by piece, no two pieces are the same—but once you put them together you’re looking at the same picture. In practice, there was a lot of overlap between our models; on the older end of the spectrum you often see a topic like “evidence fact,” while the newer end includes topics that foreground narrative, rhetoric, and gender. Some of the more surprising details turned out to be consistent as well. For instance, you might expect the topic “literary literature” to skew toward the older end of the print run. But in fact this is a relatively recent topic in both of our models, associated with discussion of canonicity. (Perhaps the owl of Minerva flies only at dusk?)
Contrasting models: a short example
While some topics look roughly the same in all of our models, it’s not always possible to identify close correlates of that sort. As you vary the overall number of topics, some topics seem to simply disappear. Where do they go? For example, there is no exact counterpart in Goldstone’s model to that “structure” topic in Underwood’s model. Does that mean it is a figment? Underwood isolated the following article as the most prominent exemplar:
Robert E. Burkhart, The Structure of Wuthering Heights, Letter to the Editor, PMLA 87 no. 1 (1972): 104–5. (Incidentally, jstor has miscategorized this as a “full-length article.”)
Goldstone’s model puts more than half of Burkhart’s comment in three topics:
0.24 topic 38 time experience reality work sense form present point world human process structure concept individual reader meaning order real relationship
0.13 topic 46 novels fiction poe gothic cooper characters richardson romance narrator story novelist reader plot novelists character reade hero heroine drf
0.12 topic 13 point reader question interpretation meaning make reading view sense argument words word problem makes evidence read clear text readers
The other prominent documents in Underwood’s 109 are connected to similar topics in Goldstone’s model. The keywords for Goldstone’s topic 38, the top topic here, immediately suggest an affinity with Underwood’s topic 109. Now compare the time course of Goldstone’s 38 with Underwood’s 109 (the latter is above):
It is reasonable to infer that some portion of the words in Underwood’s “structure” topic are absorbed in Goldstone’s “time experience” topic. But “time experience reality work sense” looks less like vocabulary for describing form (although “form” and “structure” are included in it, further down the list; cf. the top words for all 100 topics), and more like vocabulary for talking about experience in generalized ways—as is also suggested by the titles of some articles in which that topic is substantially present:
“The Vanishing Subject: Empirical Psychology and the Modern Novel”
“Toward a Modern Humanism”
“Wordsworth’s Inscrutable Workmanship and the Emblems of Reality”
This version of the topic is no less “right” or “wrong” than the one in Underwood’s model. They both reveal the same underlying evidence of word use, segmented in different but overlapping ways. Instead of focusing our vision on affinities between “form” and “structure”, Goldstone’s 100-topic model shows a broader connection between the critical vocabulary of form and structure and the keywords of “humanistic” reflection on experience.
The most striking contrast to these postwar themes is provided by a topic which dominates in the prewar period, then gives way before “time experience” takes hold. Here are box plots by ten-year intervals of the proportions of another topic, Goldstone’s topic 40, in PMLA articles:
Underwood’s model shows a similar cluster of topics centering on questions of evidence and textual documentation, which similarly decrease in frequency. The language of PMLA has shown a consistently declining interest in “evidence found fact” in the era of the postwar research university.
So any given topic model of a corpus is not definitive. Each variation in the modeling parameters can produce a new model. But although topic models vary, models of the same corpus remain fundamentally consistent with each other.
Using LDA as evidence
It’s true that a “topic model” is simply a model of how often words occur together in a corpus. But information of that kind has a deeper significance than we might at first assume. A topic model doesn’t just show you what people are writing about (a list of “topics” in our ordinary sense of the word). It can also show you how they’re writing. And that “how” seems to us a strong clue to social affinities—perhaps especially for scholars, who often identify with a methodology or critical vocabulary. To put this another way, topic modeling can identify discourses as well as subject categories and embedded languages. Naturally we also need other kinds of evidence to produce a history of the discipline, including social and institutional evidence that may not be fully manifest in discourse. But the evidence of topic modeling should be taken seriously.
As you change the number of topics (and other parameters), models provide different pictures of the same underlying collection. But this doesn’t mean that topic modeling is an indeterminate process, unreliable as evidence. All of those pictures will be valid. They are taken (so to speak) at different distances, and with different levels of granularity. But they’re all pictures of the same evidence and are by definition compatible. Different models may support different interpretations of the evidence, but not interpretations that absolutely conflict. Instead the multiplicity of models presents us with a familiar choice between “lumping” or “splitting” cultural phenomena—a choice where we have long known that multiple levels of analysis can coexist. This multiplicity of perspective should be understood as a strength rather than a limitation of the technique; it is part of the reason why an analysis using topic modeling can afford a richly detailed picture of an archive like PMLA.
Appendix: How did we actually do this?
The PMLA data obtained from JSTOR was independently processed by Goldstone and Underwood for their different LDA tools. This created some quantitative subtleties that we’ve saved for this appendix to keep this post accessible to a broad audience. If you read closely, you’ll notice that we sometimes talk about the “probability” of a term in a topic, and sometimes about its “salience.” Goldstone used MALLET for topic modeling, whereas Underwood used his own Java implementation of LDA. As a result, we also used slightly different formulas for ranking words within a topic. MALLET reports the raw probability of terms in each topic, whereas Underwood’s code uses a slightly more complex formula for term salience drawn from Blei & Lafferty (2009). In practice, this did not make a huge difference.
MALLET also has a “hyperparameter optimization” option which Goldstone’s 100-topic model above made use of. Before you run screaming, “hyperparameters” are just dials that control how much fuzziness is allowed in a topic’s distribution across words (beta) or across documents (alpha). Allowing alpha to vary allows greater differentiation between the sizes of large topics (often with common words), and smaller (often more specialized) topics. (See “Why Priors Matter,” Wallach, Mimno, and McCallum, 2009.) In any event, Goldstone’s 100-topic model used hyperparameter optimization; Underwood’s 150-topic model did not. A comparison with several other models suggests that the difference between symmetric and asymmetric (optimized) alpha parameters explains much of the difference between their structures when visualized as networks.
I’m writing this post because systems of academic review and communication are failing us in cases like this, and we need to step up our game. Tools like Google’s ngram viewer have created new opportunities, but also new methodological pitfalls. Humanists are aware of those pitfalls, but I think we need to work a bit harder to get the word out to journalists, and to disciplines like psychology.
The basic methodological problem in both articles is that researchers have used present-day patterns of association to define a wordlist that they then take as an index of the fortunes of some concept (morality, individualism, etc) over historical time. (In the second study, for instance, words associated with morality were extracted from a thesaurus and crowdsourced using Mechanical Turk.)
The fallacy involved here has little to do with hot-button issues of quantification. A basic premise of historicism is that human experience gets divided up in different ways in different eras. If we crowdsource “leadership” using twenty-first-century reactions on Mechanical Turk, for instance, we’ll probably get words like “visionary” and “professional.” “Loud-voiced” probably won’t be on the list — because that’s just rude. But to Homer, there’s nothing especially noble about working for hire (“professionally”), whereas “the loud-voiced Achilles” is cut out to be a leader of men, since he can be heard over the din of spears beating on shields (Blackwell).
The authors of both articles are dimly aware of this problem, but they imagine that it’s something they can dismiss if they’re just conscientious and careful to choose a good list of words. I don’t blame them; they’re not coming from historical disciplines. But one of the things you learn by working in a historical discipline is that our perspective is often limited by history in ways we are unable to anticipate. So if you want to understand what morality meant in 1900, you have to work to reconstruct that concept; it is not going to be intuitively accessible to you, and it cannot be crowdsourced.
The classic way to reconstruct concepts from the past involves immersing yourself in sources from the period. That’s probably still the best way, but where language is concerned, there are also quantitative techniques that can help. For instance, Ryan Heuser and Long Le-Khac have carried out research on word frequency in the nineteenth-century novel that might superficially look like the psychological articles I am critiquing. (It’s Pamphlet 4 in the Stanford Literary Lab series.) But their work is much more reliable and more interesting, because it begins by mining patterns of association from the period in question. They don’t start from an abstract concept like “individualism” and pick words that might be associated with it. Instead, they find groups of words that are associated with each other, in practice, in nineteenth-century novels, and then trace the history of those groups. In doing so, they find some intriguing patterns that scholars of the nineteenth-century novel are going to need to pay attention to.
It’s also relevant that Heuser and Le-Khac are working in a corpus that is limited to fiction. One of the problems with the Google ngram corpus is that really we have no idea what genres are represented in it, or how their relative proportions may vary over time. So it’s possible that an apparent decline in the frequency of words for moral values is actually a decline in the frequency of certain genres — say, conduct books, or hagiographic biographies. A decline of that sort would still be telling us something about literary culture; but it might be telling us something different than we initially assume from tracing the decline of a word like “fidelity.”
So please, if you know a psychologist, or journalist, or someone who blogs for The Atlantic: let them know that there is actually an emerging interdisciplinary field developing a methodology to grapple with this sort of evidence. Articles that purport to draw historical conclusions from language need to demonstrate that they have thought about the problems involved. That will require thinking about math, but it also, definitely, requires thinking about dilemmas of historical interpretation.
My illustration about “loud-voiced Achilles” is a very old example of the way concepts change over time, drawn via Friedrich Meinecke from Thomas Blackwell, An Enquiry into the Life and Writings of Homer, 1735. The word “professional,” by the way, also illustrates a kind of subtly moralized contemporary vocabulary that Kesebir & Kesebir may be ignoring in their account of the decline of moral virtue. One of the other dilemmas of historical perspective is that we’re in our own blind spot.
Big data. I’m tempted to begin “I, too, dislike it,” because the phrase has become a buzzword. To mainstream humanists, it sounds like a perversion. Even people who work in digital humanities protest that DH shouldn’t be normatively identified with big data — and I agree — so generally I keep quiet on the whole vexed question.
Except … there are a lot of grad students out there just starting to look at DH curiously, wondering whether it offers anything useful for their own subfield. In that situation, it’s natural to start by building a small collection that addresses a specific research problem you know about. And that might, in many cases, be a fine approach! But my conscience is nagging at me, because I can see some other, less obvious opportunities that students ought to be informed about.
It’s true that DH doesn’t have to be identified with scale. But the fact remains that problems of scale constitute a huge blind spot for individual researchers, and also define a problem that we know computers can help us explore. And when you first go into an area that was a blind spot for earlier generations of scholars, you’re almost guaranteed to find research opportunities — lying out on the ground like lumps of gold you don’t have to mine.
This suggests that it might be a mistake to assume that the most cost-effective way to get started in DH is to define a small collection focused on a particular problem you know about. It might actually be a better strategy to beg, borrow, or steal a large collection — and poke around in it for problems we don’t yet know about.
“But I’m not interested in big statistical generalizations; I care about describing individual works, decades, and social problems.” I understand; that’s a valid goal; but it’s not incompatible with the approach I’m recommending. I think it’s really vital that we do a better job of distinguishing “big data” (the resource) from “distant reading” (a particular interpretive strategy).* Big data doesn’t have to produce distant generalizations; we can use the leverage provided by scale and comparative analysis to crack open small and tightly-focused questions.
I don’t think most humanists have an intuitive grasp of how that “leverage” would work — but topic modeling is a good example. As I play around with topic-modeling large collections, I’m often finding that the process tells me interesting things about particular periods, genres, or works, by revealing how they differ from other relevant points of comparison. Topic modeling doesn’t use scale to identify a “trend” or an “average,” after all; what it does is identify the most salient dimensions of difference in a given collection. If you believe that the significance of a text is defined by its relation to context, then you can see how topic modeling a collection might help us crack open the (relational) significance of individual works.
“But how do we get our hands on the data?” Indeed: there’s the rub. Miriam Posner has recently suggested that the culture surrounding “coding” serves as a barrier that discourages women and minorities from entering certain precincts of DH. I think that’s right, but I’m even more concerned about the barriers embodied in access to data. Coding is actually not all that hard to pick up. Yes, it’s surrounded by gendered assumptions; but still, you can do it over a summer. [Update: Or, where that’s not practical, you can collaborate with someone. At Illinois, Loretta Auvil and Boris Capitanu do kinds of DH programming that are beyond me. I don’t mean to minimize issues of gender here, but I do mean to put “coding” in perspective. It’s not a mysterious, magical key.] By contrast, none of us can build big data on our own (or even in small teams) over the summer. If we don’t watch out, our field could easily slip into a situation where power gravitates to established scholars at large/wealthy research universities.
I’ve tried to address that by making my own data public. I haven’t documented it very well yet, but give me a few weeks. I think peer pressure should be exerted on everyone (especially established scholars) to make their data public at the time of publication. I do understand that some kinds of data can’t be shared because they’re owned by private enterprise. I accept that. But if you’ve supplemented proprietary data with other things you’ve produced on your own: in my opinion, that data should be made public at the time of publication.
Moreover, if you do that, I’m not going to care very much about the mistakes you have made in building your collection. I may think your data is completely biased and unrepresentative, because it includes too much Y and not enough X. But if so, I have an easy solution — which is to take your data, add it to my own collection of X, and other data borrowed from Initiative Z, and then select whatever subset would in my opinion create a balanced and representative collection. Then I can publish my own article correcting your initial, biased result.
Humanists are used to approaching debates about historical representation as if they were zero-sum questions. I suppose we are on some level still imagining this as a debate about canonicity — which is, as John Guillory pointed out, really a debate about space on the syllabus. Space on the syllabus is a zero-sum game. But the process of building big data is not zero-sum; it is cumulative. Every single thing you digitize is more good news for me, even if I shudder at the tired 2007-vintage assumptions implicit in your research agenda.
Personally, I feel the same way about questions of markup and interoperability. It’s all good. If you can give me clean** ascii text files with minimal metadata, I love you. If you can give me TEI with enriched metadata, I love you. I don’t want to waste a lot of breath arguing about which standard is better. In most cases, clean ascii text would be a lot better than what I can currently get.
* I hasten to say that I’m using “distant reading” here as the phrase is commonly deployed in debate — not as Franco Moretti originally used it — because the limitation I’m playing on is not really present in Moretti’s own use of the term. Moretti pointedly emphasizes that the advantage of a distant perspective may be to reveal the relational significance of an individual work.
** And, when I say “clean” — I will definitely settle for a 5% error rate.
Guillory, John. Cultural Capital. Chicago: U. of Chicago Press, 1993.
Moretti, Franco. Graphs, Maps, Trees. New York: Verso, 2005.
My reaction to Stanley Fish’s third column on digital humanities was at first so negative that I thought it not worth writing about. But in the light of morning, there is something here worth discussing. Fish raises a neglected issue that I (and a bunch of other people cited at the end of this post) have been trying to foreground: the role of discovery in the humanities. He raises the issue symptomatically, by suppressing it, but the problem is too important to let that slide.
Fish argues, in essence, that digital humanists let the data suggest hypotheses for them instead of framing hypotheses that are then tested against evidence.
The usual way of doing this is illustrated by my example: I began with a substantive interpretive proposition … and, within the guiding light, indeed searchlight, of that proposition I noticed a pattern that could, I thought be correlated with it. I then elaborated the correlation.
The direction of my inferences is critical: first the interpretive hypothesis and then the formal pattern, which attains the status of noticeability only because an interpretation already in place is picking it out.
The direction is the reverse in the digital humanities: first you run the numbers, and then you see if they prompt an interpretive hypothesis. The method, if it can be called that, is dictated by the capability of the tool.
The underlying element of truth here is that all researchers — humanists and scientists alike — do need to separate the process of discovering a hypothesis from the process of testing it. Otherwise you run into what we unreflecting empiricists call “the problem of data dredging.” If you simply sweep a net through an ocean of data, and frame a conclusion based on whatever you catch, you’re not properly testing anything, because you’re implicitly testing an infinite number of hypotheses that are left unstated — and the significance of any single test is reduced when it’s run as part of a large battery.
That’s true, but it’s also a problem that people who do data mining are quite self-conscious about. It’s why I never stop linking to this xkcd comic about “significance.” And it’s why Matt Wilkens (mistargeted by Fish as an emblem of this interpretive sin) goes through a deliberately iterative process of first framing hypotheses about nineteenth-century geographical imagination and then testing them. (For instance, after noticing that certain states seem especially prominent in 19c American fiction, he tests whether this remains true after you compensate for differences in population size, and then proposes a pair of hypotheses that he suggests will need to be evaluated against additional “test cases.”)
More importantly, Fish profoundly misrepresents his own (traditional) interpretive procedure by pretending that the act of interpretation is wholly contained in a single encounter with evidence. On his account we normally begin with a hypothesis (which seems to have sprung, like Sin, fully-formed from our head), and test it against a single sentence.
In reality, of course, our “interpretive proposition” is often suggested by the same evidence that confirms it. Or — more commonly — we derive a hypothesis from one example, and then read patiently through dozens of books until we have gathered enough confirming evidence to write a chapter. This process runs into a different interpretive fallacy: if you keep testing a hypothesis until you’ve confirmed it, you’re not testing it at all. And it’s a bit worse than that, because in practice what we do now is go to a full-text search engine and search for terms that would go together if our assumptions were correct. (In the example Fish offers, this might be “bishops” and “presbyters.”) If you find three sentences where those terms coincide, you’ve got more than enough evidence to prop up an argument, using our richly humanistic (cough, anecdotal) conception of evidence. And of course a full-text search engine can find you three examples of just about anything. But we don’t have to worry about this, because search engines are not tools that dictate a method; they are transparent extensions of our interpretive sensibility.
The basic mistake that Fish is making is this: he pretends that humanists have no discovery process at all. For Fish, the interpretive act is always fully contained in an encounter with a single piece of evidence. How your “interpretive proposition” got framed in the first place is a matter of no consequence: some readers are just fortunate to have propositions that turn out to be correct. Fish is not alone in this idealized model of interpretation; it’s widespread among humanists.
Fish is resisting the assistance of digital techniques, not because they would impose scientism on the humanities, but because they would force us to acknowledge that our ideas do after all come from somewhere — whether a search engine or a commonplace book. But as Peter Stallybrass eloquently argued five years ago in PMLA (h/t Mark Sample) the process of discovery has always been collaborative, and has long — at least since early modernity — been embodied in specific textual technologies.
Stallybrass, Peter. “Against Thinking.” PMLA 122.5 (2007): 1580-1587.
Wilkens, Matthew. “Geolocation Extraction and Mapping of Nineteenth-Century U.S. Fiction.” DHCS 2011.
On the process of embodied play that generates ideas, see also Stephen Ramsay’s book Reading Machines (University of Illinois Press, 2011).