About tedunderwood

Ted Underwood is Professor of English at the University of Illinois, Urbana-Champaign. On Twitter he is @Ted_Underwood.

Two syllabi: Digital Humanities and Data Science in the Humanities.

When I began teaching graduate courses about digital humanities, I designed syllabi that tried to cover a little of everything.

I enjoyed teaching those courses, but if I’m being honest, it was a challenge to race from digital editing — to maps and networks — to distant reading — to critical reflection on the concept of DH itself. It was even harder to cover that range of topics while giving students meaningful hands-on experience.

The solution, obviously, was to break the subject into more than one course. But I didn’t know how to do that within an English graduate curriculum. Many students are interested in learning about “digital humanities,” because a lot of debate has swirled around that broad rubric. I think the specific fields of inquiry grouped under the rubric actually make better-sized topics for a course, but they don’t have the same kind of name recognition, and courses on those topics don’t enroll as heavily.

This problem became easier to solve when part of my job moved into the School of Information Sciences. Many aspects of digital humanities — from social reflection on information technology to data mining — are already represented in the curriculum here. So I could divide DH into parts, and still have confidence that students would recognize those parts and understand how each part fit into an existing program of study.

This year I’ve taught two courses in the LIS curriculum. I’m sharing syllabi for both at once so I can also describe the contrast between them.

1. The first of the two, “Digital Humanities” (syllabus), is fundamentally a survey of DH as a social phenomenon, with special emphasis on the role of academic libraries and librarians — since that is likely to be a career path that many MLIS students are considering. The course covers a wide range of humanistic themes and topics, but doesn’t go very deeply into hands-on exploration of methods.

2. The second course, “Data Science in the Humanities” (syllabus)  covers the field that digital humanists often call “cultural analytics” — or “distant reading,” when it focuses on literature. Although I know its history is actually more complex, I’m characterizing this field as a form of data science in order to highlight its value for a wide range of students who may or may not intend to work as researchers in universities. I think humanistic questions can be great training for the slippery problems one encounters in business and computational journalism, for instance. But as Dennis Tenen and Andrew Goldstone (among others) have rightly pointed out, it can be a huge challenge to cover all the methods required for this sort of work in a single course. I’m not sure I have a perfect solution to that problem yet. The course is only in its third week! But we are aiming to achieve a kind of hands-on experience that combines Python programming with basic principles of statistics and machine learning, and with reflection on the challenges of social interpretation. I believe this may be achievable, in a course that doesn’t have to cover other aspects of DH, and when many students have at least a little previous experience, both in programming and in the humanities.

As Jupyter notebooks for the data science course are developed, I’m sharing them in a github repo. In both of the syllabi linked above, I also mention other syllabi that served as models. My thanks go out to everyone who shared their experience; I leaned on some of those models very heavily.

data_science_vdThe question I haven’t resolved yet is, How do we connect courses like these to an English curriculum? That connection remains crucial: I chose the phrase “data science” partly because the conversation around data science has explicitly acknowledged the importance of domain expertise. (See Drew Conway’s famous Venn diagram on the right.) I do think researchers need substantive knowledge about specific aspects of cultural history in order to frame meaningful questions about the past and interpret the patterns they find.

Right now, the courses I’m offering in LIS are certainly open to graduate students from humanities departments. But over the long run, I would also like to develop courses located in humanities departments that focus on specific literary-historical problems (for instance, questions of canonicity and popularity in a particular century), integrating distant-reading approaches only as one element of a broader portfolio of methods. Courses like that would fit fairly easily into an English graduate curriculum.

On the other hand, none of the courses I’ve described above can (by themselves) solve the most challenging pedagogical problem in DH, which is to make distant reading useful for doctoral dissertations. Right now, that’s very hard. The research opportunities in distant reading are huge, I believe, but that hugeness becomes itself a barrier. A field where you start making important discoveries after two to three years initial start-up time (training yourself, developing corpora, etc) is not ideally configured for the individualistic model of doctoral research that prevails in the humanities. Collective lab-centered projects are probably a better fit for this field. We may need to envision dissertations as being (at least in part) pieces of a larger research project, exploring one aspect of a shared problem.

The Gender Balance of Fiction, 1800-2007

by Ted Underwood and David Bamman

Last year, we wrote a blog post that posed questions about the differentiation of gendered roles in fiction. In doing that, we skipped over a more obvious question: how equally (or unequally) do stories distribute their attention between men and women?

This year, we’re returning to that simple question, with a richer dataset (supported by ongoing work at HathiTrust Research Center). The full story will come out in an article, but we’d like to share a few big-picture points in advance.

To start with, why have we framed this as a question about “women” and “men”? Gender isn’t a binary phenomenon. But we aren’t inquiring about the truth of gender identity here — just about gross inequalities that have separated conventional public roles. English-language fiction does typically divide characters by calling them “he” or “she,” and that division is a good place to start posing questions.

We could measure underrepresentation by counting people, but then we’d have to decide how much weight to give minor characters. A simpler approach is just to ask how many words are used to describe fictional men or women, respectively. BookNLP gave us a way to answer that question; it uses names and honorifics to infer a character’s gender, and then traces grammatical dependencies to identify adjectives that modify a character, nouns she possesses, or verbs she governs. After swinging BookNLP through 93,708 English-language volumes identified as fiction from the HathiTrust Digital Library, we can estimate the percentage of words used in characterization that are used to describe women. (To simplify the task of reading this illustration, we have left out characters coded as “other” or unknown,” so a year with equal representation of men and women would be located on the 50% line.).  To help quantify our uncertainty, we present each measurement by year along with a 95% confidence interval calculated using the bootstrap; our uncertainty decreases over time, largely as a function of an increasing number of books being published.

fig1

There is a clear decline from the nineteenth century (when women generally take up 40% or more of the “character space” in fiction) to the 1950s and 60s, when their prominence hovers around a low of 30%. A correction, beginning in the 1970s, almost restores fiction to its nineteenth-century state. (One way of thinking about this: second-wave feminism was a desperately-needed rescue operation.)

The fluctuation is not enormous, but also not trivial: women lose roughly a fourth of the space on the page they had possessed in the nineteenth century. Nor is this something we already knew. It might be a mistake to call this pattern a “surprise”: it’s not as if everyone had clearly-formed expectations about “space on the page.” But when we do pose the question, and ask scholars what they expect to see before revealing this evidence, several people have predicted a series of advances toward equality that correspond to e.g. the suffrage movement and World War II, separated by partial retreats. Instead we see a fairly steady decline from 1860 to 1970, with no overall advance toward equality.

What’s the explanation? Our methods do have blind spots. For instance, we aren’t usually able to infer gender for first-person protagonists, so they are left out here. And our inferences about other characters have a known level of error. But after cross-checking the evidence, we don’t believe the level of error is large enough to explain away this pattern (see our github repo for fuller discussion). It is of course possible that our sample of fiction is skewed. For instance, a sample of 93,708 volumes will include a lot of obscure works and works in translation. What if we focus on slightly more prominent works? We have posed that question by comparing our Hathi sample to a smaller (10,000-volume) sample drawn from the Chicago Text Lab, which emphasizes relatively prominent American works, and filters out works in translation.

fig2_chicago

As you can see, the broad outlines of the trend don’t change. If anything, the decline from 1860 to 1970 is slightly more marked in the Chicago corpus (perhaps because it does a better job of filtering out reprints, which tend to muffle change). This doesn’t prove that we will see the same pattern in every sample. There are many ways to sample the history of fiction! Some scholars will want to know about paperbacks that tend to be underrepresented in university libraries; others will only be interested in a short list of hypercanonical authors. We can’t exhaust all possible modes of sampling, but we can say at least that this trend is not an artefact of a single sampling strategy.  Nor is it an artefact of our choice to represent characters by counting words syntactically associated with them: we see the same pattern of decline to different degrees when measuring the amount of dialogue spoken by men and women, and in simply counting the number of characters as well.

So what does explain the declining representation of women? We don’t yet know. But the trend seems too complex to dismiss with a single explanation. For instance, it can be partly — but only partly — explained by a decline in the proportion of fiction writers who were women.

author

Take specific dots with a grain of salt; there are sources of error here, especially because the wall of copyright at 1923 may change digitization practices or throw off our own data pipeline. (Note the outlier right at 1923.) But the general pattern above is echoed also in the Chicago sample of American fiction, so we feel confident that there was really a decline in the fraction of fiction writers who were women. As far as we know, Chris Forster was the first person to gather broad quantitative evidence of this decline. But many scholars have grasped pieces of the story: for instance, Anne E. Boyd takes The Atlantic around 1890 as a case study of a process whereby the professionalization and canonization of American fiction tended to push out women who had previously been prominent. [See also Tuchman and Fortin 1989 in references below.]

But this is not necessarily a story about the marginalization of women writers in general. (On the contrary, the prominence of women rose throughout this period in several nonfiction genres.) The decline was specific to fiction — either because the intellectual opportunities open to women were expanding beyond belles lettres, or because the rising prestige of fiction attracted a growing number of men.

Men are overrepresented in books by men, so a decline in the number of women novelists will also tend to reduce the number of characters who are women. But that doesn’t completely explain the marginalization of feminine characters from 1860 to 1970. For instance, we can also divide authors by gender, and look at shifting patterns of attention within works by women or by men.

by_author_gender

There are several interesting details here. The inequality of attention in books by men is depressingly durable (men rarely give more than 30% of their attention to fictional women). But it’s also interesting that the fluctuations we saw earlier remain visible even when works are divided by author gender: both trend lines above show a slight decline in the space allotted to women, from 1860 to 1970. In other words, it’s not just that there were fewer works of fiction written by women; even inside books written by women, feminine characters were occupying slightly less space on the page.

Why? The rise of genres devoted to “action” and “adventure” might play a role, although we haven’t found clear evidence yet that it makes a difference. (Genre boundaries are too blurry for the question to be answered easily.) Or fiction might have been masculinized in some broader sense, less tied to specific genre categories (see Suzanne Clark, for instance, on modernism as masculinization.)

But listing possible explanations is the easy part. Figuring out which are true — and to what extent — will be harder.

We will continue to explore these questions, in collaboration with grad students, but we also want to draw other scholars’ attention to resources that can support this kind of inquiry (and invite readers to share useful secondary sources in the comments).

HathiTrust Research Center’s Extracted Features Dataset doesn’t permit the syntactic parsing performed by BookNLP, but even authors’ names and the raw frequencies of gendered pronouns can tell you a lot. Working just with that dataset, Chris Forster was able to catch significant patterns involving gender.

When we publish our article, we will also share data produced by BookNLP about specific characters across a collection of 93,708 books. HTRC is also building a “Data Capsule” that will allow other scholars to produce similar data themselves. In the meantime, in collaboration with Nikolaus N. Parulian, we have produced an interactive visualization that allows you to explore changes in the gendering of words used in characterization. (Compare, for instance, “grin” to “smile,” or “house” to “room.”) We have also made available the metadata and yearly summaries behind the visualization.

Acknowledgments. The work described here has been supported by NovelTM, funded by the Canadian Social Sciences and Humanities Research Council, and by the WCSA+DC grant at HathiTrust Research Center, funded  by the Andrew W. Mellon Foundation. We thank Hoyt Long, Teddy Roland, and Richard Jean So for permission to use the Chicago Novel Corpus. The project often relied on GenderID.py, by Bridget Baird and Cameron Blevins (2014). Boris Capitanu helped parallelize BookNLP across hundreds of thousands of volumes. Attendees at the 2016 NovelTM meeting, and Justine Murison in Illinois, provided valuable advice about literary history.

References.

Boyd, Anne E. “‘What, Has She Got into the Atlantic?’ Women Writers, The Atlantic Monthly, and the Formation of the American Canon,” American Studies 39.3 (1998): 5-36.

Clark, Suzanne. Sentimental Modernism: Women Writers and the Revolution of the Word (Indianapolis: Indiana University Press, 1992).

Forster, Chris. “A Walk Through the Metadata: Gender in the HathiTrust Dataset.” September 8, 2015. http://cforster.com/2015/09/gender-in-hathitrust-dataset/

Tuchman, Gaye, with Nina E. Fortin. Edging Women Out: Victorian Novelists, Publishers, and Social Change. New Haven: Yale University Press, 1989.

 

 

 

Notebooks? What good are notebooks?

blitz

“The letter,” Cecil Beaton, from London, Portrait of a City (1940).

It’s difficult to concentrate on intellectual work in the midst of political upheaval, partly because you’re not sure that you ought to be concentrating. Maybe ideas have become a luxury or a distraction? The dilemma is nicely captured by the image on the right, circulating on Twitter under Briallen Hopper’s caption, “What writing feels like now.”

Despite the title of this post, Americans are not actually confronting “Life During Wartime.” But we are living through a painful political transition — from the administration of a President who spoke eloquently about the responsibilities of self-government, to a President-elect who won the job by stirring up hatred of racial, ethnic, and religious minorities, and who seems not to understand or value the principles of transparency and dissent that safeguard democracy. (It doesn’t help that he’s a habitual liar and a misogynist.)

At moments like this, there’s a natural impulse to refocus one’s attention on politics, and I don’t mean to discourage it. (I have a letter to the editor of my local paper open in another window, and I’ll send it before the day is done.)

But I also want to briefly reaffirm the importance of thinking about things other than contemporary politics. Briefly, because I don’t think this should require a lot of argument. The administration now coming to power is marked by a sweeping anti-intellectualism amounting to willful ignorance, which embraces all aspects of the world, from American history to economics to atmospheric science. Our endeavors to understand the world, and communicate our understanding, obviously acquire heightened importance in this climate.

But to go a bit further: it may be worth meditating for a moment on the range of impulses that brought Trump to power. Xenophobia played a role. Misogyny played a role. Racial divisions in the working classes played a large role, which has (rightly) been widely discussed. But the bulk of Trump’s supporters were not from the working classes; they were people earning more than $50,000 a year–many of them with college degrees. Many of those voters must have recognized something problematic in Trump’s obvious arrogance, disregard of fact, contempt for minorities, and disdain for democratic norms of transparency. Republicans had a fierce debate on those subjects in the summer of 2016. But in the end, the majority of Republican voters (and more fatefully, Republican office-holders) decided that these were abstract and impalpable worries, compared to the immediate gratification of a win for their political party.

That, I think, is the most acute problem we confront right now. Something about the structure of the contemporary media ecosystem is pushing us toward a debased view of politics as a zero-sum struggle between competing teams with incommensurable world views. We are losing our ability to see a larger picture. The history and fragility of democratic institutions, our shared aspirations as Americans or human beings, the fate of the planet itself, all recede into the background, replaced by a fierce hatred of people who wear fedoras or baseball caps.

This degradation of political discourse has been gradual over the last thirty years. In my view, it has been driven particularly by changes on the right (the Southern Strategy, Lee Atwater, and Fox News Corp are names worth mentioning). But none of us are necessarily immune to the larger drift: a pressure to understand ideas instrumentally, and subordinate thinking to the reaffirmation of partisan community.

In this context, I think it becomes especially important to explain why we value modes of thinking that don’t have an immediate, contemporary, political instrumentality. The history of the Song dynasty, the reproductive biology of sea grasses, the nature of gravity, all matter for us. Part of the dignity of being human is to be a creature for whom those things matter. It may be tempting to express this by saying that universities preserve a space for dispassionate reflection, but that wouldn’t be right. In the present state of human affairs, the struggle to back up, to get a bigger picture, to think more broadly and more candidly about the world, is itself a passionately political act. And given the bloody history of our species, this was probably always true.

A distant reading of moments

I delivered a talk about time at the English Institute yesterday. Since it could easily be a year before the print version comes out, I thought I would share the draft as a working paper.

The argument has two layers. On one level it’s about the tension between distant reading and New Historicism. The New Historical anecdote fuses history with literary representation in a vivid, influential way, by compressing a large theme into a brief episode. Can quantitative arguments about the past aspire to the same kind of compression and vividness?

Inside that metacritical frame, there’s a history of narrative pace, based on evidence I gathered in collaboration with Sabrina Lee and Jessica Mercado. (We’re also working on a separate co-authored piece that will dive more deeply into this data.)

We ask how much fictional time is narrated, on average, in 250 words. We discover some dramatic changes across a timeline of 300 years, and I’m tempted to include our results as an illustration here. But I’ve decided not to, because I want to explore whether scholars already, intuitively know how the representation of duration has changed, by asking readers to reflect for a moment on what they expect to see.

So instead of illustrating this post with real evidence, I’ve provided a plausible, counterfactual illustration based on an account of duration that one might extract from influential narratological works by Gérard Genette or Seymour Chatman.

blamemodernismuse

Artificial data, generated to simulate the account of narrative pace one might extract from Gérard Genette, Narrative Discourse. Logarithmic scale.

To find out what the real story is, you’ll have to read the paper, “Why Literary Time Is Measured in Minutes.”

(Open data and code aren’t out yet, but they will be released with our co-authored essay.)

A more intimate scale of distant reading.

How big, exactly, does a collection of literary texts have to be before it makes sense to say we’re doing “distant reading”?

It’s a question people often ask, and a question that distant readers often wriggle out of answering, for good reason. The answer is not determined by the technical limits of any algorithm. It depends, rather, on the size of the blind spots in our knowledge of the literary past — and it’s part of the definition of a blind spot that we don’t already know how big it is. How far do you have to back up before you start seeing patterns that were invisible at your ordinary scale of reading? That’s how big your collection needs to be.

But from watching trends over the last couple of years, I am beginning to get the sense that the threshold for distant reading is turning out to be a bit lower than many people are currently assuming (and lower than I assumed myself in the past). To cut to the chase: it’s probably dozens or scores of books, rather than thousands.

I think there are several reasons why we all got a different impression. One is that Franco Moretti originally advertised distant reading as a continuation of 1990s canon-expansion: the whole point, presumably, was to get beyond the canon and recover a vast “slaughterhouse of literature.” That’s still one part of the project — and it leads to a lot of debate about the difficulty of recovering that slaughterhouse. But sixteen years later, it is becoming clear that new methods also allow us to do a whole lot of things that weren’t envisioned in Moretti’s original manifesto. Even if we restricted our collections to explicitly canonical works, we would still be able to tease out trends that are too long, or family resemblances that are too loose, to be described well in existing histories.

The size of the collection required depends on the question you’re posing. Unsupervised algorithms, like those used for topic modeling, are easy to package as tools: just pour in the books, and out come some topics. But since they’re not designed to answer specific questions, these approaches tend to be most useful for exploratory problems, at large scales of inquiry. (One recent project by Emily Barry, for instance, uses 22,000 Supreme Court cases.)

By contrast, a lot of recent work in distant reading has used supervised models to zero in on narrowly specified historical questions about genre or form. This approach can tell you things you didn’t already know at a smaller scale of inquiry. In “Literary Pattern Recognition,” Hoyt Long and Richard So start by gathering 400 poems in the haiku tradition. In a recent essay on genre I talk about several hundred works of detective fiction, but also ten hardboiled detective novels, and seven Newgate novels.

 

Figure5Generational

Predictive accuracy for several genres of roughly generational size, plotted relative to a curve that indicates accuracy for a random sample of detective fiction drawn from the whole period 1829-1989. The shaded ribbon covers 90% of models for a given number of examples.

Admittedly, seven is on the low side. I wouldn’t put a lot of faith in any individual dot above. But I do think we can learn something by looking at five subgenres that each contain 7-21 volumes. (In the graph above we learn, for instance, that focused “generational” genres aren’t lexically more coherent than a sample drawn from the whole 160 years of detective fiction — because the longer tradition is remarkably coherent, and pretty easy to recognize, even when you downsample it to ten or twenty volumes.)

I’d like to pitch this reduction of scale as encouraging news. Grad students and assistant professors don’t have to build million-volume collections before they can start exploring new methods. And literary scholars can practice distant reading without feeling they need to buy into any cyclopean ethic of “big data.” (I’m not sure that ethic exists, except as a vaguely-sketched straw man. But if it did exist, you wouldn’t need to buy into it.)

Computational methods themselves won’t even be necessary for all of this work. For some questions, standard social-scientific content analysis (aka reading texts and characterizing them according to an agreed-upon scheme) is a better way to proceed. In fact, if you look back at “The Slaughterhouse of Literature,” that’s what Moretti did with “about twenty” detective stories (212). Shawna Ross recently did something similar, looking at the representation of women’s scholarship at MLA#16 by reading and characterizing 792 tweets.

Humanists still have a lot to learn about social-scientific methods, as Tanya Clement has recently pointed out. (Inter-rater reliability, anyone?) And I think content analysis will run into some limits as we stretch the timelines of our studies: as you try to cover centuries of social change, it gets hard to frame a predefined coding scheme that’s appropriate for everything on the timeline. Computational models have some advantages at that scale, because they can be relatively flexible. Plus, we actually do want to reach beyond the canon.

But my point is simply that “distant reading” doesn’t prescribe a single scale of analysis. There’s a smooth ramp that leads from describing seven books, to characterizing a score or so (still by hand, but in a more systematic way), to statistical reflection on the uncertainty and variation in your evidence, to text mining and computational modeling (which might cover seven books or seven hundred). Proceed only as far as you find useful for a given question.

The real problem with distant reading.

This will be an old-fashioned, shamelessly opinionated, 1000-word blog post.

Anyone who has tried to write literary history using numbers knows that they are a double-edged sword. On the one hand they make it possible, not only to consider more examples, but often to trace subtler, looser patterns than we could trace by hand.

On the other hand, quantitative methods are inherently complex, and unfamiliar for humanists. So it’s easy to bog down in preambles about method.

Social scientists talking about access to healthcare may get away with being slightly dull. But literary criticism has little reason to exist unless it’s interesting; if it bogs down in a methodological preamble, it’s already dead. Some people call the cause of death “positivism,” but only because that sounds more official than “boredom.”

This is a rhetorical rather than epistemological problem, and it needs a rhetorical solution. For instance, Matthew Wilkens and Cameron Blevins have rightly been praised for focusing on historical questions, moving methods to an appendix if necessary. You may also recall that a book titled Distant Reading recently won a book award in the US. Clearly, distant reading can be interesting, even exciting, when writers are able to keep it simple. That requires resisting several temptations.

One temptation to complexity is technical, of course: writers who want to reach a broad audience need to resist geeking out over the latest algorithm. Perhaps fewer people recognize that the advice of more traditional colleagues can be another source of temptation. Scholars who haven’t used computational methods rarely understand the rhetorical challenges that confront distant readers. They worry that our articles won’t be messy enough — bless their hearts — so they advise us to multiply close readings, use special humanistic visualizations, add editorial apparatus to the corpus, and scatter nuances liberally over everything.

Some parts of this advice are useful: a crisp, pointed close reading can be a jolt of energy. And Katherine Bode is right that, in 2016, scholars should share data. But a lot of the extra knobs and nuances that colleagues suggest adding are gimcrack notions that would aggravate the real problem we face: complexity.

Consider the common advice that distant readers should address questions about the representativeness of their corpora by weighting all the volumes to reflect their relative cultural importance. A lot of otherwise smart people have recommended this. But as far as I know, no one ever does it. The people who recommend it, don’t do it themselves, because a moment’s thought reveals that weighting volumes can only multiply dubious assumptions. Personally, I suspect that all quests for the One Truly Representative Corpus are a mug’s game. People who worry about representativeness are better advised to consult several differently-selected samples. That sometimes reveals confounding variables — but just as often reveals that selection practices make little difference for the long-term history of the variable you’re tracing. (The differences between canon and archive are not always as large, or as central to this project, as Franco Moretti initially assumed.)

treknorman

Sic semper robotis. “I, Mudd” (1967).

Another tempting piece of advice comes from colleagues who invite distant readers to prove humanistic credentials by adding complexity to their data models. This suggestion draws moral authority from a long-standing belief that computers force everything to be a one or a zero, whereas human beings are naturally at home in paradox. That’s why Captain Kirk could easily destroy alien robots by confronting them with the Liar’s Paradox. “How can? It be X. But also? Not X. Does not compute. <Smell of frying circuitry>.”

Maybe in 1967 it was true that computers could only handle exclusive binary categories. I don’t know: I was divided into a pair of gametes at the time myself. But nowadays data models can be as complex as you like. Go ahead, add another column to the database. No technical limit will stop you. Make categories perspectival, by linking each tag to a specific observer. Allow contradictory tags. If you’re still worried that things are too simple, make each variable a probability distribution. The computer won’t break a sweat, although your data model may look like the illustration below to human readers.

goldberg

Rube Goldberg, “Professor Butts and the Self-Operating Napkin” (1931), via Wikimedia Commons.

I just wrote an article, for instance, where I consider eighteen different sources of testimony about genre — each of which models a genre in ways that can implicitly conflict with, or overlap with, other definitions. I trust you can see the danger: it’s not that the argument will be too reductive. I was willing to run a risk of complexity in this case because I was tired of being told that computers force everything into binaries. Machine learning is actually good at eschewing fixed categories to tease out loose family resemblances; it can be every bit as perspectival, multifaceted, and blurry as you wanna be.

I hope my article manages to remain lively, but I think readers will discover, by the end, that it could have succeeded with a simpler data model. When I rework it for the book version, I may do some streamlining.

It’s okay to simplify the world in order to investigate a specific question. That’s what smart qualitative scholars do themselves, when they’re not busy giving impractical advice to their quantitative friends. Max Weber and Hannah Arendt didn’t make an impact on their respective fields — or on public life — by adding the maximum amount of nuance to everything, so their models could represent every aspect of reality at once, and also function as self-operating napkins.

Because distant readers use larger corpora and more explicit data models than is usual for literary study, critics of the field (internal as well as external) have a tendency to push on those visible innovations, asking “What is still left out?” Something will always be left out, but I don’t think distant reading needs to be pushed toward even more complexity and completism. Computers make those things only too easy. Instead distant readers need to struggle to retain the vividness and polemical verve of the best literary criticism, and the  “combination of simplicity and strength” that characterizes useful social theory.

 

Versions of disciplinary history.

Accounts of the history of the humanities are being strongly shaped, right now, by stances for or against something called “digital humanities.” I have to admit I avoid the phrase when I can. The good thing about DH is, it creates a lively community that crosses disciplinary lines to exchange ideas. The bad thing is, it also creates a community that crosses disciplinary lines to fight pointlessly over the meaning of “digital humanities.” Egyptologists and scholars of game studies, who once got along just fine doing different things, suddenly understand themselves as advancing competing, incompatible versions of DH.

The desire to defend a coherent tradition called DH can also lead to models of intellectual history that I find bizarre. Sometimes, for instance, people trace all literary inquiry using computers back to Roberto Busa. That seems to me an oddly motivated genealogy: it would only make sense if you thought the physical computers themselves were very important. I tend to trace the things people are doing instead to Janice Radway, Roman Jakobson, Raymond Williams, or David Blei.

On the other hand, we’ve recently seen that a desire to take a stand against digital humanities can lead to equally unpersuasive genealogies. I’m referring to a recent critique of digital humanities in LARB by Daniel Allington, Sarah Brouillette, and David Golumbia. The central purpose of the piece is to identify digital humanities as a neoliberal threat to the humanities.

I’m not going to argue about whether digital humanities is neoliberal; I’ve already said that I fear the term is becoming a source of pointless fights. So I’m not the person to defend the phrase, or condemn it. But I do care about properly crediting people who contributed to the tradition of literary history I work in, and here I think the piece in LARB leads to important misunderstandings.

The argument is supported by two moves that I would call genealogical sleight-of-hand. On the one hand, it unifies a wide range of careers that might seem to have pursued different ends (from E. D. Hirsch to Rita Felski) by the crucial connecting link that all these people worked at the University of Virginia. On the other hand, it needs to separate various things that readers might associate with digital humanities, so if any intellectual advances happen to take place in some corner of a discipline, it can say “well, you know, that part wasn’t really related; it had a totally different origin.”

I don’t mean to imply that the authors are acting in bad faith here; nor do I think people who over-credit Roberto Busa for all literary work done with computers are motivated by bad faith. This is just an occupational hazard of doing history. If you belong to a particular group (a national identity, or a loose social network like “DH”), there’s always a danger of linking and splitting things so history becomes a story about “the rise of France.” The same thing can happen if you deeply dislike a group.

So, I take it as a sincere argument. But the article’s “splitting” impulses are factually mistaken in three ways. First, the article tries to crisply separate everything happening in distant reading from the East Coast — where people are generally tarnished (in the authors’ eyes) by association with UVA. Separating these traditions allows the article to conclude “well, Franco Moretti may be a Marxist, but the kind of literary history he’s pursuing had nothing to do with those editorial theory types.”

That’s just not true; the projects may be different, but there have also been strong personal and intellectual connections between them. At times, the connections have been embodied institutionally in the ADHO, but let me offer a more personal example: I wouldn’t be doing what I’m doing right now if it weren’t for the MONK project. Before I knew how to code — or, code in anything other than 1980s-era Basic — I spent hours playing with the naive Bayes feature in MONK online, discovering what it was capable of. For me, that was the gateway drug that led eventually to a deeper engagement with sociology of literature, book history, machine learning, and so on. MONK was created by a group centered at our Graduate School of Library and Information Science, but the dark truth is that several of those people had been trained at UVA (I know Unsworth, Ramsay, and Kirschenbaum were involved — pardon me if I’m forgetting others).

MONK is also an example of another way the article’s genealogy goes wrong: by trying to separate anything that might be achieved intellectually in a field like literary history from the mere “support functions for the humanities” provided by librarians and academic professionals. Just as a matter of historical fact, that’s not a correct account of how large-scale literary history has developed. My first experiment with quantitative methods — before MONK — took shape back in 1995, when my first published article, in Studies in Romanticism (1995). used quantitative methods influenced by Mark Olsen, a figure who deserves a lot more credit than he has received. Olsen had already sketched out the theoretical rationale for a research program you might call “distant reading” in 1989, arguing that text analysis would only really become useful for the humanities when it stopped trying to produce readings of individual books and engaged broad social-historical questions. But Olsen was not a literature professor. He had a Ph.D in French history, and was working off the tenure track with a digital library called ARTFL at the University of Chicago.

Really at every step of the way — from ARTFL, to MONK, to the Stanford Literary Lab, to HathiTrust Research Center — my thinking about this field has been shaped by projects that were organized and led by people with appointments in libraries and/or in library science. You may like that, or feel that it’s troubling — up to you — but it’s the historical fact.

Personally, I take it as a sign that, in historical disciplines, libraries and archives really matter. A methodology, by itself, is not enough; you also need material, and the material needs to be organized in ways that are far from merely clerical. Metadata is a hard problem. The organization of the past is itself an interpretive act, and libraries are one of the institutional forms it takes. I might not have realized that ten years ago, but after struggling to keep my head above water in a sea of several million books, I feel it very sincerely.

This is why I think the article is also wrong to treat distant reading as a mere transplantation of social-science methods. I suspect the article has seen this part of disciplinary history mainly through the lens of Daniel Allington’s training in linguistics, so I credit it as a good-faith understanding: if you’re trained in social science, then I understand, large-scale literary history will probably look like sociology and linguistics that happen to have gotten mixed in some way and then applied to the past.

But the article is leaving out something that really matters in this field, which is turning methods into historical arguments. To turn social-scientific methods into literary history, you have to connect the results of a model, meaningfully, to an existing conversation about the literary past. For that, you need a lot of things that aren’t contained in the original method. Historical scholarship. Critical insight, dramatized by lively writing. And also metadata. Authors’ dates of birth and death; testimony about perceived genre categories. A corpus isn’t enough. Social-scientific methods can only become literary history in collaboration with libraries.

I know nothing I have said here will really address the passions evoked on multiple sides by the LARB article. I expect this post will be read by some as an attempt to defend digital humanities, and by others as a mealy-mouthed failure to do so. That’s okay. But from my own (limited) perspective, I’m just trying to write some history here, giving proper credit to people who were involved in building the institutions and ideas I rely on. Those people included social scientists, humanists, librarians, scholars in library and information science, and people working off the tenure track in humanities computing.

Postscript: On the importance of libraries, see Steven E. Jones, quoting Bethany Nowviskie about the catalytic effect of Google Books (Emergence 8, and “Resistance in the Materials”). Since metadata matters, Google Books became enormously more valuable to scholars in the form of HathiTrust. The institutional importance I attribute to libraries is related to Alan Liu’s recent observations about the importance of critically engaging infrastructure.

References

Jones, Steven E. The Emergence of the Digital Humanities. New York: Routledge, 2014.

Olsen, Mark. “The History of Meaning: Computational and Quantitative Methods in Intellectual History,” Joumal of History and Politics 6 (1989): 121-54.

Olsen, Mark. “Signs, Symbols, and Discourses: A New Direction for Computer-Aided Literary Studies.” Computers and the Humanities 27 (1993): 309-14.