Versions of disciplinary history.

Accounts of the history of the humanities are being strongly shaped, right now, by stances for or against something called “digital humanities.” I have to admit I avoid the phrase when I can. The good thing about DH is, it creates a lively community that crosses disciplinary lines to exchange ideas. The bad thing is, it also creates a community that crosses disciplinary lines to fight pointlessly over the meaning of “digital humanities.” Egyptologists and scholars of game studies, who once got along just fine doing different things, suddenly understand themselves as advancing competing, incompatible versions of DH.

The desire to defend a coherent tradition called DH can also lead to models of intellectual history that I find bizarre. Sometimes, for instance, people trace all literary inquiry using computers back to Roberto Busa. That seems to me an oddly motivated genealogy: it would only make sense if you thought the physical computers themselves were very important. I tend to trace the things people are doing instead to Janice Radway, Roman Jakobson, Raymond Williams, or David Blei.

On the other hand, we’ve recently seen that a desire to take a stand against digital humanities can lead to equally unpersuasive genealogies. I’m referring to a recent critique of digital humanities in LARB by Daniel Allington, Sarah Brouillette, and David Golumbia. The central purpose of the piece is to identify digital humanities as a neoliberal threat to the humanities.

I’m not going to argue about whether digital humanities is neoliberal; I’ve already said that I fear the term is becoming a source of pointless fights. So I’m not the person to defend the phrase, or condemn it. But I do care about properly crediting people who contributed to the tradition of literary history I work in, and here I think the piece in LARB leads to important misunderstandings.

The argument is supported by two moves that I would call genealogical sleight-of-hand. On the one hand, it unifies a wide range of careers that might seem to have pursued different ends (from E. D. Hirsch to Rita Felski) by the crucial connecting link that all these people worked at the University of Virginia. On the other hand, it needs to separate various things that readers might associate with digital humanities, so if any intellectual advances happen to take place in some corner of a discipline, it can say “well, you know, that part wasn’t really related; it had a totally different origin.”

I don’t mean to imply that the authors are acting in bad faith here; nor do I think people who over-credit Roberto Busa for all literary work done with computers are motivated by bad faith. This is just an occupational hazard of doing history. If you belong to a particular group (a national identity, or a loose social network like “DH”), there’s always a danger of linking and splitting things so history becomes a story about “the rise of France.” The same thing can happen if you deeply dislike a group.

So, I take it as a sincere argument. But the article’s “splitting” impulses are factually mistaken in three ways. First, the article tries to crisply separate everything happening in distant reading from the East Coast — where people are generally tarnished (in the authors’ eyes) by association with UVA. Separating these traditions allows the article to conclude “well, Franco Moretti may be a Marxist, but the kind of literary history he’s pursuing had nothing to do with those editorial theory types.”

That’s just not true; the projects may be different, but there have also been strong personal and intellectual connections between them. At times, the connections have been embodied institutionally in the ADHO, but let me offer a more personal example: I wouldn’t be doing what I’m doing right now if it weren’t for the MONK project. Before I knew how to code — or, code in anything other than 1980s-era Basic — I spent hours playing with the naive Bayes feature in MONK online, discovering what it was capable of. For me, that was the gateway drug that led eventually to a deeper engagement with sociology of literature, book history, machine learning, and so on. MONK was created by a group centered at our Graduate School of Library and Information Science, but the dark truth is that several of those people had been trained at UVA (I know Unsworth, Ramsay, and Kirschenbaum were involved — pardon me if I’m forgetting others).

MONK is also an example of another way the article’s genealogy goes wrong: by trying to separate anything that might be achieved intellectually in a field like literary history from the mere “support functions for the humanities” provided by librarians and academic professionals. Just as a matter of historical fact, that’s not a correct account of how large-scale literary history has developed. My first experiment with quantitative methods — before MONK — took shape back in 1995, when my first published article, in Studies in Romanticism (1995). used quantitative methods influenced by Mark Olsen, a figure who deserves a lot more credit than he has received. Olsen had already sketched out the theoretical rationale for a research program you might call “distant reading” in 1989, arguing that text analysis would only really become useful for the humanities when it stopped trying to produce readings of individual books and engaged broad social-historical questions. But Olsen was not a literature professor. He had a Ph.D in French history, and was working off the tenure track with a digital library called ARTFL at the University of Chicago.

Really at every step of the way — from ARTFL, to MONK, to the Stanford Literary Lab, to HathiTrust Research Center — my thinking about this field has been shaped by projects that were organized and led by people with appointments in libraries and/or in library science. You may like that, or feel that it’s troubling — up to you — but it’s the historical fact.

Personally, I take it as a sign that, in historical disciplines, libraries and archives really matter. A methodology, by itself, is not enough; you also need material, and the material needs to be organized in ways that are far from merely clerical. Metadata is a hard problem. The organization of the past is itself an interpretive act, and libraries are one of the institutional forms it takes. I might not have realized that ten years ago, but after struggling to keep my head above water in a sea of several million books, I feel it very sincerely.

This is why I think the article is also wrong to treat distant reading as a mere transplantation of social-science methods. I suspect the article has seen this part of disciplinary history mainly through the lens of Daniel Allington’s training in linguistics, so I credit it as a good-faith understanding: if you’re trained in social science, then I understand, large-scale literary history will probably look like sociology and linguistics that happen to have gotten mixed in some way and then applied to the past.

But the article is leaving out something that really matters in this field, which is turning methods into historical arguments. To turn social-scientific methods into literary history, you have to connect the results of a model, meaningfully, to an existing conversation about the literary past. For that, you need a lot of things that aren’t contained in the original method. Historical scholarship. Critical insight, dramatized by lively writing. And also metadata. Authors’ dates of birth and death; testimony about perceived genre categories. A corpus isn’t enough. Social-scientific methods can only become literary history in collaboration with libraries.

I know nothing I have said here will really address the passions evoked on multiple sides by the LARB article. I expect this post will be read by some as an attempt to defend digital humanities, and by others as a mealy-mouthed failure to do so. That’s okay. But from my own (limited) perspective, I’m just trying to write some history here, giving proper credit to people who were involved in building the institutions and ideas I rely on. Those people included social scientists, humanists, librarians, scholars in library and information science, and people working off the tenure track in humanities computing.

Postscript: On the importance of libraries, see Steven E. Jones, quoting Bethany Nowviskie about the catalytic effect of Google Books (Emergence 8, and “Resistance in the Materials”). Since metadata matters, Google Books became enormously more valuable to scholars in the form of HathiTrust. The institutional importance I attribute to libraries is related to Alan Liu’s recent observations about the importance of critically engaging infrastructure.

References

Jones, Steven E. The Emergence of the Digital Humanities. New York: Routledge, 2014.

Olsen, Mark. “The History of Meaning: Computational and Quantitative Methods in Intellectual History,” Joumal of History and Politics 6 (1989): 121-54.

Olsen, Mark. “Signs, Symbols, and Discourses: A New Direction for Computer-Aided Literary Studies.” Computers and the Humanities 27 (1993): 309-14.

Can we date revolutions in the history of literature and music?

Humanists know the subjects we study are complex. So on the rare occasions when we describe them with numbers at all, we tend to proceed cautiously. Maybe too cautiously. Distant readers have spent a lot of time, for instance, just convincing colleagues that it might be okay to use numbers for exploratory purposes.

But the pace of this conversation is not entirely up to us. Outsiders to our disciplines may rush in where we fear to tread, forcing us to confront questions we haven’t faced squarely.

For instance, can we use numbers to identify historical periods when music or literature changed especially rapidly or slowly? Humanists have often used qualitative methods to make that sort of argument. At least since the nineteenth century, our narratives have described periods of stasis separated by paradigm shifts and revolutionary ruptures. For scientists, this raises an obvious, tempting question: why not actually measure rates of change and specify the points on the timeline when ruptures happened?

The highest-profile recent example of this approach is an article in Royal Society Open Science titled “The evolution of popular music” (Mauch et al. 2015). The authors identify three moments of rapid change in US popular music between 1960 and 2010. Moreover, they rank those moments, and argue that the advent of rap caused popular music to change more rapidly than the British Invasion — a claim you may remember, because it got a lot of play in the popular press. Similar arguments have appeared about the pace of change in written expression — e.g, a recent article argues that 1917 was a turning point in political rhetoric (h/t Cameron Blevins).

When disciplinary outsiders make big historical claims, humanists may be tempted just to roll our eyes. But I don’t think this is a kind of intervention we can afford to ignore. Arguments about the pace of cultural change engage theoretical questions that are fundamental to our disciplines, and questions that genuinely fascinate the public. If scientists are posing these questions badly, we need to explain why. On the other hand, if outsiders are addressing important questions with new methods, we need to learn from them. Scholarship is not a struggle between disciplines where the winner is the discipline that repels foreign ideas with greatest determination.

I feel particularly obligated to think this through, because I’ve been arguing for a couple of years that quantitative methods tend to reveal gradual change rather than the sharply periodized plateaus we might like to discover in the past. But maybe I just haven’t been looking closely enough for discontinuities? Recent articles introduce new ways of locating and measuring them.

This blog post applies methods from “The evolution of popular music” to a domain I understand better — nineteenth-century literary history. I’m not making a historical argument yet, just trying to figure out how much weight these new methods could actually support. I hope readers will share their own opinions in the comments. So far I would say I’m skeptical about these methods — or at least skeptical that I know how to interpret them.

How scientists found musical revolutions.

Mauch et al. start by collecting thirty-second snippets of songs in the Billboard Hot 100 between 1960 and 2010. Then they topic-model the collection to identify recurring harmonic and timbral topics. To study historical change, they divide the fifty-year collection into two hundred quarter-year periods, and aggregate the topic frequencies for each quarter. They’re thus able to create a heat map of pairwise “distances” between all these quarter-year periods. This heat map becomes the foundation for the crucial next step in their argument — the calculation of “Foote novelty” that actually identifies revolutionary ruptures in music history.

Figure 5 from Mauch, et. al., “The evolution of popular music” (RSOS 2015).

The diagonal line from bottom left to top right of the heat map represents comparisons of each time segment to itself: that distance, obviously, should be zero. As you rise above that line, you’re comparing the same moment to quarters in its future; if you sink below, you’re comparing it to its past. Long periods where topic distributions remain roughly similar are visible in this heat map as yellowish squares. (In the center of those squares, you can wander a long way from the diagonal line without hitting much dissimilarity.) The places where squares are connected at the corners are moments of rapid change. (Intuitively, if you deviate to either side of the narrow bridge there, you quickly get into red areas. The temporal “window of similarity” is narrow.) Using an algorithm outlined by Jonathan Foote (2000), the authors translate this grid into a line plot where the dips represent musical “revolutions.”

Trying the same thing on the history of the novel.

Could we do the same thing for the history of fiction? The labor-intensive part would be coming up with a corpus. Nineteenth-century literary scholars don’t have a Billboard Hot 100. We could construct one, but before I spend months crafting a corpus to address this question, I’d like to know whether the question itself is meaningful. So this is a deliberately rough first pass. I’ve created a sample of roughly 1000 novels in a quick and dirty way by randomly selecting 50 male and 50 female authors from each decade 1820-1919 in HathiTrust. Each author is represented in the whole corpus only by a single volume. The corpus covers British and American authors; spelling is normalized to modern British practice. If I were writing an article on this topic I would want a larger dataset and I would definitely want to record things like each author’s year of birth and nationality. This is just a first pass.

Because this is a longer and sparser sample than Mauch et al. use, we’ll have to compare two-year periods instead of quarters of a year, giving us a coarser picture of change. It’s a simple matter to run a topic model (with 50 topics) and then plot a heat map based on cosine similarities between the topic distributions in each two-year period.

Heatmap and Foote novelty for 1000 novels, 1820-1919. Rises in the trend lines correspond to increased Foote novelty.

Heatmap and Foote novelty for 1000 novels, 1820-1919. Rises in the trend lines correspond to increased Foote novelty.

Voila! The dark and light patterns are not quite as clear here as they are in “The evolution of popular music.” But there are certainly some squarish areas of similarity connected at the corners. If we use Foote novelty to interpret this graph, we’ll have one major revolution in fiction around 1848, and a minor one around 1890. (I’ve flipped the axis so peaks, rather than dips, represent rapid change.) Between these peaks, presumably, lies a valley of Victorian stasis.

Is any of that true? How would we know? If we just ask whether this story fits our existing preconceptions, I guess we could make it fit reasonably well. As Eleanor Courtemanche pointed out when I discussed this with her, the end of the 1840s is often understood as a moment of transition to realism in British fiction, and the 1890s mark the demise of the three-volume novel. But it’s always easy to assimilate new evidence to our preconceptions. Before we rush to do it, let’s ask whether the quantitative part of this argument has given us any reason at all to believe that the development of English-language fiction really accelerated in the 1840s.

I want to pose four skeptical questions, covering the spectrum from fiddly quantitative details to broad theoretical doubts. I’ll start with the fiddliest part.

1) Is this method robust to different ways of measuring the “distance” between texts?

The short answer is “yes.” The heat maps plotted above are calculated on a topic model, after removing stopwords, but I get very similar results if I compare texts directly, without a topic model, using a range of different distance metrics. Mauch et al. actually apply PCA as well as a topic model; that doesn’t seem to make much difference. The “moments of revolution” stay roughly in the same place.

2) How crucial is the “Foote novelty” piece of the method?

Very crucial, and this is where I think we should start to be skeptical. Mauch et al. are identifying moments of transition using a method that Jonathan Foote developed to segment audio files. The algorithm is designed to find moments of transition, even if those moments are quite subtle. It achieves this by making comparisons — not just between the immediately previous and subsequent moments in a stream of observations — but between all segments of the timeline.

It’s a clever and sensitive method. But there are other, more intuitive ways of thinking about change. For instance, we could take the first ten years of the dataset as a baseline and directly compare the topic distributions in each subsequent novel back to the average distribution in 1820-1829. Here’s the pattern we see if we do that:

byyearThat looks an awful lot like a steady trend; the trend may gradually flatten out (either because change really slows down or, more likely, because cosine distances are bounded at 1.0) but significant spurts of revolutionary novelty are in any case quite difficult to see here.

That made me wonder about the statistical significance of “Foote novelty,” and I’m not satisfied that we know how to assess it. One way to test the statistical significance of a pattern is to randomly permute your data and see how often patterns of the same magnitude turn up. So I repeatedly scrambled the two-year periods I had been comparing, constructed a heat matrix by comparing them pairwise, and calculated Foote novelty.

A heatmap produced by randomly scrambling the fifty two-year periods in the corpus. The “dates” on the timeline are now meaningless.

When I do this I almost always find Foote novelties that are as large as the ones we were calling “revolutions” in the earlier graph.

The authors of “The evolution of popular music” also tested significance with a permutation test. They report high levels of significance (p < 0.01) and large effect sizes (they say music changes four to six times faster at the peak of a revolution than at the bottom of a trough). Moreover, they have generously made their data available, in a very full and clearly-organized csv. But when I run my permutation test on their data, I run into the same problem — I keep discovering random Foote novelties that seem as large as the ones in the real data.

It’s possible that I’m making some error, or that we're testing significance differently. I'm permuting the underlying data, which always gives me a matrix that has the checkerboardy look you see above. The symmetrical logic of pairwise comparison still guarantees that random streaks organize themselves in a squarish way, so there are still “pinch points” in the matrix that create high Foote novelties. But the article reports that significance was calculated “by random permutation of the distance matrix.” If I actually scramble the rows or columns of the distance matrix itself I get a completely random pattern that does give me very low Foote novelty scores. But I would never get a pattern like that by calculating pairwise distances in a real dataset, so I haven’t been able to convince myself that it’s an appropriate test.

3) How do we know that all forms of change should carry equal cultural weight?

Now we reach some questions that will make humanists feel more at home. The basic assumption we’re making in the discussion above is that all the features of an expressive medium bear historical significance. If writers replace “love” with “spleen,” or replace “cannot” with “can’t,” it may be more or less equal where this method is concerned. It all potentially counts as change.

This is not to say that all verbal substitutions will carry exactly equal weight. The weight assigned to words can vary a great deal depending on how exactly you measure the distance between texts; topic models, for instance, will tend to treat synonyms as equivalent. But — broadly speaking — things like contractions can still potentially count as literary change, just as instrumentation and timbre count as musical change in “The evolution of popular music.”

At this point a lot of humanists will heave a relieved sigh and say “Well! We know that cultural change doesn’t depend on that kind of merely verbal difference between texts, so I can stop worrying about this whole question.”

Not so fast! I doubt that we know half as much as we think we know about this, and I particularly doubt that we have good reasons to ignore all the kinds of change we’re currently ignoring. Paying attention to merely verbal differences is revealing some massive changes in fiction that previously slipped through our net — like the steady displacement of abstract social judgment by concrete description outlined by Heuser and Le-Khac in LitLab pamphlet #4.

For me, the bottom line is that we know very little about the kinds of change that should, or shouldn’t, count in cultural history. “The evolution of popular music” may move too rapidly to assume that every variation of a waveform bears roughly equal historical significance. But in our daily practice, literary historians rely on a set of assumptions that are much narrower and just as arbitrary. An interesting debate could take place about these questions, once humanists realize what’s at stake, but it’s going to be a thorny debate, and it may not be the only way forward, because …

4) Instead of discussing change in the abstract, we might get further by specifying the particular kinds of change we care about.

Our systems of cultural periodization tend to imply that lots of different aspects of writing (form and style and theme) all change at the same time — when (say) “aestheticism” is replaced by “modernism.” That underlying theory justifies the quest for generalized cultural growth spurts in “The evolution of popular music.”

But we don’t actually have to think about change so generally. We could specify particular social questions that interest us, and measure change relative to those questions.

The advantage of this approach is that you no longer have to start with arbitrary assumptions about the kind of “distance” that counts. Instead you could use social evidence to train a predictive model. Insofar as that model predicts the variables you care about, you know that it’s capturing the specific kind of change that matters for your question.

Jordan Sellers and I took this approach in a working paper we released last spring, modeling the boundary between volumes of poetry that were reviewed in prominent venues, and those that remained obscure. We found that the stylistic signals of poetic prestige remained relatively stable across time, but we also found that they did move, gradually, in a coherent direction. What we didn’t do, in that article, is try to measure the pace of change very precisely. But conceivably you could, using Foote novelty or some other method. Instead of creating a heatmap that represents pairwise distances between texts, you could create a grid where models trained to recognize a social boundary in particular decades make predictions about the same boundary in other decades. If gender ideologies or definitions of poetic prestige do change rapidly in a particular decade, it would show up in the grid, because models trained to predict authorial gender or poetic prominence before that point would become much worse at predicting it afterward.

Conclusion

I haven’t come to any firm conclusion about “The evolution of popular music.” It’s a bold article that proposes and tests important claims; I’ve learned a lot from trying the same thing on literary history. I don’t think I proved that there aren’t any revolutionary growth spurts in the history of the novel. It’s possible (my gut says, even likely) that something does happen around 1848 and around 1890. But I wasn’t able to show that there’s a statistically significant acceleration of change at those moments. More importantly, I haven’t yet been able to convince myself that I know how to measure significance and effect size for Foote novelty at all; so far my attempts to do that produce results that seem different from the results in a paper written by four authors who have more scientific training than I do, so there’s a very good chance that I’m misunderstanding something.

I would welcome comments, because there are a lot of open questions here. The broader task of measuring the pace of cultural change is the kind of genuinely puzzling problem that I hope we’ll be discussing at more length in the IPAM Cultural Analytics workshop next spring at UCLA.

Postscript Oct 5: More will be coming in a day or two. The suggestions I got from comments (below) have helped me think the quantitative part of this through, and I’m working up an iPython notebook that will run reliable tests of significance and effect size for the music data in Mauch et al. as well as a larger corpus of novels. I have become convinced that significance tests on Foote novelty are not a good way to identify moments of rapid change. The basic problem with that approach is that sequential datasets will always have higher Foote novelties than permuted (non-sequential) datasets, if you make the “window” wide enough — even if the pace of change remains constant. Instead, borrowing an idea from Hoyt Long and Richard So, I’m going to use a Chow test to see whether rates of change vary.

Postscript Oct 8: Actually it could be a while before I have more to say about this, because the quantitative part of the problem turns out to be hard. Rates of change definitely vary. Whether they vary significantly, may be a tricky question.

References:

Jonathan Foote. Automatic audio segmentation using a measure of audio novelty. In Proceedings of IEEE International Conference on Multimedia and Expo, vol. I, pp. 452-455, 2000.

Matthias Mauch, Robert M. MacCallum, Mark Levy, Armand M. Leroi. The evolution of popular music. Royal Society Open Science. May 6, 2015.

Distant reading and the blurry edges of genre.

There are basically two different ways to build collections for distant reading. You can build up collections of specific genres, selecting volumes that you know belong to them. Or you can take an entire digital library as your base collection, and subdivide it by genre.

Most people do it the first way, and having just spent two years learning to do it the second way, I’d like to admit that they’re right. There’s a lot of overhead involved in mining a library. The problem becomes too big for your desktop; you have to schedule batch jobs; you have to learn to interpret MARC records. All this may be necessary eventually, but it’s not the ideal place to start.

But some of the problems I’ve encountered have been interesting. In particular, the problem of “dividing a library by genre” has made me realize that literary studies is constituted by exclusions that are a bit larger and more arbitrary than I used to think.

First of all, why is dividing by genre even a problem? Well, most machine-readable catalog records don’t say much about genre, and even if they did, a single volume usually contains multiple genres anyway. (Think introductions, indexes, collected poems and plays, etc.) With support from the ACLS and NEH, I’ve spent the last year wrestling with that problem, and in a couple of weeks I’m going to share an imperfect page-level map of genre for English-language books in HathiTrust 1700-1923.

But the bigger thing I want to report is that the ambiguity of genre may run deeper than most scholars who aren’t librarians currently imagine. To be sure, we know that subgenres like “detective fiction” are social institutions rather than natural forms. And in a vague way we also accept that broader categories like “fiction” and “poetry” are social constructs with blurry edges. We can all point to a few anomalies: prose poems, eighteenth-century journalistic fictions like The Spectator, and so on.

But somehow, in spite of knowing this for twenty years, I never grasped the full scale of the problem. For instance, I knew the boundary between fiction and nonfiction was blurry in the 18c, but I thought it had stabilized over time. By the time you got to the Victorians, surely, you could draw a circle around “fiction.” Exceptions would just prove the rule.

Selecting volumes one by one for genre-specific collections didn’t shake my confidence. But if you start with a whole library and try to winnow it down, you’re forced to consider a lot of things you would otherwise never look at. I’ve become convinced that the subset of genre-typical cases (should we call them cis-genred volumes?) is nowhere near as paradigmatic as literary scholars like to imagine. A substantial proportion of the books in a library don’t fit those models.

This is both a photograph of a real, unnamed mother and baby, and a picture of a fictional character named Shinkah. Frontispiece to Shinkah, The Osage Indian (1916).

This is both a photograph of a real, unnamed mother and baby, and a picture of a fictional character named Shinkah. Frontispiece to Shinkah, The Osage Indian (1916).


Consider the case of Shinkah, the Osage Indian, published in 1916 by S. M. Barrett. The preface to this volume informs us that it’s intended as a contribution to “the sociology of the Osage Indians.” But it’s set a hundred years in the past, and the central character Shinkah is entirely fictional (his name just means “child.”) On the other hand, the book is illustrated with photographs of real contemporary people, who stand for the characters in an ethnotypical way.

After wading though 872,000 volumes, I’m sorry to report that odd cases of this kind are more typical of nineteenth- and early twentieth-century fiction than my graduate-school training had led me to believe. There’s a smooth continuum for instance between Shinkah and Old Court Life in France (1873), by Frances Elliot. This book has a bibliography, and a historiographical preface, but otherwise reads like a historical novel, complete with invented dialogue. I’m not sure how to distinguish it from other historical novels with real historical personages as characters.

Literary critics know there’s a problem with historical fiction. We also know about the blurry boundary between fiction, journalism, and travel writing represented by the genre of the “sketch.” And anyone who remembers James Frey being kicked out of Oprah Winfrey’s definition of nonfiction knows that autobiographies can be problematic. And we know that didactic fiction blurs into philosophical dialogue. And anyone who studies children’s literature knows that the boundary between fiction and nonfiction gets especially blurry there. And probably some of us know about ethnographic novels like Shinkah. But I’m not sure many of us (except for librarians) have added it all up. When you’re sorting through an entire library you’re forced to see the scale of it: in the period 1700-1923, maybe 10% of the volumes that could be cataloged as fiction present puzzling boundary cases.

You run into a lot of these works even if you browse or select titles at random; that’s how I met Shinkah. But I’ve also been training probabilistic models of genre that report, among other things, how certain or uncertain they are about each page. These models are good at identifying clear cases of our received categories; I found that they agreed with my research assistants almost exactly as often as the research assistants agreed with each other (93-94% of the time, about broad categories like fiction/nonfiction). But you can also ask a model to sift through several thousand volumes looking for hard cases. When I did that I was taken aback to discover that about half the volumes it had most trouble with were things I also found impossible to classify. The model was most uncertain, for instance, about The Terrific Register (1825) — an almanac that mixes historical anecdote, urban legend, and outright fiction randomly from page to page. The second-most puzzling book was Madagascar, or Robert Drury’s Journal (1729), a book that offers itself as a travel journal by a real person, and was for a long time accepted as one, although scholars have more recently argued that it was written by Defoe.

Of course, a statistical model of fiction doesn’t care whether things “really happened”; it pays attention mostly to word frequency. Past-tense verbs of speech, personal names, and “the,” for instance, are disproportionately common in fiction. “Is” and “also” and “mr” (and a few hundred other words) are common in nonfiction. Human readers probably think about genre in a more abstract way. But it’s not particularly miraculous that a model using word frequencies should be confused by the same examples we find confusing. The model was trained, after all, on examples tagged by human beings; the whole point of doing that was to reproduce as much as possible the contours of the boundary that separates genres for us. The only thing that’s surprising is that trawling the model through a library turns up more books right in the middle of the boundary region than our habits of literary attention would have suggested.

A lot of discussions of distant reading have imagined it as a move from canonical to popular or obscure examples of a (known) genre. But reconsidering our definitions of the genres we’re looking for may be just as important. We may come to recognize that “the novel” and “the lyric poem” have always been islands floating in a sea of other texts, widely read but never genre-typical enough to be replicated on English syllabi.

In the long run, this may require us to balance two kinds of inclusiveness. We already know that digital libraries exclude a lot. Allen Riddell has nicely demonstrated just how much: he concludes that there are digital scans for only about 58% of the novels listed in bibliographies as having been published between 1800 and 1836.

One way to ensure inclusion might be to start with those bibliographies, which highlight books invisible in digital libraries. On the other hand, bibliographies also make certain things invisible. The Terrific Register (1825), for instance, is not in Garside’s bibliography of early-nineteenth-century fiction. Neither is The Wonder-Working Water Mill (1791), to mention another odd thing I bumped into. These aren’t oversights; Garside et. al. acknowledge that they’re excluding certain categories of fiction from their conception of the novel. But because we’re trained to think about novels, the scale of that exclusion may only become visible after you spend some time trawling a library catalog.

I don’t want to present this as an aporia that makes it impossible to know where to start. It’s not. Most people attempting distant reading are already starting in the right place — which is to build up medium-sized collections of familiar generic categories like “the novel.” The boundaries of those categories may be blurrier than we usually acknowledge. But there’s also such a thing as fretting excessively about the synchronic representativeness of your sample. A lot of the interesting questions in distant reading are actually trends that involve relative, diachronic differences in the collection. Subtle differences of synchronic coverage may more or less drop out of questions about change over time.

On the other hand, if I’m right that the gray areas between (for instance) fiction and nonfiction are bigger and more persistently blurry than literary scholarship usually mentions, that’s probably in the long run an issue we should consider! When I release a page-level map of genre in a couple of weeks, I’m going to try to provide some dials that allow researchers to make more explicit choices about degrees of inclusion or exclusion.

Predictive models that report probabilities give us a natural way to handle this, because they allow us to characterize every boundary as a gradient, and explicitly acknowledge our compromises (for instance, trade-offs between precision and recall). People who haven’t done much statistical modeling often imagine that numbers will give humanists spuriously clear definitions of fuzzy concepts. My experience has been the opposite: I think our received disciplinary practices often make categories seem self-evident and stable because they teach us to focus on easy cases. Attempting to model those categories explicitly, on a large scale, can force you to acknowledge the real instability of the boundaries involved.

References and acknowledgments

Training data for this project was produced by Shawn Ballard, Jonathan Cheng, Lea Potter, Nicole Moore and Clara Mount, as well as me. Michael L. Black and Boris Capitanu built a GUI that helped us tag volumes at the page level. Material support was provided by the National Endowment for the Humanities and the American Council of Learned Societies. Some information about results and methods is online as a paper and a poster, but much more will be forthcoming in the next month or so — along with a page-level map of broad genre categories and types of paratext.

The project would have been impossible without help from HathiTrust and HathiTrust Research Center. I’ve also been taught to read MARC records by librarians and information scientists including Tim Cole, M. J. Han, Colleen Fallaw, and Jacob Jett, any of whom could teach a course on “Cursed Metadata in Theory and Practice.”

I mention Garside’s bibliography of early nineteenth-century fiction. This is Garside, Peter, and Rainer Schöwerling. The English novel, 1770-1829 : a bibliographical survey of prose fiction published in the British Isles. Ed. Peter Garside, James Raven, and Rainer Schöwerling. 2 vols. Oxford: Oxford University Press, 2000.

Paul Fyfe directed me to a couple of useful works on the genre of the sketch. Michael Widner has recently written a dissertation about the cognitive dimension of genre titled Genre Trouble. I’ve also tuned into ongoing thoughts about the temporal and social dimensions of genre from Daniel Allington and Michael Witmore. The now-classic pamphlet #1 from the Stanford Literary Lab, “Quantitative Formalism,” is probably responsible for my interest in the topic.

Genre, gender, and point of view.

A paper for the IEEE “big humanities” workshop, written in collaboration with Michael L. Black, Loretta Auvil, and Boris Capitanu, is available on arXiv now as a preprint.

The Institute of Electrical and Electronics Engineers is an odd venue for literary history, and our paper ends up touching so many disciplinary bases that it may be distracting.* So I thought I’d pull out four issues of interest to humanists and discuss them briefly here; I’m also taking the occasion to add a little information about gender that we uncovered too late to include in the paper itself.

1) The overall point about genre. Our title, “Mapping Mutable Genres in Structurally Complex Volumes,” may sound like the sort of impossible task heroines are assigned in fairy tales. But the paper argues that the blurry mutability of genres is actually a strong argument for a digital approach to their history. If we could start from some consensus list of categories, it would be easy to crowdsource the history of genre: we’d each take a list of definitions and fan out through the archive. But centuries of debate haven’t yet produced stable definitions of genre. In that context, the advantage of algorithmic mapping is that it can be comprehensive and provisional at the same time. If you change your mind about underlying categories, you can just choose a different set of training examples and hit “run” again. In fact we may never need to reach a consensus about definitions in order to have an interesting conversation about the macroscopic history of genre.

2) A workset of 32,209 volumes of English-language fiction. On the other hand, certain broad categories aren’t going to be terribly controversial. We can probably agree about volumes — and eventually specific page ranges — that contain (for instance) prose fiction and nonfiction, narrative and lyric poetry, and drama in verse, or prose, or some mixture of the two. (Not to mention interesting genres like “publishers’ ads at the back of the volume.”) As a first pass at this problem, we extract a workset of 32,209 volumes containing prose fiction from a collection of 469,200 eighteenth- and nineteenth-century volumes in HathiTrust Digital Library. The metadata for this workset is publicly available from Illinois’ institutional repository. More substantial page-level worksets will soon be produced and archived at HathiTrust Research Center.

3) The declining prevalence of first-person narration. Once we’ve identified this fiction workset, we switch gears to consider point of view — frankly, because it’s a temptingly easy problem with clear literary significance. Though the fiction workset we’re using is defined more narrowly than it was last February, we confirm the result I glimpsed at that point, which is that the prevalence of first-person point of view declines significantly toward the end of the eighteenth century and then remains largely stable for the nineteenth.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 32,209 volumes of fiction extracted from HathiTrust Digital Library. Points are mean probabilities for five-year spans of time; a trend line with standard errors has been plotted with loess smoothing.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 32,209 volumes of fiction extracted from HathiTrust Digital Library. Points are mean probabilities for five-year spans of time; a trend line with standard errors has been plotted with loess smoothing.


We can also confirm that result in a way I’m finding increasingly useful, which is to test it in a collection of a completely different sort. The HathiTrust collection includes reprints, which means that popular works have more weight in the collection than a novel printed only once. It also means that many volumes carry a date much later than their first date of publication. In some ways this gives a more accurate picture of print culture (an approximation to “what everyone read,” to borrow Scott Weingart’s phrase), but one could also argue for a different kind of representativeness, where each volume would be included only once, in a record dated to its first publication (an attempt to represent “what everyone wrote”).

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 774 volumes of fiction selected by multiple hands from multiple sources. Plotted in 20-year bins because n is small here. Works are weighted by the number of words they contain.

Mean probability that fiction is written in first person, 1700-1899. Based on a corpus of 774 volumes of fiction selected by multiple hands from multiple sources. Plotted in 20-year bins because n is smaller here. Works are weighted by the number of words they contain.


Fortunately, Jordan Sellers and I produced a collection like that a few years ago, and we can run the same point-of-view classifier on this very different set of 774 fiction volumes (metadata available), selected by multiple hands from multiple sources (including TCP-ECCO, the Brown Women Writers Project, and the Internet Archive). Doing that reveals broadly the same trend line we saw in the HathiTrust collection. No collection can be absolutely representative (for one thing, because we don’t agree on what we ought to be representing). But discovering parallel results in collections that were constructed very differently does give me some confidence that we’re looking at a real trend.

4. Gender and point of view. In the process of classifying works of fiction, we stumbled on interesting thematic patterns associated with point of view. Features associated with first-person perspective include first-person pronouns, obviously, but also number words and words associated with sea travel. Some of this association may be explained by the surprising persistence of a particular two-century-long genre, the Robinsonade. A castaway premise obviously encourages first-person narration, but the colonial impulse in the Robinsonade also seems to have encouraged acquisitive enumeration of the objects (goats, barrels, guns, slaves) its European narrators find on ostensibly deserted islands. Thus all the number words. (But this association of first-person perspective with colonial settings and acquisitive enumeration may well extend beyond the boundaries of the Robinsonade to other genres of adventure fiction.)

Third-person perspective, on the other hand, is durably associated with words for domestic relationships (husband, lover, marriage). We’re still trying to understand these associations; they could be consequences of a preference for third-person perspective in, say, courtship fiction. But third-person pronouns correlate particularly strongly with words for feminine roles (girl, daughter, woman) — which suggests that there might also be a more specifically gendered dimension to this question.

Since transmitting our paper to the IEEE I’ve had a chance to investigate this hypothesis in the smaller of the two collections we used for that paper — 774 works of fiction between 1700 and 1899: 521 by men, 249 by women, and four not characterized by gender. (Mike Black and Jordan Sellers recorded this gender data by hand.) In this collection, it does appear that male writers choose first-person perspective significantly more than women do. The gender gap persists across the whole timespan, although it might be fading toward the end of the nineteenth century.

Proportion of works of fiction by men and women in first person. Based on the same set of 774 volumes described above. (This figure counts strictly by the number of works rather than weighting works by the number of words they contain.)

Proportion of works of fiction by men and women in first person. Based on the same set of 774 volumes described above. (This figure counts strictly by the number of works rather than weighting works by the number of words they contain.)


Over the whole timespan, women use first person in roughly 23% of their works, and men use it in roughly 35% of their works.** That’s not a huge difference, but in relative terms it’s substantial. (Men are using first person 52% more than women). The Bayesian mafia have made me wary of p-values, but if you still care: a chi-squared test on the 2×2 contingency table of gender and point of view gives p < 0.001. (Attentive readers may already be wondering whether the decline of first person might be partly explained by an increase in the proportion of women writers. But actually, in this collection, works by women have a distribution that skews slightly earlier than that of works by men.)

These are very preliminary results. 774 volumes is a small set when you could test 32,209. At the recent HTRC Uncamp, Stacy Kowalczyk described a method for gender identification in the larger HathiTrust corpus, which we will be eager to borrow once it’s published. Also, the mere presence of an association between gender and point of view doesn’t answer any of the questions literary critics will really want to pose about this phenomenon — like, why is point of view associated with gender? Is this actually a direct consequence of gender, or is it an indirect consequence of some other variable like genre? Does this gendering of narrative perspective really fade toward the end of the nineteenth century? I don’t pretend to have answered any of those questions, all I’m doing here is flagging the existence of an interesting open question that will deserve further inquiry.

— — — — —

*Other papers for the panel are beginning to appear online. Here’s “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers,” by David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon.

** We don’t actually represent point of view as a binary choice between first person or third person; the classifier reports probabilities as a continuous range between 0 and 1. But for purposes of this blog post I’ve simplified by dividing the works into two sets at the 0.5 mark. On this point, and for many other details of quantitative methodology, you’ll want to consult the paper itself.

Distant reading and representativeness.

Digital collections are vastly expanding literary scholars’ field of view: instead of describing a few hundred well-known novels, we can now test our claims against corpora that include tens of thousands of works. But because this expansion of scope has also raised expectations, the question of representativeness is often discussed as if it were a weakness rather than a strength of digital methods. How can we ever produce a corpus complete and balanced enough to represent print culture accurately?

I think the question is wrongly posed, and I’d like to suggest an alternate frame. As I see it, the advantage of digital methods is that we never need to decide on a single model of representation. We can and should keep enlarging digital collections, to make them as inclusive as possible. But no matter how large our collections become, the logic of representation itself will always remain open to debate. For instance, men published more books than women in the eighteenth century. Would a corpus be correctly balanced if it reproduced those disproportions? Or would a better model of representation try to capture the demographic reality that there were roughly as many women as men? There’s something to be said for both views.

Scott Weingart tweet.To take another example, Scott Weingart has pointed out that there’s a basic tension in text mining between measuring “what was written” and “what was read.” A corpus that contains one record for every title, dated to its year of first publication, would tend to emphasize “what was written.” Measuring “what was read” is harder: a perfect solution would require sales figures, reviews, and other kinds of evidence. But, as a quick stab at the problem, we could certainly measure “what was printed,” by including one record for every volume in a consortium of libraries like HathiTrust. If we do that, a frequently-reprinted work like Robinson Crusoe will carry about a hundred times more weight than a novel printed only once.

We’ll never create a single collection that perfectly balances all these considerations. But fortunately, we don’t need to: there’s nothing to prevent us from framing our inquiry instead as a comparative exploration of many different corpora balanced in different ways.

For instance, if we’re troubled by the difference between “what was written” and “what was read,” we can simply create two different collections — one limited to first editions, the other including reprints and duplicate copies. Neither collection is going to be a perfect mirror of print culture. Counting the volumes of a novel preserved in libraries is not the same thing as counting the number of its readers. But comparing these collections should nevertheless tell us whether the issue of popularity makes much difference for a given research question.

I suspect in many cases we’ll find that it makes little difference. For instance, in tracing the development of literary language, I got interested in the relative prominence of words that entered English before and after the Norman Conquest — and more specifically, in how that ratio changed over time in different genres. My first approach to this problem was based on a collection of 4,275 volumes that were, for the most part, limited to first editions (773 of these were prose fiction).

But I recognized that other scholars would have questions about the representativeness of my sample. So I spent the last year wrestling with 470,000 volumes from HathiTrust; correcting their OCR and using classification algorithms to separate fiction from the rest of the collection. This produced a collection with a fundamentally different structure — where a popular work of fiction could be represented by dozens or scores of reprints scattered across the timeline. What difference did that make to the result? (click through to enlarge)

The same question posed to two different collections. 773 hand-selected first editions on the left; on the right, 47,549 volumes, including many translations and reprints.

The same question posed to two different collections. 773 hand-selected first editions on the left; on the right, 47,549 volumes, including many translations and reprints. Yearly ratios are plotted rather than individual works.


It made almost no difference. The scatterplots look different, of course, because the hand-selected collection (on the left) is relatively stable in size across the timespan, and has a consistent kind of noisiness, whereas the HathiTrust collection (on the right) gets so huge in the nineteenth century that noise almost disappears. But the trend lines are broadly comparable, although the collections were created in completely different ways and rely on incompatible theories of representation.

I don’t regret the year I spent getting a binocular perspective on this question. Although in this case changing the corpus made little difference to the result, I’m sure there are other questions where it will make a difference. And we’ll want to consider as many different models of representation as we can. I’ve been gathering metadata about gender, for instance, so that I can ask what difference gender makes to a given question; I’d also like to have metadata about the ethnicity and national origin of authors.

pullquoteBut the broader point I want to make here is that people pursuing digital research don’t need to agree on a theory of representation in order to cooperate.

If you’re designing a shared syllabus or co-editing an anthology, I suppose you do need to agree in advance about the kind of representativeness you’re aiming to produce. Space is limited; tradeoffs have to be made; you can only select one set of works.

But in digital research, there’s no reason why we should ever have to make up our minds about a model of representativeness, let alone reach consensus. The number of works we can select for discussion is not limited. So we don’t need to imagine that we’re seeking a correspondence between the reality of the past and any set of works. Instead, we can look at the past from many different angles and ask how it’s transformed by different perspectives. We can look at all the digitized volumes we have — and then at a subset of works that were widely reprinted — and then at another subset of works published in India — and then at three or four works selected as case studies for close reading. These different approaches will produce different pictures of the past, to be sure. But nothing compels us to make a final choice among them.

How not to do things with words.

In recent weeks, journals published two papers purporting to draw broad cultural inferences from Google’s ngram corpus. The first of these papers, in PLoS One, argued that “language in American books has become increasingly focused on the self and uniqueness” since 1960. The second, in The Journal of Positive Psychology, argued that “moral ideals and virtues have largely waned from the public conversation” in twentieth-century America. Both articles received substantial attention from journalists and blogs; both have been discussed skeptically by linguists and digital humanists. (Mark Liberman’s takes on Language Log are particularly worth reading.)

I’m writing this post because systems of academic review and communication are failing us in cases like this, and we need to step up our game. Tools like Google’s ngram viewer have created new opportunities, but also new methodological pitfalls. Humanists are aware of those pitfalls, but I think we need to work a bit harder to get the word out to journalists, and to disciplines like psychology.

The basic methodological problem in both articles is that researchers have used present-day patterns of association to define a wordlist that they then take as an index of the fortunes of some concept (morality, individualism, etc) over historical time. (In the second study, for instance, words associated with morality were extracted from a thesaurus and crowdsourced using Mechanical Turk.)

The fallacy involved here has little to do with hot-button issues of quantification. A basic premise of historicism is that human experience gets divided up in different ways in different eras. If we crowdsource “leadership” using twenty-first-century reactions on Mechanical Turk, for instance, we’ll probably get words like “visionary” and “professional.” “Loud-voiced” probably won’t be on the list — because that’s just rude. But to Homer, there’s nothing especially noble about working for hire (“professionally”), whereas “the loud-voiced Achilles” is cut out to be a leader of men, since he can be heard over the din of spears beating on shields (Blackwell).

The laws of perspective apply to history as well. We don’t have an objective overview; we have a position in time that produces its own kind of distortion and foreshortening. Photo 2004 by June Ruivivar.

The authors of both articles are dimly aware of this problem, but they imagine that it’s something they can dismiss if they’re just conscientious and careful to choose a good list of words. I don’t blame them; they’re not coming from historical disciplines. But one of the things you learn by working in a historical discipline is that our perspective is often limited by history in ways we are unable to anticipate. So if you want to understand what morality meant in 1900, you have to work to reconstruct that concept; it is not going to be intuitively accessible to you, and it cannot be crowdsourced.

The classic way to reconstruct concepts from the past involves immersing yourself in sources from the period. That’s probably still the best way, but where language is concerned, there are also quantitative techniques that can help. For instance, Ryan Heuser and Long Le-Khac have carried out research on word frequency in the nineteenth-century novel that might superficially look like the psychological articles I am critiquing. (It’s Pamphlet 4 in the Stanford Literary Lab series.) But their work is much more reliable and more interesting, because it begins by mining patterns of association from the period in question. They don’t start from an abstract concept like “individualism” and pick words that might be associated with it. Instead, they find groups of words that are associated with each other, in practice, in nineteenth-century novels, and then trace the history of those groups. In doing so, they find some intriguing patterns that scholars of the nineteenth-century novel are going to need to pay attention to.

It’s also relevant that Heuser and Le-Khac are working in a corpus that is limited to fiction. One of the problems with the Google ngram corpus is that really we have no idea what genres are represented in it, or how their relative proportions may vary over time. So it’s possible that an apparent decline in the frequency of words for moral values is actually a decline in the frequency of certain genres — say, conduct books, or hagiographic biographies. A decline of that sort would still be telling us something about literary culture; but it might be telling us something different than we initially assume from tracing the decline of a word like “fidelity.”

So please, if you know a psychologist, or journalist, or someone who blogs for The Atlantic: let them know that there is actually an emerging interdisciplinary field developing a methodology to grapple with this sort of evidence. Articles that purport to draw historical conclusions from language need to demonstrate that they have thought about the problems involved. That will require thinking about math, but it also, definitely, requires thinking about dilemmas of historical interpretation.

References
My illustration about “loud-voiced Achilles” is a very old example of the way concepts change over time, drawn via Friedrich Meinecke from Thomas Blackwell, An Enquiry into the Life and Writings of Homer, 1735. The word “professional,” by the way, also illustrates a kind of subtly moralized contemporary vocabulary that Kesebir & Kesebir may be ignoring in their account of the decline of moral virtue. One of the other dilemmas of historical perspective is that we’re in our own blind spot.