The Gender Balance of Fiction, 1800-2007

by Ted Underwood and David Bamman

Last year, we wrote a blog post that posed questions about the differentiation of gendered roles in fiction. In doing that, we skipped over a more obvious question: how equally (or unequally) do stories distribute their attention between men and women?

This year, we’re returning to that simple question, with a richer dataset (supported by ongoing work at HathiTrust Research Center). The full story will come out in an article, but we’d like to share a few big-picture points in advance.

To start with, why have we framed this as a question about “women” and “men”? Gender isn’t a binary phenomenon. But we aren’t inquiring about the truth of gender identity here — just about gross inequalities that have separated conventional public roles. English-language fiction does typically divide characters by calling them “he” or “she,” and that division is a good place to start posing questions.

We could measure underrepresentation by counting people, but then we’d have to decide how much weight to give minor characters. A simpler approach is just to ask how many words are used to describe fictional men or women, respectively. BookNLP gave us a way to answer that question; it uses names and honorifics to infer a character’s gender, and then traces grammatical dependencies to identify adjectives that modify a character, nouns she possesses, or verbs she governs. After swinging BookNLP through 93,708 English-language volumes identified as fiction from the HathiTrust Digital Library, we can estimate the percentage of words used in characterization that are used to describe women. (To simplify the task of reading this illustration, we have left out characters coded as “other” or unknown,” so a year with equal representation of men and women would be located on the 50% line.).  To help quantify our uncertainty, we present each measurement by year along with a 95% confidence interval calculated using the bootstrap; our uncertainty decreases over time, largely as a function of an increasing number of books being published.


There is a clear decline from the nineteenth century (when women generally take up 40% or more of the “character space” in fiction) to the 1950s and 60s, when their prominence hovers around a low of 30%. A correction, beginning in the 1970s, almost restores fiction to its nineteenth-century state. (One way of thinking about this: second-wave feminism was a desperately-needed rescue operation.)

The fluctuation is not enormous, but also not trivial: women lose roughly a fourth of the space on the page they had possessed in the nineteenth century. Nor is this something we already knew. It might be a mistake to call this pattern a “surprise”: it’s not as if everyone had clearly-formed expectations about “space on the page.” But when we do pose the question, and ask scholars what they expect to see before revealing this evidence, several people have predicted a series of advances toward equality that correspond to e.g. the suffrage movement and World War II, separated by partial retreats. Instead we see a fairly steady decline from 1860 to 1970, with no overall advance toward equality.

What’s the explanation? Our methods do have blind spots. For instance, we aren’t usually able to infer gender for first-person protagonists, so they are left out here. And our inferences about other characters have a known level of error. But after cross-checking the evidence, we don’t believe the level of error is large enough to explain away this pattern (see our github repo for fuller discussion). It is of course possible that our sample of fiction is skewed. For instance, a sample of 93,708 volumes will include a lot of obscure works and works in translation. What if we focus on slightly more prominent works? We have posed that question by comparing our Hathi sample to a smaller (10,000-volume) sample drawn from the Chicago Text Lab, which emphasizes relatively prominent American works, and filters out works in translation.


As you can see, the broad outlines of the trend don’t change. If anything, the decline from 1860 to 1970 is slightly more marked in the Chicago corpus (perhaps because it does a better job of filtering out reprints, which tend to muffle change). This doesn’t prove that we will see the same pattern in every sample. There are many ways to sample the history of fiction! Some scholars will want to know about paperbacks that tend to be underrepresented in university libraries; others will only be interested in a short list of hypercanonical authors. We can’t exhaust all possible modes of sampling, but we can say at least that this trend is not an artefact of a single sampling strategy.  Nor is it an artefact of our choice to represent characters by counting words syntactically associated with them: we see the same pattern of decline to different degrees when measuring the amount of dialogue spoken by men and women, and in simply counting the number of characters as well.

So what does explain the declining representation of women? We don’t yet know. But the trend seems too complex to dismiss with a single explanation. For instance, it can be partly — but only partly — explained by a decline in the proportion of fiction writers who were women.


Take specific dots with a grain of salt; there are sources of error here, especially because the wall of copyright at 1923 may change digitization practices or throw off our own data pipeline. (Note the outlier right at 1923.) But the general pattern above is echoed also in the Chicago sample of American fiction, so we feel confident that there was really a decline in the fraction of fiction writers who were women. As far as we know, Chris Forster was the first person to gather broad quantitative evidence of this decline. But many scholars have grasped pieces of the story: for instance, Anne E. Boyd takes The Atlantic around 1890 as a case study of a process whereby the professionalization and canonization of American fiction tended to push out women who had previously been prominent. [See also Tuchman and Fortin 1989 in references below.]

But this is not necessarily a story about the marginalization of women writers in general. (On the contrary, the prominence of women rose throughout this period in several nonfiction genres.) The decline was specific to fiction — either because the intellectual opportunities open to women were expanding beyond belles lettres, or because the rising prestige of fiction attracted a growing number of men.

Men are overrepresented in books by men, so a decline in the number of women novelists will also tend to reduce the number of characters who are women. But that doesn’t completely explain the marginalization of feminine characters from 1860 to 1970. For instance, we can also divide authors by gender, and look at shifting patterns of attention within works by women or by men.


There are several interesting details here. The inequality of attention in books by men is depressingly durable (men rarely give more than 30% of their attention to fictional women). But it’s also interesting that the fluctuations we saw earlier remain visible even when works are divided by author gender: both trend lines above show a slight decline in the space allotted to women, from 1860 to 1970. In other words, it’s not just that there were fewer works of fiction written by women; even inside books written by women, feminine characters were occupying slightly less space on the page.

Why? The rise of genres devoted to “action” and “adventure” might play a role, although we haven’t found clear evidence yet that it makes a difference. (Genre boundaries are too blurry for the question to be answered easily.) Or fiction might have been masculinized in some broader sense, less tied to specific genre categories (see Suzanne Clark, for instance, on modernism as masculinization.)

But listing possible explanations is the easy part. Figuring out which are true — and to what extent — will be harder.

We will continue to explore these questions, in collaboration with grad students, but we also want to draw other scholars’ attention to resources that can support this kind of inquiry (and invite readers to share useful secondary sources in the comments).

HathiTrust Research Center’s Extracted Features Dataset doesn’t permit the syntactic parsing performed by BookNLP, but even authors’ names and the raw frequencies of gendered pronouns can tell you a lot. Working just with that dataset, Chris Forster was able to catch significant patterns involving gender.

When we publish our article, we will also share data produced by BookNLP about specific characters across a collection of 93,708 books. HTRC is also building a “Data Capsule” that will allow other scholars to produce similar data themselves. In the meantime, in collaboration with Nikolaus N. Parulian, we have produced an interactive visualization that allows you to explore changes in the gendering of words used in characterization. (Compare, for instance, “grin” to “smile,” or “house” to “room.”) We have also made available the metadata and yearly summaries behind the visualization.

Acknowledgments. The work described here has been supported by NovelTM, funded by the Canadian Social Sciences and Humanities Research Council, and by the WCSA+DC grant at HathiTrust Research Center, funded  by the Andrew W. Mellon Foundation. We thank Hoyt Long, Teddy Roland, and Richard Jean So for permission to use the Chicago Novel Corpus. The project often relied on, by Bridget Baird and Cameron Blevins (2014). Boris Capitanu helped parallelize BookNLP across hundreds of thousands of volumes. Attendees at the 2016 NovelTM meeting, and Justine Murison in Illinois, provided valuable advice about literary history.


Boyd, Anne E. “‘What, Has She Got into the Atlantic?’ Women Writers, The Atlantic Monthly, and the Formation of the American Canon,” American Studies 39.3 (1998): 5-36.

Clark, Suzanne. Sentimental Modernism: Women Writers and the Revolution of the Word (Indianapolis: Indiana University Press, 1992).

Forster, Chris. “A Walk Through the Metadata: Gender in the HathiTrust Dataset.” September 8, 2015.

Tuchman, Gaye, with Nina E. Fortin. Edging Women Out: Victorian Novelists, Publishers, and Social Change. New Haven: Yale University Press, 1989.




You say you found a revolution.

by Ted Underwood, Hoyt Long, Richard Jean So, and Yuancheng Zhu

This is the second part of a two-part blog post about quantitative approaches to cultural change, focusing especially on a recent article that claimed to identify “stylistic revolutions” in popular music.

Although “The Evolution of Popular Music” (Mauch et al.) appeared in a scientific journal, it raises two broad questions that humanists should care about:

  1. Are measures of the stylistic “distance” between songs or texts really what we mean by cultural change?
  2. If we did take that approach to measuring change, would we find brief periods where the history of music or literature speeds up by a factor of six, as Mauch et al. claim?

Underwood’s initial post last October discussed both of these questions. The first one is more important. But it may also be hard to answer — in part because “cultural change” could mean a range of different things (e.g., the ever-finer segmentation of the music market, not just changes that affect it as a whole).

So putting the first question aside for now, let’s look at the the second one closely. When we do measure the stylistic or linguistic “distance” between works of music or literature, do we actually discover brief periods of accelerated change?

The authors of “The Evolution of Popular Music” say “yes!” Epochal breaks can be dated to particular years.

We identified three revolutions: a major one around 1991 and two smaller ones around 1964 and 1983 (figure 5b). From peak to succeeding trough, the rate of musical change during these revolutions varied four- to six-fold.

Tying musical revolutions to particular years (and making 1991 more important than 1964) won the article a lot of attention in the press. Underwood’s questions about these claims last October stirred up an offline conversation with three researchers at the University of Chicago, who have joined this post as coauthors. After gathering in Hyde Park to discuss the question for a couple of days, we’ve concluded that “The Evolution of Popular Music” overstates its results, but is also a valuable experiment, worth learning from. The article calculates significance in a misleading way: only two of the three “revolutions” it reported are really significant at p < 0.05, and it misses some odd periods of stasis that are just as significant as the periods of acceleration. But these details are less interesting than the reason for the error, which involved a basic challenge facing quantitative analysis of history.

To explain that problem, we’ll need to explain the central illustration in the original article. The authors’ strategy was to take every quarter-year of the Billboard Hot 100 between 1960 and 2010, and compare it to every other quarter, producing a distance matrix where light (yellow-white) colors indicate similarity, and dark (red) colors indicate greater differences. (Music historians may wonder whether “harmonic and timbral topics” are the right things to be comparing in the first place, and it’s a fair question — but not central to our purpose in this post, so we’ll give it a pass.)

You see a diagonal white line in the matrix, because comparing a quarter to itself naturally produces a lot of similarity. As you move away from that line (to the upper left or lower right), you’re making comparisons across longer and longer spans of time, so colors become darker (reflecting greater differences).


Figure 5 from Mauch, et. al., “The evolution of popular music” (RSOS 2015).

Then, underneath the distance matrix, Mauch et al. provide a second illustration that measures “Foote novelty” for each quarter. This is a technique for segmenting audio files developed by Jonathan Foote. The basic idea is to look for moments of acceleration where periods of relatively slow change are separated by a spurt of rapid change. In effect, that means looking for a point where yellow “squares” of similarity touch at their corners.

For instance, follow the dotted line associated with 1991 in the illustration above up to its intersection with the white diagonal. At that diagonal line, 1991 is (unsurprisingly) similar to itself. But if you move upward in the matrix (comparing 1991 to its own future), you rapidly get into red areas, revealing that 1994 is already quite different. The same thing is true if you move over a year to 1992 and then move down (comparing 1992 to its own past). At a “pinch point” like this, change is rapid. According to “The Evolution of Popular Music,” we’re looking at the advent of rap and hip-hop in the Billboard Hot 100. Contrast this pattern, for instance, to a year like 1975, in the middle of a big yellow square, where it’s possible to move several years up or down without encountering significant change.

matrixMathematically, “Foote novelty” is measured by sliding a smaller matrix along the diagonal timeline, multiplying it element-wise with the measurements of distance underlying all those red or yellow points. Then you add up the multiplied values. The smaller matrix has positive and negative coefficients corresponding to the “squares” you want to contrast, as seen on the right.

As you can see, matrices of this general shape will tend to produce a very high sum when they reach a pinch point where two yellow squares (of small distances) are separated by the corners of reddish squares (containing large distances) to the upper left and lower right. The areas of ones and negative-ones can be enlarged to measure larger windows of change.

This method works by subtracting the change on either side of a temporal boundary from the changes across the boundary itself. But it has one important weakness. The contrast between positive and negative areas in the matrix is not apples-to-apples, because comparisons made across a boundary are going to stretch across a longer span of time, on average, than the comparisons made within the half-spans on either side. (Concretely, you can see that the ones in the matrix above will be further from the central diagonal timeline than the negative-ones.)

If you’re interested in segmenting music, that imbalance may not matter. There’s a lot of repetition in music, and it’s not always true that a note will resemble a nearby note more than it resembles a note from elsewhere in the piece. Here’s a distance matrix, for instance, from The Well-Tempered Clavier, used by Foote as an example.


From Foote, “Automatic Audio Segmentation Using a Measure of Audio Novelty.”

Unlike the historical matrix in “The Evolution of Popular Music,” this has many light spots scattered all over — because notes are often repeated.


Original distance matrix produced using data from Mauch et al. (2015).

History doesn’t repeat itself in the same way. It’s extremely likely (almost certain) that music from 1992 will resemble music from 1991 more than it resembles music from 1965. That’s why the historical distance matrix has a single broad yellow path running from lower left to upper right.

As a result, historical sequences are always going to produce very high measurements of Foote novelty.  Comparisons across a boundary will always tend to create higher distances than the comparisons within the half-spans on either side, because differences across longer spans of time always tend to be bigger.


Matrix produced by permuting years and then measuring the distances between them.

This also makes it tricky to assess the significance of “Foote novelty” on historical evidence. You might ordinarily do this using a “permutation test.” Scramble all the segments of the timeline repeatedly and check Foote novelty each time, in order to see how often you get “squares” as big or well-marked as the ones you got in testing the real data. But that sort of scrambling will make no sense at all when you’re looking at history. If you scramble the years, you’ll always get a matrix that has a completely different structure of similarity — because it’s no longer sequential.

The Foote novelties you get from a randomized matrix like this will always be low, because “Foote novelty” partly measures the contrast between areas close to, and far from, the diagonal line (a contrast that simply doesn’t exist here).


This explains a deeply puzzling aspect of the original article. If you look at the significance curves labeled .001, .01, and 0.05 in the visualization of Foote novelties (above), you’ll notice that every point in the original timeline had a strongly significant novelty score. As interpreted by the caption, this seems to imply that change across every point was faster than average for the sequence … which … can’t possibly be true everywhere.

All this image really reveals is that we’re looking at evidence that takes the form of a sequential chain. Comparisons across long spans of time always involve more difference than comparisons across short ones — to an extent that you would never find in a randomized matrix.

In short, the tests in Mauch et al. don’t prove that there were significant moments of acceleration in the history of music. They just prove that we’re looking at historical evidence! The authors have interpreted this as a sign of “revolution,” because all change looks revolutionary when compared to temporal chaos.

On the other hand, when we first saw the big yellow and red squares in the original distance matrix, it certainly looked like a significant pattern. Granted that the math used in the article doesn’t work — isn’t there some other way to test the significance of these variations?

It took us a while to figure out, but there is a reliable way to run significance tests for Foote novelty. Instead of scrambling the original data, you need to permute the distances along diagonals of the distance matrix.


Produced by permuting diagonals in the original matrix.

In other words, you take a single diagonal line in the original matrix and record the measurements of distance along that line. (If you’re looking at the central diagonal, this will contain a comparison of every quarter to itself; if you move up one notch, it will contain a comparison of every quarter to the quarter in its immediate future.) Then you scramble those values randomly, and put them back on the same line in the matrix. (We’ve written up a Jupyter notebook showing how to do it.) This approach distributes change randomly across time while preserving the sequential character of the data: comparisons over short spans of time will still tend to reveal more similarity than long ones.

If you run this sort of permutation 100 times, you can discover the maximum and minimum Foote novelties that would be likely to occur by chance.


Measurements of Foote novelty produced by a matrix with a five-year half-width, and the thresholds for significance.

Variation between the two red lines isn’t statistically significant — only the peaks of rapid change poking above the top line, and the troughs of stasis dipping below the bottom line. (The significance of those troughs couldn’t become visible in the original article, because the question had been framed in a way that made smaller-than-random Foote novelties impossible by definition.)

These corrected calculations do still reveal significant moments of acceleration in the history of the Billboard Hot 100: two out of three of the “revolutions” Mauch et al. report (around 1983 and 1991) are still significant at p < 0.05 and even p < 0.001. (The British Invasion, alas, doesn’t pass the test.) But the calculations also reveal something not mentioned in the original article: a very significant slowing of change after 1995.

Can we still call the moments of acceleration in this graph stylistic “revolutions”?

Foote novelty itself won’t answer the question. Instead of directly measuring a rate of change, it measures a difference between rates of change in overlapping periods. But once we’ve identified the periods that interest us, it’s simple enough to measure the pace of change in each of them. You can just divide the period in half and compare the first half to the second (see the “Effect size” section in our Jupyter notebook). This confirms the estimate in Mauch et al.: if you compare the most rapid period of change (from 1990 to 1994) to the slowest four years (2001 to 2005), there is a sixfold difference between them.

On the other hand, it could be misleading to interpret this as a statement about the height of the early-90s “peak” of change, since we’re comparing it to an abnormally stable period in the early 2000s. If we compare both of those periods to the mean rate of change across any four years in this dataset, we find that change in the early 90s was about 171% of the mean pace, whereas change in the early 2000s was only 29% of mean. Proportionally, the slowing of change after 1995 might be the more dramatic aberration here.

Overall, the picture we’re seeing is different from the story in “The Evolution of Popular Music.” Instead of three dramatic “revolutions” dated to specific years, we see two periods where change was significantly (but not enormously) faster than average, and two periods where it was slower. These periods range from four to fifteen years in length.

Humanists will surely want to challenge this picture in theoretical ways as well. Was the Billboard Hot 100 the right sample to be looking at? Are “timbral topics” the right things to be comparing? These are all valid questions.

But when scientists make quantitative claims about humanistic subjects, it’s also important to question the quantitative part of their argument. If humanists begin by ceding that ground, the conversation can easily become a stalemate where interpretive theory faces off against the (supposedly objective) logic of science, neither able to grapple with the other.

One of the authors of “The Evolution of Popular Music,” in fact, published an editorial in The New York Times representing interdisciplinary conversation as exactly this sort of stalemate between “incommensurable interpretive fashions” and the “inexorable logic” of math (“One Republic of Learning,” NYT Feb 2015). But in reality, as we’ve just seen, the mathematical parts of an argument about human culture also encode interpretive premises (assumptions, for instance, about historical difference and similarity). We need to make those premises explicit, and question them.

Having done that here, and having proposed a few corrections to “The Evolution of Popular Music,” we want to stress that the article still seems to us a bold and valuable experiment that has advanced conversation about cultural history. The basic idea of calculating “Foote novelty” on a distance matrix is useful: it can give historians a way of thinking about change that acknowledges several different scales of comparison at once.

The authors also deserve admiration for making their data available; that transparency has permitted us to replicate and test their claims, just as Andrew Goldstone recently tested Ted Underwood’s model of poetic prestige, and Annie Swafford tested Matt Jockers’ syuzhet package. Our understanding of these difficult problems can only advance through collective practices of data-sharing and replication. Being transparent in our methods is more important, in the long run, than being right about any particular detail.

The authors want to thank the NovelTM project for supporting the collaboration reported here. (And we promise to apply these methods to the history of the novel next.)


Jonathan Foote. Automatic audio segmentation using a measure of audio novelty. In Proceedings of IEEE International Conference on Multimedia and Expo, vol. I, pp. 452-455, 2000.

Mauch et al. 2015. “The Evolution of Popular Music.” Royal Society Open Science. May 6, 2015. DOI: 10.1098/rsos.150081

Postscript: Several commenters on the original blog post proposed simpler ways of measuring change that begin by comparing adjacent segments of a timeline. This an intuitive approach, and a valid one, but it does run into difficulties — as we discovered when we tried to base changepoint analysis on it (Jupyter notebook here). The main problem is that apparent trajectories of change can become very delicately dependent on the particular window of comparison you use. You’ll see lots of examples of that problem toward the end of our notebook.

The advantage of the “Foote novelty” approach is that it combines lots of different scales of comparison (since you’re considering all the points in a matrix — some closer and some farther from the timeline). That makes the results more robust. Here, for instance, we’ve overlaid the “Foote novelties” generated by three different windows of comparison on the music dataset, flagging the quarters that are significant at p < 0.05 in each case.


This sort of close congruence is not something we found with simpler methods. Compare the analogous image below, for instance. Part of the chaos here is a purely visual issue related to the separation of curves — but part comes from using segments rather than a distance matrix.


A dataset for distant-reading literature in English, 1700-1922.

Literary critics have been having a speculative conversation about close and distant reading. It might be premature to call it a debate.

A “debate” is normally a situation where people are free to choose between two paths. “Should I believe Habermas, or Foucault? I’m listening; I could go either way.” Conversation about distant reading is different, first, because there’s not much need to make a choice. Have any critics stopped reading closely? A close reading of The Bourgeois suggests that Franco Moretti hasn’t.

More importantly, this isn’t a debate yet because most of the people involved aren’t free to explore both paths. So far only a tiny number of scholars have actually tried distant reading, and it’s easy to see why. You can wake up tomorrow and try a Foucauldian reading of Frankenstein, but you can’t wake up and trace patterns of change in a thousand novels. In either case, you may need to learn new methods, but in the “distant” case, it can also take years to assemble a collection of texts.

A dataset for distant reading
To reduce barriers to entry, I’ve collaborated with HathiTrust Research Center to create an easier place to start with English-language literature. It’s aimed at scholars studying long-nineteenth-century (1750-1922) fiction and poetry, but it will gradually expand into the twentieth century. This post describes the humanistic uses of the dataset; if you want technical information, there’s more on the page where the data actually lives.

HathiTrust contains more than a million volumes in English between 1700 and 1922. Contractual agreements make it hard to share the texts themselves in bulk, but many of the questions that can be posed “at a distance” can be posed just as well using simpler representations of the texts — for instance, by counting the words they contain. To support this project, HathiTrust Research Center has extracted page-level word counts for 4.8 million volumes; scholars who are interested in the highest level of detail should go directly to their data.

However, many literary scholars are mainly concerned with books in a particular genre — they limit their inquiries, say, to “poetry” or “prose fiction.” Finding those needles in a five-millon-volume haystack is not easy. Many books in this period don’t carry genre tags; even when they do, volumes are heterogenous things. A volume of poetry, for instance, may begin with a prose life of the author and end with publishers’ ads.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren't represented here. Results have been smoothed with a five-year moving average.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren’t represented here. Results have been smoothed with a five-year moving average.

To create datasets that reliably track a single genre, we need page-level metadata. The National Endowment for the Humanities and the American Council of Learned Societies funded a year-long project to create that metadata. (The methods involved are described in a white paper on “Understanding Genre,” along with information about accuracy.) Now, by pairing this metadata with HTRC’s page-level wordcounts, I’ve created three genre-specific datasets of word counts covering poetry, fiction, and drama from 1700 to 1922. (Coverage is relatively sparse before 1750; if you need the early eighteenth century, you might want a resource like ECCO-TCP instead of or in addition to this.)

The collection consists of word counts for 101,948 volumes of fiction, 58,724 volumes of poetry, and 17,709 volumes of drama, aggregated at the volume level and including only pages identified as belonging to the relevant genre. I’ve collected these volume-level files in tar.gz chunks by genre and date, and have provided basic metadata for them all. You can use the volume IDs to view the original texts on the HathiTrust website if you need to read them closely. I’m calling this a “collection” rather than a “corpus” because I don’t necessarily recommend that you use the whole thing, as is. The whole thing may or may not represent the sample you need for your research question. What it represents is, “American university and public libraries, insofar as they were digitized in the year 2012 (when the project began).” For some big diachronic questions, that’s a good sample; for other questions, you’ll need to be more selective.

Three big blocks of stone. Like collections, these don't represent anything in particular. But the corpus you want to create might be contained somewhere within them.

Three big blocks of stone. Like collections, these don’t represent anything in particular. But like a statue, the corpus you want to create might be contained somewhere within them.

Because this is a very large collection, it’s likely in any case that the sample you need for your research may be contained somewhere within it. To address some questions, you might even select several samples and contrast them. To understand the history of literary prestige, for instance, Jordan Sellers and I gathered 360 prominent books of poetry by finding reviews in literary magazines and extracting the corresponding books from HathiTrust; we then contrasted that to a sample of 360 more obscure volumes selected from the whole HathiTrust collection of poetry. Just using volume-level wordcounts for those two samples, we were able to draw inferences about the way diachronic literary change is related to synchronic prestige.

Well-known texts may be represented in this dataset by dozens of reprints. For some questions, that may be exactly the sort of “weighted” sample you want; for other questions, you’ll want to winnow each title down to a single early example. More datasets may be developed to help you do that.

Distant reading rarely means “big data”
I realize the practice described above (selecting samples of a few hundred or a few thousand books to address particular questions) doesn’t line up with the version of distant reading currently circulating in public imagination. Isn’t the point of distant reading to construct a massive database that includes “everything that has been thought and said”? The Nation recently said so, and also warned us that “in reality, servers powerful enough to process big data can only be located in a highly select number of well-endowed institutions.”

That sounds grim, but I’m happy to report that it’s also malarkey. You can download this dataset, and process it, on your laptop. It’s true that I used our campus cluster to create it (because I had to manage a terabyte of text). But a) managing a terabyte won’t put a hole in most endowments, and b) you don’t need to do that anyway. Once nonfiction is set aside, we’re talking about a smaller group of books (compressed, this whole dataset runs to about 5GB). A well-designed sampling strategy can make it even smaller.

Wait, what’s this about “sampling”? aren’t distant readers supposed to claim to have everything? Not really. In the early days of distant reading, Franco Moretti did frame the project as a challenge to literary historians’ claims about synchronic coverage. (We only discuss a tiny number of books from any given period — what about all the rest?) But even in those early publications, Moretti acknowledged that we would only be able to represent “all the rest” through some kind of sample.

Fifteen years later, it’s becoming clear that distant reading has a lot of applications that aren’t about synchronic completeness at all. Expanding the diachronic scope of our research can be an equally important source of discovery. Certain kinds of change only become visible when you compare many examples across long timelines. Even if we restricted a digital corpus (say) to the academic canon, or to a thousand bestsellers, computational analysis would allow us to see long-term changes that aren’t visible to casual recollection.

It’s true that distant readers will often want to have the biggest possible table of metadata, so that our sampling strategies aren’t unduly constrained. But from that table, we may only sample a few hundred or a few thousand titles to address any single question. This scale of inquiry is not, in any meaningful sense, “big data.” (In fact, I doubt the phrase “big data” is often very meaningful, but that’s another story.) It’s a larger sample than literary scholars have usually attempted to describe, but it would not greatly distress our neighbors in linguistics and sociology.

How hard is this to use?
Of course, we’re not linguists or sociologists, so there is going to be a learning curve involved when we apply quantitative methods on any scale. The main dataset I’m providing here includes 178,381 separate files — one file for each volume. This is not something that can be sliced easily using a tool like Excel. Someone involved with the project needs to be able to program in order to pair the metadata table with the files.

On the other hand, there may be some questions that can be answered with a simple yearly summary, so I’ve also provided yearly_summary tables for each genre that aggregate term frequencies for the 10,000 most common tokens in each genre (selected by document frequency). This is the gentlest on-ramp to the dataset; data in this form probably can be sliced with Excel; to make it even easier I’ve also gone ahead and applied OCR correction and spelling normalization to those tables.

But the yearly_summary table aggregates all the volumes in the collection, and (as I’ve stressed) you may not want all of them. This dataset is a roughly-hewn, but very large, block of stone. You may be able to find the corpus you need somewhere within it, but decisions about selection are yours to make. Over the course of the next two years I hope to extend coverage further into the twentieth century; it is not illegal to share word counts from texts still covered by copyright. If you’re interested in more complex kinds of distant reading where word order matters, you can contact the HathiTrust Research Center; they are creating a workflow that can handle more complex kinds of computational analysis.

Postscript: We’ve done a lot of testing, but this is still a beta release. General estimates about error are summarized in “Understanding Genre”. Precision in these datasets is higher than 97%, but that still means there will be hundreds of volumes and thousands of pages mistakenly included. If you notice systematic problems with the data, please send feedback to the e-mail address provided in the data description. But individual misclassified volumes are not problems we’re likely to fix on a case-by-case basis; that sort of problem will be addressed by improving our methods in our next release.

How quickly do literary standards change?

by Ted Underwood and Jordan Sellers

Part of this project will appear next year — revised and improved — in MLQ. But we’ve decided to release it as a free-standing draft rather than a preprint, because it allows us to use color and to explore some puzzling leads that won’t fit into the physical limits of one journal article.

To understand the aesthetic standards that govern reception, we contrasted two samples of English-language poetry, drawn from different social contexts: 1) a group of 360 volumes that we chose by sampling reviews in prominent periodicals, 1820-1919, and 2) a group of 360 volumes sampled at random from HathiTrust Digital Library, many of them pretty obscure.
We were curious whether the difference in prestige between these books would be legible in the texts themselves. For instance, could you train a statistical model to predict whether a volume of poetry came from the “reviewed” or “random” sample just by looking at diction? And if you could, what social difference exactly would you be detecting?

Scholars sometimes suggest that high culture hadn’t differentiated from the rest of the literary field very sharply yet in the early 19th century [1: Huyssen 1986]. If so, books of poetry reviewed in prestigious contexts might be hard to identify in that part of the timeline. It might get easier toward the 20th century, as different poetic styles specialized to address (say) “high” and “middlebrow” audiences.

On the other hand, if writers became prominent by occupying the leading edge of a rapidly-moving wave, we might only be able to separate these samples by training a sequence of different models for different periods. For instance, prominent poets in the 1820s might be united by gloomy Byronism; in the 1850s they might share an interest in history; by the 1890s what they had in common might be the word “mauve.” As for the randomly-selected volumes, who knows? Maybe they would share only a tendency to trail thirty years behind the trend.

Since it seemed reasonable to assume that the standards governing reception had been volatile, we began by training a different model of poetic prestige for each twenty-year period. But we found, in practice, that the best way to separate these samples was to treat the whole period 1820-1919 as a single unit organized by a single set of aesthetic standards. You can click on the image that follows to see a slightly larger and clearer version.


In the image above, each point is a volume of poetry, colored according to its actual social provenance. The y axis expresses a statistical model’s prediction about that provenance: How likely is it that this volume came from the “reviewed” sample, based only on the words in the volume?

As you can see, the model does a pretty decent job of sorting the two samples. It’s not right all the time, because of course a volume’s reception is determined by a lot of factors other than language (politics, the whims of reviewers, social networks). But the model is right 79.2% of the time, which is often enough to suggest that volumes reviewed in prominent venues had something in common. The sort of poetic language that got reviewed is distinguished from other poetic traditions not just toward the twentieth century, as we had expected, but throughout this period.

What’s even more puzzling is this: reviewed writers seem to have had the same thing in common throughout this century. The model is using essentially the same list of prestigious and banal words to separate Lord Byron from more obscure poets around 1819, and Christina Rossetti from more obscure poets around 1866, and T. S. Eliot from more obscure writers around 1917. That’s starting to sound like an oddly durable set of preferences. And actually, it’s even more durable than the image above suggests. A model trained on a quarter-century of the evidence can predict the other 75 years almost as accurately as a model trained on the whole century.

A model trained only on evidence from 1845-69 makes predictions about the other 75 years in the dataset.

A model trained only on evidence from 1845-69 makes predictions about the other 75 years in the dataset.

So how is it even possible to characterize a whole century of poetic reception — based on fourteen different periodicals from both sides of the Atlantic — with a single set of aesthetic standards? Weren’t there supposed to be a couple of “poetic revolutions” in this century somewhere? W. B. Yeats certainly thought that one happened in the 1890s [2].

There’s another curious detail implied in the image above: why is the boundary between “reviewed” and “random” volumes drifting upward across the timeline? Technically, that’s an error. Volumes are not really “more likely to be reviewed” just because they were published later. But this is an error of an interesting kind. The model doesn’t know when these volumes were published: the dataset drifts upward because words that were more common in reviewed volumes across this period turn out to be more common in all volumes by the end of the period. If you divide the timeline into parts, the same pattern recurs in each part; and — to leak a detail from the next stage of this project — it also happens when we model fiction. That starts to suggest an interestingly general connection between synchronic judgment and diachronic change.

And there’s more. The detailed differences between reviewed and random poetry are interesting. In the article, we examine a haunting passage from Christina Rossetti; it turns out the model likes “haunting.” We also generalize about the theory of representativeness underpinning distant reading, and ask how our contemporary pedagogical canon looks when viewed by nineteenth-century aesthetic standards.

But all this, obviously, is too much to discuss in a blog post. See the article itself for our actual attempt to understand these puzzles.

We’ve released our code and data on Github, and hope readers will find flaws in our reasoning so we can improve the project. But this draft has been bounced off a couple of audiences already; at this point it’s stable enough to be cited and criticized. So, after some reflection, we’ve closed comments on this post in order to encourage a more public sort of critique. If we’re overlooking something, please say so in a blog post. It’s an explicit premise of the project that “being reviewed at all indicates a sort of literary distinction — even if the review is negative.”

[1]: One influential thesis holds that this division crystallized “in the last decades of the 19th century and the first few years of the 20th.” Andreas Huyssen, After the Great Divide: Modernism, Mass Culture, Postmodernism (Bloomington: Indiana UP, 1986), viii.

[2]: W.B. Yeats dated the “revolt against Victorianism” and against “the poetical diction of everybody” to the 1890s. See discussion in Richard Fallis, “Yeats and the Reinterpretation of Victorian Poetry,” Victorian Poetry 14.2 (1976): 89-100.

How to find English-language fiction, poetry, and drama in HathiTrust.

Although methods of analysis are more fun to discuss, the most challenging part of distant reading may still be locating the texts in the first place [1].

In principle, millions of books are available in digital libraries. But literary historians need collections organized by genre, and locating the fiction or poetry in a digital library is not as simple as it sounds. Older books don’t necessarily have genre information attached. (In HathiTrust, less than 40% of English-language fiction published before 1923 is tagged “fiction” in the appropriate MARC control field.)

Volume-level information wouldn’t be enough to guide machine reading in any case, because genres are mixed up inside volumes. For instance Hoyt Long, Richard So, and I recently published an article in Slate arguing (among other things) that references to specific amounts of money become steadily more common in fiction from 1825 to 1950.

Frequency of reference to "specific amounts" of money in 7,700 English-language works of fiction. Graphics from Wickham, ggplot2 [2].

Frequency of reference to “specific amounts” of money in 7,700 English-language works of fiction. Graphics here and throughout from Wickham, ggplot2 [2].

But Google’s “English Fiction” collection tells a very different story. The frequencies of many symbols that appear in prices (dollar signs, sixpence) skyrocket in the late nineteenth century, and then drop back by the early twentieth.

Frequencies of "$" and "6d" in Google's "English Fiction" collection, 1800-1950.

Frequencies of “$” and “6d” in Google’s “English Fiction” collection, 1800-1950.

On the other hand, several other words or symbols that tend to appear in advertisements for books follow a suspiciously similar trajectory.

Frequencies of "$", "8vo" (octavo) and "cloth" in Google's "English Fiction" collection, 1800-1950.

Frequencies of “$”, “8vo” (octavo) and “cloth” in Google’s “English Fiction” collection, 1800-1950.

What we see in Google’s “Fiction” collection is something that happens in volumes of fiction, but not exactly in the genre of fiction — the rise and fall of publishers’ catalogs in the backs of books [3]. Individually, these two- or three-page lists of titles for sale may not look like significant noise, but because they often mention prices, and are distributed unevenly across the timeline, they add up to a significant potential pitfall for anyone interested in the role of money in fiction.

I don’t say this to criticize the team behind the Ngram Viewer. Genre wasn’t central to their goals; they provided a rough “fiction” collection merely as a cherry on top of a massively successful public-humanities project. My point is just that genres fail to line up with volume boundaries in ways that can really matter for the questions scholars want to pose. (In fact, fiction may be the genre that comes closest to lining up with volume boundaries: drama and poetry often appear mixed in The Collected Poems and Plays of So-and-So, With a Prose Life of the Author.)

You can solve this problem by selecting works manually, or by borrowing proprietary collections from a vendor. Those are both good, practical solutions, especially up to (say) 1900. But because they rely on received bibliographies, they may not entirely fulfill the promises we’ve been making about dredging the depths of “the great unread,” boldly going where no one has gone before, etc [4]. Over the past two years, with support from the ACLS and NEH, I’ve been trying to develop another alternative — a way of starting with a whole library, and dividing it by genre at the page level, using machine learning.

In researching the Slate article, we relied on that automatic mapping of genre to select pages of fiction from HathiTrust. It helped us avoid conflating advertisements with fiction, and I hope other scholars will also find that it reduces the labor involved in creating large, genre-specific collections. The point of this blog post is to announce the release of a first version of the map we used (covering 854,476 English-language books in HathiTrust 1700-1922).

The whole dataset is available on Figshare, where it has a DOI and is citable as a publication. An interim report is also available; it addresses theoretical questions about genre, as well as questions about methods and data format. And the code we used for the project is available on Github.

For in-depth answers to questions, please consult the interim project report. It’s 47 pages long; it actually explains the project; this blog post doesn’t. But here are a few quick FAQs just so you can decide whether to read further.

“What categories did you try to separate?”

We identify pages as paratext (front matter, back matter, ads), prose nonfiction, poetry (narrative and lyric are grouped together), drama (including verse drama), or prose fiction. The report discusses the rationale for these choices, but other choices would be possible.

“How accurate is this map?”

Since genres are social institutions, questions about accuracy are relative to human dissensus. Our pairs of human readers agreed about the five categories just mentioned for 94.5% of the pages they tagged [5]. Relying on two-out-of-three voting (among other things), we boiled those varying opinions down to a human consensus, and our model agreed with the consensus 93.6% of the time. So this map is nearly as accurate as we might expect crowdsourcing to be. But it covers 276 million pages. For full details, see the confusion matrices in the report. Also, note that we provide ways of adjusting the tradeoff between recall and precision to fit a researcher’s top priority — which could be catching everything that might belong in a genre, or filtering out everything that doesn’t belong. We provide filtered collections of drama, fiction, and poetry for scholars who want to work with datasets that are 97-98% precise.

“You just wrote a blog post admitting that even simple generic boundaries like fiction/nonfiction are blurry and contested. So how can we pretend to stabilize a single map of genre?”

The short answer: we can’t. I don’t expect the genre predictions in this dataset to be more than one resource among many. We’ve also designed this dataset to have a certain amount of flexibility. There are confidence metrics associated with each volume, and users can define their collection of, say, poetry more broadly or narrowly by adjusting the confidence thresholds for inclusion. So even this dataset is not really a single map.

“What about divisions below the page level?”

With the exception of divisions between running headers and body text, we don’t address them. There are certainly a wide range of divisions below the page level that can matter, but we didn’t feel there was much to be gained by trying to solve all those problems at the same time as page-level mapping. In many cases, divisions below the page level are logically a subsequent step.

“How would I actually use this map to find stuff?”

There are three different ways — see “How to use this data?” in the interim report. If you’re working with HathiTrust Research Center, you could use this data to define a workset in their portal. Alternatively, if your research question can be answered with word frequencies, you could download public page-level features from HTRC and align them with our genre predictions on your own machine to produce a dataset of word counts from “only pages that have a 97% probability of being prose fiction,” or what have you. (HTRC hasn’t released feature counts for all the volumes we mapped yet, but they’re about to.) You can also align our predictions directly with HathiTrust zip files, if you have those. The pagealigner module in the utilities subfolder of our Github repo is intended as a handy shortcut for people who use Python; it will work both with HT zip files and HTRC feature files, aligning them with our genre predictions and returning a list of pages zipped with genre codes.

Is this sort of collection really what I need for my project?

Maybe not. There are a lot of books in HathiTrust. But as I admitted in my last post, a medium-sized collection based on bibliographies may be a better starting point for most scholars. Library-based collections include things like reprints, works in translation, juvenile fiction, and so on, that could be viewed as giving a fuller picture of literary culture … or could be viewed as messy complicating factors. I don’t mean to advocate for a library-based approach; I’m just trying to expand the range of alternatives we have available.

“What if I want to find fiction in French books between 1900 and 1970?”

Although we’ve made our code available as a resource, we definitely don’t want to represent it as a “tool” that could simply be pointed at other collections to do the same kind of genre mapping. Much of the work involved in this process is domain-specific (for instance, you have to develop page-level training data in a particular language and period). So this is better characterized as a method than a tool, and the report is probably more important than the repo. I plan to continue expanding the English-language map into the twentieth century (algorithmic mapping of genre may in fact be especially necessary for distant reading behind the veil of copyright). But I don’t personally have plans to expand this map to other languages; I hope someone else will take up that task.

As a reward for reading this far, here’s a visualization of the relative sizes of genres across time, represented as a percentage of pages in the English-language portion of HathiTrust.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren't represented here. Results have been smoothed with a five-year moving average.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren’t represented here. Results have been smoothed with a five-year moving average. Click through to enlarge.

The image is discussed at more length in the interim progress report.


The blog post above often slips awkwardly into first-person plural, because I’m describing a project that involved a lot of people. Parts of the code involved were written by Michael L. Black and Boris Capitanu. The code also draws on machine learning libraries in Weka and Scikit-Learn [6, 7]. Shawn Ballard organized the process of gathering training data, assisted by Jonathan Cheng, Nicole Moore, Clara Mount, and Lea Potter. The project also depended on collaboration and conversation with a wide range of people at HathiTrust Digital Library, HathiTrust Research Center, and the University of Illinois Library, including but not limited to Loretta Auvil, Timothy Cole, Stephen Downie, Colleen Fallaw, Harriett Green, Myung-Ja Han, Jacob Jett, and Jeremy York. Jana Diesner and David Bamman offered useful advice about machine learning. Essential material support was provided by a Digital Humanities Start-Up Grant from the National Endowment for the Humanities and a Digital Innovation Fellowship from the American Council of Learned Societies. None of these people or agencies should be held responsible for mistakes.


[1] Perhaps it goes without saying, since the phrase has now lost its quotation marks, but “distant reading” is Franco Moretti, “Conjectures on World Literature,” New Left Review 1 (2000).

[2] Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis. http: // Springer New York, 2009.

[3] Having mapped advertisements in volumes of fiction, I’m pretty certain that they’re responsible for the spike in dollar signs in Google’s “English Fiction” collection. The collection I mapped overlaps heavily with Google Books, and the number of pages of ads in fiction volumes tracks very closely with the frequency of dollars signs, “8vo,” and so on.

Percentage of pages in mostly-fiction volumes that are ads. Based on a filtered collection of 102,349 mostly-fiction volumes selected from a larger group of 854,476 volumes 1700-1922.

Percentage of pages in mostly-fiction volumes that are ads. Based on a filtered collection of 102,349 mostly-fiction volumes selected from a larger group of 854,476 volumes 1700-1922. Five-year moving average.

[4] “The great unread” comes from Margaret Cohen, The Sentimental Education of the Novel (Princeton NJ: Princeton University Press, 1999), 23.

[5] See the interim report (subsection, “Evaluating Confusion Matrices”) for a fuller description; it gets complicated, because we actually assessed accuracy in terms of the number of words misclassified, although the classification was taking place at a page level.

[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.

Distant reading and the blurry edges of genre.

There are basically two different ways to build collections for distant reading. You can build up collections of specific genres, selecting volumes that you know belong to them. Or you can take an entire digital library as your base collection, and subdivide it by genre.

Most people do it the first way, and having just spent two years learning to do it the second way, I’d like to admit that they’re right. There’s a lot of overhead involved in mining a library. The problem becomes too big for your desktop; you have to schedule batch jobs; you have to learn to interpret MARC records. All this may be necessary eventually, but it’s not the ideal place to start.

But some of the problems I’ve encountered have been interesting. In particular, the problem of “dividing a library by genre” has made me realize that literary studies is constituted by exclusions that are a bit larger and more arbitrary than I used to think.

First of all, why is dividing by genre even a problem? Well, most machine-readable catalog records don’t say much about genre, and even if they did, a single volume usually contains multiple genres anyway. (Think introductions, indexes, collected poems and plays, etc.) With support from the ACLS and NEH, I’ve spent the last year wrestling with that problem, and in a couple of weeks I’m going to share an imperfect page-level map of genre for English-language books in HathiTrust 1700-1923.

But the bigger thing I want to report is that the ambiguity of genre may run deeper than most scholars who aren’t librarians currently imagine. To be sure, we know that subgenres like “detective fiction” are social institutions rather than natural forms. And in a vague way we also accept that broader categories like “fiction” and “poetry” are social constructs with blurry edges. We can all point to a few anomalies: prose poems, eighteenth-century journalistic fictions like The Spectator, and so on.

But somehow, in spite of knowing this for twenty years, I never grasped the full scale of the problem. For instance, I knew the boundary between fiction and nonfiction was blurry in the 18c, but I thought it had stabilized over time. By the time you got to the Victorians, surely, you could draw a circle around “fiction.” Exceptions would just prove the rule.

Selecting volumes one by one for genre-specific collections didn’t shake my confidence. But if you start with a whole library and try to winnow it down, you’re forced to consider a lot of things you would otherwise never look at. I’ve become convinced that the subset of genre-typical cases (should we call them cis-genred volumes?) is nowhere near as paradigmatic as literary scholars like to imagine. A substantial proportion of the books in a library don’t fit those models.

This is both a photograph of a real, unnamed mother and baby, and a picture of a fictional character named Shinkah. Frontispiece to Shinkah, The Osage Indian (1916).

This is both a photograph of a real, unnamed mother and baby, and a picture of a fictional character named Shinkah. Frontispiece to Shinkah, The Osage Indian (1916).

Consider the case of Shinkah, the Osage Indian, published in 1916 by S. M. Barrett. The preface to this volume informs us that it’s intended as a contribution to “the sociology of the Osage Indians.” But it’s set a hundred years in the past, and the central character Shinkah is entirely fictional (his name just means “child.”) On the other hand, the book is illustrated with photographs of real contemporary people, who stand for the characters in an ethnotypical way.

After wading though 872,000 volumes, I’m sorry to report that odd cases of this kind are more typical of nineteenth- and early twentieth-century fiction than my graduate-school training had led me to believe. There’s a smooth continuum for instance between Shinkah and Old Court Life in France (1873), by Frances Elliot. This book has a bibliography, and a historiographical preface, but otherwise reads like a historical novel, complete with invented dialogue. I’m not sure how to distinguish it from other historical novels with real historical personages as characters.

Literary critics know there’s a problem with historical fiction. We also know about the blurry boundary between fiction, journalism, and travel writing represented by the genre of the “sketch.” And anyone who remembers James Frey being kicked out of Oprah Winfrey’s definition of nonfiction knows that autobiographies can be problematic. And we know that didactic fiction blurs into philosophical dialogue. And anyone who studies children’s literature knows that the boundary between fiction and nonfiction gets especially blurry there. And probably some of us know about ethnographic novels like Shinkah. But I’m not sure many of us (except for librarians) have added it all up. When you’re sorting through an entire library you’re forced to see the scale of it: in the period 1700-1923, maybe 10% of the volumes that could be cataloged as fiction present puzzling boundary cases.

You run into a lot of these works even if you browse or select titles at random; that’s how I met Shinkah. But I’ve also been training probabilistic models of genre that report, among other things, how certain or uncertain they are about each page. These models are good at identifying clear cases of our received categories; I found that they agreed with my research assistants almost exactly as often as the research assistants agreed with each other (93-94% of the time, about broad categories like fiction/nonfiction). But you can also ask a model to sift through several thousand volumes looking for hard cases. When I did that I was taken aback to discover that about half the volumes it had most trouble with were things I also found impossible to classify. The model was most uncertain, for instance, about The Terrific Register (1825) — an almanac that mixes historical anecdote, urban legend, and outright fiction randomly from page to page. The second-most puzzling book was Madagascar, or Robert Drury’s Journal (1729), a book that offers itself as a travel journal by a real person, and was for a long time accepted as one, although scholars have more recently argued that it was written by Defoe.

Of course, a statistical model of fiction doesn’t care whether things “really happened”; it pays attention mostly to word frequency. Past-tense verbs of speech, personal names, and “the,” for instance, are disproportionately common in fiction. “Is” and “also” and “mr” (and a few hundred other words) are common in nonfiction. Human readers probably think about genre in a more abstract way. But it’s not particularly miraculous that a model using word frequencies should be confused by the same examples we find confusing. The model was trained, after all, on examples tagged by human beings; the whole point of doing that was to reproduce as much as possible the contours of the boundary that separates genres for us. The only thing that’s surprising is that trawling the model through a library turns up more books right in the middle of the boundary region than our habits of literary attention would have suggested.

A lot of discussions of distant reading have imagined it as a move from canonical to popular or obscure examples of a (known) genre. But reconsidering our definitions of the genres we’re looking for may be just as important. We may come to recognize that “the novel” and “the lyric poem” have always been islands floating in a sea of other texts, widely read but never genre-typical enough to be replicated on English syllabi.

In the long run, this may require us to balance two kinds of inclusiveness. We already know that digital libraries exclude a lot. Allen Riddell has nicely demonstrated just how much: he concludes that there are digital scans for only about 58% of the novels listed in bibliographies as having been published between 1800 and 1836.

One way to ensure inclusion might be to start with those bibliographies, which highlight books invisible in digital libraries. On the other hand, bibliographies also make certain things invisible. The Terrific Register (1825), for instance, is not in Garside’s bibliography of early-nineteenth-century fiction. Neither is The Wonder-Working Water Mill (1791), to mention another odd thing I bumped into. These aren’t oversights; Garside et. al. acknowledge that they’re excluding certain categories of fiction from their conception of the novel. But because we’re trained to think about novels, the scale of that exclusion may only become visible after you spend some time trawling a library catalog.

I don’t want to present this as an aporia that makes it impossible to know where to start. It’s not. Most people attempting distant reading are already starting in the right place — which is to build up medium-sized collections of familiar generic categories like “the novel.” The boundaries of those categories may be blurrier than we usually acknowledge. But there’s also such a thing as fretting excessively about the synchronic representativeness of your sample. A lot of the interesting questions in distant reading are actually trends that involve relative, diachronic differences in the collection. Subtle differences of synchronic coverage may more or less drop out of questions about change over time.

On the other hand, if I’m right that the gray areas between (for instance) fiction and nonfiction are bigger and more persistently blurry than literary scholarship usually mentions, that’s probably in the long run an issue we should consider! When I release a page-level map of genre in a couple of weeks, I’m going to try to provide some dials that allow researchers to make more explicit choices about degrees of inclusion or exclusion.

Predictive models that report probabilities give us a natural way to handle this, because they allow us to characterize every boundary as a gradient, and explicitly acknowledge our compromises (for instance, trade-offs between precision and recall). People who haven’t done much statistical modeling often imagine that numbers will give humanists spuriously clear definitions of fuzzy concepts. My experience has been the opposite: I think our received disciplinary practices often make categories seem self-evident and stable because they teach us to focus on easy cases. Attempting to model those categories explicitly, on a large scale, can force you to acknowledge the real instability of the boundaries involved.

References and acknowledgments

Training data for this project was produced by Shawn Ballard, Jonathan Cheng, Lea Potter, Nicole Moore and Clara Mount, as well as me. Michael L. Black and Boris Capitanu built a GUI that helped us tag volumes at the page level. Material support was provided by the National Endowment for the Humanities and the American Council of Learned Societies. Some information about results and methods is online as a paper and a poster, but much more will be forthcoming in the next month or so — along with a page-level map of broad genre categories and types of paratext.

The project would have been impossible without help from HathiTrust and HathiTrust Research Center. I’ve also been taught to read MARC records by librarians and information scientists including Tim Cole, M. J. Han, Colleen Fallaw, and Jacob Jett, any of whom could teach a course on “Cursed Metadata in Theory and Practice.”

I mention Garside’s bibliography of early nineteenth-century fiction. This is Garside, Peter, and Rainer Schöwerling. The English novel, 1770-1829 : a bibliographical survey of prose fiction published in the British Isles. Ed. Peter Garside, James Raven, and Rainer Schöwerling. 2 vols. Oxford: Oxford University Press, 2000.

Paul Fyfe directed me to a couple of useful works on the genre of the sketch. Michael Widner has recently written a dissertation about the cognitive dimension of genre titled Genre Trouble. I’ve also tuned into ongoing thoughts about the temporal and social dimensions of genre from Daniel Allington and Michael Witmore. The now-classic pamphlet #1 from the Stanford Literary Lab, “Quantitative Formalism,” is probably responsible for my interest in the topic.

The long history of humanistic reaction to sociology.

N+1’s recent editorial on the sociology of taste is worth reading. Whatever it gets wrong, it’s probably right about the real source of tension in the humanities* right now.

People spend a lot of time arguing about the disruptive effects of technology. But if the humanities were challenged primarily by online delivery of recorded lectures, I would sleep very well at night.

The challenge humanists are confronting springs from social rather than technological change. And n+1 is right that part of the problem involves cynicism about the model of culture that justified the study of literature and other arts in the twentieth century. For much of that century, humanists felt comfortable claiming that their disciplines conveyed a kind of cultivation that transcended mere specialized learning. You learned about literary form not because it was in itself useful, but because it transformed you in a way that gave you full possession of a collective human legacy. I have to admit that the sociology of culture has made it harder to write sentences like that last one with a straight face. “Transformation” and “possession” are too obviously metaphors for cultural distinction.

John Guillory, Cultural Capital, Chicago, 1993.

John Guillory, Cultural Capital, Chicago, 1993.

This isn’t to say that Pierre Bourdieu and John Guillory are personally responsible for our predicament. I remember reading Guillory in 1993, and Cultural Capital didn’t come as a great shock. Rather, it seemed to explain, more candidly than usual, a state of imperial unclothedness that sidelong glances had already led most of us to privately suspect.

The n+1 editorial seems weakest when it tries to inflate this recent dilemma for humanists into a broader crisis for left politics or individual agency as such. If social theory necessarily sapped individuals’ will to action, we would be in very hot water indeed! We’d have to avoid reading Marx, as well as Bourdieu. But social analysis can of course coexist with a commitment to social change, and it’s not clear that the sociology of culture has done anything to undermine that commitment. The solidarity of middle and working classes against oligarchic power may even be in better shape today than it was in 1993.

That’s a bit beside the point, however, because n+1 doesn’t seem primarily interested in politics as such. They cite a few dubiously representative examples of contemporary(ish) political(ish) debate (e.g., David Brooks on bobos). But their heart seems to be in the academy, and their real concern appears to be that sociology is undermining academic humanists’ ability to defend their own institutions forcefully, untroubled by any doubt that those institutions merely reproduce cultural distinction. At least that’s what I infer when the editors write that “the spokespeople most effectively diminished by Bourdieu’s influence turn out to be those already in the precarious position of having to articulate and transmit a language of aesthetic experience that could remain meaningful outside either a regime of status or a regime of productivity.”

But here it seems to me that the editors are conflating two conversations. On the one hand, there’s a social and institutional debate about reforming and/or defending specific academic disciplines. On the other, there’s an abstract debate about the tension between social analysis and “aesthetic experience.” The rationale for treating them as the same seems weak.

Bowie, Heroes, 45 rpm, photo by Affendaddy. CC-BY-NC-SA.

Bowie, Heroes, 45 rpm, photo by Affendaddy. CC-BY-NC-SA.

For after all, aesthetic appreciation is doing just fine these days: the sociology of culture hasn’t even dented it. I don’t find my appreciation of David Bowie, for instance, even slightly compromised when I acknowledge that he concocted a specific kind of glamour out of racial, national, gender, and class identities. A historically specific fabulousness is no less fabulous.

The social specificity of Bowie’s glam does, on the other hand, complicate the kind of rationale I could provide for requiring students to study his music. It makes it harder to invoke him as a vehicle for a general cultivation that transcends mere specialized learning. And that’s why the sociology of culture has posed a problem for the humanities: not that it undermines aesthetic discourse as such, but that it complicates claims about the social necessity of aesthetic cultivation.

This is a real dilemma that I can’t begin to resolve in a blog post; instead I’ll just gesture at recent scholarly conversation on the topic broadly construed, including articles, courses, and presentations by Rachel Buurma, James English, Andrew Goldstone, and Laura Heffernan, among others.

The one detail I’d like to add to that conversation is that the concept of “the humanities” we are now tempted to defend may have been shaped in the early twentieth century by a reaction to social science rather like the reaction n+1 is now articulating.

It has been almost completely erased from the discipline’s collective memory, but between 1895 and 1925, literary studies came rather close to becoming a social science. The University of Chicago had a “Professor of Literary Theory and Interpretation” in 1903 — and what literary theory meant, at the time, was an ambitious project to articulate general laws of historical development for literary form. At other institutions this project was often called “general literatology” or “comparative literature,” but it had little in common with contemporary comparative literature. If you go back and read H. M. Posnett’s Comparative Literature (1886), you discover a project that resembles comparative anthropology more than contemporary literary study.

This period of the discipline’s history is now largely forgotten. English professors remember Matthew Arnold; we remember the New Criticism, and we vaguely remember that there was something dusty called “philology” in between. But we probably don’t remember that Chicago had a Professorship of (anthropologically conceived) “Literary Theory” in 1903.

The reason we don’t remember is that there was intense and effective push-back against the incorporation of social sciences (including history) in the study of arts and letters. The reaction stretched from works like Norman Foerster’s American Scholar (1929) to René Wellek’s widely-reprinted Theory of Literature (1949), and it argued at times rather explicitly that social-scientific approaches to culture would reduce the prestige of the arts by undermining the authority of personal cultivation. (One might almost say that critics of this period foresaw the danger posed by Bourdieu.)

humanitiesIt may not be an accident that this was also the period when a concept of “the humanities” (newly identified as an alternative to social science) became institutionally central in American universities (see Geoffrey Harpham’s Humanities and the Dream of America and my related blog post).

I’ll have a little more to say about the anthropologically-ambitious literary theory of the early twentieth century in a book forthcoming this summer (Why Literary Periods Mattered, Stanford UP). I don’t expect that book will resolve contemporary tension between the humanities and social sciences, but I do want to point out that the debate has been going on for more than a hundred years, and that it has constituted the humanities as a distinct entity as least as much as it has threatened them.

Postscript: For a response to n+1 by an actual sociologist of culture, see whatisthewhat.

* Postscript two days later: I now disagree with one aspect of this post — the way its opening paragraphs talk generally about a challenge “for the humanities.” Actually, it’s not clear to me that Bourdieu et. al have posed a problem for historians. I was describing a challenge “for the study of literature and the arts,” and I ought to have said that specifically. In fact, the tendency to inflate doubts about a specific model of literary culture into a generalized “crisis in the humanities” is part of what’s wrong with the n+1 editorial, and part of what I ought to be taking aim at here. But I guess blogging is about learning in public.