Can we date revolutions in the history of literature and music?

Humanists know the subjects we study are complex. So on the rare occasions when we describe them with numbers at all, we tend to proceed cautiously. Maybe too cautiously. Distant readers have spent a lot of time, for instance, just convincing colleagues that it might be okay to use numbers for exploratory purposes.

But the pace of this conversation is not entirely up to us. Outsiders to our disciplines may rush in where we fear to tread, forcing us to confront questions we haven’t faced squarely.

For instance, can we use numbers to identify historical periods when music or literature changed especially rapidly or slowly? Humanists have often used qualitative methods to make that sort of argument. At least since the nineteenth century, our narratives have described periods of stasis separated by paradigm shifts and revolutionary ruptures. For scientists, this raises an obvious, tempting question: why not actually measure rates of change and specify the points on the timeline when ruptures happened?

The highest-profile recent example of this approach is an article in Royal Society Open Science titled “The evolution of popular music” (Mauch et al. 2015). The authors identify three moments of rapid change in US popular music between 1960 and 2010. Moreover, they rank those moments, and argue that the advent of rap caused popular music to change more rapidly than the British Invasion — a claim you may remember, because it got a lot of play in the popular press. Similar arguments have appeared about the pace of change in written expression — e.g, a recent article argues that 1917 was a turning point in political rhetoric (h/t Cameron Blevins).

When disciplinary outsiders make big historical claims, humanists may be tempted just to roll our eyes. But I don’t think this is a kind of intervention we can afford to ignore. Arguments about the pace of cultural change engage theoretical questions that are fundamental to our disciplines, and questions that genuinely fascinate the public. If scientists are posing these questions badly, we need to explain why. On the other hand, if outsiders are addressing important questions with new methods, we need to learn from them. Scholarship is not a struggle between disciplines where the winner is the discipline that repels foreign ideas with greatest determination.

I feel particularly obligated to think this through, because I’ve been arguing for a couple of years that quantitative methods tend to reveal gradual change rather than the sharply periodized plateaus we might like to discover in the past. But maybe I just haven’t been looking closely enough for discontinuities? Recent articles introduce new ways of locating and measuring them.

This blog post applies methods from “The evolution of popular music” to a domain I understand better — nineteenth-century literary history. I’m not making a historical argument yet, just trying to figure out how much weight these new methods could actually support. I hope readers will share their own opinions in the comments. So far I would say I’m skeptical about these methods — or at least skeptical that I know how to interpret them.

How scientists found musical revolutions.

Mauch et al. start by collecting thirty-second snippets of songs in the Billboard Hot 100 between 1960 and 2010. Then they topic-model the collection to identify recurring harmonic and timbral topics. To study historical change, they divide the fifty-year collection into two hundred quarter-year periods, and aggregate the topic frequencies for each quarter. They’re thus able to create a heat map of pairwise “distances” between all these quarter-year periods. This heat map becomes the foundation for the crucial next step in their argument — the calculation of “Foote novelty” that actually identifies revolutionary ruptures in music history.

Figure 5 from Mauch, et. al., “The evolution of popular music” (RSOS 2015).

The diagonal line from bottom left to top right of the heat map represents comparisons of each time segment to itself: that distance, obviously, should be zero. As you rise above that line, you’re comparing the same moment to quarters in its future; if you sink below, you’re comparing it to its past. Long periods where topic distributions remain roughly similar are visible in this heat map as yellowish squares. (In the center of those squares, you can wander a long way from the diagonal line without hitting much dissimilarity.) The places where squares are connected at the corners are moments of rapid change. (Intuitively, if you deviate to either side of the narrow bridge there, you quickly get into red areas. The temporal “window of similarity” is narrow.) Using an algorithm outlined by Jonathan Foote (2000), the authors translate this grid into a line plot where the dips represent musical “revolutions.”

Trying the same thing on the history of the novel.

Could we do the same thing for the history of fiction? The labor-intensive part would be coming up with a corpus. Nineteenth-century literary scholars don’t have a Billboard Hot 100. We could construct one, but before I spend months crafting a corpus to address this question, I’d like to know whether the question itself is meaningful. So this is a deliberately rough first pass. I’ve created a sample of roughly 1000 novels in a quick and dirty way by randomly selecting 50 male and 50 female authors from each decade 1820-1919 in HathiTrust. Each author is represented in the whole corpus only by a single volume. The corpus covers British and American authors; spelling is normalized to modern British practice. If I were writing an article on this topic I would want a larger dataset and I would definitely want to record things like each author’s year of birth and nationality. This is just a first pass.

Because this is a longer and sparser sample than Mauch et al. use, we’ll have to compare two-year periods instead of quarters of a year, giving us a coarser picture of change. It’s a simple matter to run a topic model (with 50 topics) and then plot a heat map based on cosine similarities between the topic distributions in each two-year period.

Heatmap and Foote novelty for 1000 novels, 1820-1919. Rises in the trend lines correspond to increased Foote novelty.

Heatmap and Foote novelty for 1000 novels, 1820-1919. Rises in the trend lines correspond to increased Foote novelty.

Voila! The dark and light patterns are not quite as clear here as they are in “The evolution of popular music.” But there are certainly some squarish areas of similarity connected at the corners. If we use Foote novelty to interpret this graph, we’ll have one major revolution in fiction around 1848, and a minor one around 1890. (I’ve flipped the axis so peaks, rather than dips, represent rapid change.) Between these peaks, presumably, lies a valley of Victorian stasis.

Is any of that true? How would we know? If we just ask whether this story fits our existing preconceptions, I guess we could make it fit reasonably well. As Eleanor Courtemanche pointed out when I discussed this with her, the end of the 1840s is often understood as a moment of transition to realism in British fiction, and the 1890s mark the demise of the three-volume novel. But it’s always easy to assimilate new evidence to our preconceptions. Before we rush to do it, let’s ask whether the quantitative part of this argument has given us any reason at all to believe that the development of English-language fiction really accelerated in the 1840s.

I want to pose four skeptical questions, covering the spectrum from fiddly quantitative details to broad theoretical doubts. I’ll start with the fiddliest part.

1) Is this method robust to different ways of measuring the “distance” between texts?

The short answer is “yes.” The heat maps plotted above are calculated on a topic model, after removing stopwords, but I get very similar results if I compare texts directly, without a topic model, using a range of different distance metrics. Mauch et al. actually apply PCA as well as a topic model; that doesn’t seem to make much difference. The “moments of revolution” stay roughly in the same place.

2) How crucial is the “Foote novelty” piece of the method?

Very crucial, and this is where I think we should start to be skeptical. Mauch et al. are identifying moments of transition using a method that Jonathan Foote developed to segment audio files. The algorithm is designed to find moments of transition, even if those moments are quite subtle. It achieves this by making comparisons — not just between the immediately previous and subsequent moments in a stream of observations — but between all segments of the timeline.

It’s a clever and sensitive method. But there are other, more intuitive ways of thinking about change. For instance, we could take the first ten years of the dataset as a baseline and directly compare the topic distributions in each subsequent novel back to the average distribution in 1820-1829. Here’s the pattern we see if we do that:

byyearThat looks an awful lot like a steady trend; the trend may gradually flatten out (either because change really slows down or, more likely, because cosine distances are bounded at 1.0) but significant spurts of revolutionary novelty are in any case quite difficult to see here.

That made me wonder about the statistical significance of “Foote novelty,” and I’m not satisfied that we know how to assess it. One way to test the statistical significance of a pattern is to randomly permute your data and see how often patterns of the same magnitude turn up. So I repeatedly scrambled the two-year periods I had been comparing, constructed a heat matrix by comparing them pairwise, and calculated Foote novelty.

A heatmap produced by randomly scrambling the fifty two-year periods in the corpus. The “dates” on the timeline are now meaningless.

When I do this I almost always find Foote novelties that are as large as the ones we were calling “revolutions” in the earlier graph.

The authors of “The evolution of popular music” also tested significance with a permutation test. They report high levels of significance (p < 0.01) and large effect sizes (they say music changes four to six times faster at the peak of a revolution than at the bottom of a trough). Moreover, they have generously made their data available, in a very full and clearly-organized csv. But when I run my permutation test on their data, I run into the same problem — I keep discovering random Foote novelties that seem as large as the ones in the real data.

It’s possible that I’m making some error, or that we're testing significance differently. I'm permuting the underlying data, which always gives me a matrix that has the checkerboardy look you see above. The symmetrical logic of pairwise comparison still guarantees that random streaks organize themselves in a squarish way, so there are still “pinch points” in the matrix that create high Foote novelties. But the article reports that significance was calculated “by random permutation of the distance matrix.” If I actually scramble the rows or columns of the distance matrix itself I get a completely random pattern that does give me very low Foote novelty scores. But I would never get a pattern like that by calculating pairwise distances in a real dataset, so I haven’t been able to convince myself that it’s an appropriate test.

3) How do we know that all forms of change should carry equal cultural weight?

Now we reach some questions that will make humanists feel more at home. The basic assumption we’re making in the discussion above is that all the features of an expressive medium bear historical significance. If writers replace “love” with “spleen,” or replace “cannot” with “can’t,” it may be more or less equal where this method is concerned. It all potentially counts as change.

This is not to say that all verbal substitutions will carry exactly equal weight. The weight assigned to words can vary a great deal depending on how exactly you measure the distance between texts; topic models, for instance, will tend to treat synonyms as equivalent. But — broadly speaking — things like contractions can still potentially count as literary change, just as instrumentation and timbre count as musical change in “The evolution of popular music.”

At this point a lot of humanists will heave a relieved sigh and say “Well! We know that cultural change doesn’t depend on that kind of merely verbal difference between texts, so I can stop worrying about this whole question.”

Not so fast! I doubt that we know half as much as we think we know about this, and I particularly doubt that we have good reasons to ignore all the kinds of change we’re currently ignoring. Paying attention to merely verbal differences is revealing some massive changes in fiction that previously slipped through our net — like the steady displacement of abstract social judgment by concrete description outlined by Heuser and Le-Khac in LitLab pamphlet #4.

For me, the bottom line is that we know very little about the kinds of change that should, or shouldn’t, count in cultural history. “The evolution of popular music” may move too rapidly to assume that every variation of a waveform bears roughly equal historical significance. But in our daily practice, literary historians rely on a set of assumptions that are much narrower and just as arbitrary. An interesting debate could take place about these questions, once humanists realize what’s at stake, but it’s going to be a thorny debate, and it may not be the only way forward, because …

4) Instead of discussing change in the abstract, we might get further by specifying the particular kinds of change we care about.

Our systems of cultural periodization tend to imply that lots of different aspects of writing (form and style and theme) all change at the same time — when (say) “aestheticism” is replaced by “modernism.” That underlying theory justifies the quest for generalized cultural growth spurts in “The evolution of popular music.”

But we don’t actually have to think about change so generally. We could specify particular social questions that interest us, and measure change relative to those questions.

The advantage of this approach is that you no longer have to start with arbitrary assumptions about the kind of “distance” that counts. Instead you could use social evidence to train a predictive model. Insofar as that model predicts the variables you care about, you know that it’s capturing the specific kind of change that matters for your question.

Jordan Sellers and I took this approach in a working paper we released last spring, modeling the boundary between volumes of poetry that were reviewed in prominent venues, and those that remained obscure. We found that the stylistic signals of poetic prestige remained relatively stable across time, but we also found that they did move, gradually, in a coherent direction. What we didn’t do, in that article, is try to measure the pace of change very precisely. But conceivably you could, using Foote novelty or some other method. Instead of creating a heatmap that represents pairwise distances between texts, you could create a grid where models trained to recognize a social boundary in particular decades make predictions about the same boundary in other decades. If gender ideologies or definitions of poetic prestige do change rapidly in a particular decade, it would show up in the grid, because models trained to predict authorial gender or poetic prominence before that point would become much worse at predicting it afterward.


I haven’t come to any firm conclusion about “The evolution of popular music.” It’s a bold article that proposes and tests important claims; I’ve learned a lot from trying the same thing on literary history. I don’t think I proved that there aren’t any revolutionary growth spurts in the history of the novel. It’s possible (my gut says, even likely) that something does happen around 1848 and around 1890. But I wasn’t able to show that there’s a statistically significant acceleration of change at those moments. More importantly, I haven’t yet been able to convince myself that I know how to measure significance and effect size for Foote novelty at all; so far my attempts to do that produce results that seem different from the results in a paper written by four authors who have more scientific training than I do, so there’s a very good chance that I’m misunderstanding something.

I would welcome comments, because there are a lot of open questions here. The broader task of measuring the pace of cultural change is the kind of genuinely puzzling problem that I hope we’ll be discussing at more length in the IPAM Cultural Analytics workshop next spring at UCLA.

Postscript Oct 5: More will be coming in a day or two. The suggestions I got from comments (below) have helped me think the quantitative part of this through, and I’m working up an iPython notebook that will run reliable tests of significance and effect size for the music data in Mauch et al. as well as a larger corpus of novels. I have become convinced that significance tests on Foote novelty are not a good way to identify moments of rapid change. The basic problem with that approach is that sequential datasets will always have higher Foote novelties than permuted (non-sequential) datasets, if you make the “window” wide enough — even if the pace of change remains constant. Instead, borrowing an idea from Hoyt Long and Richard So, I’m going to use a Chow test to see whether rates of change vary.

Postscript Oct 8: Actually it could be a while before I have more to say about this, because the quantitative part of the problem turns out to be hard. Rates of change definitely vary. Whether they vary significantly, may be a tricky question.


Jonathan Foote. Automatic audio segmentation using a measure of audio novelty. In Proceedings of IEEE International Conference on Multimedia and Expo, vol. I, pp. 452-455, 2000.

Matthias Mauch, Robert M. MacCallum, Mark Levy, Armand M. Leroi. The evolution of popular music. Royal Society Open Science. May 6, 2015.

Digital humanities might never be evenly distributed.

In an eloquent and pragmatic blog post about building the UCL Centre for Digital Humanities, Melissa Terras stresses the importance of rooting a DH center in local institutional culture, in order to “link people” across the whole spectrum from arts and humanities to computer science and engineering. It’s an impressive achievement that has clearly fostered a lot of significant work at UCL, and it has started to change my own way of thinking about this perplexing phrase “digital humanities.”

In the past, I’ve tended to understand “digital humanities” as an abstract term. If you understand it that way, it’s easy to see that it covers a whole range of disparate things, which has sometimes led me to predict that it would fall apart in the near future into a bunch of separate projects.

Alan Liu, “Map of Digital Humanities” — photo by Quinn Dombrowski at UC Berkeley, August 17, 2015. CC-BY-SA.

But as time passes and the darn thing refuses to fall apart, it seems appropriate to revisit that prediction. I still think digital humanities is hard to define, but apparently, being hard to define doesn’t prevent human institutions from enduring and growing. When I read it a few months ago, Melissa’s post made me reflect that “DH” doesn’t have to be defined abstractly at all. It could be understood, quite concretely, as an institutional achievement that happens to exist on some campuses and not others.

If you understand DH abstractly, as a rubric covering many different projects, there’s a lot of it going on here at UIUC. On the west side of campus, we have a leading school of Library and Information Science (GSLIS), which regularly offers courses on digital humanities, and is one of two institutions piloting HathiTrust Research Center. At the north end of campus, we have the National Center for Supercomputing Applications (NCSA), which excels at providing computational support for the arts, humanities, and social sciences. The Colleges of Media, and Fine and Applied Arts, and Liberal Arts and Sciences are home to a lot of individual scholars pursuing research on or critique of digital media, and the campus as a whole hosts ambitious experiments like Learning to See Systems, that combine technological practice and theory.

On the other hand, we’ve never had a digital humanities center or curricular initiative. We have a program called I-CHASS, at NCSA, which provides computational support to scholars who need it (my own work would have been impossible without their support). And Scholarly Commons, at the Library, helps faculty and students find the resources and training they need. But we don’t have any center of the kind Melissa describes, tasked with building a bridge between all the different people mentioned above, and getting them in the same room.

One way to view this would be: we’re lagging behind. Digital humanities is getting organized at Berkeley and Stanford and Iowa and the University of Pennsylvania and Yale. From time to time I think “we need to get something moving.”

And from time to time I try. But I rapidly discover the size of this campus, and the huge range of digitally-human projects already scattered across it, already moving (quite successfully) in diametrically opposed directions — and it occurs to me, first, that it would take superhuman effort to herd them into the same room, and second, that maybe UIUC doesn’t have a digital humanities center because it doesn’t need one. I’m finding all the resources I need over at GSLIS and NCSA; other kinds of projects are also humming along; maybe we’ve never developed a single center precisely because our various distributed centers are so strong.

There are some drawbacks to this arrangement — mainly, that the strengths of the institution are not well-publicized either internally or abroad. For instance, I’m sure some grad students in the humanities here don’t realize that GSLIS regularly offers excellent courses in digital humanities. I’m writing this blog post partly in hopes of flagging that kind of local opportunity.

I think it’s even harder for undergraduates to envision creative connections between the humanities and other subjects without some kind of interdisciplinary program as a model. This is probably the biggest drawback of our distributed structure, and I do feel I should do something about it. But given the way my own interests fit into the local landscape, I suspect the wheel I can put my shoulder to may be an undergraduate program in data science rather than digital humanities. It seems increasingly possible to me that “digital humanities” — as such — may never take institutional form on this campus. By the time we organize that curricular space, it may be occupied by several distinct projects.

That possibility is making me reflect that public discussion of this topic (as skeptical and wide-ranging as it has been) may still have been too quick to assume we’re all moving in the same direction. William Gibson’s famous quip that the future is already here — but not evenly distributed — encourages us to imagine two possible futures for experiments like digital humanities: either they are destined to (eventually) get distributed everywhere, or they will turn out to have been blind alleys.

I think it’s pretty clear at this point that digital humanities is not a blind alley; there’s too much valuable research and teaching being done under that rubric in too many places, and momentum is continuing to build. But I also doubt its institutions — DH centers and curricula — will ever be evenly distributed. I suspect this is going to be one of those disciplinary spaces that different institutions handle differently even over the long run. In some places, the concept of “DH” may be exactly the seed crystal a local culture needs to bring people together. In other places, institutional DH may fail to coalesce, although — or even because — the interdisciplinary projects it would have organized are separately thriving.

A dataset for distant-reading literature in English, 1700-1922.

Literary critics have been having a speculative conversation about close and distant reading. It might be premature to call it a debate.

A “debate” is normally a situation where people are free to choose between two paths. “Should I believe Habermas, or Foucault? I’m listening; I could go either way.” Conversation about distant reading is different, first, because there’s not much need to make a choice. Have any critics stopped reading closely? A close reading of The Bourgeois suggests that Franco Moretti hasn’t.

More importantly, this isn’t a debate yet because most of the people involved aren’t free to explore both paths. So far only a tiny number of scholars have actually tried distant reading, and it’s easy to see why. You can wake up tomorrow and try a Foucauldian reading of Frankenstein, but you can’t wake up and trace patterns of change in a thousand novels. In either case, you may need to learn new methods, but in the “distant” case, it can also take years to assemble a collection of texts.

A dataset for distant reading
To reduce barriers to entry, I’ve collaborated with HathiTrust Research Center to create an easier place to start with English-language literature. It’s aimed at scholars studying long-nineteenth-century (1750-1922) fiction and poetry, but it will gradually expand into the twentieth century. This post describes the humanistic uses of the dataset; if you want technical information, there’s more on the page where the data actually lives.

HathiTrust contains more than a million volumes in English between 1700 and 1922. Contractual agreements make it hard to share the texts themselves in bulk, but many of the questions that can be posed “at a distance” can be posed just as well using simpler representations of the texts — for instance, by counting the words they contain. To support this project, HathiTrust Research Center has extracted page-level word counts for 4.8 million volumes; scholars who are interested in the highest level of detail should go directly to their data.

However, many literary scholars are mainly concerned with books in a particular genre — they limit their inquiries, say, to “poetry” or “prose fiction.” Finding those needles in a five-millon-volume haystack is not easy. Many books in this period don’t carry genre tags; even when they do, volumes are heterogenous things. A volume of poetry, for instance, may begin with a prose life of the author and end with publishers’ ads.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren't represented here. Results have been smoothed with a five-year moving average.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren’t represented here. Results have been smoothed with a five-year moving average.

To create datasets that reliably track a single genre, we need page-level metadata. The National Endowment for the Humanities and the American Council of Learned Societies funded a year-long project to create that metadata. (The methods involved are described in a white paper on “Understanding Genre,” along with information about accuracy.) Now, by pairing this metadata with HTRC’s page-level wordcounts, I’ve created three genre-specific datasets of word counts covering poetry, fiction, and drama from 1700 to 1922. (Coverage is relatively sparse before 1750; if you need the early eighteenth century, you might want a resource like ECCO-TCP instead of or in addition to this.)

The collection consists of word counts for 101,948 volumes of fiction, 58,724 volumes of poetry, and 17,709 volumes of drama, aggregated at the volume level and including only pages identified as belonging to the relevant genre. I’ve collected these volume-level files in tar.gz chunks by genre and date, and have provided basic metadata for them all. You can use the volume IDs to view the original texts on the HathiTrust website if you need to read them closely. I’m calling this a “collection” rather than a “corpus” because I don’t necessarily recommend that you use the whole thing, as is. The whole thing may or may not represent the sample you need for your research question. What it represents is, “American university and public libraries, insofar as they were digitized in the year 2012 (when the project began).” For some big diachronic questions, that’s a good sample; for other questions, you’ll need to be more selective.

Three big blocks of stone. Like collections, these don't represent anything in particular. But the corpus you want to create might be contained somewhere within them.

Three big blocks of stone. Like collections, these don’t represent anything in particular. But like a statue, the corpus you want to create might be contained somewhere within them.

Because this is a very large collection, it’s likely in any case that the sample you need for your research may be contained somewhere within it. To address some questions, you might even select several samples and contrast them. To understand the history of literary prestige, for instance, Jordan Sellers and I gathered 360 prominent books of poetry by finding reviews in literary magazines and extracting the corresponding books from HathiTrust; we then contrasted that to a sample of 360 more obscure volumes selected from the whole HathiTrust collection of poetry. Just using volume-level wordcounts for those two samples, we were able to draw inferences about the way diachronic literary change is related to synchronic prestige.

Well-known texts may be represented in this dataset by dozens of reprints. For some questions, that may be exactly the sort of “weighted” sample you want; for other questions, you’ll want to winnow each title down to a single early example. More datasets may be developed to help you do that.

Distant reading rarely means “big data”
I realize the practice described above (selecting samples of a few hundred or a few thousand books to address particular questions) doesn’t line up with the version of distant reading currently circulating in public imagination. Isn’t the point of distant reading to construct a massive database that includes “everything that has been thought and said”? The Nation recently said so, and also warned us that “in reality, servers powerful enough to process big data can only be located in a highly select number of well-endowed institutions.”

That sounds grim, but I’m happy to report that it’s also malarkey. You can download this dataset, and process it, on your laptop. It’s true that I used our campus cluster to create it (because I had to manage a terabyte of text). But a) managing a terabyte won’t put a hole in most endowments, and b) you don’t need to do that anyway. Once nonfiction is set aside, we’re talking about a smaller group of books (compressed, this whole dataset runs to about 5GB). A well-designed sampling strategy can make it even smaller.

Wait, what’s this about “sampling”? aren’t distant readers supposed to claim to have everything? Not really. In the early days of distant reading, Franco Moretti did frame the project as a challenge to literary historians’ claims about synchronic coverage. (We only discuss a tiny number of books from any given period — what about all the rest?) But even in those early publications, Moretti acknowledged that we would only be able to represent “all the rest” through some kind of sample.

Fifteen years later, it’s becoming clear that distant reading has a lot of applications that aren’t about synchronic completeness at all. Expanding the diachronic scope of our research can be an equally important source of discovery. Certain kinds of change only become visible when you compare many examples across long timelines. Even if we restricted a digital corpus (say) to the academic canon, or to a thousand bestsellers, computational analysis would allow us to see long-term changes that aren’t visible to casual recollection.

It’s true that distant readers will often want to have the biggest possible table of metadata, so that our sampling strategies aren’t unduly constrained. But from that table, we may only sample a few hundred or a few thousand titles to address any single question. This scale of inquiry is not, in any meaningful sense, “big data.” (In fact, I doubt the phrase “big data” is often very meaningful, but that’s another story.) It’s a larger sample than literary scholars have usually attempted to describe, but it would not greatly distress our neighbors in linguistics and sociology.

How hard is this to use?
Of course, we’re not linguists or sociologists, so there is going to be a learning curve involved when we apply quantitative methods on any scale. The main dataset I’m providing here includes 178,381 separate files — one file for each volume. This is not something that can be sliced easily using a tool like Excel. Someone involved with the project needs to be able to program in order to pair the metadata table with the files.

On the other hand, there may be some questions that can be answered with a simple yearly summary, so I’ve also provided yearly_summary tables for each genre that aggregate term frequencies for the 10,000 most common tokens in each genre (selected by document frequency). This is the gentlest on-ramp to the dataset; data in this form probably can be sliced with Excel; to make it even easier I’ve also gone ahead and applied OCR correction and spelling normalization to those tables.

But the yearly_summary table aggregates all the volumes in the collection, and (as I’ve stressed) you may not want all of them. This dataset is a roughly-hewn, but very large, block of stone. You may be able to find the corpus you need somewhere within it, but decisions about selection are yours to make. Over the course of the next two years I hope to extend coverage further into the twentieth century; it is not illegal to share word counts from texts still covered by copyright. If you’re interested in more complex kinds of distant reading where word order matters, you can contact the HathiTrust Research Center; they are creating a workflow that can handle more complex kinds of computational analysis.

Postscript: We’ve done a lot of testing, but this is still a beta release. General estimates about error are summarized in “Understanding Genre”. Precision in these datasets is higher than 97%, but that still means there will be hundreds of volumes and thousands of pages mistakenly included. If you notice systematic problems with the data, please send feedback to the e-mail address provided in the data description. But individual misclassified volumes are not problems we’re likely to fix on a case-by-case basis; that sort of problem will be addressed by improving our methods in our next release.

Seven ways humanists are using computers to understand text.

[This is an updated version of a blog post I wrote three years ago, which organized introductory resources for a workshop. Getting ready for another workshop this summer, I glanced back at the old post and realized it’s out of date, because we’ve collectively covered a lot of ground in three years. Here’s an overhaul.]

Why are humanists using computers to understand text at all?
Part of the point of the phrase “digital humanities” is to claim information technology as something that belongs in the humanities — not an invader from some other field. And it’s true, humanistic interpretation has always had a technological dimension: we organized writing with commonplace books and concordances before we took up keyword search [Nowviskie, 2004; Stallybrass, 2007].

But framing new research opportunities as a specifically humanistic movement called “DH” has the downside of obscuring a bigger picture. Computational methods are transforming the social and natural sciences as much as the humanities, and they’re doing so partly by creating new conversations between disciplines. One of the main ways computers are changing the textual humanities is by mediating new connections to social science. The statistical models that help sociologists understand social stratification and social change haven’t in the past contributed much to the humanities, because it’s been difficult to connect quantitative models to the richer, looser sort of evidence provided by written documents. But that barrier is dissolving. As new methods make it easier to represent unstructured text in a statistical model, a lot of fascinating questions are opening up for social scientists and humanists alike [O’Connor et. al. 2011].

In short, computational analysis of text is not a specific new technology or a subfield of digital humanities; it’s a wide-open conversation in the space between several different disciplines. Humanists often approach this conversation hoping to find digital tools that will automate familiar tasks. That’s a good place to start: I’ll mention tools you could use to create a concordance or a word cloud. And it’s fair to stop there. More involved forms of text analysis do start to resemble social science, and humanists are under no obligation to dabble in social science.

But I should also warn you that digital tools are gateway drugs. This thing called “text analysis” or “distant reading” is really an interdisciplinary conversation about methods, and if you get drawn into the conversation, you may find that you want to try a lot of things that aren’t packaged yet as tools.

What can we actually do?
The image below is a map of a few things you might do with text (inspired by, though different from, Alan Liu’s map of “digital humanities”). The idea is to give you a loose sense of how different activities are related to different disciplinary traditions. We’ll start in the center, and spiral out; this is just a way to organize discussion, and isn’t necessarily meant to suggest a sequential work flow.


1) Visualize single texts.
Text analysis is sometimes represented as part of a “new modesty” in the humanities [Williams]. Generally, that’s a bizarre notion. Most of the methods described in this post aim to reveal patterns hidden from individual readers — not a particularly modest project. But there are a few forms of analysis that might count as surface readings, because they visualize textual patterns that are open to direct inspection.

For instance, people love cartoons by Randall Munroe that visualize the plots of familiar movies by showing which characters are together at different points in the narrative.

Detail from an xkcd cartoon.

Detail from an xkcd cartoon.

These cartoons reveal little we didn’t know. They’re fun to explore in part because the narratives being represented are familiar: we get to rediscover familiar material in a graphical medium that makes it easy to zoom back and forth between macroscopic patterns and details. Network graphs that connect characters are fun to explore for a similar reason. It’s still a matter of debate what (if anything) they reveal; it’s important to keep in mind that fictional networks can behave very differently from real-world social networks [Elson, et al., 2010]. But people tend to find them interesting.

A concordance also, in a sense, tells us nothing we couldn’t learn by reading on our own. But critics nevertheless find them useful. If you want to make a concordance for a single work (or for that matter a whole library), AntConc is a good tool.

Visualization strategies themselves are a topic that could deserve a whole separate discussion.

2) Choose features to represent texts.
A scholar undertaking computational analysis of text needs to answer two questions. First, how are you going to represent texts? Second, what are you going to do with that representation once you’ve got it? Most what follows will focus on the second question, because there are a lot of equally good answers to the first one — and your answer to the first question doesn’t necessarily constrain what you do next.

In practice, texts are often represented simply by counting the various words they contain (they are treated as so-called “bags of words”). Because this representation of text is radically different from readers’ sequential experience of language, people tend to be surprised that it works. But the goal of computational analysis is not, after all, to reproduce the modes of understanding readers have already achieved. If we’re trying to reveal large-scale patterns that wouldn’t be evident in ordinary reading, it may not actually be necessary to retrace the syntactic patterns that organize readers’ understanding of specific passages. And it turns out that a lot of large-scale questions are registered at the level of word choice: authorship, theme, genre, intended audience, and so on. The popularity of Google’s Ngram Viewer shows that people often find word frequencies interesting in their own right.

But there are lots of other ways to represent text. You can count two-word phrases, or measure white space if you like. Qualitative information that can’t be counted can be represented as a “categorical variable.” It’s also possible to consider syntax, if you need to. Computational linguists are getting pretty good at parsing sentences; many of their insights have been packaged accessibly in projects like the Natural Language Toolkit. And there will certainly be research questions — involving, for instance, the concept of character — that require syntactic analysis. But they tend not to be questions that are appropriate for people just starting out.

3) Identify distinctive vocabulary.
It can be pretty easy, on the other hand, to produce useful insights on the level of diction. These are claims of a kind that literary scholars have long made: The Norton Anthology of English Literature proves that William Wordsworth emblematizes Romantic alienation, for instance, by saying that “the words ‘solitary,’ ‘by one self,’ ‘alone’ sound through his poems” [Greenblatt et. al., 16].

Of course, literary scholars have also learned to be wary of these claims. I guess Wordsworth does write “alone” a lot: but does he really do so more than other writers? “Alone” is a common word. How do we distinguish real insights about diction from specious cherry-picking?

Corpus linguists have developed a number of ways to identify locutions that are really overrepresented in one sample of writing relative to others. One of the most widely used is Dunning’s log-likelihood: Ben Schmidt has explained why it works, and it’s easily accessible online through Voyant or downloaded in the AntConc application already mentioned. So if you have a sample of one author’s writing (say Wordsworth), and a reference corpus against which to contrast it (say, a collection of other poetry), it’s really pretty straightforward to identify terms that typify Wordsworth relative to the other sample. (There are also other ways to measure overrepresentation; Adam Kilgarriff recommends a Mann-Whitney test.) And in fact there’s pretty good evidence that “solitary” is among the words that distinguish Wordsworth from other poets.

Words that are consistently more common in works by William Wordsworth than in other poets from 1780 to 1850. I’ve used Wordle’s graphics, but the words have been selected by a Mann-Whitney test, which measures overrepresentation relative to a context — not by Wordle’s own (context-free) method.

It’s also easy to turn results like this into a word cloud — if you want to. People make fun of word clouds, with some justice; they’re eye-catching but don’t give you a lot of information. I use them in blog posts, because eye-catching, but I wouldn’t in an article.

4) Find or organize works.
This rubric is shorthand for the enormous number of different ways we might use information technology to organize collections of written material or orient ourselves in discursive space. Humanists already do this all the time, of course: we rely very heavily on web search, as well as keyword searching in library catalogs and full-text databases.

But our current array of strategies may not necessarily reveal all the things we want to find. This will be obvious to historians, who work extensively with unpublished material. But it’s true even for printed books: works of poetry or fiction published before 1960, for instance, are often not tagged as “poetry” or “fiction.”

A detail from Fig 7 in So and Long, “Network Analysis and the Sociology of Modernism.”

Even if we believed that the task of simply finding things had been solved, we would still need ways to map or organize these collections. One interesting thread of research over the last few years has involved mapping the concrete social connections that organize literary production. Natalie Houston has mapped connections between Victorian poets and publishing houses; Hoyt Long and Richard Jean So have shown how writers are related by publication in the same journals [Houston 2014; So and Long 2013].

There are of course hundreds of other ways humanists might want to organize their material. Maps are often used to visualize references to places, or places of publication. Another obvious approach is to group works by some measure of textual similarity.

There aren’t purpose-built tools to support much of this work. There are tools for building visualizations, but often the larger part of the problem is finding, or constructing, the metadata you need.

5) Model literary forms or genres.
Throughout the rest of this post I’ll be talking about “modeling”; underselling the centrality of that concept seems to me the main oversight in the 2012 post I’m fixing.

A model treehouse, by Austin and Zak -- CC-NC-SA.

A model treehouse, by Austin and Zak — CC-NC-SA.

A model is a simplified representation of something, and in principle models can be built out of words, balsa wood, or anything you like. In practice, in the social sciences, statistical models are often equations that describe the probability of an association between variables. Often the “response variable” is the thing you’re trying to understand (literary form, voting behavior, or what have you), and the “predictor variables” are things you suspect might help explain or predict it.

This isn’t the only way to approach text analysis; historically, humanists have tended to begin instead by first choosing some aspect of text to measure, and then launching an argument about the significance of the thing they measured. I’ve done that myself, and it can work. But social scientists prefer to tackle problems the other way around: first identify a concept that you’re trying to understand, and then try to model it. There’s something to be said for their bizarrely systematic approach.

Building a model can help humanists in a number of ways. Classically, social scientists model concepts in order to understand them better. If you’re trying to understand the difference between two genres or forms, building a model could help identify the features that distinguish them.

Scholars can also frame models of entirely new genres, as Andrew Piper does in a recent essay on the “conversional novel.”

A very simple, imaginary statistical model that distinguishes pages of poetry from pages of prose.

A very simple, imaginary statistical model that distinguishes pages of poetry from pages of prose.

In other cases, the point of modeling will not actually be to describe or explain the concept being modeled, but very simply to recognize it at scale. I found that I needed to build predictive models simply to find the fiction, poetry, and drama in a collection of 850,000 volumes.

The tension between modeling-to-explain and modeling-to-predict has been discussed at length in other disciplines [Shmueli, 2010]. But statistical models haven’t been used extensively in historical research yet, and humanists may well find ways to use them that aren’t common in other disciplines. For instance, once we have a model of a phenomenon, we may want to ask questions about the diachronic stability of the pattern we’re modeling. (Does a model trained to recognize this genre in one decade make equally good predictions about the next?)

There are lots of software packages that can help you infer models of your data. But assessing the validity and appropriateness of a model is a trickier business. It’s important to fully understand the methods we’re borrowing, and that’s likely to require a bit of background reading. One might start by understanding the assumptions implicit in simple linear models, and work up to the more complex models produced by machine learning algorithms [Sculley and Pasanek 2008]. In particular, it’s important to learn something about the problem of “overfitting.” Part of the reason statistical models are becoming more useful in the humanities is that new methods make it possible to use hundreds or thousands of variables, which in turn makes it possible to represent unstructured text (those bags of words tend to contain a lot of variables). But large numbers of variables raise the risk of “overfitting” your data, and you’ll need to know how to avoid that.

6) Model social boundaries.
There’s no reason why statistical models of text need to be restricted to questions of genre and form. Texts are also involved in all kinds of social transactions, and those social contexts are often legible in the text itself.

For instance, Jordan Sellers and I have recently been studying the history of literary distinction by training models to distinguish poetry reviewed in elite periodicals from a random selection of volumes drawn from a digital library. There are a lot of things we might learn by doing this, but the top-line result is that the implicit standards distinguishing elite poetic discourse turn out to be relatively stable across a century.

plotmainmodelannotateSimilar questions could be framed about political or legal history.

7) Unsupervised modeling.
The models we’ve discussed so far are supervised in the sense that they have an explicit goal. You already know (say) which novels got reviewed in prominent periodicals, and which didn’t; you’re training a model in order to discover whether there are any patterns in the texts themselves that might help us explain this social boundary, or trace its history.

But advances in machine learning have also made it possible to train unsupervised models. Here you start with an unlabeled collection of texts; you ask a learning algorithm to organize the collection by finding clusters or patterns of some loosely specified kind. You don’t necessarily know what patterns will emerge.

If this sounds epistemologically risky, you’re not wrong. Since the hermeneutic circle doesn’t allow us to get something for nothing, unsupervised modeling does inevitably involve a lot of (explicit) assumptions. It can nevertheless be extremely useful as an exploratory heuristic, and sometimes as a foundation for argument. A family of unsupervised algorithms called “topic modeling” have attracted a lot of attention in the last few years, from both social scientists and humanists. Robert K. Nelson has used topic modeling, for instance, to identify patterns of publication in a Civil-War-era newspaper from Richmond.

But I’m putting unsupervised models at the end of this list because they may almost be too seductive. Topic modeling is perfectly designed for workshops and demonstrations, since you don’t have to start with a specific research question. A group of people with different interests can just pour a collection of texts into the computer, gather round, and see what patterns emerge. Generally, interesting patterns do emerge: topic modeling can be a powerful tool for discovery. But it would be a mistake to take this workflow as paradigmatic for text analysis. Usually researchers begin with specific research questions, and for that reason I suspect we’re often going to prefer supervised models.

* * *

In short, there are a lot of new things humanists can do with text, ranging from new versions of things we’ve always done (make literary arguments about diction), to modeling experiments that take us fairly deep into the methodological terrain of the social sciences. Some of these projects can be crystallized in a push-button “tool,” but some of the more ambitious projects require a little familiarity with a data-analysis environment like Rstudio, or even a programming language like Python, and more importantly with the assumptions underpinning quantitative social science. For that reason, I don’t expect these methods to become universally diffused in the humanities any time soon. In principle, everything above is accessible for undergraduates, with a semester or two of preparation — but it’s not preparation of a kind that English or History majors are guaranteed to have.

Generally I leave blog posts undisturbed after posting them, to document what happened when. But things are changing rapidly, and it’s a lot of work to completely overhaul a survey post like this every few years, so in this one case I may keep tinkering and adding stuff as time passes. I’ll flag my edits with a date in square brackets.

* * *


Elson, D. K., N. Dames, and K. R. McKeown. “Extracting Social Networks from Literary Fiction.” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, 2010. 138-147.

Greenblatt, Stephen, et. al., Norton Anthology of English Literature 8th Edition, vol 2 (New York: WW Norton, 2006.

Houston, Natalie. “Towards a Computational Analysis of Victorian Poetics.” Victorian Studies 56.3 (Spring 2014): 498-510.

Nowviskie, Bethany. “Speculative Computing: Instruments for Interpretive Scholarship.” Ph.D dissertation, University of Virginia, 2004.

O’Connor, Brendan, David Bamman, and Noah Smith, “Computational Text Analysis for Social Science: Model Assumptions and Complexity,” NIPS Workshop on Computational Social Science, December 2011.

Piper, Andrew. “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel.” New Literary History 46.1 (2015).

Sculley, D., and Bradley M. Pasanek. “Meaning and Mining: The Impact of Implicit Assumptions in Data Mining for the Humanities.” Literary and Linguistic Computing 23.4 (2008): 409-24.

Shmueli, Galit. “To Explain or to Predict?” Statistical Science 25.3 (2010).

So, Richard Jean, and Hoyt Long, “Network Analysis and the Sociology of Modernism,” boundary 2 40.2 (2013).

Stallybrass, Peter. “Against Thinking.” PMLA 122.5 (2007): 1580-1587.

Williams, Jeffrey. “The New Modesty in Literary Criticism.” Chronicle of Higher Education January 5, 2015.

How quickly do literary standards change?

by Ted Underwood and Jordan Sellers

Part of this project will appear next year — revised and improved — in MLQ. But we’ve decided to release it as a free-standing draft rather than a preprint, because it allows us to use color and to explore some puzzling leads that won’t fit into the physical limits of one journal article.

To understand the aesthetic standards that govern reception, we contrasted two samples of English-language poetry, drawn from different social contexts: 1) a group of 360 volumes that we chose by sampling reviews in prominent periodicals, 1820-1919, and 2) a group of 360 volumes sampled at random from HathiTrust Digital Library, many of them pretty obscure.
We were curious whether the difference in prestige between these books would be legible in the texts themselves. For instance, could you train a statistical model to predict whether a volume of poetry came from the “reviewed” or “random” sample just by looking at diction? And if you could, what social difference exactly would you be detecting?

Scholars sometimes suggest that high culture hadn’t differentiated from the rest of the literary field very sharply yet in the early 19th century [1: Huyssen 1986]. If so, books of poetry reviewed in prestigious contexts might be hard to identify in that part of the timeline. It might get easier toward the 20th century, as different poetic styles specialized to address (say) “high” and “middlebrow” audiences.

On the other hand, if writers became prominent by occupying the leading edge of a rapidly-moving wave, we might only be able to separate these samples by training a sequence of different models for different periods. For instance, prominent poets in the 1820s might be united by gloomy Byronism; in the 1850s they might share an interest in history; by the 1890s what they had in common might be the word “mauve.” As for the randomly-selected volumes, who knows? Maybe they would share only a tendency to trail thirty years behind the trend.

Since it seemed reasonable to assume that the standards governing reception had been volatile, we began by training a different model of poetic prestige for each twenty-year period. But we found, in practice, that the best way to separate these samples was to treat the whole period 1820-1919 as a single unit organized by a single set of aesthetic standards. You can click on the image that follows to see a slightly larger and clearer version.


In the image above, each point is a volume of poetry, colored according to its actual social provenance. The y axis expresses a statistical model’s prediction about that provenance: How likely is it that this volume came from the “reviewed” sample, based only on the words in the volume?

As you can see, the model does a pretty decent job of sorting the two samples. It’s not right all the time, because of course a volume’s reception is determined by a lot of factors other than language (politics, the whims of reviewers, social networks). But the model is right 79.2% of the time, which is often enough to suggest that volumes reviewed in prominent venues had something in common. The sort of poetic language that got reviewed is distinguished from other poetic traditions not just toward the twentieth century, as we had expected, but throughout this period.

What’s even more puzzling is this: reviewed writers seem to have had the same thing in common throughout this century. The model is using essentially the same list of prestigious and banal words to separate Lord Byron from more obscure poets around 1819, and Christina Rossetti from more obscure poets around 1866, and T. S. Eliot from more obscure writers around 1917. That’s starting to sound like an oddly durable set of preferences. And actually, it’s even more durable than the image above suggests. A model trained on a quarter-century of the evidence can predict the other 75 years almost as accurately as a model trained on the whole century.

A model trained only on evidence from 1845-69 makes predictions about the other 75 years in the dataset.

A model trained only on evidence from 1845-69 makes predictions about the other 75 years in the dataset.

So how is it even possible to characterize a whole century of poetic reception — based on fourteen different periodicals from both sides of the Atlantic — with a single set of aesthetic standards? Weren’t there supposed to be a couple of “poetic revolutions” in this century somewhere? W. B. Yeats certainly thought that one happened in the 1890s [2].

There’s another curious detail implied in the image above: why is the boundary between “reviewed” and “random” volumes drifting upward across the timeline? Technically, that’s an error. Volumes are not really “more likely to be reviewed” just because they were published later. But this is an error of an interesting kind. The model doesn’t know when these volumes were published: the dataset drifts upward because words that were more common in reviewed volumes across this period turn out to be more common in all volumes by the end of the period. If you divide the timeline into parts, the same pattern recurs in each part; and — to leak a detail from the next stage of this project — it also happens when we model fiction. That starts to suggest an interestingly general connection between synchronic judgment and diachronic change.

And there’s more. The detailed differences between reviewed and random poetry are interesting. In the article, we examine a haunting passage from Christina Rossetti; it turns out the model likes “haunting.” We also generalize about the theory of representativeness underpinning distant reading, and ask how our contemporary pedagogical canon looks when viewed by nineteenth-century aesthetic standards.

But all this, obviously, is too much to discuss in a blog post. See the article itself for our actual attempt to understand these puzzles.

We’ve released our code and data on Github, and hope readers will find flaws in our reasoning so we can improve the project. But this draft has been bounced off a couple of audiences already; at this point it’s stable enough to be cited and criticized. So, after some reflection, we’ve closed comments on this post in order to encourage a more public sort of critique. If we’re overlooking something, please say so in a blog post. It’s an explicit premise of the project that “being reviewed at all indicates a sort of literary distinction — even if the review is negative.”

[1]: One influential thesis holds that this division crystallized “in the last decades of the 19th century and the first few years of the 20th.” Andreas Huyssen, After the Great Divide: Modernism, Mass Culture, Postmodernism (Bloomington: Indiana UP, 1986), viii.

[2]: W.B. Yeats dated the “revolt against Victorianism” and against “the poetical diction of everybody” to the 1890s. See discussion in Richard Fallis, “Yeats and the Reinterpretation of Victorian Poetry,” Victorian Poetry 14.2 (1976): 89-100.

Free research question about plot.

I think the whole syuzhet controversy is turning out to be fabulously productive.

I particularly enjoyed David Bamman’s latest contribution to the discussion, which begins to flesh out what validation might look like for questions about plot. Briefly, he got five human readers to evaluate the emotional pitch of different scenes in Romeo and Juliet, and visualized the range of their agreement over time.

It’s clear that there are differences; but it’s also clear that there’s a great deal of consensus. And not surprisingly. Romeo and Juliet is (spoiler alert) a tragedy, and the simple, strong difference in perceived tone between the first and second halves of the script is exactly what we might have expected.

David offered this brief project as an example of data one could use for validating methods, which it is. But mulling this over online with Ana-Maria Popescu (whose tweets are alas protected), I realized that David’s example might also help give us a sharper sense of the literary stakes of this whole discussion. Because of course the question arises, “Will the emotional trajectory of novels be as easy to chart as that of 16/17c drama?” We intuitively suspect not, and for good reason. As Popescu put it, “work … from that period (Elizabethan) would have a more clear pattern (bc. they used plot patterns).”

She’s right. It’s a well-worn thesis about the rise of the novel that the point of novelistic realism was, partly, to get away from the predictable trajectories of comedy, tragedy, and romance — to produce a messier arc with lots of contingent interruptions (people hate it when I cite this guy, but that’s Ian Watt’s conception of formal realism). If that’s true, David’s experiment might not work as well for novels.

Matt Jockers’ syuzhet package is based on a diametrically opposed account of novelistic plot, coming through Kurt Vonnegut. Vonnegut argued that novels are really still organized by a small number of predictable patterns moving, in fairly broad undulations, between fortune and misfortune. And … wait, that sounds plausible too.

The conflict between Vonnegut and Watt might give us a testable question with clear literary stakes. Are the perceived emotional trajectories of novels in fact more complex over time, or more uncertain at any given moment, than the perceived trajectories of (say) 17c comedy and tragedy? Watt says they should be. Vonnegut says no. To be sure, there are lots of complexities involved in answering this; “emotional valence” is still not very well defined. But with a question like this, where theories of the novel clash directly, it’s hard to fail — whatever you discover, you’re going to be overturning some well-documented received opinion.

There are potentially lots of ways to approach a problem like that. David’s sort of ground truth could be used as a foundation for predictive modeling, or we could use it to validate Jockers’ method. By the way, if anyone’s still interested in doing that, here’s the trajectory you get if you run Romeo and Juliet though syuzhet using afinn sentiment detection and a low-pass setting of 5. Compare it to Bamman’s human ground truth above. One example is not validation, and this is just an eyeball comparison, but it’s a pretty decent fit. And syuzhet was incredibly easy to install and run. I did this in literally five minutes. My gut is starting to tell me that’s a nice little R package Matt just gave away for free.

Then again, if predictive models or sentiment detection don’t work well enough to satisfy us, there’s no reason why a question like this couldn’t be pursued purely through human annotation. I don’t have time to tackle this question; I’m working on a different project where human ground truth is provided by reviewers. But I really think someone should go for it.

Robert Boyle's description of a controversial, leaky air-pump.

Robert Boyle’s description of a controversial, notoriously leaky air-pump.

For me the lesson of this conversation has also been that the open web and dissent are still good things. I’m glad Matt Jockers put syuzhet out there as a resource, and glad Annie Swafford critiqued it. I’ve been saying this reminds me of the Hobbes-Boyle dispute; I mean partly, as Anna Marie Roos points out in a review of Leviathan and the Air-Pump, that the clash between opposing interpretations in that case fruitfully advanced knowledge.

I also mean, of course, that experiments, with clearly defined predictive hypotheses, are good things.


PS: By the way, if anyone’s interested, here’s Romeo and Juliet smoothed with a rolling mean (using a 101-sentence window) rather than a Fourier transform. I still understand rolling means better, and I think the detail revealed here is interesting. The balcony scene is, unsurprisingly, the high point for human readers and sentiment detection alike. As David Bamman points out, readers are a bit divided about how to interpret the tone at the end of this tragedy. Syuzhet, however, considers it a downer.

And Bamman’s human readers again:

P.P.S: Thanks to David Wilson-Okamura for correcting my labeling of scenes.

Why it’s hard for syuzhet to be right or wrong yet.

I’ve enjoyed following the exchange between Matt Jockers, Annie Swafford, Jacob Eisenstein, and Dan Piepenbring about Jockers’ R package syuzhet — designed to illuminate plot by tracing the “emotional valence” of narration across the course of a novel.

I’ve found this a consistently impressive and informative conversation; it has taught me literally everything I know about “low-pass filters.” But I have no idea who is right or wrong.

More fundamentally, I’m unsure how anyone could be right or wrong here, because as far as I can tell there’s no thesis under discussion yet. Jockers’ article isn’t published. All we have is an R package, syuzhet, which does something I would call exploratory data analysis. And it’s hard to evaluate exploratory data analysis in the absence of a specific argument.

For instance, does syuzhet smooth plot arcs appropriately? I don’t know. Without a specific thesis we’re trying to test, how would we decide what scale of variation matters? In some novels it might be a scene-to-scene rhythm; in others it might be a long arc. Until I know what scale of variation matters for a particular question, I have no way of knowing what kind of smoothing is “too much” or “too little.”*

The same thing goes, more fundamentally, for the concepts of “plot” and “emotional valence” themselves. As Jacob Eisenstein has pointed out, these aren’t concepts that have a single agreed-upon meaning. To argue about them meaningfully, we’re going to need a particular historical or formal question we’re trying to solve.

It seems to me likely that syuzhet will usefully illuminate some aspects of plot. But I have no way of knowing which aspects until I look at a test involving groups of books that readers perceive as different in some specific way. For instance, if syuzhet reliably discriminates between books with tragic and comic endings, that would already be interesting. It’s not everything we mean by plot, but it’s one important thing.

The underlying issue here is that Matt hasn’t published his article yet. So we don’t actually have a thesis to debate. What we have is a new form of exploratory data analysis, released as an R package. Conversation about exploration can be interesting; it can teach me a lot about low-pass filters; but I don’t know how it could be wrong or right until I know what the exploration is trying to reveal.

I think this holds even for Matt’s claim that he’s identified six (or seven) fundamental plot patterns. That sounds like a thesis, but I would tend to say it’s still description of exploratory analysis — in this case a clustering process. Matt has done the clustering in a principled and careful way, but clustering is still (in my eyes) basically an exploratory method. I’m not sure how to evaluate it until I know what kind of generic or historical evidence would count as confirmation that we’re looking at a coherent “plot pattern.”

There are a range of ways to get that confirmation. Lynn Cherny has explored plot using supervised methods; if you do that, predictive accuracy gives you an easy test. But unsupervised methods can also be great, in cases where tests aren’t so easy to define; it’s just that an unsupervised method needs to be supplemented by historical or formal discussion that tells you what would count as confirmation for this method. I imagine there will be some of that in Matt’s article, when it comes out.

* [Edit March 31: After playing around with some artificial data myself, I have to acknowledge that the low-pass filter option in syuzhet can behave in unintuitive ways where extreme outliers and edges are involved. I think Annie Swafford (in blog posts) and Daniel Lepage (below) have been right to emphasize this. It could be less of an issue with real data; I had to use pretty extreme outliers to “break” the filter; it’s not actually the case that the whole shape is necessarily defined by its single highest point. But my guess is that this sort of filter would only add value if you wanted to build in a strong prior that plot fluctuates on or near a particular “wavelength.” On the other hand, Matt Jockers has alluded to unpublished evidence for that sort of prior (or at least for a particular filter setting). So, after changing my opinion a couple times, I’m still not feeling I have an answer here.]