Digital humanities might never be evenly distributed.

In an eloquent and pragmatic blog post about building the UCL Centre for Digital Humanities, Melissa Terras stresses the importance of rooting a DH center in local institutional culture, in order to “link people” across the whole spectrum from arts and humanities to computer science and engineering. It’s an impressive achievement that has clearly fostered a lot of significant work at UCL, and it has started to change my own way of thinking about this perplexing phrase “digital humanities.”

In the past, I’ve tended to understand “digital humanities” as an abstract term. If you understand it that way, it’s easy to see that it covers a whole range of disparate things, which has sometimes led me to predict that it would fall apart in the near future into a bunch of separate projects.

Alan Liu, “Map of Digital Humanities” — photo by Quinn Dombrowski at UC Berkeley, August 17, 2015. CC-BY-SA.

But as time passes and the darn thing refuses to fall apart, it seems appropriate to revisit that prediction. I still think digital humanities is hard to define, but apparently, being hard to define doesn’t prevent human institutions from enduring and growing. When I read it a few months ago, Melissa’s post made me reflect that “DH” doesn’t have to be defined abstractly at all. It could be understood, quite concretely, as an institutional achievement that happens to exist on some campuses and not others.

If you understand DH abstractly, as a rubric covering many different projects, there’s a lot of it going on here at UIUC. On the west side of campus, we have a leading school of Library and Information Science (GSLIS), which regularly offers courses on digital humanities, and is one of two institutions piloting HathiTrust Research Center. At the north end of campus, we have the National Center for Supercomputing Applications (NCSA), which excels at providing computational support for the arts, humanities, and social sciences. The Colleges of Media, and Fine and Applied Arts, and Liberal Arts and Sciences are home to a lot of individual scholars pursuing research on or critique of digital media, and the campus as a whole hosts ambitious experiments like Learning to See Systems, that combine technological practice and theory.

On the other hand, we’ve never had a digital humanities center or curricular initiative. We have a program called I-CHASS, at NCSA, which provides computational support to scholars who need it (my own work would have been impossible without their support). And Scholarly Commons, at the Library, helps faculty and students find the resources and training they need. But we don’t have any center of the kind Melissa describes, tasked with building a bridge between all the different people mentioned above, and getting them in the same room.

One way to view this would be: we’re lagging behind. Digital humanities is getting organized at Berkeley and Stanford and Iowa and the University of Pennsylvania and Yale. From time to time I think “we need to get something moving.”

And from time to time I try. But I rapidly discover the size of this campus, and the huge range of digitally-human projects already scattered across it, already moving (quite successfully) in diametrically opposed directions — and it occurs to me, first, that it would take superhuman effort to herd them into the same room, and second, that maybe UIUC doesn’t have a digital humanities center because it doesn’t need one. I’m finding all the resources I need over at GSLIS and NCSA; other kinds of projects are also humming along; maybe we’ve never developed a single center precisely because our various distributed centers are so strong.

There are some drawbacks to this arrangement — mainly, that the strengths of the institution are not well-publicized either internally or abroad. For instance, I’m sure some grad students in the humanities here don’t realize that GSLIS regularly offers excellent courses in digital humanities. I’m writing this blog post partly in hopes of flagging that kind of local opportunity.

I think it’s even harder for undergraduates to envision creative connections between the humanities and other subjects without some kind of interdisciplinary program as a model. This is probably the biggest drawback of our distributed structure, and I do feel I should do something about it. But given the way my own interests fit into the local landscape, I suspect the wheel I can put my shoulder to may be an undergraduate program in data science rather than digital humanities. It seems increasingly possible to me that “digital humanities” — as such — may never take institutional form on this campus. By the time we organize that curricular space, it may be occupied by several distinct projects.

That possibility is making me reflect that public discussion of this topic (as skeptical and wide-ranging as it has been) may still have been too quick to assume we’re all moving in the same direction. William Gibson’s famous quip that the future is already here — but not evenly distributed — encourages us to imagine two possible futures for experiments like digital humanities: either they are destined to (eventually) get distributed everywhere, or they will turn out to have been blind alleys.

I think it’s pretty clear at this point that digital humanities is not a blind alley; there’s too much valuable research and teaching being done under that rubric in too many places, and momentum is continuing to build. But I also doubt its institutions — DH centers and curricula — will ever be evenly distributed. I suspect this is going to be one of those disciplinary spaces that different institutions handle differently even over the long run. In some places, the concept of “DH” may be exactly the seed crystal a local culture needs to bring people together. In other places, institutional DH may fail to coalesce, although — or even because — the interdisciplinary projects it would have organized are separately thriving.

Free research question about plot.

I think the whole syuzhet controversy is turning out to be fabulously productive.

I particularly enjoyed David Bamman’s latest contribution to the discussion, which begins to flesh out what validation might look like for questions about plot. Briefly, he got five human readers to evaluate the emotional pitch of different scenes in Romeo and Juliet, and visualized the range of their agreement over time.

It’s clear that there are differences; but it’s also clear that there’s a great deal of consensus. And not surprisingly. Romeo and Juliet is (spoiler alert) a tragedy, and the simple, strong difference in perceived tone between the first and second halves of the script is exactly what we might have expected.

David offered this brief project as an example of data one could use for validating methods, which it is. But mulling this over online with Ana-Maria Popescu (whose tweets are alas protected), I realized that David’s example might also help give us a sharper sense of the literary stakes of this whole discussion. Because of course the question arises, “Will the emotional trajectory of novels be as easy to chart as that of 16/17c drama?” We intuitively suspect not, and for good reason. As Popescu put it, “work … from that period (Elizabethan) would have a more clear pattern (bc. they used plot patterns).”

She’s right. It’s a well-worn thesis about the rise of the novel that the point of novelistic realism was, partly, to get away from the predictable trajectories of comedy, tragedy, and romance — to produce a messier arc with lots of contingent interruptions (people hate it when I cite this guy, but that’s Ian Watt’s conception of formal realism). If that’s true, David’s experiment might not work as well for novels.

Matt Jockers’ syuzhet package is based on a diametrically opposed account of novelistic plot, coming through Kurt Vonnegut. Vonnegut argued that novels are really still organized by a small number of predictable patterns moving, in fairly broad undulations, between fortune and misfortune. And … wait, that sounds plausible too.

The conflict between Vonnegut and Watt might give us a testable question with clear literary stakes. Are the perceived emotional trajectories of novels in fact more complex over time, or more uncertain at any given moment, than the perceived trajectories of (say) 17c comedy and tragedy? Watt says they should be. Vonnegut says no. To be sure, there are lots of complexities involved in answering this; “emotional valence” is still not very well defined. But with a question like this, where theories of the novel clash directly, it’s hard to fail — whatever you discover, you’re going to be overturning some well-documented received opinion.

There are potentially lots of ways to approach a problem like that. David’s sort of ground truth could be used as a foundation for predictive modeling, or we could use it to validate Jockers’ method. By the way, if anyone’s still interested in doing that, here’s the trajectory you get if you run Romeo and Juliet though syuzhet using afinn sentiment detection and a low-pass setting of 5. Compare it to Bamman’s human ground truth above. One example is not validation, and this is just an eyeball comparison, but it’s a pretty decent fit. And syuzhet was incredibly easy to install and run. I did this in literally five minutes. My gut is starting to tell me that’s a nice little R package Matt just gave away for free.

Then again, if predictive models or sentiment detection don’t work well enough to satisfy us, there’s no reason why a question like this couldn’t be pursued purely through human annotation. I don’t have time to tackle this question; I’m working on a different project where human ground truth is provided by reviewers. But I really think someone should go for it.

Robert Boyle's description of a controversial, leaky air-pump.

Robert Boyle’s description of a controversial, notoriously leaky air-pump.

For me the lesson of this conversation has also been that the open web and dissent are still good things. I’m glad Matt Jockers put syuzhet out there as a resource, and glad Annie Swafford critiqued it. I’ve been saying this reminds me of the Hobbes-Boyle dispute; I mean partly, as Anna Marie Roos points out in a review of Leviathan and the Air-Pump, that the clash between opposing interpretations in that case fruitfully advanced knowledge.

I also mean, of course, that experiments, with clearly defined predictive hypotheses, are good things.


PS: By the way, if anyone’s interested, here’s Romeo and Juliet smoothed with a rolling mean (using a 101-sentence window) rather than a Fourier transform. I still understand rolling means better, and I think the detail revealed here is interesting. The balcony scene is, unsurprisingly, the high point for human readers and sentiment detection alike. As David Bamman points out, readers are a bit divided about how to interpret the tone at the end of this tragedy. Syuzhet, however, considers it a downer.

And Bamman’s human readers again:

P.P.S: Thanks to David Wilson-Okamura for correcting my labeling of scenes.

How to find English-language fiction, poetry, and drama in HathiTrust.

Although methods of analysis are more fun to discuss, the most challenging part of distant reading may still be locating the texts in the first place [1].

In principle, millions of books are available in digital libraries. But literary historians need collections organized by genre, and locating the fiction or poetry in a digital library is not as simple as it sounds. Older books don’t necessarily have genre information attached. (In HathiTrust, less than 40% of English-language fiction published before 1923 is tagged “fiction” in the appropriate MARC control field.)

Volume-level information wouldn’t be enough to guide machine reading in any case, because genres are mixed up inside volumes. For instance Hoyt Long, Richard So, and I recently published an article in Slate arguing (among other things) that references to specific amounts of money become steadily more common in fiction from 1825 to 1950.

Frequency of reference to "specific amounts" of money in 7,700 English-language works of fiction. Graphics from Wickham, ggplot2 [2].

Frequency of reference to “specific amounts” of money in 7,700 English-language works of fiction. Graphics here and throughout from Wickham, ggplot2 [2].

But Google’s “English Fiction” collection tells a very different story. The frequencies of many symbols that appear in prices (dollar signs, sixpence) skyrocket in the late nineteenth century, and then drop back by the early twentieth.

Frequencies of "$" and "6d" in Google's "English Fiction" collection, 1800-1950.

Frequencies of “$” and “6d” in Google’s “English Fiction” collection, 1800-1950.

On the other hand, several other words or symbols that tend to appear in advertisements for books follow a suspiciously similar trajectory.

Frequencies of "$", "8vo" (octavo) and "cloth" in Google's "English Fiction" collection, 1800-1950.

Frequencies of “$”, “8vo” (octavo) and “cloth” in Google’s “English Fiction” collection, 1800-1950.

What we see in Google’s “Fiction” collection is something that happens in volumes of fiction, but not exactly in the genre of fiction — the rise and fall of publishers’ catalogs in the backs of books [3]. Individually, these two- or three-page lists of titles for sale may not look like significant noise, but because they often mention prices, and are distributed unevenly across the timeline, they add up to a significant potential pitfall for anyone interested in the role of money in fiction.

I don’t say this to criticize the team behind the Ngram Viewer. Genre wasn’t central to their goals; they provided a rough “fiction” collection merely as a cherry on top of a massively successful public-humanities project. My point is just that genres fail to line up with volume boundaries in ways that can really matter for the questions scholars want to pose. (In fact, fiction may be the genre that comes closest to lining up with volume boundaries: drama and poetry often appear mixed in The Collected Poems and Plays of So-and-So, With a Prose Life of the Author.)

You can solve this problem by selecting works manually, or by borrowing proprietary collections from a vendor. Those are both good, practical solutions, especially up to (say) 1900. But because they rely on received bibliographies, they may not entirely fulfill the promises we’ve been making about dredging the depths of “the great unread,” boldly going where no one has gone before, etc [4]. Over the past two years, with support from the ACLS and NEH, I’ve been trying to develop another alternative — a way of starting with a whole library, and dividing it by genre at the page level, using machine learning.

In researching the Slate article, we relied on that automatic mapping of genre to select pages of fiction from HathiTrust. It helped us avoid conflating advertisements with fiction, and I hope other scholars will also find that it reduces the labor involved in creating large, genre-specific collections. The point of this blog post is to announce the release of a first version of the map we used (covering 854,476 English-language books in HathiTrust 1700-1922).

The whole dataset is available on Figshare, where it has a DOI and is citable as a publication. An interim report is also available; it addresses theoretical questions about genre, as well as questions about methods and data format. And the code we used for the project is available on Github.

For in-depth answers to questions, please consult the interim project report. It’s 47 pages long; it actually explains the project; this blog post doesn’t. But here are a few quick FAQs just so you can decide whether to read further.

“What categories did you try to separate?”

We identify pages as paratext (front matter, back matter, ads), prose nonfiction, poetry (narrative and lyric are grouped together), drama (including verse drama), or prose fiction. The report discusses the rationale for these choices, but other choices would be possible.

“How accurate is this map?”

Since genres are social institutions, questions about accuracy are relative to human dissensus. Our pairs of human readers agreed about the five categories just mentioned for 94.5% of the pages they tagged [5]. Relying on two-out-of-three voting (among other things), we boiled those varying opinions down to a human consensus, and our model agreed with the consensus 93.6% of the time. So this map is nearly as accurate as we might expect crowdsourcing to be. But it covers 276 million pages. For full details, see the confusion matrices in the report. Also, note that we provide ways of adjusting the tradeoff between recall and precision to fit a researcher’s top priority — which could be catching everything that might belong in a genre, or filtering out everything that doesn’t belong. We provide filtered collections of drama, fiction, and poetry for scholars who want to work with datasets that are 97-98% precise.

“You just wrote a blog post admitting that even simple generic boundaries like fiction/nonfiction are blurry and contested. So how can we pretend to stabilize a single map of genre?”

The short answer: we can’t. I don’t expect the genre predictions in this dataset to be more than one resource among many. We’ve also designed this dataset to have a certain amount of flexibility. There are confidence metrics associated with each volume, and users can define their collection of, say, poetry more broadly or narrowly by adjusting the confidence thresholds for inclusion. So even this dataset is not really a single map.

“What about divisions below the page level?”

With the exception of divisions between running headers and body text, we don’t address them. There are certainly a wide range of divisions below the page level that can matter, but we didn’t feel there was much to be gained by trying to solve all those problems at the same time as page-level mapping. In many cases, divisions below the page level are logically a subsequent step.

“How would I actually use this map to find stuff?”

There are three different ways — see “How to use this data?” in the interim report. If you’re working with HathiTrust Research Center, you could use this data to define a workset in their portal. Alternatively, if your research question can be answered with word frequencies, you could download public page-level features from HTRC and align them with our genre predictions on your own machine to produce a dataset of word counts from “only pages that have a 97% probability of being prose fiction,” or what have you. (HTRC hasn’t released feature counts for all the volumes we mapped yet, but they’re about to.) You can also align our predictions directly with HathiTrust zip files, if you have those. The pagealigner module in the utilities subfolder of our Github repo is intended as a handy shortcut for people who use Python; it will work both with HT zip files and HTRC feature files, aligning them with our genre predictions and returning a list of pages zipped with genre codes.

Is this sort of collection really what I need for my project?

Maybe not. There are a lot of books in HathiTrust. But as I admitted in my last post, a medium-sized collection based on bibliographies may be a better starting point for most scholars. Library-based collections include things like reprints, works in translation, juvenile fiction, and so on, that could be viewed as giving a fuller picture of literary culture … or could be viewed as messy complicating factors. I don’t mean to advocate for a library-based approach; I’m just trying to expand the range of alternatives we have available.

“What if I want to find fiction in French books between 1900 and 1970?”

Although we’ve made our code available as a resource, we definitely don’t want to represent it as a “tool” that could simply be pointed at other collections to do the same kind of genre mapping. Much of the work involved in this process is domain-specific (for instance, you have to develop page-level training data in a particular language and period). So this is better characterized as a method than a tool, and the report is probably more important than the repo. I plan to continue expanding the English-language map into the twentieth century (algorithmic mapping of genre may in fact be especially necessary for distant reading behind the veil of copyright). But I don’t personally have plans to expand this map to other languages; I hope someone else will take up that task.

As a reward for reading this far, here’s a visualization of the relative sizes of genres across time, represented as a percentage of pages in the English-language portion of HathiTrust.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren't represented here. Results have been smoothed with a five-year moving average.

The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren’t represented here. Results have been smoothed with a five-year moving average. Click through to enlarge.

The image is discussed at more length in the interim progress report.


The blog post above often slips awkwardly into first-person plural, because I’m describing a project that involved a lot of people. Parts of the code involved were written by Michael L. Black and Boris Capitanu. The code also draws on machine learning libraries in Weka and Scikit-Learn [6, 7]. Shawn Ballard organized the process of gathering training data, assisted by Jonathan Cheng, Nicole Moore, Clara Mount, and Lea Potter. The project also depended on collaboration and conversation with a wide range of people at HathiTrust Digital Library, HathiTrust Research Center, and the University of Illinois Library, including but not limited to Loretta Auvil, Timothy Cole, Stephen Downie, Colleen Fallaw, Harriett Green, Myung-Ja Han, Jacob Jett, and Jeremy York. Jana Diesner and David Bamman offered useful advice about machine learning. Essential material support was provided by a Digital Humanities Start-Up Grant from the National Endowment for the Humanities and a Digital Innovation Fellowship from the American Council of Learned Societies. None of these people or agencies should be held responsible for mistakes.


[1] Perhaps it goes without saying, since the phrase has now lost its quotation marks, but “distant reading” is Franco Moretti, “Conjectures on World Literature,” New Left Review 1 (2000).

[2] Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis. http: // Springer New York, 2009.

[3] Having mapped advertisements in volumes of fiction, I’m pretty certain that they’re responsible for the spike in dollar signs in Google’s “English Fiction” collection. The collection I mapped overlaps heavily with Google Books, and the number of pages of ads in fiction volumes tracks very closely with the frequency of dollars signs, “8vo,” and so on.

Percentage of pages in mostly-fiction volumes that are ads. Based on a filtered collection of 102,349 mostly-fiction volumes selected from a larger group of 854,476 volumes 1700-1922.

Percentage of pages in mostly-fiction volumes that are ads. Based on a filtered collection of 102,349 mostly-fiction volumes selected from a larger group of 854,476 volumes 1700-1922. Five-year moving average.

[4] “The great unread” comes from Margaret Cohen, The Sentimental Education of the Novel (Princeton NJ: Princeton University Press, 1999), 23.

[5] See the interim report (subsection, “Evaluating Confusion Matrices”) for a fuller description; it gets complicated, because we actually assessed accuracy in terms of the number of words misclassified, although the classification was taking place at a page level.

[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.

What no one tells you about the digital humanities.

There are already several great posts out there that exhaustively list resources and starting points for people getting into DH (a lot of them are by Lisa Spiro, who is good at it).

Opportunities are not always well signposted.

This will be a shorter list. I’m still new enough at this to remember what surprised me in the early going, and there were two areas where my previous experience in the academy failed to prepare me for the fluid nature of this field.

1) I had no idea, going into this, just how active a scholarly field could be online. Things are changing rapidly — copyright lawsuits, new tools, new ideas. To find out what’s happening, I think it’s actually vital to lurk on Twitter. Before I got on Twitter, I was flying blind, and didn’t even realize it. Start by following Brett Bobley, head of the Office of Digital Humanities at the NEH. Then follow everyone else.

2) The technical aspect of the field is important — too important, in many cases, to be delegated. You need to get your hands dirty. But the technical aspect is also much less of an obstacle than I originally assumed. There’s an amazing amount of information on the web, and you can teach yourself to do almost anything in a couple of weekends.* Realizing that you can is half the battle. For a pep talk / inspiring example, try this great narrative by Tim Sherratt.

That’s it. If you want more information, see the links to Lisa Spiro and DiRT at the top of this post. Lisa is right, by the way, that the place to start is with a particular problem you want to solve. Don’t dutifully acquire skills that you think you’re supposed to have for later use. Just go solve that problem!

* ps: Technical obstacles are minor even if you want to work with “big data.” We’re at a point now where you can harvest your own big data — big, at least, by humanistic standards. Hardware limitations are not quite irrelevant, but you won’t hit them for the first year or so, though you may listen anxiously while that drive grinds much more than you’re used to …