Over the last decade, the (small) fraction of articles in the humanities that use numbers has slowly grown. This is happening partly because computational methods are becoming flexible enough to represent a wider range of humanistic evidence. We can model concepts and social practices, for instance, instead of just counting people and things.
That’s exciting, but flexibility also makes arguments complex and hard to review. Journal editors in the humanities may not have a long list of reviewers who can evaluate statistical models. So while quantitative articles certainly encounter some resistance, they don’t always get the kind of detailed resistance they need. I thought it might be useful to stir up conversation on this topic with a few suggestions, aimed less at the DH community than at the broader community of editors and reviewers in the humanities. I’ll start with proposals where I think there’s consensus, and get more opinionated as I go along.
1. Ask to see code and data.
Getting an informed reviewer is a great first step. But to be honest, there’s not a lot of consensus yet about many methodological questions in the humanities. What we need is less strict gatekeeping than transparent debate.
As computational methods spread in the sciences, scientists have realized that it’s impossible to discuss this work fruitfully if you can’t see how the work was done. Journals like Cultural Analytics reflect this emerging consensus with policies that require authors to share code and data. But mainstream humanities journals don’t usually have a policy in place yet.
Three or four years ago, confusion on this topic was understandable. But in 2018, journals that accept quantitative evidence at all need a policy that requires authors to share code and data when they submit an article for review, and to make it public when the article is published.
I don’t think the details of that policy matter deeply. There are lots of different ways to archive code and data; they are all okay. Special cases and quibbles can be accomodated. For instance, texts covered by copyright (or other forms of IP) need not be shared in their original form. Derived data can be shared instead; that’s usually fine. (Ideally one might also share the code used to derive it.)
2. … especially code.
Humanists are usually skeptical enough about the data underpinning an argument, because decades of debate about canons have trained us to pose questions about the works an author chooses to discuss.
But we haven’t been trained to pose questions about the magnitude of a pattern, or the degree of uncertainty surrounding it. These aspects of a mathematical argument often deserve more discussion than an author initially provides, and to discuss them, we’re going to need to see the code.
I don’t think we should expect code to be polished, or to run easily on any machine. Writing an article doesn’t commit the author to produce an elegant software tool. (In fact, to be blunt, “it’s okay for academic software to suck.”) The author just needs to document what they did, and the best way to do that is to share the code and data they actually used, warts and all.
3. Reproducibility is great, but replication is the real point.
Ideally, the code and data supporting an article should permit a reader to reproduce all the stages of analysis the author(s) originally performed. When this is true, we say the research is “reproducible.”
But there are often rough spots in reproducibility. Stochastic processes may not run exactly the same way each time, for instance.
At this point, people who study reproducibility professionally will crowd forward and offer an eleven-point plan for addressing all rough spots. (“You just set the random number seed so it’s predictable …”)
That’s wonderful, if we really want to polish a system that allows a reader to push a button and get the same result as the original researcher, to the seventh decimal place. But in the humanities, we’re not always at the “polishing” stage of inquiry yet. Often, our question is more like “could this conceivably work? and if so, would it matter?”
In short, I think we shouldn’t let the imperative to share code foster a premature perfectionism. Our ultimate goal is not to prove that you get exactly the same result as the author if you use exactly the same assumptions and the same books. It’s to decide whether the experiment is revealing anything meaningful about the human past. And to decide that, we probably want to repeat the author’s question using different assumptions and a different sample of books.
When we do that, we are not reproducing the argument but replicating it. (See Language Log for a fuller discussion of the difference.) Replication is the real prize in most cases; that’s how knowledge advances. So the point of sharing code and data is often less to stabilize the results of your own work to the seventh decimal place, and more to guide investigators who may want to undertake parallel inquiries. (For instance, Jonathan Goodwin borrowed some of my code to pose a parallel question about Darko Suvin’s model of science fiction.)
I admit this is personal opinion. But I stress replication over reproducibility because it has some implications for the spirit of the whole endeavor. Since people often imagine that quantitative problems have a right answer, we may initially imagine that the point of sharing code and data is simply to catch mistakes.
In my view the point is rather to permit a (mathematical) conversation about the interpretation of the human past. I hope authors and readers will understand themselves as delayed collaborators, working together to explore different options. What if we did X differently? What if we tried a different sample of books? Usually neither sample is wrong, and neither is right. The point is to understand how much different interpretive assumptions do or don’t change our conclusions. In a sense no single article can answer that question “correctly”; it’s a question that has to be solved collectively, by returning to questions and adjusting the way we frame them. The real point of code-sharing is to permit that kind of delayed collaboration.
In an eloquent and pragmatic blog post about building the UCL Centre for Digital Humanities, Melissa Terras stresses the importance of rooting a DH center in local institutional culture, in order to “link people” across the whole spectrum from arts and humanities to computer science and engineering. It’s an impressive achievement that has clearly fostered a lot of significant work at UCL, and it has started to change my own way of thinking about this perplexing phrase “digital humanities.”
Alan Liu, “Map of Digital Humanities” — photo by Quinn Dombrowski at UC Berkeley, August 17, 2015. CC-BY-SA.But as time passes and the darn thing refuses to fall apart, it seems appropriate to revisit that prediction. I still think digital humanities is hard to define, but apparently, being hard to define doesn’t prevent human institutions from enduring and growing. When I read it a few months ago, Melissa’s post made me reflect that “DH” doesn’t have to be defined abstractly at all. It could be understood, quite concretely, as an institutional achievement that happens to exist on some campuses and not others.
If you understand DH abstractly, as a rubric covering many different projects, there’s a lot of it going on here at UIUC. On the west side of campus, we have a leading school of Library and Information Science (GSLIS), which regularly offers courses on digital humanities, and is one of two institutions piloting HathiTrust Research Center. At the north end of campus, we have the National Center for Supercomputing Applications (NCSA), which excels at providing computational support for the arts, humanities, and social sciences. The Colleges of Media, and Fine and Applied Arts, and Liberal Arts and Sciences are home to a lot of individual scholars pursuing research on or critique of digital media, and the campus as a whole hosts ambitious experiments like Learning to See Systems, that combine technological practice and theory.
On the other hand, we’ve never had a digital humanities center or curricular initiative. We have a program called I-CHASS, at NCSA, which provides computational support to scholars who need it (my own work would have been impossible without their support). And Scholarly Commons, at the Library, helps faculty and students find the resources and training they need. But we don’t have any center of the kind Melissa describes, tasked with building a bridge between all the different people mentioned above, and getting them in the same room.
One way to view this would be: we’re lagging behind. Digital humanities is getting organized at Berkeley and Stanford and Iowa and the University of Pennsylvania and Yale. From time to time I think “we need to get something moving.”
And from time to time I try. But I rapidly discover the size of this campus, and the huge range of digitally-human projects already scattered across it, already moving (quite successfully) in diametrically opposed directions — and it occurs to me, first, that it would take superhuman effort to herd them into the same room, and second, that maybe UIUC doesn’t have a digital humanities center because it doesn’t need one. I’m finding all the resources I need over at GSLIS and NCSA; other kinds of projects are also humming along; maybe we’ve never developed a single center precisely because our various distributed centers are so strong.
There are some drawbacks to this arrangement — mainly, that the strengths of the institution are not well-publicized either internally or abroad. For instance, I’m sure some grad students in the humanities here don’t realize that GSLIS regularly offers excellent courses in digital humanities. I’m writing this blog post partly in hopes of flagging that kind of local opportunity.
I think it’s even harder for undergraduates to envision creative connections between the humanities and other subjects without some kind of interdisciplinary program as a model. This is probably the biggest drawback of our distributed structure, and I do feel I should do something about it. But given the way my own interests fit into the local landscape, I suspect the wheel I can put my shoulder to may be an undergraduate program in data science rather than digital humanities. It seems increasingly possible to me that “digital humanities” — as such — may never take institutional form on this campus. By the time we organize that curricular space, it may be occupied by several distinct projects.
That possibility is making me reflect that public discussion of this topic (as skeptical and wide-ranging as it has been) may still have been too quick to assume we’re all moving in the same direction. William Gibson’s famous quip that the future is already here — but not evenly distributed — encourages us to imagine two possible futures for experiments like digital humanities: either they are destined to (eventually) get distributed everywhere, or they will turn out to have been blind alleys.
I think it’s pretty clear at this point that digital humanities is not a blind alley; there’s too much valuable research and teaching being done under that rubric in too many places, and momentum is continuing to build. But I also doubt its institutions — DH centers and curricula — will ever be evenly distributed. I suspect this is going to be one of those disciplinary spaces that different institutions handle differently even over the long run. In some places, the concept of “DH” may be exactly the seed crystal a local culture needs to bring people together. In other places, institutional DH may fail to coalesce, although — or even because — the interdisciplinary projects it would have organized are separately thriving.
It’s clear that there are differences; but it’s also clear that there’s a great deal of consensus. And not surprisingly. Romeo and Juliet is (spoiler alert) a tragedy, and the simple, strong difference in perceived tone between the first and second halves of the script is exactly what we might have expected.
David offered this brief project as an example of data one could use for validating methods, which it is. But mulling this over online with Ana-Maria Popescu (whose tweets are alas protected), I realized that David’s example might also help give us a sharper sense of the literary stakes of this whole discussion. Because of course the question arises, “Will the emotional trajectory of novels be as easy to chart as that of 16/17c drama?” We intuitively suspect not, and for good reason. As Popescu put it, “work … from that period (Elizabethan) would have a more clear pattern (bc. they used plot patterns).”
She’s right. It’s a well-worn thesis about the rise of the novel that the point of novelistic realism was, partly, to get away from the predictable trajectories of comedy, tragedy, and romance — to produce a messier arc with lots of contingent interruptions (people hate it when I cite this guy, but that’s Ian Watt’s conception of formal realism). If that’s true, David’s experiment might not work as well for novels.
The conflict between Vonnegut and Watt might give us a testable question with clear literary stakes. Are the perceived emotional trajectories of novels in fact more complex over time, or more uncertain at any given moment, than the perceived trajectories of (say) 17c comedy and tragedy? Watt says they should be. Vonnegut says no. To be sure, there are lots of complexities involved in answering this; “emotional valence” is still not very well defined. But with a question like this, where theories of the novel clash directly, it’s hard to fail — whatever you discover, you’re going to be overturning some well-documented received opinion.
There are potentially lots of ways to approach a problem like that. David’s sort of ground truth could be used as a foundation for predictive modeling, or we could use it to validate Jockers’ method. By the way, if anyone’s still interested in doing that, here’s the trajectory you get if you run Romeo and Juliet though syuzhet using afinn sentiment detection and a low-pass setting of 5. Compare it to Bamman’s human ground truth above. One example is not validation, and this is just an eyeball comparison, but it’s a pretty decent fit. And syuzhet was incredibly easy to install and run. I did this in literally five minutes. My gut is starting to tell me that’s a nice little R package Matt just gave away for free.
Then again, if predictive models or sentiment detection don’t work well enough to satisfy us, there’s no reason why a question like this couldn’t be pursued purely through human annotation. I don’t have time to tackle this question; I’m working on a different project where human ground truth is provided by reviewers. But I really think someone should go for it.
Robert Boyle’s description of a controversial, notoriously leaky air-pump.
I also mean, of course, that experiments, with clearly defined predictive hypotheses, are good things.
.
PS: By the way, if anyone’s interested, here’s Romeo and Juliet smoothed with a rolling mean (using a 101-sentence window) rather than a Fourier transform. I still understand rolling means better, and I think the detail revealed here is interesting. The balcony scene is, unsurprisingly, the high point for human readers and sentiment detection alike. As David Bamman points out, readers are a bit divided about how to interpret the tone at the end of this tragedy. Syuzhet, however, considers it a downer.
And Bamman’s human readers again:
P.P.S: Thanks to David Wilson-Okamura for correcting my labeling of scenes.
The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren't represented here. Results have been smoothed with a five-year moving average.
Although methods of analysis are more fun to discuss, the most challenging part of distant reading may still be locating the texts in the first place [1].
In principle, millions of books are available in digital libraries. But literary historians need collections organized by genre, and locating the fiction or poetry in a digital library is not as simple as it sounds. Older books don’t necessarily have genre information attached. (In HathiTrust, less than 40% of English-language fiction published before 1923 is tagged “fiction” in the appropriate MARC control field.)
Volume-level information wouldn’t be enough to guide machine reading in any case, because genres are mixed up inside volumes. For instance Hoyt Long, Richard So, and I recently published an article in Slate arguing (among other things) that references to specific amounts of money become steadily more common in fiction from 1825 to 1950.
Frequency of reference to “specific amounts” of money in 7,700 English-language works of fiction. Graphics here and throughout from Wickham, ggplot2 [2].
But Google’s “English Fiction” collection tells a very different story. The frequencies of many symbols that appear in prices (dollar signs, sixpence) skyrocket in the late nineteenth century, and then drop back by the early twentieth.
Frequencies of “$” and “6d” in Google’s “English Fiction” collection, 1800-1950.
Frequencies of “$”, “8vo” (octavo) and “cloth” in Google’s “English Fiction” collection, 1800-1950.
What we see in Google’s “Fiction” collection is something that happens in volumes of fiction, but not exactly in the genre of fiction — the rise and fall of publishers’ catalogs in the backs of books [3]. Individually, these two- or three-page lists of titles for sale may not look like significant noise, but because they often mention prices, and are distributed unevenly across the timeline, they add up to a significant potential pitfall for anyone interested in the role of money in fiction.
I don’t say this to criticize the team behind the Ngram Viewer. Genre wasn’t central to their goals; they provided a rough “fiction” collection merely as a cherry on top of a massively successful public-humanities project. My point is just that genres fail to line up with volume boundaries in ways that can really matter for the questions scholars want to pose. (In fact, fiction may be the genre that comes closest to lining up with volume boundaries: drama and poetry often appear mixed in The Collected Poems and Plays of So-and-So, With a Prose Life of the Author.)
You can solve this problem by selecting works manually, or by borrowing proprietary collections from a vendor. Those are both good, practical solutions, especially up to (say) 1900. But because they rely on received bibliographies, they may not entirely fulfill the promises we’ve been making about dredging the depths of “the great unread,” boldly going where no one has gone before, etc [4]. Over the past two years, with support from the ACLS and NEH, I’ve been trying to develop another alternative — a way of starting with a whole library, and dividing it by genre at the page level, using machine learning.
In researching the Slate article, we relied on that automatic mapping of genre to select pages of fiction from HathiTrust. It helped us avoid conflating advertisements with fiction, and I hope other scholars will also find that it reduces the labor involved in creating large, genre-specific collections. The point of this blog post is to announce the release of a first version of the map we used (covering 854,476 English-language books in HathiTrust 1700-1922).
We identify pages as paratext (front matter, back matter, ads), prose nonfiction, poetry (narrative and lyric are grouped together), drama (including verse drama), or prose fiction. The report discusses the rationale for these choices, but other choices would be possible.
“How accurate is this map?”
Since genres are social institutions, questions about accuracy are relative to human dissensus. Our pairs of human readers agreed about the five categories just mentioned for 94.5% of the pages they tagged [5]. Relying on two-out-of-three voting (among other things), we boiled those varying opinions down to a human consensus, and our model agreed with the consensus 93.6% of the time. So this map is nearly as accurate as we might expect crowdsourcing to be. But it covers 276 million pages. For full details, see the confusion matrices in the report. Also, note that we provide ways of adjusting the tradeoff between recall and precision to fit a researcher’s top priority — which could be catching everything that might belong in a genre, or filtering out everything that doesn’t belong. We provide filtered collections of drama, fiction, and poetry for scholars who want to work with datasets that are 97-98% precise.
The short answer: we can’t. I don’t expect the genre predictions in this dataset to be more than one resource among many. We’ve also designed this dataset to have a certain amount of flexibility. There are confidence metrics associated with each volume, and users can define their collection of, say, poetry more broadly or narrowly by adjusting the confidence thresholds for inclusion. So even this dataset is not really a single map.
“What about divisions below the page level?”
With the exception of divisions between running headers and body text, we don’t address them. There are certainly a wide range of divisions below the page level that can matter, but we didn’t feel there was much to be gained by trying to solve all those problems at the same time as page-level mapping. In many cases, divisions below the page level are logically a subsequent step.
“How would I actually use this map to find stuff?”
There are three different ways — see “How to use this data?” in the interim report. If you’re working with HathiTrust Research Center, you could use this data to define a workset in their portal. Alternatively, if your research question can be answered with word frequencies, you could download public page-level features from HTRC and align them with our genre predictions on your own machine to produce a dataset of word counts from “only pages that have a 97% probability of being prose fiction,” or what have you. (HTRC hasn’t released feature counts for all the volumes we mapped yet, but they’re about to.) You can also align our predictions directly with HathiTrust zip files, if you have those. The pagealigner module in the utilities subfolder of our Github repo is intended as a handy shortcut for people who use Python; it will work both with HT zip files and HTRC feature files, aligning them with our genre predictions and returning a list of pages zipped with genre codes.
Is this sort of collection really what I need for my project?
Maybe not. There are a lot of books in HathiTrust. But as I admitted in my last post, a medium-sized collection based on bibliographies may be a better starting point for most scholars. Library-based collections include things like reprints, works in translation, juvenile fiction, and so on, that could be viewed as giving a fuller picture of literary culture … or could be viewed as messy complicating factors. I don’t mean to advocate for a library-based approach; I’m just trying to expand the range of alternatives we have available.
“What if I want to find fiction in French books between 1900 and 1970?”
Although we’ve made our code available as a resource, we definitely don’t want to represent it as a “tool” that could simply be pointed at other collections to do the same kind of genre mapping. Much of the work involved in this process is domain-specific (for instance, you have to develop page-level training data in a particular language and period). So this is better characterized as a method than a tool, and the report is probably more important than the repo. I plan to continue expanding the English-language map into the twentieth century (algorithmic mapping of genre may in fact be especially necessary for distant reading behind the veil of copyright). But I don’t personally have plans to expand this map to other languages; I hope someone else will take up that task.
As a reward for reading this far, here’s a visualization of the relative sizes of genres across time, represented as a percentage of pages in the English-language portion of HathiTrust.
The relative sizes of different genres, represented as a percentage of pages in the English-language portion of HathiTrust. 854,476 volumes are covered. Nonfiction, front matter, and back matter aren’t represented here. Results have been smoothed with a five-year moving average. Click through to enlarge.
The blog post above often slips awkwardly into first-person plural, because I’m describing a project that involved a lot of people. Parts of the code involved were written by Michael L. Black and Boris Capitanu. The code also draws on machine learning libraries in Weka and Scikit-Learn [6, 7]. Shawn Ballard organized the process of gathering training data, assisted by Jonathan Cheng, Nicole Moore, Clara Mount, and Lea Potter. The project also depended on collaboration and conversation with a wide range of people at HathiTrust Digital Library, HathiTrust Research Center, and the University of Illinois Library, including but not limited to Loretta Auvil, Timothy Cole, Stephen Downie, Colleen Fallaw, Harriett Green, Myung-Ja Han, Jacob Jett, and Jeremy York. Jana Diesner and David Bamman offered useful advice about machine learning. Essential material support was provided by a Digital Humanities Start-Up Grant from the National Endowment for the Humanities and a Digital Innovation Fellowship from the American Council of Learned Societies. None of these people or agencies should be held responsible for mistakes.
References
[1] Perhaps it goes without saying, since the phrase has now lost its quotation marks, but “distant reading” is Franco Moretti, “Conjectures on World Literature,” New Left Review 1 (2000).
[2] Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis.http: //had.co.nz/ggplot2/book. Springer New York, 2009.
[3] Having mapped advertisements in volumes of fiction, I’m pretty certain that they’re responsible for the spike in dollar signs in Google’s “English Fiction” collection. The collection I mapped overlaps heavily with Google Books, and the number of pages of ads in fiction volumes tracks very closely with the frequency of dollars signs, “8vo,” and so on.
Percentage of pages in mostly-fiction volumes that are ads. Based on a filtered collection of 102,349 mostly-fiction volumes selected from a larger group of 854,476 volumes 1700-1922. Five-year moving average.
[4] “The great unread” comes from Margaret Cohen, The Sentimental Education of the Novel (Princeton NJ: Princeton University Press, 1999), 23.
[5] See the interim report (subsection, “Evaluating Confusion Matrices”) for a fuller description; it gets complicated, because we actually assessed accuracy in terms of the number of words misclassified, although the classification was taking place at a page level.
[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.
There are already several great posts out there that exhaustively list resources and starting points for people getting into DH (a lot of them are by Lisa Spiro, who is good at it). Opportunities are not always well signposted.
This will be a shorter list. I’m still new enough at this to remember what surprised me in the early going, and there were two areas where my previous experience in the academy failed to prepare me for the fluid nature of this field.
1) I had no idea, going into this, just how active a scholarly field could be online. Things are changing rapidly — copyright lawsuits, new tools, new ideas. To find out what’s happening, I think it’s actually vital to lurk on Twitter. Before I got on Twitter, I was flying blind, and didn’t even realize it. Start by following Brett Bobley, head of the Office of Digital Humanities at the NEH. Then follow everyone else.
2) The technical aspect of the field is important — too important, in many cases, to be delegated. You need to get your hands dirty. But the technical aspect is also much less of an obstacle than I originally assumed. There’s an amazing amount of information on the web, and you can teach yourself to do almost anything in a couple of weekends.* Realizing that you can is half the battle. For a pep talk / inspiring example, try this great narrative by Tim Sherratt.
That’s it. If you want more information, see the links to Lisa Spiro and DiRT at the top of this post. Lisa is right, by the way, that the place to start is with a particular problem you want to solve. Don’t dutifully acquire skills that you think you’re supposed to have for later use. Just go solve that problem!
* ps: Technical obstacles are minor even if you want to work with “big data.” We’re at a point now where you can harvest your own big data — big, at least, by humanistic standards. Hardware limitations are not quite irrelevant, but you won’t hit them for the first year or so, though you may listen anxiously while that drive grinds much more than you’re used to …