June 2014 – The Stone and the Shell

A window on the twentieth century may be about to open.

The nineteenth century gets a lot of attention from scholars interested in text mining, simply because it’s in the public domain. After 1923, you run into copyright laws that make it impossible to share digital texts of many volumes.

"Ray of Light," by Russell H Cribb, 2006. CC-BY 2.0. — “Ray of Light,” by Russell H Cribb, 2006. CC-BY 2.0.

One of the most promising solutions to that problem is the non-consumptive research portal being designed by the HathiTrust Research Center. In non-consumptive research, algorithms characterize a collection without exposing the original texts to human reading or copying.

This could work in a range of ways. Some of them are complex — for instance, if worksets and algorithms have to be tailored to individual projects. HTRC is already supporting that kind of research, but expanding it to the twentieth century may pose problems of scale that take a while to solve. But where algorithms can be standardized, calculations can run once, in advance, across a whole collection, creating datasets that are easy to serve up in a secure way. This strategy could rapidly expand opportunities for research on twentieth-century print culture.

For instance, a great deal of interesting macroscopic research can be done, at bottom, by counting words. JSTOR has stirred up a lot of interest by making word counts available for scholarly journal articles. Word counts from printed books would be at least equally valuable, and relatively easy to provide.

So people interested in twentieth-century history and literary history should prick up their ears at the news that HathiTrust Research Center is releasing an initial set of word counts from public-domain works as an alpha test. This set only includes 250,000 of the eleven million volumes in HathiTrust, and does not yet include any data about works after 1923, but one can hope that the experiment will soon expand to cover the twentieth century. (I’m just an interested observer, so I don’t know how rapid the expansion will be, but the point of this experiment is ultimately to address obstacles to twentieth-century text mining.)

The data provided by HTRC is in certain ways richer than the data provided by JSTOR, and it may already provide a valuable service for scholars who study the nineteenth or early twentieth centuries. Words are tagged with parts of speech, and word counts are provided at the page level — an important choice, since a single volume may combine a number of different kinds of text. HTRC is also making an effort to separate recurring headers and footers from the main text on each page; they’re providing line counts and sentence counts for each page, and also providing a count of the characters that begin and end lines. In my own research, I’ve found that it’s possible to use this kind of information to separate genres and categories of paratext within a volume (the lines of an index tend to begin with capital letters and end with numbers, for instance).

Of course, researchers would like to pose many questions that can’t be answered by page-level unigram counts. Some of those questions will require custom-tailored algorithms. But other questions might be possible to address with precalculated features extracted in a relatively standard way.

Whatever kinds of information interest you, speak up for them, using the e-mail address provided on the HTRC feature-extraction page. And if this kind of service would have value for your research, please write in to say how you would use it. Part of the point of this experiment is to assess the degree of scholarly interest.