A fair number of scholars would like to work on large digital collections, but aren’t entirely sure where to get them.
For people who work on text after, say, 1700, I’d like to briefly make a case for HathiTrust. I’m a few months into a project based on 800,000 volumes — collaborating with Mike Black, an English Ph.D student and extraordinary Python programmer. We decided to get our collection from HathiTrust, and it’s a decision I haven’t regretted. In terms of sheer numbers, I don’t know whether they’re larger than, say, the Internet Archive. But their collection has some subtle details that I’ve come to greatly appreciate.
For one thing, they divide documents into individual page files. At first this may seem like a pain (you want a file, right, not a folder of files?) But in fact it’s a significant advantage to have that hard-coded representation of page breaks. It has made it possible for Mike to design a Python script that a) recognizes running headers at the tops of pages b) uses them to make a reasonable guess about chapters and other document divisions and then c) removes the headers, which can otherwise throw a wrench in your topic model.
Also, the HathiTrust API is solid and well documented. If you request a large dataset from them, you will get metadata with it. But the availability of the bibliographic API can still be a significant benefit. (By the way, re: metadata — ask them to give you the complete .json record, not just the marc-record part of the json.)
For small numbers of texts, you could in fact get the text itself from the data API. But this is not recommended for a big collection. Instead you’re going to want to write Hathi and request that they construct a dataset for you, based on facets that would be available in their Advanced Search feature. Once they build it — which could take a few weeks to a month — you can send them a hard drive or download data through rsync. (I initially found rsync perplexing, but after the nice people at Hathi gave me precise instructions, it was easy.) Using rsync through my campus office connection, it took about two days to transfer 800,000 volumes, which consumed a little less than 1TB of disk space. It would have been slower if I had tried to do it at home through commercial broadband and an AirPort.
There is a lot of time involved simply in moving data around, and in part I’m writing this post to warn people about that. One really basic point that took me a while to figure out: do not try to unzip the files. Part of the reason why it’s slow to move a large collection is that separate files require your i/o to do a lot of starting and stopping. That’s hard enough with (say) 500,000 separate zipped document folders. If you unzip those documents and get 165 million separate page files, it becomes very hard indeed. I actually spent more than a week unzipping the collection, and about a week trying to move it from one drive to another — only to get a disk error halfway through the process that required reformat.
Mothers, teach your children not to do as I have done. Just use the Python module zipfiles that works directly with the .zip file. It takes Python a few tenths of a second to extract the data, but it’s much better than trying to move 165 million individual pages. H/t to Loretta Auvil, by the way, for convincing me that this was simpler.
I’m going to try to make available the Python scripts and lexica that Mike and I design for working with the collection. There are
a) Simple logistical issues, like navigating the pairtree folder structure where files are stored and extracting them from .zip.
b) Metadata issues, like normalizing dates of publication that can be “1871” or “[18–]”.
c) Document-format issues, like running headers and page numbers.
d) OCR issues, which are the really fun ones as far as I’m concerned.
We’ve written pieces of all of this, and (a) through (c) are working, but it’s not yet in beta (to put it mildly). However, if you’re grappling with a similar problem, drop me a line and I’ll send you our code, such as it is. Development of this code was supported by the Andrew W. Mellon Foundation.
I’d also like to encourage everyone who’s interested in these kinds of problems to attend the HathiTrust Research Center UnCamp in Indiana this September (pre-register by August 1). This should be particularly useful if you’re interested in working on collections after 1923. HTRC has begun to design an infrastructure that will permit non-consumptive or non-expressive research on texts without transmitting the text itself to the researcher — obviously a crucial part of the solution to the problem of research on copyrighted text. They hope to demo parts of that infrastructure in September — but if you show up, you also have a fair chance of getting input on the design of the final version.