Big data. I’m tempted to begin “I, too, dislike it,” because the phrase has become a buzzword. To mainstream humanists, it sounds like a perversion. Even people who work in digital humanities protest that DH shouldn’t be normatively identified with big data — and I agree — so generally I keep quiet on the whole vexed question.
Except … there are a lot of grad students out there just starting to look at DH curiously, wondering whether it offers anything useful for their own subfield. In that situation, it’s natural to start by building a small collection that addresses a specific research problem you know about. And that might, in many cases, be a fine approach! But my conscience is nagging at me, because I can see some other, less obvious opportunities that students ought to be informed about.
It’s true that DH doesn’t have to be identified with scale. But the fact remains that problems of scale constitute a huge blind spot for individual researchers, and also define a problem that we know computers can help us explore. And when you first go into an area that was a blind spot for earlier generations of scholars, you’re almost guaranteed to find research opportunities — lying out on the ground like lumps of gold you don’t have to mine.

This suggests that it might be a mistake to assume that the most cost-effective way to get started in DH is to define a small collection focused on a particular problem you know about. It might actually be a better strategy to beg, borrow, or steal a large collection — and poke around in it for problems we don’t yet know about.
“But I’m not interested in big statistical generalizations; I care about describing individual works, decades, and social problems.” I understand; that’s a valid goal; but it’s not incompatible with the approach I’m recommending. I think it’s really vital that we do a better job of distinguishing “big data” (the resource) from “distant reading” (a particular interpretive strategy).* Big data doesn’t have to produce distant generalizations; we can use the leverage provided by scale and comparative analysis to crack open small and tightly-focused questions.
I don’t think most humanists have an intuitive grasp of how that “leverage” would work — but topic modeling is a good example. As I play around with topic-modeling large collections, I’m often finding that the process tells me interesting things about particular periods, genres, or works, by revealing how they differ from other relevant points of comparison. Topic modeling doesn’t use scale to identify a “trend” or an “average,” after all; what it does is identify the most salient dimensions of difference in a given collection. If you believe that the significance of a text is defined by its relation to context, then you can see how topic modeling a collection might help us crack open the (relational) significance of individual works.
“But how do we get our hands on the data?” Indeed: there’s the rub. Miriam Posner has recently suggested that the culture surrounding “coding” serves as a barrier that discourages women and minorities from entering certain precincts of DH. I think that’s right, but I’m even more concerned about the barriers embodied in access to data. Coding is actually not all that hard to pick up. Yes, it’s surrounded by gendered assumptions; but still, you can do it over a summer. [Update: Or, where that’s not practical, you can collaborate with someone. At Illinois, Loretta Auvil and Boris Capitanu do kinds of DH programming that are beyond me. I don’t mean to minimize issues of gender here, but I do mean to put “coding” in perspective. It’s not a mysterious, magical key.] By contrast, none of us can build big data on our own (or even in small teams) over the summer. If we don’t watch out, our field could easily slip into a situation where power gravitates to established scholars at large/wealthy research universities.
I’ve tried to address that by making my own data public. I haven’t documented it very well yet, but give me a few weeks. I think peer pressure should be exerted on everyone (especially established scholars) to make their data public at the time of publication. I do understand that some kinds of data can’t be shared because they’re owned by private enterprise. I accept that. But if you’ve supplemented proprietary data with other things you’ve produced on your own: in my opinion, that data should be made public at the time of publication.
Moreover, if you do that, I’m not going to care very much about the mistakes you have made in building your collection. I may think your data is completely biased and unrepresentative, because it includes too much Y and not enough X. But if so, I have an easy solution — which is to take your data, add it to my own collection of X, and other data borrowed from Initiative Z, and then select whatever subset would in my opinion create a balanced and representative collection. Then I can publish my own article correcting your initial, biased result.
Humanists are used to approaching debates about historical representation as if they were zero-sum questions. I suppose we are on some level still imagining this as a debate about canonicity — which is, as John Guillory pointed out, really a debate about space on the syllabus. Space on the syllabus is a zero-sum game. But the process of building big data is not zero-sum; it is cumulative. Every single thing you digitize is more good news for me, even if I shudder at the tired 2007-vintage assumptions implicit in your research agenda.
Personally, I feel the same way about questions of markup and interoperability. It’s all good. If you can give me clean** ascii text files with minimal metadata, I love you. If you can give me TEI with enriched metadata, I love you. I don’t want to waste a lot of breath arguing about which standard is better. In most cases, clean ascii text would be a lot better than what I can currently get.
* I hasten to say that I’m using “distant reading” here as the phrase is commonly deployed in debate — not as Franco Moretti originally used it — because the limitation I’m playing on is not really present in Moretti’s own use of the term. Moretti pointedly emphasizes that the advantage of a distant perspective may be to reveal the relational significance of an individual work.
** And, when I say “clean” — I will definitely settle for a 5% error rate.
References
Guillory, John. Cultural Capital. Chicago: U. of Chicago Press, 1993.
Moretti, Franco. Graphs, Maps, Trees. New York: Verso, 2005.
[UPDATE: For a different perspective on the question of representativeness, see Katherine D. Harris on Big Data, DH, and Gender. Also, see Roger Whitson, who suggests that linked open data may help us address issues of representation.]








