Big data. I’m tempted to begin “I, too, dislike it,” because the phrase has become a buzzword. To mainstream humanists, it sounds like a perversion. Even people who work in digital humanities protest that DH shouldn’t be normatively identified with big data — and I agree — so generally I keep quiet on the whole vexed question.
Except … there are a lot of grad students out there just starting to look at DH curiously, wondering whether it offers anything useful for their own subfield. In that situation, it’s natural to start by building a small collection that addresses a specific research problem you know about. And that might, in many cases, be a fine approach! But my conscience is nagging at me, because I can see some other, less obvious opportunities that students ought to be informed about.
It’s true that DH doesn’t have to be identified with scale. But the fact remains that problems of scale constitute a huge blind spot for individual researchers, and also define a problem that we know computers can help us explore. And when you first go into an area that was a blind spot for earlier generations of scholars, you’re almost guaranteed to find research opportunities — lying out on the ground like lumps of gold you don’t have to mine.
This suggests that it might be a mistake to assume that the most cost-effective way to get started in DH is to define a small collection focused on a particular problem you know about. It might actually be a better strategy to beg, borrow, or steal a large collection — and poke around in it for problems we don’t yet know about.
“But I’m not interested in big statistical generalizations; I care about describing individual works, decades, and social problems.” I understand; that’s a valid goal; but it’s not incompatible with the approach I’m recommending. I think it’s really vital that we do a better job of distinguishing “big data” (the resource) from “distant reading” (a particular interpretive strategy).* Big data doesn’t have to produce distant generalizations; we can use the leverage provided by scale and comparative analysis to crack open small and tightly-focused questions.
I don’t think most humanists have an intuitive grasp of how that “leverage” would work — but topic modeling is a good example. As I play around with topic-modeling large collections, I’m often finding that the process tells me interesting things about particular periods, genres, or works, by revealing how they differ from other relevant points of comparison. Topic modeling doesn’t use scale to identify a “trend” or an “average,” after all; what it does is identify the most salient dimensions of difference in a given collection. If you believe that the significance of a text is defined by its relation to context, then you can see how topic modeling a collection might help us crack open the (relational) significance of individual works.
“But how do we get our hands on the data?” Indeed: there’s the rub. Miriam Posner has recently suggested that the culture surrounding “coding” serves as a barrier that discourages women and minorities from entering certain precincts of DH. I think that’s right, but I’m even more concerned about the barriers embodied in access to data. Coding is actually not all that hard to pick up. Yes, it’s surrounded by gendered assumptions; but still, you can do it over a summer. [Update: Or, where that’s not practical, you can collaborate with someone. At Illinois, Loretta Auvil and Boris Capitanu do kinds of DH programming that are beyond me. I don’t mean to minimize issues of gender here, but I do mean to put “coding” in perspective. It’s not a mysterious, magical key.] By contrast, none of us can build big data on our own (or even in small teams) over the summer. If we don’t watch out, our field could easily slip into a situation where power gravitates to established scholars at large/wealthy research universities.
I’ve tried to address that by making my own data public. I haven’t documented it very well yet, but give me a few weeks. I think peer pressure should be exerted on everyone (especially established scholars) to make their data public at the time of publication. I do understand that some kinds of data can’t be shared because they’re owned by private enterprise. I accept that. But if you’ve supplemented proprietary data with other things you’ve produced on your own: in my opinion, that data should be made public at the time of publication.
Moreover, if you do that, I’m not going to care very much about the mistakes you have made in building your collection. I may think your data is completely biased and unrepresentative, because it includes too much Y and not enough X. But if so, I have an easy solution — which is to take your data, add it to my own collection of X, and other data borrowed from Initiative Z, and then select whatever subset would in my opinion create a balanced and representative collection. Then I can publish my own article correcting your initial, biased result.
Humanists are used to approaching debates about historical representation as if they were zero-sum questions. I suppose we are on some level still imagining this as a debate about canonicity — which is, as John Guillory pointed out, really a debate about space on the syllabus. Space on the syllabus is a zero-sum game. But the process of building big data is not zero-sum; it is cumulative. Every single thing you digitize is more good news for me, even if I shudder at the tired 2007-vintage assumptions implicit in your research agenda.
Personally, I feel the same way about questions of markup and interoperability. It’s all good. If you can give me clean** ascii text files with minimal metadata, I love you. If you can give me TEI with enriched metadata, I love you. I don’t want to waste a lot of breath arguing about which standard is better. In most cases, clean ascii text would be a lot better than what I can currently get.
* I hasten to say that I’m using “distant reading” here as the phrase is commonly deployed in debate — not as Franco Moretti originally used it — because the limitation I’m playing on is not really present in Moretti’s own use of the term. Moretti pointedly emphasizes that the advantage of a distant perspective may be to reveal the relational significance of an individual work.
** And, when I say “clean” — I will definitely settle for a 5% error rate.
Guillory, John. Cultural Capital. Chicago: U. of Chicago Press, 1993.
Moretti, Franco. Graphs, Maps, Trees. New York: Verso, 2005.
[UPDATE: For a different perspective on the question of representativeness, see Katherine D. Harris on Big Data, DH, and Gender. Also, see Roger Whitson, who suggests that linked open data may help us address issues of representation.]
5 replies on “Big but not distant.”
[…] Fair enough. Ted’s project is to use the corpus that is available. He’s moved his project beyond that labor of creating the digital representatives and is working on the humanistic queries that are so engaging (to me) in Digital Humanities. To be fair, Ted has returned to his data set to add more women authors from the Brown Womens Writers Project and such. And, even as I’m finishing up this post, Ted posts a longer response to representation, big data, and the canon. […]
Names are a mess!
Last summer I was working on The Heart of Darkness and found myself being curious about paragraph length. Well, the text is short so it was easy for me to use MSWord’s count function to find the length of each paragraph and then to transfer for the results to Excel to graph the distribution (I lack scripting skills so I had to do it this way). When I was done the results were interesting, interesting enough to get the Language Log folks into action.
Now, I would be hard to call this “big data” and it’s not exactly what’s conjured up by the term DH, but it surely is DH in some meaningful sense. What I did was very much about the particularities of an individual text and some of my commentary looked at those paragraph lengths against the qualitative work that, at the moment, can only be done by a skilled human analyst. So that’s one thing.
I suspect, however, that HoD has a pattern of paragraph lengths that is likely to be rare, if not unique. There’s only one way to check that out, look at other texts. How many, and which ones? Well, you might say, let’s try fifty and let’s just pick them at random from some collection of British fiction. That’s hardly big by current standards, but it’s way more than you’d want to do using my semi-manual techniques. At this point you’re going to want to write some scripts to automate the process, which is is Mark Liberman did when doing two or three more texts just to see . . . The fact is if I had the resources I’d just do it for every text I could lay my hands an, and count sentence length too. If I were Google, I’d do it for every text in my collection.
That’s big data. And with millions of little files containing information about paragraph lengths in millions of texts, well, now we’ve got something of a problem with comparing all those distributions, which, of course, is what we want to do. And if, as I suspect, the HoD pattern is relatively rare, well, what then?
I don’t know what would come out of such work but, given work that’s been done on shot length in film, I think we might find long-term historical trends in paragraph and sentence length. If we do, what then? Well, we’ve got something we’ve got to explain.
And so forth and so on. As you say, there’s lots of “low hanging fruit” as the call it in the consulting business. As more people overcome their fears of numbers and start poking around, we’re going to see more and more of it. And maybe, in time, people will begin get the idea that explaining these patterns that we’d never before even suspect, that’s going to require new ways of thinking.
And then there’s the unfortunate term, “distant reading.” Why unfortunate? Well, I think it was a mistake to allow the term “reading” to elide the distinction between the ordinary activity by which John, Jane, Suzy, and Timmy Doe read texts and the specialized activity of creating written explications of texts. The effect of such elision to enable the belief that the two processes are basically the same, but that what the professional critic is doing is deeper and more rigorous than what John, Jane, Suzy and Timmy are doing and the Does really ought to tighten up their act. Think about that for a moment or two and you realize that, on that view, Will Shakespeare, Leo Tolstoy, and Murasaki Shikibu were really little more than very skilled chimps and they ought to get themselves to the nearest Summer School for Criticism in order to be able to “read” the texts they wrote.
No, what literary critics do is different from (ordinary) reading and ought not be terminologically confused with it. And so it is with what Moretti has done and just about anything in DH. It’s not a species of reading and ought not to be so designated. When you read a text you’re looking for a certain kind of experience. When you’re examining the publication of texts by genre over time, or the appearence of topics in a collection of texts, or even a single text, that’s different and so should have a different name. Calling it some kind of reading just confuses the enterprise.
As you say, there’s really a continuum between an individual work, hundreds of texts in a particular genre or social context, and thousands of texts across a century or more. All of those scales of analysis are potentially useful, and we’re getting to a point where you don’t necessarily have to choose between them.
[…] not to single anyone out, but when Ted Underwood says: Miriam Posner has recently suggested that the culture surrounding “coding” serves as a barrier […]
[…] again, that seems like a poor use of my time. Ted Underwood addresses some of these reservations here, particularly in terms of big data vs. small questions, but no matter what I read about uses of […]